ECO 440/640 — Problem Set 3

For this assignment, use the asec2016 data that I gave you earlier in the semester.

Goals for this assignment

There will be four things for you to get out of this assignment:

Dealing with overplotting

Use ggplot() to compare hourly wages to years of schooling in a plot. Example without a title (you can add one with ggplot or add one in the body of your assignment—which is typically how it is done to allow for consistent fonts and figure numbering):

    ggplot(data=asec2016, aes(x=school,y=hourwage)) +
	    geom_point() +
		scale_y_continuous("Hourly wage", labels=scales::dollar) +
		scale_x_continuous("Years of school")
aes() stands for “aesthetics”, which apparently makes sense (but I have never quite understood how). Inside aes(), you can also put things like group=factor(female), color=factor(female) to separate the points by sex (but that would be useless in this case). What is wrong with this graph? Identify what we mean by “overplotting” in the graph.

Here we will look at a few ways to deal with overplotting. The strategies come in a few forms:

  1. Make a bigger graph
  2. Change the shape or size of points
  3. Make points partially transparent (in R you need ggplot to do this easily)
  4. Plot conditional densities
  5. Use box and whisker plots
  6. Use logs
  7. Plot just a subsample of the data
  8. Remove outliers
  9. Introduce measurement error on purpose if one of the variables takes on a small set of possible values (“jittering”

Artificial measurement error

Let's start with the last idea. Here is a little trick to deal with overplotting when you plot wages against schooling:

    asec2016$schoolMessy = asec2016$school+rnorm(dim(asec2016)[1], sd=.1)
Make sure you understand what that is doing. It is creating a new variable with a little bit of random error added onto the schooling variable. This way the space between 11 and 12 years on a graph will fill up with points instead of them all being stuck at exactly 11 or exactly 12 years. If you want to keep the points closer to their actual values, change the sd=.1 to a smaller value.

Use a subsample

We should first ask exactly which data we are interested in. Create a sample of just white people who worked full-time over the last year and worked at least 48 weeks in the last year. Then drop from the sample anyone who earned less than the first percentile or more than the 99th percentile. The quantile function will help with this:

        asecSmall$hourwage>quantile(asecSmall$hourwage, probs=.01) & 
        asecSmall$hourwage<quantile(asecSmall$hourwage, probs=.99),]
gives a subset of asecSmall for which hourwage is between the 1st and 99th percentiles (as always, read the help file if you are confused). This is not how you should deal with all your data. It may be okay in this context to help us look at the graphs we will produce and to quickly eliminate cases of people who strangely report earning $.00004 per hour, but you should never just throw out outliers without thinking about why you are doing it.

Now that we have a sample with fairly well-defined qualities, you should check how many data we have by using dim(). Alternately you can see how many data we have that are not missing schooling information (since your teacher was nice enough to tell you that he dropped anyone from the sample who was missing wage information) with sum(!$school)); this adds up all the TRUE values—where TRUE gets interpreted as 1 and FALSE gets interpreted as 0—from checking if school is not NA (an exclamation point means “not”). Is this still too many to view on one graph? Probably. Take a random sample of 400-800 people from that sample (this is a small enough sample that when we plot it with some transparency we should be able to differentiate most of the points). You can use the sample() function for this and then use the output of that function to index the rows of the data.frame you are using:

    asecSmall2 = asecSmall[sample(dim(asecSmall)[1], 600, replace=T),]
Again, you should make sure you understand this, because I am providing you tools to use in the future. Note that unless you set.seed() beforehand, you will get a different sample every time you do this. You do not have to do this. I mention it only for your information.


To get semi-transparent points, inside the geom_point(), put alpha=.2. The value must be between 0 and 1 and tells you how opaque the points will be.

Look at the plots

Use ggplot() to compare hourly wages to years of schooling in a plot with the sample you constructed above. Create two separate graphs: one with school and one with schoolMessy (or whatever you called the comparable variable). Use transparency. Remember to label your graphs properly (including axis labels with units, figure numbers, and captions).


Explain how these graphs show you that there is heteroskedasticity. Why is there heteroskedasticity in this case (hint: heteroskedasticity is very common when we deal with variables that are bounded below by zero like income, population, house floor area, or number of cigarettes smoked)? These really are different questions, and they are both very important (more important than generating regression tables). Does using the log of wages help solve the problem (check this with a graph)?

Answer this question: how valuable (in terms of wages) is a year of schooling? (No, this is not bullet point for you to write one sentence on; treat this like your boss asked you this question, and you have to submit an answer). Use the entire sample of white, full-time workers (not just 600 people). Write out the regression you will use. You can choose to use logs or not, but be aware that anyone estimating this effect in real life would use the log of wages, whereas heteroskedasticity will be more obvious if you do not use logs (perhaps making this assignment more informative to you). Identify exactly which coefficients you are interested in and what you want to know about them. Produce a (standard format) regression table with at least two columns: one with the default standard error estimates and one with White's heteroskedasticity-consistent standard errors. Make sure it is clear which is which in the table. Note that the cofficients are not different (we would normally only include one column and either report two separate standard errors or just only include the RSEs, but reporting both standard errors in the same column that would be more work for you, so you do not need to do this).

It is not that interesting to test the hypothesis that a year of schooling has no effect on wages (since almost no one expects the effect to be 0). Instead we are probably interested in getting confidence intervals so we can plan for our futures. Construct the relevant confidence intervals using both standard error estimators (stargazer can include confidence intervals, and confint() can calculate them for the usual standard errors while coefci() (in the lmtest package) can calculate them for the robust standard errors (or you could just add and subtract 1.96*SE from the coefficient estimate). Are there economically-significant differences?