######################################################## #### Week 6 Lab: Dealing With Messy Data - Outliers #### #### Friday, October 13th, 2017 #### ######################################################## ################################################################################################ #### Case Analysis Part 1: Examine the data #################################################### ################################################################################################ # We're going to work with the Prestige dataset again, but we've modified it, so download # the file from the website. library(lmSupport) d = dfReadDat('Lab6_prestigedata.dat') #### 1. Univariate Statistics #### # Explore the data. This should be the first step in any analysis. varDescribe(d) # Do the means look appropriate? Generally yes # Do the SDs make sense? Mostly yes # Is this the range we would expect for these variables? Argh no! # We can plot education and see that one case is way out there. varPlot(d$education) # This useful command gives you the descriptives of the variable and shows you a frequency plot. # Identify which case has this anomalous value d[d$education==-10.7,] # This is to say: in the data frame d, find me the row with education = -10.7 # Remove it. d = dfRemoveCases(d,103) # This is to say: remove case 103 from dataset d # If this were our own data, we might go back and see if we could correct this error. # What might be the problem with coding the remove cases this way? ?dfRemoveCases # These are row numbers, which are dependent on the order the data are in when we read them in. # This means if we reorder the data file but keep this code the same, we're deleting observations # that have no problems. # What could we do instead? d = d[d$education > 0,] d = dfRemoveCases(d, which(d$education==-10.7)) # which returns the row number of observations # that meet a given criterion #### 2. Univariate Plots #### # Let's say that we're interested in predicting income from education and women. # So, those are the variables we're going to focus on. varPlot(d$education) # this looks much better now varPlot(d$income) # should we remove the "outliers" here? probably not (not mistake or miscoding) varPlot(d$women) # This is an extremely informal and simplistic way to detect outliers. All you're going to find # here are highly anomalous values, and you'll only be able to remove things based on impossibility. #### 3. Bivariate correlations #### library(psych) corr.test(d[, c('education', 'income', 'women')]) # This gives us a the correlation matrix that shows the correlations between education, income, and # women for the data frame d. Here, we are getting a sense of how our variables are related to each # other (very similar to last week). What if you need more precise values for these statistics? print(corr.test(d[, c('education','income','women')]), digits=4) spm(d[, c('education', 'income', 'women')]) # We can look at the scatterplot matrix to assess departures from (univariate) normality and to # visually inspect violations of bivariate assumptions. We wouldn't make decisions based on these # plots, but they might suggest some problematic cases. # What's a simple thing we could have done to make this whole section a bit easier? # We could have saved our vars of interest as an object. ######################################################################################################## #### Case Analysis Part 2: Diagnostic functions ######################################################## ######################################################################################################## # The case analysis function in lmSupport is modelCaseAnalysis(), followed by a model object and # one of 7 options, including RESIDUALS, UNIVARIATE, HATVALUES, and COOKSD. # Remember that we're interested in predicting income from education and women. mod1 = lm(income ~ education + women, data = d) modelSummary(mod1) # What did we learn? Like last week, years education has a positive relationship with income and # percent women has a negative relationship with income. # quartz() # windows() #### 4. LEVERAGE #### # Leverage = hat values # What is the hat value an index of? # It's how far a given value on a predictor variable is from the centroid of the data. xHats = modelCaseAnalysis(mod1, Type='HATVALUES') # Here's how this works. Click on points that you would consider to be problematic (far beyond the red # line, separated from the rest of the data, etc). It will look like nothing happened, but trust us, # something did. When you're done, click 'Finish.' R will store information about these points in # an object we've named xHats. You can call this object to remind yourself what points are problematic. # Do any of these points look extreme? # Is one point separated from the distribution of hat values? xHats # will give us the rowname and the hat value (measure of leverage) d[xHats$Rownames,] # gives us the complete data record for that row # Sewing machine operators have high leverage. # Why might this be? # They have low education and high percent women. # So, are we ready to remove this occupation from our data? # No! #### 5. REGRESSION OUTLIERS #### # Regression outliers = large residuals # What is this an index of? # It indicates how dramatically a given observation deviates from the model we've estimated. # The RESIDUALS graph plots a histogram of the (studentized) residuals for an individual model. # Because the studentized residuals are distributed as t, one can calculate a p-value for each # point that tests whether a given point significantly differs from the model. These p-values # are automatically Bonferroni corrected for the fact that you are doing a large number of tests # (one for every observation in your dataframe). xResids = modelCaseAnalysis(mod1, Type='RESIDUALS') # What does it mean for these studentized residuals to be large? How could they affect our model estimates? # It means the observation differs significantly from the predictions made by our model. Regression # outliers can (but don't always) change our parameter estimates, but they do inflate the standard # error of our parameter estimates. xResids # gives you row names and studentized residual values d[xResids$Rownames,] # Two cases are beyond the red cutoff and are pretty separated from the bulk of the data: general # managers and physicians. # The next step is to look at their data: both appear to have very high occupational incomes as # compared to the rest of the data. varDescribe(d$income) # Should we remove these observations from our data? # It depends! On factors we will elaborate on later. #### 6. INFLUENCE #### # Influence: Cook's distance # What two characteristics does influence take into account? # Leverage and studentized residuals. Observations with a large Cook's D are extreme on both of these. # Is it necessary for an influential point to have high leverage? A large studentized residual? # Not necessarily, but it will have to be high on at least one of these dimensions. Remember, # Cook's D is basically indicating how much our parameter estimates would change if we removed # a given observation. xCooks = modelCaseAnalysis(mod1, Type='COOKSD') xCooks d[xCooks$Rownames,] # Here, several points exceed the cutoff. This cutoff is only a rough rule-of-thumb, though. # What's more important is whether or not the points are part of our distribution of Cook's # distance. # How many points have a large gap? Two. And which are they? Two of the cases we were suspicious of! # But what are we missing here? # We don't know which parameters are being affected by these influencial points. # Influence: df betas # DF Betas calculate influence on a given parameter. It helps us sort out what points are affecting # what parameters. modelCaseAnalysis(mod1, Type="DFBETAS") # you'll notice this works somewhat differently than the others. Here, we get a figure for each of # the parameters. You also want to record the row numbers of the observations as you go. # 024 on intercept (sort of), 024 on education (sort of), no real percent women problem # in the scatterplots, we also identify point 002. # Influence: The influence plot xInfs = modelCaseAnalysis(mod1, Type='INFLUENCEPLOT') # What's going on here? What's the circle size? What are the axes? # The size of the circle indicates the Cook's D value. # The X axis is the hat value # The Y axis is the studentized residual. # BEFORE YOU CLICK THE POINTS, guess which circles represent the three 'problematic' points we've # identified in our previous analyses. xInfs d[xInfs$Rownames,] # From our Cook's d plot, we already saw that general managers and physicians had high influence. # Notice: Sewing operators have high leverage, but this point is not a regression outlier. While # we might want to investigate this case further, it seems less potentially problematic. #### So..... now what?? #### # Through our analyses, we found physicians and general managers are ill-fit by our model, have # extreme values on the outcome variable (income), and have a large influence on our parameter # estimates. Furthermore, including them inflates the standard error of our parameter estimates. # We also identified one case that has high leverage but is well-fit by our model. #### KEY QUESTIONS #### # 1. Can we explain this result from some kind of error in coding? Can we correct the error(s)? # We certainly can't in this case since we're not the researchers, so continue. # 2. What is the impact of removing the cases? d2 = dfRemoveCases(d, c(which(d$name == 'general.managers'), which(d$name == 'physicians'))) mod2 = lm(income ~ education + women, data = d2) modelSummary(mod1) # including problem observations modelSummary(mod2) # removing problem observations # What changes? Conclusions? Precision? # In this case, our conclusions do not change, but our precision improves. The decision about whether # to remove these observations carries less weight in this example. If we did choose to remove them, # we would have to report this in our Results section, and we'd have to provide some reason for # doing so (see below). # Another much less common option is "bringing your outliers to the fence," which means changing the # value for the observation based on the residual cutoff that would no longer make it an outlier. # This allows you to maintain power while addressing extreme values. You'd want to have a good policy # for doing this, though. ######################################################################################## #### CONSIDERING OPEN SCIENCE STANDARDS ################################################ ######################################################################################## # This is a new section as of Fall 2016. Edited 2017. # Recently, some research practices have been called into question in light of the replication crisis. # Outlier analysis is one area strongly impacted by this development in the field. Basically, # you'll want to think about and answer a lot of questions regarding outliers BEFORE you even start # collecting data, and you'll have to have good reasons for the decisions you make. # Here are a couple questions you might ask before you even start: # What would we consider too extreme of an observation on a predictor variable? # You should try to answer this question in terms of practical raw values as opposed to hat values # (e.g., we might decide an IQ score below 70 is too low for inclusion in our study). # Can we identify some other variable that might contribute to outliers? # One example of this is recording a subject's duration on a questionnaire and setting cutoff points # for what 'acceptable' times are based on pilot-testing (e.g., throwing out anyone who completes # it in under 5 mins). # What is the process we'll use to identify outliers, and what will we do with them when we find them? # It's very difficult to accuse you of p-hacking if you can point to your pre-registration and show # that all of the actions you took to deal with outliers are wholly consistent with what you said you # were going to do. # Once you have collected your data and identified outliers, there are a few more questions you # might ask about these observations that can help give justification to your decision to exclude them: # Was there human error in entering this value? # In certain cases, you can go back to the source and see whether this observation is simply due # to data entry error. We love situations like this (because they're pretty easy to fix). # Was anything about that participant's experimental session weird? # Maybe the power went out, they got a phone call, or they fell asleep. If this is a relevant concern, # someone should keep a notebook where they jot down observations of strange occurrences when # running a given participant. # Does it make sense that their value is extreme? # If we ran the Prestige study in America in 2017, we would have an even more skewed income graph: # people in top positions are paid astronomically more than working-class people. We might be able # to tell a compelling story about why these extreme observations have so much influence, and our # (theory-based) argument could give us reason to exclude them. Can look quite suspicious though. # Based on the claims we're trying to make with our research, can we exclude these observations? # Removing extreme observations can, in certain cases, affect the exteral validity of our research. # For example, if we exclude top earners when we analyze the prestige data, we can't claim that # our conclusions apply to those professions. However, we will have more precise parameter estimates # for the large majority of occupations we've included in our sample. We have to note these # shortcomings by being precise with our conclusions.