######################################################## #### Week 6 Lab: Dealing With Messy Data - Outliers #### #### Friday, October 13th, 2017 #### ######################################################## ################################################################################################ #### Case Analysis Part 1: Examine the data #################################################### ################################################################################################ # We're going to work with the Prestige dataset again, but we've modified it, so download # the file from the website. library(lmSupport) d = dfReadDat() #### 1. Univariate Statistics #### # Explore the data. This should be the first step in any analysis. # Do the means look appropriate? # Do the SDs make sense? # Is this the range we would expect for these variables? # We can plot education varPlot() # This useful command gives you the descriptives of the variable and shows you a frequency plot. # Identify which case has this anomalous value # What does this mean to R? # Remove it. d = dfRemoveCases() # What does this mean? # If this were our own data, we might go back and see if we could correct this error. # What might be the problem with coding the remove cases this way? ?dfRemoveCases # What could we do instead? #### 2. Univariate Plots #### # Let's say that we're interested in predicting income from education and women. # So, those are the variables we're going to focus on. # This is an extremely informal and simplistic way to detect outliers. All you're going to find # here are highly anomalous values, and you'll only be able to remove things based on impossibility. #### 3. Bivariate correlations #### library(psych) corr.test() # This gives us a the correlation matrix that shows the correlations between education, income, and # women for the data frame d. Here, we are getting a sense of how our variables are related to each # other (very similar to last week). What if you need more precise values for these statistics? # We can look at the scatterplot matrix to assess departures from (univariate) normality and to # visually inspect violations of bivariate assumptions. We wouldn't make decisions based on these # plots, but they might suggest some problematic cases. # What's a simple thing we could have done to make this whole section a bit easier? ######################################################################################################## #### Case Analysis Part 2: Diagnostic functions ######################################################## ######################################################################################################## # The case analysis function in lmSupport is modelCaseAnalysis(), followed by a model object and # one of 7 options, including RESIDUALS, UNIVARIATE, HATVALUES, and COOKSD. # Remember that we're interested in predicting income from education and women. # What did we learn? # If you want, open a larger plot window # quartz() # for macs # windows() # for PCs #### 4. LEVERAGE #### # Leverage = ? # What is the hat value an index of? xHats = modelCaseAnalysis() # Here's how this works. Click on points that you would consider to be problematic (far beyond the red # line, separated from the rest of the data, etc). It will look like nothing happened, but trust us, # something did. When you're done, click 'Finish.' R will store information about these points in # an object we've named xHats. You can call this object to remind yourself what points are problematic. # Do any of these points look extreme? # Is one point separated from the distribution of hat values? # Why might this be? # So, are we ready to remove this occupation from our data? #### 5. REGRESSION OUTLIERS #### # Regression outliers = ? # What is this an index of? # The RESIDUALS graph plots a histogram of the (studentized) residuals for an individual model. # Because the studentized residuals are distributed as t, one can calculate a p-value for each # point that tests whether a given point significantly differs from the model. These p-values # are automatically Bonferroni corrected for the fact that you are doing a large number of tests # (one for every observation in your dataframe). xResids = modelCaseAnalysis() # What does it mean for these studentized residuals to be large? How could they affect our model estimates? # Should we remove these observations from our data? #### 6. INFLUENCE #### # Influence: Cook's D # What two characteristics does influence take into account? # Is it necessary for an influential point to have high leverage? A large studentized residual? xCooks = modelCaseAnalysis() # Here, several points exceed the cutoff. This cutoff is only a rough rule-of-thumb, though. # What's more important is whether or not the points are part of our distribution of Cook's # distance. # How many points have a large gap? And which are they? # But what are we missing here? # Influence: df betas # DF Betas calculate influence on a given parameter. It helps us sort out what points are affecting # what parameters. modelCaseAnalysis() # you'll notice this works somewhat differently than the others. Here, we get a figure for each of # the parameters. You also want to record the row numbers of the observations as you go. # Influence: The influence plot xInfs = modelCaseAnalysis() # What's going on here? What's the circle size? What are the axes? # The size of the circle # The X axis # The Y axis # BEFORE YOU CLICK THE POINTS, guess which circles represent the three 'problematic' points we've # identified in our previous analyses. #### So..... now what?? #### # Through our analyses, we found ___ and ___ are ill-fit by our model, have # extreme values on the outcome variable (income), and have a large influence on our parameter # estimates. Furthermore, including them inflates the standard error of our parameter estimates. # We also identified one case that has high leverage but is well-fit by our model. #### KEY QUESTIONS #### # 1. Can we explain this result from some kind of error in coding? Can we correct the error(s)? # 2. What is the impact of removing the cases? # What changes? Conclusions? Precision? # Another much less common option is "bringing your outliers to the fence," which means changing the # value for the observation based on the residual cutoff that would no longer make it an outlier. # This allows you to maintain power while addressing extreme values. You'd want to have a good policy # for doing this, though. ######################################################################################## #### CONSIDERING OPEN SCIENCE STANDARDS ################################################ ######################################################################################## # This is a new section as of Fall 2016. Edited 2017. # Recently, some research practices have been called into question in light of the replication crisis. # Outlier analysis is one area strongly impacted by this development in the field. Basically, # you'll want to think about and answer a lot of questions regarding outliers BEFORE you even start # collecting data, and you'll have to have good reasons for the decisions you make. # Here are a couple questions you might ask before you even start: # What would we consider too extreme of an observation on a predictor variable? # You should try to answer this question in terms of practical raw values as opposed to hat values # (e.g., we might decide an IQ score below 70 is too low for inclusion in our study). # Can we identify some other variable that might contribute to outliers? # One example of this is recording a subject's duration on a questionnaire and setting cutoff points # for what 'acceptable' times are based on pilot-testing (e.g., throwing out anyone who completes # it in under 5 mins). # What is the process we'll use to identify outliers, and what will we do with them when we find them? # It's very difficult to accuse you of p-hacking if you can point to your pre-registration and show # that all of the actions you took to deal with outliers are wholly consistent with what you said you # were going to do. # Once you have collected your data and identified outliers, there are a few more questions you # might ask about these observations that can help give justification to your decision to exclude them: # Was there human error in entering this value? # In certain cases, you can go back to the source and see whether this observation is simply due # to data entry error. We love situations like this (because they're pretty easy to fix). # Was anything about that participant's experimental session weird? # Maybe the power went out, they got a phone call, or they fell asleep. If this is a relevant concern, # someone should keep a notebook where they jot down observations of strange occurrences when # running a given participant. # Does it make sense that their value is extreme? # If we ran the Prestige study in America in 2017, we would have an even more skewed income graph: # people in top positions are paid astronomically more than working-class people. We might be able # to tell a compelling story about why these extreme observations have so much influence, and our # (theory-based) argument could give us reason to exclude them. Can look quite suspicious though. # Based on the claims we're trying to make with our research, can we exclude these observations? # Removing extreme observations can, in certain cases, affect the exteral validity of our research. # For example, if we exclude top earners when we analyze the prestige data, we can't claim that # our conclusions apply to those professions. However, we will have more precise parameter estimates # for the large majority of occupations we've included in our sample. We have to note these # shortcomings by being precise with our conclusions.