########################################################
#### Week 6 Lab: Dealing With Messy Data - Outliers ####
####          Friday, October 13th, 2017            ####
########################################################

################################################################################################
#### Case Analysis Part 1: Examine the data ####################################################
################################################################################################

# We're going to work with the Prestige dataset again, but we've modified it, so download 
# the file from the website.

library(lmSupport)
d = dfReadDat()


#### 1. Univariate Statistics ####
# Explore the data. This should be the first step in any analysis.

# Do the means look appropriate? 
# Do the SDs make sense?
# Is this the range we would expect for these variables?

# We can plot education
varPlot()
# This useful command gives you the descriptives of the variable and shows you a frequency plot.

# Identify which case has this anomalous value

# What does this mean to R?

# Remove it.
d = dfRemoveCases()
# What does this mean?
# If this were our own data, we might go back and see if we could correct this error.

# What might be the problem with coding the remove cases this way?
?dfRemoveCases


# What could we do instead?


#### 2. Univariate Plots ####
# Let's say that we're interested in predicting income from education and women. 
# So, those are the variables we're going to focus on.

# This is an extremely informal and simplistic way to detect outliers. All you're going to find
# here are highly anomalous values, and you'll only be able to remove things based on impossibility.


#### 3. Bivariate correlations ####
library(psych)
corr.test()

# This gives us a the correlation matrix that shows the correlations between education, income, and
# women for the data frame d. Here, we are getting a sense of how our variables are related to each 
# other (very similar to last week). What if you need more precise values for these statistics?


# We can look at the scatterplot matrix to assess departures from (univariate) normality and to 
# visually inspect violations of bivariate assumptions. We wouldn't make decisions based on these 
# plots, but they might suggest some problematic cases.

# What's a simple thing we could have done to make this whole section a bit easier?


########################################################################################################
#### Case Analysis Part 2: Diagnostic functions ########################################################
########################################################################################################
# The case analysis function in lmSupport is modelCaseAnalysis(), followed by a model object and 
# one of 7 options, including RESIDUALS, UNIVARIATE, HATVALUES, and COOKSD.

# Remember that we're interested in predicting income from education and women. 

# What did we learn? 

# If you want, open a larger plot window
# quartz() # for macs
# windows() # for PCs


#### 4. LEVERAGE ####
# Leverage = ?
# What is the hat value an index of? 


xHats = modelCaseAnalysis()

# Here's how this works. Click on points that you would consider to be problematic (far beyond the red
# line, separated from the rest of the data, etc). It will look like nothing happened, but trust us,
# something did. When you're done, click 'Finish.' R will store information about these points in 
# an object we've named xHats. You can call this object to remind yourself what points are problematic.

# Do any of these points look extreme?
# Is one point separated from the distribution of hat values?


# Why might this be?


# So, are we ready to remove this occupation from our data?


#### 5. REGRESSION OUTLIERS ####
# Regression outliers = ?
# What is this an index of?


# The RESIDUALS graph plots a histogram of the (studentized) residuals for an individual model.  
# Because the studentized residuals are distributed as t, one can calculate a p-value for each 
# point that tests whether a given point significantly differs from the model.  These p-values
# are automatically Bonferroni corrected for the fact that you are doing a large number of tests 
# (one for every observation in your dataframe).

xResids = modelCaseAnalysis()

# What does it mean for these studentized residuals to be large? How could they affect our model estimates?


# Should we remove these observations from our data? 


#### 6. INFLUENCE ####
# Influence: Cook's D
# What two characteristics does influence take into account?

# Is it necessary for an influential point to have high leverage? A large studentized residual?

xCooks = modelCaseAnalysis()


# Here, several points exceed the cutoff. This cutoff is only a rough rule-of-thumb, though. 
# What's more important is whether or not the points are part of our distribution of Cook's 
# distance.
# How many points have a large gap? And which are they?

# But what are we missing here?

# Influence: df betas
# DF Betas calculate influence on a given parameter. It helps us sort out what points are affecting
# what parameters.
modelCaseAnalysis()
# you'll notice this works somewhat differently than the others. Here, we get a figure for each of
# the parameters. You also want to record the row numbers of the observations as you go.


# Influence: The influence plot
xInfs = modelCaseAnalysis()
# What's going on here? What's the circle size? What are the axes?
# The size of the circle
# The X axis
# The Y axis
# BEFORE YOU CLICK THE POINTS, guess which circles represent the three 'problematic' points we've 
# identified in our previous analyses.


#### So..... now what?? ####

# Through our analyses, we found ___ and ___ are ill-fit by our model, have
# extreme values on the outcome variable (income), and have a large influence on our parameter
# estimates. Furthermore, including them inflates the standard error of our parameter estimates.
# We also identified one case that has high leverage but is well-fit by our model.

#### KEY QUESTIONS ####
# 1. Can we explain this result from some kind of error in coding? Can we correct the error(s)?


# 2. What is the impact of removing the cases?


# What changes? Conclusions? Precision?


# Another much less common option is "bringing your outliers to the fence," which means changing the
# value for the observation based on the residual cutoff that would no longer make it an outlier.
# This allows you to maintain power while addressing extreme values. You'd want to have a good policy
# for doing this, though.


########################################################################################
#### CONSIDERING OPEN SCIENCE STANDARDS ################################################
########################################################################################
# This is a new section as of Fall 2016. Edited 2017.
# Recently, some research practices have been called into question in light of the replication crisis.
# Outlier analysis is one area strongly impacted by this development in the field. Basically,
# you'll want to think about and answer a lot of questions regarding outliers BEFORE you even start
# collecting data, and you'll have to have good reasons for the decisions you make.
# Here are a couple questions you might ask before you even start:

# What would we consider too extreme of an observation on a predictor variable?
# You should try to answer this question in terms of practical raw values as opposed to hat values
# (e.g., we might decide an IQ score below 70 is too low for inclusion in our study).

# Can we identify some other variable that might contribute to outliers?
# One example of this is recording a subject's duration on a questionnaire and setting cutoff points
# for what 'acceptable' times are based on pilot-testing (e.g., throwing out anyone who completes 
# it in under 5 mins).

# What is the process we'll use to identify outliers, and what will we do with them when we find them?
# It's very difficult to accuse you of p-hacking if you can point to your pre-registration and show
# that all of the actions you took to deal with outliers are wholly consistent with what you said you
# were going to do.

# Once you have collected your data and identified outliers, there are a few more questions you
# might ask about these observations that can help give justification to your decision to exclude them:

# Was there human error in entering this value?
# In certain cases, you can go back to the source and see whether this observation is simply due
# to data entry error. We love situations like this (because they're pretty easy to fix).

# Was anything about that participant's experimental session weird?
# Maybe the power went out, they got a phone call, or they fell asleep. If this is a relevant concern,
# someone should keep a notebook where they jot down observations of strange occurrences when 
# running a given participant.

# Does it make sense that their value is extreme?
# If we ran the Prestige study in America in 2017, we would have an even more skewed income graph:
# people in top positions are paid astronomically more than working-class people. We might be able
# to tell a compelling story about why these extreme observations have so much influence, and our
# (theory-based) argument could give us reason to exclude them. Can look quite suspicious though.

# Based on the claims we're trying to make with our research, can we exclude these observations?
# Removing extreme observations can, in certain cases, affect the exteral validity of our research.
# For example, if we exclude top earners when we analyze the prestige data, we can't claim that
# our conclusions apply to those professions. However, we will have more precise parameter estimates
# for the large majority of occupations we've included in our sample. We have to note these 
# shortcomings by being precise with our conclusions.