########################################################
#### Week 6 Lab: Dealing With Messy Data - Outliers ####
####          Friday, October 13th, 2017            ####
########################################################

################################################################################################
#### Case Analysis Part 1: Examine the data ####################################################
################################################################################################

# We're going to work with the Prestige dataset again, but we've modified it, so download 
# the file from the website.

library(lmSupport)
d = dfReadDat('Lab6_prestigedata.dat')


#### 1. Univariate Statistics ####
# Explore the data. This should be the first step in any analysis.
varDescribe(d)
# Do the means look appropriate? Generally yes
# Do the SDs make sense? Mostly yes
# Is this the range we would expect for these variables? Argh no!

# We can plot education and see that one case is way out there.
varPlot(d$education)
# This useful command gives you the descriptives of the variable and shows you a frequency plot.

# Identify which case has this anomalous value
d[d$education==-10.7,]
# This is to say: in the data frame d, find me the row with education = -10.7

# Remove it.
d = dfRemoveCases(d,103)
# This is to say: remove case 103 from dataset d
# If this were our own data, we might go back and see if we could correct this error.

# What might be the problem with coding the remove cases this way?
?dfRemoveCases
# These are row numbers, which are dependent on the order the data are in when we read them in.
# This means if we reorder the data file but keep this code the same, we're deleting observations
# that have no problems.

# What could we do instead?
d = d[d$education > 0,]
d = dfRemoveCases(d, which(d$education==-10.7)) # which returns the row number of observations
# that meet a given criterion


#### 2. Univariate Plots ####
# Let's say that we're interested in predicting income from education and women. 
# So, those are the variables we're going to focus on.
varPlot(d$education) # this looks much better now
varPlot(d$income) # should we remove the "outliers" here? probably not (not mistake or miscoding)
varPlot(d$women)
# This is an extremely informal and simplistic way to detect outliers. All you're going to find
# here are highly anomalous values, and you'll only be able to remove things based on impossibility.


#### 3. Bivariate correlations ####
library(psych)
corr.test(d[, c('education', 'income', 'women')])

# This gives us a the correlation matrix that shows the correlations between education, income, and
# women for the data frame d. Here, we are getting a sense of how our variables are related to each 
# other (very similar to last week). What if you need more precise values for these statistics?

print(corr.test(d[, c('education','income','women')]), digits=4)

spm(d[, c('education', 'income', 'women')])

# We can look at the scatterplot matrix to assess departures from (univariate) normality and to 
# visually inspect violations of bivariate assumptions. We wouldn't make decisions based on these 
# plots, but they might suggest some problematic cases.

# What's a simple thing we could have done to make this whole section a bit easier?
# We could have saved our vars of interest as an object.


########################################################################################################
#### Case Analysis Part 2: Diagnostic functions ########################################################
########################################################################################################
# The case analysis function in lmSupport is modelCaseAnalysis(), followed by a model object and 
# one of 7 options, including RESIDUALS, UNIVARIATE, HATVALUES, and COOKSD.

# Remember that we're interested in predicting income from education and women. 
mod1 = lm(income ~ education + women, data = d)
modelSummary(mod1)
# What did we learn? Like last week, years education has a positive relationship with income and
# percent women has a negative relationship with income.

# quartz()
# windows()

#### 4. LEVERAGE ####
# Leverage = hat values
# What is the hat value an index of? 
# It's how far a given value on a predictor variable is from the centroid of the data.

xHats = modelCaseAnalysis(mod1, Type='HATVALUES')

# Here's how this works. Click on points that you would consider to be problematic (far beyond the red
# line, separated from the rest of the data, etc). It will look like nothing happened, but trust us,
# something did. When you're done, click 'Finish.' R will store information about these points in 
# an object we've named xHats. You can call this object to remind yourself what points are problematic.

# Do any of these points look extreme?
# Is one point separated from the distribution of hat values?
xHats # will give us the rowname and the hat value (measure of leverage)
d[xHats$Rownames,] # gives us the complete data record for that row
# Sewing machine operators have high leverage.
# Why might this be?
# They have low education and high percent women.

# So, are we ready to remove this occupation from our data?
# No!


#### 5. REGRESSION OUTLIERS ####
# Regression outliers = large residuals
# What is this an index of?
# It indicates how dramatically a given observation deviates from the model we've estimated.

# The RESIDUALS graph plots a histogram of the (studentized) residuals for an individual model.  
# Because the studentized residuals are distributed as t, one can calculate a p-value for each 
# point that tests whether a given point significantly differs from the model.  These p-values
# are automatically Bonferroni corrected for the fact that you are doing a large number of tests 
# (one for every observation in your dataframe).

xResids = modelCaseAnalysis(mod1, Type='RESIDUALS')

# What does it mean for these studentized residuals to be large? How could they affect our model estimates?
# It means the observation differs significantly from the predictions made by our model. Regression
# outliers can (but don't always) change our parameter estimates, but they do inflate the standard
# error of our parameter estimates.
xResids # gives you row names and studentized residual values
d[xResids$Rownames,]
# Two cases are beyond the red cutoff and are pretty separated from the bulk of the data: general 
# managers and physicians. 
# The next step is to look at their data: both appear to have very high occupational incomes as 
# compared to the rest of the data.
varDescribe(d$income)

# Should we remove these observations from our data? 
# It depends! On factors we will elaborate on later.


#### 6. INFLUENCE ####
# Influence: Cook's distance 
# What two characteristics does influence take into account?
# Leverage and studentized residuals. Observations with a large Cook's D are extreme on both of these.
# Is it necessary for an influential point to have high leverage? A large studentized residual?
# Not necessarily, but it will have to be high on at least one of these dimensions. Remember,
# Cook's D is basically indicating how much our parameter estimates would change if we removed
# a given observation.
xCooks = modelCaseAnalysis(mod1, Type='COOKSD')

xCooks
d[xCooks$Rownames,]
# Here, several points exceed the cutoff. This cutoff is only a rough rule-of-thumb, though. 
# What's more important is whether or not the points are part of our distribution of Cook's 
# distance.
# How many points have a large gap? Two. And which are they? Two of the cases we were suspicious of!

# But what are we missing here?
# We don't know which parameters are being affected by these influencial points.
# Influence: df betas
# DF Betas calculate influence on a given parameter. It helps us sort out what points are affecting
# what parameters.
modelCaseAnalysis(mod1, Type="DFBETAS")
# you'll notice this works somewhat differently than the others. Here, we get a figure for each of
# the parameters. You also want to record the row numbers of the observations as you go.
# 024 on intercept (sort of), 024 on education (sort of), no real percent women problem
# in the scatterplots, we also identify point 002.

# Influence: The influence plot
xInfs = modelCaseAnalysis(mod1, Type='INFLUENCEPLOT')
# What's going on here? What's the circle size? What are the axes?
# The size of the circle indicates the Cook's D value. 
# The X axis is the hat value
# The Y axis is the studentized residual.
# BEFORE YOU CLICK THE POINTS, guess which circles represent the three 'problematic' points we've 
# identified in our previous analyses.
xInfs
d[xInfs$Rownames,]

# From our Cook's d plot, we already saw that general managers and physicians had high influence. 
# Notice: Sewing operators have high leverage, but this point is not a regression outlier. While 
# we might want to investigate this case further, it seems less potentially problematic.


#### So..... now what?? ####

# Through our analyses, we found physicians and general managers are ill-fit by our model, have
# extreme values on the outcome variable (income), and have a large influence on our parameter
# estimates. Furthermore, including them inflates the standard error of our parameter estimates.
# We also identified one case that has high leverage but is well-fit by our model.

#### KEY QUESTIONS ####
# 1. Can we explain this result from some kind of error in coding? Can we correct the error(s)?
# We certainly can't in this case since we're not the researchers, so continue.

# 2. What is the impact of removing the cases?
d2 = dfRemoveCases(d, c(which(d$name == 'general.managers'), which(d$name == 'physicians')))
mod2 = lm(income ~ education + women, data = d2)

modelSummary(mod1) # including problem observations
modelSummary(mod2) # removing problem observations

# What changes? Conclusions? Precision?
# In this case, our conclusions do not change, but our precision improves. The decision about whether
# to remove these observations carries less weight in this example. If we did choose to remove them,
# we would have to report this in our Results section, and we'd have to provide some reason for
# doing so (see below).

# Another much less common option is "bringing your outliers to the fence," which means changing the
# value for the observation based on the residual cutoff that would no longer make it an outlier.
# This allows you to maintain power while addressing extreme values. You'd want to have a good policy
# for doing this, though.


########################################################################################
#### CONSIDERING OPEN SCIENCE STANDARDS ################################################
########################################################################################
# This is a new section as of Fall 2016. Edited 2017.
# Recently, some research practices have been called into question in light of the replication crisis.
# Outlier analysis is one area strongly impacted by this development in the field. Basically,
# you'll want to think about and answer a lot of questions regarding outliers BEFORE you even start
# collecting data, and you'll have to have good reasons for the decisions you make.
# Here are a couple questions you might ask before you even start:

# What would we consider too extreme of an observation on a predictor variable?
# You should try to answer this question in terms of practical raw values as opposed to hat values
# (e.g., we might decide an IQ score below 70 is too low for inclusion in our study).

# Can we identify some other variable that might contribute to outliers?
# One example of this is recording a subject's duration on a questionnaire and setting cutoff points
# for what 'acceptable' times are based on pilot-testing (e.g., throwing out anyone who completes 
# it in under 5 mins).

# What is the process we'll use to identify outliers, and what will we do with them when we find them?
# It's very difficult to accuse you of p-hacking if you can point to your pre-registration and show
# that all of the actions you took to deal with outliers are wholly consistent with what you said you
# were going to do.

# Once you have collected your data and identified outliers, there are a few more questions you
# might ask about these observations that can help give justification to your decision to exclude them:

# Was there human error in entering this value?
# In certain cases, you can go back to the source and see whether this observation is simply due
# to data entry error. We love situations like this (because they're pretty easy to fix).

# Was anything about that participant's experimental session weird?
# Maybe the power went out, they got a phone call, or they fell asleep. If this is a relevant concern,
# someone should keep a notebook where they jot down observations of strange occurrences when 
# running a given participant.

# Does it make sense that their value is extreme?
# If we ran the Prestige study in America in 2017, we would have an even more skewed income graph:
# people in top positions are paid astronomically more than working-class people. We might be able
# to tell a compelling story about why these extreme observations have so much influence, and our
# (theory-based) argument could give us reason to exclude them. Can look quite suspicious though.

# Based on the claims we're trying to make with our research, can we exclude these observations?
# Removing extreme observations can, in certain cases, affect the exteral validity of our research.
# For example, if we exclude top earners when we analyze the prestige data, we can't claim that
# our conclusions apply to those professions. However, we will have more precise parameter estimates
# for the large majority of occupations we've included in our sample. We have to note these 
# shortcomings by being precise with our conclusions.