####################################################
## Week 8 Lab: Assumptions and transformations    ##
## Friday, October 20th, 2017                     ##
####################################################


############################################################
#### First Steps: Examine the data & check for outliers ####
############################################################
library(lmSupport)
library(car)
library(psych)


# We are going to return to the Prestige dataset. Remind yourself what's going on here.
d = dfReadDat('Lab8_DemoData.dat',SubID = "SubID")

# Remember we had a value that was clearly a data entry error 
# So let's remove it 


# Univariate exploration

# Note: This is NOT how we assess normality.

# Bivariate exploration
# Do the correlations match your intuitions?


# Fit our linear model predicting income from gender composition and education.



#### Case Analyses: Residuals ####
# Are there cases that don't fit the model? Which?


#### Case Analyses: Influence & Levarage ####
# Are there cases that exert a lot of influence on our model predictions? Which?
# Are they the same cases that did not fit the model? (From last analysis)


#### Case Analyses: Removing problematic points ####
# Let's remove these datapoints before beginning testing our model
# assumptions.  It's possible that a couple extreme outliers can
# cause assumptions to be violated, and by simply eliminating them
# the rest of the data conform.




#################################################
#### Second Step: Checking Model Assumptions ####
#################################################

# The GLM makes five primary assumptions about the structure of the data:
# (1) no measurement error in the predictors;
# (2) independence of residuals;
# (3) normality of residuals;
# (4) constant variance (homoscedasticity);
# (5) linearity of relationships

# Violations of assumptions can cause:
# 1. Inefficient standard errors (low power, type II errors)
# 2. Inaccurate standard errors (incorrect statistical tests,  type I errors)
# 3. Inaccurate parameter estimates

# Assumptions (3) - (5) can be evaluated by using modelAssumptions. 
# Assumption (1) is always violated to some extent, but essentially demands 
# that you use the most reliable measures at your disposal. 
# Assumption (2) can be fixed by a variety of approaches (e.g., repeated measures analysis)

###########################################
#### Assumption 3: Normality of errors ####
###########################################



# Remember that "normality" in this case refers to the model residuals, NOT the
# distribution of any single variable in the model. 

# A quantile-comparison (Q-Q) plot of the studentized residuals readily displays 
# how well the studentized residuals compare to a theoretical t-distribution. 
# modelAssumptions() will give you this and more.



# The function will also spit out some statisical tests of these assumptions.
# THEY ARE NO SUBSTITUTE FOR YOUR OWN VISUAL INSPECTION OF THE RESIDUALS!

# In the case of our model predicting income from education and percentage women,
# modelAssumptions tells us that our data seem to violate all the assumptions
# of the GLM. But this is not necessarily diagnostic on its own.

# So, should we be concerned at this point? 

#########################################
#### Assumption 4: Constant variance ####
#########################################

# The constant variance assumption implies that, for any given predicted value
# of the dependent variable, the residuals will have the same variance.  

# One easy way to assess the viability of this assumption is to plot the predicted
# values against the observed values and assess visually whether the residuals
# have more scatter at some predicted values than at others. 

# modelAssuptions() will generate that figure, as well as a plot of the predicted 
# values versus the absolute value of the residuals ("spread-level plot").


# So... are we concerned now? 

#############################
## Assumption 5: Linearity ##
#############################

# If the residuals are linear, then the slope of the best fitting line should be
# approximately the same no matter what sub-sample of the data you consider. 
# Imagine sorting your dataframe on one of your predictor variables, and sliding
# a window down the dataframe and fitting little models for each window. If the 
# relationship is linear, the solutions for the sub models should be very similar,
# deviating from each other randomly. On the other hand, if the relationship of
# X and Y is non-linear, than the estimate of the slope will differ systematically
# at different points along X. modelAssumptions() will plot such lines, overlayed on
# the residual scores and in addition to the best linear fit. 


# All that detail aside, if the green line and the red line don't look systematically
# different, then the relationship between that X and Y is relatively linear. Otherwise,
# there may be a non-linear relationship.


############################
#### So, what do we do? ####
############################

# We have discovered that our data violate the constant variance and normality of 
# errors assumptions.  We will attempt to correct some of these issues by applying 
# transformations. But here is an interesting point: we detected our outliers in a
# model with the raw scores. The outliers may be different in the model based on 
# transformed data.

# To be thorough, it might be worth testing whether adding back these points correct
# any of the violations. Spoiler alert for the sake of our lab time: they don't. 

# Let's consider why the assumptions were violated.


# One place to start is by eye-balling the univarite distribution and noting
# the direction of the skew. Log transformations will spread out the lower values
# and condense the higher values---in other words, it will impose a rightward skew.
# If the data is skewed left, this may serve to counteract this. 

# Raising to a power between 0 and 1 (where raising to 1/2 is the square root, 
# 1/3 is the cubed root, etc.) will impose a similar transformation, but may be 
# a little less intuitive.

# Powers greater than 1 will skew in the opposite direction.

# But before we start playing with these transformations on these data, 
# let's take a moment to clearly see what transfomrations do

############################################################
#### Quick visualization of transformations ####
############################################################

# To give a quick sense of what transforming variables looks, we can start by 
# using R to quickly create some fake data 


# If we were to Plot these two variables against each other
# what would it look like? 


# Now let's see what some transformations of x do

# We can try several differen't powers  

#-2 

#-.5

#.2

#1.5

#5

#log

# Transformations of y #

#-2

#-.5

#.2

#1.5

#5

#log

#Now that we have a better understanding of what transforming data can do, let's return to 
# our dataset and use modelBoxCox to help us figure out what transformatation to use. 

varPlot(d$income)
varPlot(d$women)


# modelBoxCox finds a power to raise your dependent variable to that is most likely
# to correct model assumption problems.



# How do we interpret the coefficients of this model? 


# So, did this fix our model assumptions?


# We could try transforming percent women as well. 



# Remember, that violating the assumptions of linearity, equal variance, and normality
# are not death knells for a model. Normality is the least concerning. They inflate 
# standard error, and violating linearity may bias the parameter estimates, but all
# three can often be handled through transformations.

# Our next step might be to consider models of the data that make different assumptions
# about the distribution of residuals. It may be that the GLM is not the right tool for
# the job. That said, remember that GLM is quite robust to violations of normality and 
# constant variance. Linearity is a little more troublesome, but even so small violations
# are not worth panicking and running to find a more exotic method. Take a measured approach
# and seek advice from those with relevant expertise if you feel your data violate 
# the GLM assumptions in an irreperable way. We will explore some options next semester.

# But, let us be satisfied with the model with log(income)
# ...after we conduct a case analysis on it! All the threats of influential data points 
# still exist in models with transformed variables. We'll want to check if there are 
# any points worth removing.


# What did we learn?

# Remove these problem cases

# Did our parameter estimates change when we removed these outliers? Standard errors?
# F statistics?

##################
#### Graphing ####
##################
library(ggplot2)
# Publication-quality graphing
# Create our set of values on our predictors for which we want predictions
d$log2income <- log2(d$income)

Yhat <- seq(min(d$women), max(d$women), length = 100)
Yhat <- data.frame(women=Yhat, education = mean(d$education))
dIncome <- modelPredictions(mod1, Yhat)

scatterplot <- ggplot() + 
  geom_point(data=d, aes(x = women, y = log2income), color = "black") + 
  geom_smooth(aes(ymin = CILo, ymax = CIHi, x = women, y = Predicted), 
              data = dIncome, stat = "identity", color="red") +
  theme_bw(base_size = 14) + 
  scale_x_continuous("Percent women", breaks = seq(0, 100, by=10)) +   
# x-axis label and tick mark location. Use scale_x_continuous() because x is specified as 
# a continuous variable in default aes settings
  scale_y_continuous("Log2 Income", breaks = seq(8, 15, by=1)) + 
# y-axis label and tick mark location. Use scale_y_continuous() because y is specified as 
# a continuous variable in default aes settings
  labs(title = 'Effect of % Women on Income')
scatterplot


#### WITH A CURVED REGRESSION LINE ####
dIncome$Predicted = 2^dIncome$Predicted
scatterplot <- ggplot() + 
  geom_point(data=d, aes(x = women, y = income), color = "black") + 
  geom_smooth(aes(ymin = CILo, ymax = CIHi, x = women, y = Predicted), 
              data = dIncome, stat = "identity", color="red") +
  theme_bw(base_size = 14) + 
  scale_x_continuous("Percent women", breaks = seq(0, 100, by=10)) +   
  # x-axis label and tick mark location. Use scale_x_continuous() because x is specified as 
  # a continuous variable in default aes settings
  scale_y_continuous("Income", breaks = seq(500, 26000, by=1500)) + 
  # y-axis label and tick mark location. Use scale_y_continuous() because y is specified as 
  # a continuous variable in default aes settings
  labs(title = 'Effect of % Women on Income')
scatterplot
varDescribe(d$income)