################################################
####     Week 5 Lab: 2+ Predictor Models    ####
####       Friday, October 6th, 2017        ####
################################################

######################################################################################
####       The Prestige Dataset        ###############################################
######################################################################################

# Today we're going to use a dataset (Prestige). This dataset is one that comes
# included in R packages, so it doesn't need to be downloaded and read in with
# dfReadDat.

# How can we find out more about the Prestige dataset?
# First, we need to load one of two packages: car and effects
library(car) # we'll use this one since we are going to use this package anyway
?Prestige # Description of the Prestige dataset

# Copy Prestige dataset into an object called 'd' to prevent unnecessary typing.


# Run some commands to get a sense of the data
# Use the help pane to get some more information about these variables


# What does each row represent?


# You can list all the rows:


# Create a correlation table listing all of the pairwise correlations between the 4 variables
# we care about: 'income', 'education', 'women', and 'prestige'

cor()

# What does the use = pairwise.complete do?
?cor 

# The alternative (and the default) is listwise deletion. 

# What do you notice? Which variables are highly correlated, which are not? Do the patterns make sense?


# Note that the data come from 1971, so (hopefully) things have changed a bit since then.

# What if we also want p values for these correlations?
library(psych)
?corr.test
corr.test()
# Let's save it as an object so we can easily call it later:


# Why can't we just use 'd' instead of doing this subsetting?


# We can also look at these correlations visually, by looking at all the bivariate scatterplots
spm() # this command lives in the car package

# What do the plots along the diagonal represent? What about the red and green lines? How can we find out?
?spm 


####  Quick Review   ####

# We are interested in whether occupations with more years of education offer higher average income
# Fit a model that tests this hypothesis


# Evaluate the model. 

# Is education significantly related to income?

# What direction is the effect? 

# How large is the effect?


##################################################################################
#### Models with multiple predictors #############################################
##################################################################################

# Before collecting the data, let's assume that your PI suggested you fit a model 
# with percentage of women as a covariate.
# In general, why might it make sense to add a covariate of this kind? 


# Another is to answer a slightly different question. What would that question be here?

# This might be done in response to a reviewer who says something like...

# Fit a model predicting income from education and the percentage of women in a given profession.


# How do we see that adding percent women gives us more power to test the effect of education?

# Compare partial eta-squared in the models with and without the covariate.


# Notice also how our CI bands narrow after we add the covariate to the model:
library(effects)

# CI bands narrow because we added a good covariate (women)
# we are removing some of the previously unexplained variance in the DV (income)
# *Note: remember you can use the back and forward arrows in the plots pane to toggle between graphs.

# Did our parameter estimate for education change? Is that surprising?


################################################################
####   3 predictor models   ####################################
################################################################

# Why might we add a third predictor? (four main situations)

# 1.  
# 2. 
# 3. 
# 4. 

# New question: Does education explain variance in income above and beyond prestige?
# Maybe jobs that require more education are also more prestigeous, and it's prestige 
# that's really related to income.
# Is there an effect of education on income, even when we control for prestige?

# Fit a 3rd model predicting income from education, perceived prestige, and percentage of women


# What is the interpretation of the intercept now?


# This isn't particularly meaningful. One way to make the parameter estimate for the intercept more 
# meaningful is to mean-center our predictor variables, and refit the model
d$educationC =
d$womenC =
d$prestigeC =


# What is the interpretation of the intercept now?

# Compare the model summary output between m3 and m3C. What changed? What hasn't changed?

# Why is the standard error for the intercept so much larger for the model with uncentered 
# predictors as compared to the model with centered predictors? 


# Do any of the effect sizes change between the two models?


# What is the interpretation of the coefficient for education and what does the coefficient tell us? 


# There is a dramatic change in the coefficient for education. It has gone from being quite large 
# ($944, from the 3 parameter model) to not so large ($177, from the 4 parameter model). Why?

# We can find a partial answer to this conundrum by revisiting the correlation
# between education and prestige.

# Education and prestige have an extremely strong relationship (r = .85).  In
# fact, if we predict prestige from education, we will find that every extra
# year of education is associated with a 5.4 unit increase in occupational prestige,
# and that education accounts for 72% of the variance in occupational prestige.
m.pres = lm(prestige ~ education, data = d)
modelSummary(m.pres)
modelEffectSizes(m.pres)

# Because of the strong relationship between prestige and education, it's likely
# that a lot of the variance in education that predicts income will overlap with
# prestige.  In fact, the dramatic change in the coefficient for education
# when prestige is added to the model tells us precisely that: the
# variance in education that predicts income overlaps with the variance in
# prestige that predicts income.

# We can directly test for multicollinearity of all predictors in our model using this function:
?vif
vif(m3)
# vif stands for...? 
# If you have a vif value >= ___ for your focal predictor(s), then multicollinearity is a problem
# If it is less than ___ but still close (around ___ for education and prestige in this example)
# then there is some multicollinearity that is affecting your model, but not necessarily critically so.
# We don't really care about this much unless it spells trouble for our focal variables.


# What is the interpretation of the coefficient for women?


# The education coefficients vary dramatically between model 2 and model 3. But what if 
# our main predictor had been prestige, instead of education, in model 2? Would the prestige 
# coefficients similarly change when adding the third predictor? Would the magnitude of the 
# effect size for prestige change?
# Fit a model (m2b) predicting income from percent women and prestige. Compare this model to model m3C.


# Note also how the SE for prestige doubles when education is added to the model


# Effect sizes?


# What is the model comparison interpretation of model R-squared for model 3C? 


######################################################################################
####             Graphing              ###############################################
######################################################################################
# * may skip in class to be able to do exercise

# Create a plot showing the effect of education on income when controlling for percentage 
# of women and prestige
# What values should we choose for percent women and prestige?


# Quick and dirty look at the data


# Publication-quality graphing
# Create our set of values on our predictors for which we want predictions


# Basic plot elements
plot1 = 
plot1

# Add raw data
plot1 = plot1 + 
plot1

# Add regression line
plot1 = plot1 + 
plot1

# Make it pretty
plot1 = plot1 + 
plot1

# If you were asked to draw a best-fitting regression line on the graph based on the raw data,
# this probably isn't what it would look like. Why does it appear to fit the data poorly?

pY2 = 
plot1 = plot1 + 
plot1