################################################ #### Week 5 Lab: 2+ Predictor Models #### #### Friday, October 6th, 2017 #### ################################################ ###################################################################################### #### The Prestige Dataset ############################################### ###################################################################################### # Today we're going to use a dataset (Prestige). This dataset is one that comes # included in R packages, so it doesn't need to be downloaded and read in with # dfReadDat. # How can we find out more about the Prestige dataset? # First, we need to load one of two packages: car and effects library(car) # we'll use this one since we are going to use this package anyway ?Prestige # Description of the Prestige dataset # Copy Prestige dataset into an object called 'd' to prevent unnecessary typing. # Run some commands to get a sense of the data # Use the help pane to get some more information about these variables # What does each row represent? # You can list all the rows: # Create a correlation table listing all of the pairwise correlations between the 4 variables # we care about: 'income', 'education', 'women', and 'prestige' cor() # What does the use = pairwise.complete do? ?cor # The alternative (and the default) is listwise deletion. # What do you notice? Which variables are highly correlated, which are not? Do the patterns make sense? # Note that the data come from 1971, so (hopefully) things have changed a bit since then. # What if we also want p values for these correlations? library(psych) ?corr.test corr.test() # Let's save it as an object so we can easily call it later: # Why can't we just use 'd' instead of doing this subsetting? # We can also look at these correlations visually, by looking at all the bivariate scatterplots spm() # this command lives in the car package # What do the plots along the diagonal represent? What about the red and green lines? How can we find out? ?spm #### Quick Review #### # We are interested in whether occupations with more years of education offer higher average income # Fit a model that tests this hypothesis # Evaluate the model. # Is education significantly related to income? # What direction is the effect? # How large is the effect? ################################################################################## #### Models with multiple predictors ############################################# ################################################################################## # Before collecting the data, let's assume that your PI suggested you fit a model # with percentage of women as a covariate. # In general, why might it make sense to add a covariate of this kind? # Another is to answer a slightly different question. What would that question be here? # This might be done in response to a reviewer who says something like... # Fit a model predicting income from education and the percentage of women in a given profession. # How do we see that adding percent women gives us more power to test the effect of education? # Compare partial eta-squared in the models with and without the covariate. # Notice also how our CI bands narrow after we add the covariate to the model: library(effects) # CI bands narrow because we added a good covariate (women) # we are removing some of the previously unexplained variance in the DV (income) # *Note: remember you can use the back and forward arrows in the plots pane to toggle between graphs. # Did our parameter estimate for education change? Is that surprising? ################################################################ #### 3 predictor models #################################### ################################################################ # Why might we add a third predictor? (four main situations) # 1. # 2. # 3. # 4. # New question: Does education explain variance in income above and beyond prestige? # Maybe jobs that require more education are also more prestigeous, and it's prestige # that's really related to income. # Is there an effect of education on income, even when we control for prestige? # Fit a 3rd model predicting income from education, perceived prestige, and percentage of women # What is the interpretation of the intercept now? # This isn't particularly meaningful. One way to make the parameter estimate for the intercept more # meaningful is to mean-center our predictor variables, and refit the model d$educationC = d$womenC = d$prestigeC = # What is the interpretation of the intercept now? # Compare the model summary output between m3 and m3C. What changed? What hasn't changed? # Why is the standard error for the intercept so much larger for the model with uncentered # predictors as compared to the model with centered predictors? # Do any of the effect sizes change between the two models? # What is the interpretation of the coefficient for education and what does the coefficient tell us? # There is a dramatic change in the coefficient for education. It has gone from being quite large # ($944, from the 3 parameter model) to not so large ($177, from the 4 parameter model). Why? # We can find a partial answer to this conundrum by revisiting the correlation # between education and prestige. # Education and prestige have an extremely strong relationship (r = .85). In # fact, if we predict prestige from education, we will find that every extra # year of education is associated with a 5.4 unit increase in occupational prestige, # and that education accounts for 72% of the variance in occupational prestige. m.pres = lm(prestige ~ education, data = d) modelSummary(m.pres) modelEffectSizes(m.pres) # Because of the strong relationship between prestige and education, it's likely # that a lot of the variance in education that predicts income will overlap with # prestige. In fact, the dramatic change in the coefficient for education # when prestige is added to the model tells us precisely that: the # variance in education that predicts income overlaps with the variance in # prestige that predicts income. # We can directly test for multicollinearity of all predictors in our model using this function: ?vif vif(m3) # vif stands for...? # If you have a vif value >= ___ for your focal predictor(s), then multicollinearity is a problem # If it is less than ___ but still close (around ___ for education and prestige in this example) # then there is some multicollinearity that is affecting your model, but not necessarily critically so. # We don't really care about this much unless it spells trouble for our focal variables. # What is the interpretation of the coefficient for women? # The education coefficients vary dramatically between model 2 and model 3. But what if # our main predictor had been prestige, instead of education, in model 2? Would the prestige # coefficients similarly change when adding the third predictor? Would the magnitude of the # effect size for prestige change? # Fit a model (m2b) predicting income from percent women and prestige. Compare this model to model m3C. # Note also how the SE for prestige doubles when education is added to the model # Effect sizes? # What is the model comparison interpretation of model R-squared for model 3C? ###################################################################################### #### Graphing ############################################### ###################################################################################### # * may skip in class to be able to do exercise # Create a plot showing the effect of education on income when controlling for percentage # of women and prestige # What values should we choose for percent women and prestige? # Quick and dirty look at the data # Publication-quality graphing # Create our set of values on our predictors for which we want predictions # Basic plot elements plot1 = plot1 # Add raw data plot1 = plot1 + plot1 # Add regression line plot1 = plot1 + plot1 # Make it pretty plot1 = plot1 + plot1 # If you were asked to draw a best-fitting regression line on the graph based on the raw data, # this probably isn't what it would look like. Why does it appear to fit the data poorly? pY2 = plot1 = plot1 + plot1