#################################################### ## Week 8 Lab: Assumptions and transformations ## ## Friday, October 20th, 2017 ## #################################################### ############################################################ #### First Steps: Examine the data & check for outliers #### ############################################################ library(lmSupport) library(car) library(psych) # We are going to return to the Prestige dataset. Remind yourself what's going on here. d = dfReadDat('Lab8_DemoData.dat',SubID = "SubID") # Remember we had a value that was clearly a data entry error # So let's remove it # Univariate exploration # Note: This is NOT how we assess normality. # Bivariate exploration # Do the correlations match your intuitions? # Fit our linear model predicting income from gender composition and education. #### Case Analyses: Residuals #### # Are there cases that don't fit the model? Which? #### Case Analyses: Influence & Levarage #### # Are there cases that exert a lot of influence on our model predictions? Which? # Are they the same cases that did not fit the model? (From last analysis) #### Case Analyses: Removing problematic points #### # Let's remove these datapoints before beginning testing our model # assumptions. It's possible that a couple extreme outliers can # cause assumptions to be violated, and by simply eliminating them # the rest of the data conform. ################################################# #### Second Step: Checking Model Assumptions #### ################################################# # The GLM makes five primary assumptions about the structure of the data: # (1) no measurement error in the predictors; # (2) independence of residuals; # (3) normality of residuals; # (4) constant variance (homoscedasticity); # (5) linearity of relationships # Violations of assumptions can cause: # 1. Inefficient standard errors (low power, type II errors) # 2. Inaccurate standard errors (incorrect statistical tests, type I errors) # 3. Inaccurate parameter estimates # Assumptions (3) - (5) can be evaluated by using modelAssumptions. # Assumption (1) is always violated to some extent, but essentially demands # that you use the most reliable measures at your disposal. # Assumption (2) can be fixed by a variety of approaches (e.g., repeated measures analysis) ########################################### #### Assumption 3: Normality of errors #### ########################################### # Remember that "normality" in this case refers to the model residuals, NOT the # distribution of any single variable in the model. # A quantile-comparison (Q-Q) plot of the studentized residuals readily displays # how well the studentized residuals compare to a theoretical t-distribution. # modelAssumptions() will give you this and more. # The function will also spit out some statisical tests of these assumptions. # THEY ARE NO SUBSTITUTE FOR YOUR OWN VISUAL INSPECTION OF THE RESIDUALS! # In the case of our model predicting income from education and percentage women, # modelAssumptions tells us that our data seem to violate all the assumptions # of the GLM. But this is not necessarily diagnostic on its own. # So, should we be concerned at this point? ######################################### #### Assumption 4: Constant variance #### ######################################### # The constant variance assumption implies that, for any given predicted value # of the dependent variable, the residuals will have the same variance. # One easy way to assess the viability of this assumption is to plot the predicted # values against the observed values and assess visually whether the residuals # have more scatter at some predicted values than at others. # modelAssuptions() will generate that figure, as well as a plot of the predicted # values versus the absolute value of the residuals ("spread-level plot"). # So... are we concerned now? ############################# ## Assumption 5: Linearity ## ############################# # If the residuals are linear, then the slope of the best fitting line should be # approximately the same no matter what sub-sample of the data you consider. # Imagine sorting your dataframe on one of your predictor variables, and sliding # a window down the dataframe and fitting little models for each window. If the # relationship is linear, the solutions for the sub models should be very similar, # deviating from each other randomly. On the other hand, if the relationship of # X and Y is non-linear, than the estimate of the slope will differ systematically # at different points along X. modelAssumptions() will plot such lines, overlayed on # the residual scores and in addition to the best linear fit. # All that detail aside, if the green line and the red line don't look systematically # different, then the relationship between that X and Y is relatively linear. Otherwise, # there may be a non-linear relationship. ############################ #### So, what do we do? #### ############################ # We have discovered that our data violate the constant variance and normality of # errors assumptions. We will attempt to correct some of these issues by applying # transformations. But here is an interesting point: we detected our outliers in a # model with the raw scores. The outliers may be different in the model based on # transformed data. # To be thorough, it might be worth testing whether adding back these points correct # any of the violations. Spoiler alert for the sake of our lab time: they don't. # Let's consider why the assumptions were violated. # One place to start is by eye-balling the univarite distribution and noting # the direction of the skew. Log transformations will spread out the lower values # and condense the higher values---in other words, it will impose a rightward skew. # If the data is skewed left, this may serve to counteract this. # Raising to a power between 0 and 1 (where raising to 1/2 is the square root, # 1/3 is the cubed root, etc.) will impose a similar transformation, but may be # a little less intuitive. # Powers greater than 1 will skew in the opposite direction. # But before we start playing with these transformations on these data, # let's take a moment to clearly see what transfomrations do ############################################################ #### Quick visualization of transformations #### ############################################################ # To give a quick sense of what transforming variables looks, we can start by # using R to quickly create some fake data # If we were to Plot these two variables against each other # what would it look like? # Now let's see what some transformations of x do # We can try several differen't powers #-2 #-.5 #.2 #1.5 #5 #log # Transformations of y # #-2 #-.5 #.2 #1.5 #5 #log #Now that we have a better understanding of what transforming data can do, let's return to # our dataset and use modelBoxCox to help us figure out what transformatation to use. varPlot(d$income) varPlot(d$women) # modelBoxCox finds a power to raise your dependent variable to that is most likely # to correct model assumption problems. # How do we interpret the coefficients of this model? # So, did this fix our model assumptions? # We could try transforming percent women as well. # Remember, that violating the assumptions of linearity, equal variance, and normality # are not death knells for a model. Normality is the least concerning. They inflate # standard error, and violating linearity may bias the parameter estimates, but all # three can often be handled through transformations. # Our next step might be to consider models of the data that make different assumptions # about the distribution of residuals. It may be that the GLM is not the right tool for # the job. That said, remember that GLM is quite robust to violations of normality and # constant variance. Linearity is a little more troublesome, but even so small violations # are not worth panicking and running to find a more exotic method. Take a measured approach # and seek advice from those with relevant expertise if you feel your data violate # the GLM assumptions in an irreperable way. We will explore some options next semester. # But, let us be satisfied with the model with log(income) # ...after we conduct a case analysis on it! All the threats of influential data points # still exist in models with transformed variables. We'll want to check if there are # any points worth removing. # What did we learn? # Remove these problem cases # Did our parameter estimates change when we removed these outliers? Standard errors? # F statistics? ################## #### Graphing #### ################## library(ggplot2) # Publication-quality graphing # Create our set of values on our predictors for which we want predictions d$log2income <- log2(d$income) Yhat <- seq(min(d$women), max(d$women), length = 100) Yhat <- data.frame(women=Yhat, education = mean(d$education)) dIncome <- modelPredictions(mod1, Yhat) scatterplot <- ggplot() + geom_point(data=d, aes(x = women, y = log2income), color = "black") + geom_smooth(aes(ymin = CILo, ymax = CIHi, x = women, y = Predicted), data = dIncome, stat = "identity", color="red") + theme_bw(base_size = 14) + scale_x_continuous("Percent women", breaks = seq(0, 100, by=10)) + # x-axis label and tick mark location. Use scale_x_continuous() because x is specified as # a continuous variable in default aes settings scale_y_continuous("Log2 Income", breaks = seq(8, 15, by=1)) + # y-axis label and tick mark location. Use scale_y_continuous() because y is specified as # a continuous variable in default aes settings labs(title = 'Effect of % Women on Income') scatterplot #### WITH A CURVED REGRESSION LINE #### dIncome$Predicted = 2^dIncome$Predicted scatterplot <- ggplot() + geom_point(data=d, aes(x = women, y = income), color = "black") + geom_smooth(aes(ymin = CILo, ymax = CIHi, x = women, y = Predicted), data = dIncome, stat = "identity", color="red") + theme_bw(base_size = 14) + scale_x_continuous("Percent women", breaks = seq(0, 100, by=10)) + # x-axis label and tick mark location. Use scale_x_continuous() because x is specified as # a continuous variable in default aes settings scale_y_continuous("Income", breaks = seq(500, 26000, by=1500)) + # y-axis label and tick mark location. Use scale_y_continuous() because y is specified as # a continuous variable in default aes settings labs(title = 'Effect of % Women on Income') scatterplot varDescribe(d$income)