######################################################## #### Lab 11: Categorical variables with >2 levels #### #### Week 1: 17 Nov 17 #### ######################################################## # OVERVIEW OF TODAY # # 1. Categorical predictor with 3 levels # 2. Graph the model with 3-level predictor # 3. Categorical predictor with 4 levels # For our first example, we're going to use data that should be familiar: they're # based on the 1971 Canadian occupational prestige study. This time, we're going to # focus on a variable we haven't looked at before: "type." The codebook for our focal # variables appears below. library(lmSupport) library(car) d = Prestige # +-----------+-------------------------+------------------+ # | Name | Description | Values | # +-----------+-------------------------+------------------+ # | education | Mean education in years | 6.38--15.97 | # | income | Mean income | 611--25879 | # | type | 3-level IV | bc=blue collar | # | | | prof=professional| # | | | wc=white collar | # +-----------+-------------------------+------------------+ # Get a sense of what's going on str(d) head(d) summary(d) varDescribe(d) # Is there initial evidence that income differs by type? varDescribeBy(d$income,d$type) # yup. # Analyzing these data will proceed in several steps: # +---------------------------------+ # | Step 1: State your hypothesis. | # +---------------------------------+ # When a variable has only two levels, the hypothesis is clear: the mean of one # level will be different from the mean of the other. When there are >2, there are many # possible hypotheses that could be tested: is 1>2&3? 2>1&3? 1>2>3? ...etc. Your # experiment will be aimed at testing one of these. # THAT IS TO SAY: your hypothesis is about a PATTERN of means. # For these data, one reasonable hypothesis is: # Professional income > White Collar income = Blue Collar income. # So Professionals make more money than White and Blue Collar # workers, while White and Blue Collar workers do not differ in their income. # Let's say this is our prediction we want to test. # +-----------------------------------------------------+ # | Step 2: Translate your hypothesis into a contrast. | # +-----------------------------------------------------+ # How should we code this contrast? What is our contrast of interest? # Remember: bc==>Blue Collar, prof==>Professional, wc==>White Collar. # When we use varRecode, our initial variable has to be a number and not a factor. # We can easily do this with the following command: d$typeN = as.numeric(d$type) # ...but which value is which? View(d) # 1 = bc, 2 = prof, 3 = wc (you'll notice this is alphabetical) # Make your focal contrast d$type_c1 = varRecode(d$typeN, c(1,2,3), c(-.333,.667,-.333)) # Is it centered? yes # Is it unit-weighted? yes # +-------------------------------------------------+ # | Step 3: generate a set of orthogonal contrasts. | # +-------------------------------------------------+ # The purpose of these contrasts is NOT to test other hypotheses. Rather, they # are selected to account for the remaining between-groups variance present in the outcome. # How many of these additional contrasts are we going to need in this example? # 1 (m - 1 total contrasts, 1 focal and one orthogonal) # Q: What are the 2 characteristics that our contrasts should have? # A: Orthogonal and centered # What should this contrast be? d$type_c2 = varRecode(d$typeN, c(1,2,3), c(.5,0,-.5)) # How do we know this is orthogonal to the first contrast? temp = c(-.333,.667,-.333)*c(.5,0,-.5) sum(temp) # This should be zero # +------------------------------------------+ # | Step 4: Fit the model with all contrasts | # +------------------------------------------+ mA = lm(income ~ type_c1 + type_c2, data=d) modelSummary(mA) # What's going on here? We don't really know yet... # +-------------------+ # | Step 5: Interpret | # +-------------------+ # Specifically, is the contrast of interest significant? # It is! # NOTE! Having a significant focal contrast does not necessarily mean that your data perfectly reflect # the pattern of means you hypothesized. Rather, it simply says the data are consistent with the hypothesis. # It's entirely possible, and indeed likely if you have many groups, that there are other things # going on, some of which you'll get a sense of by looking at the significance of the other contrasts. # This is when pairwise comparisons can become useful. # How do we interpret the parameter estimate? # The average salary in professional jobs, on average, is 5346 dollars more than white collar and blue # collar jobs. # Why can we make this inference? We unit-weighted our contrasts! #### +--- Quick covariate example ---+ #### # What if we control for education level in the above analysis? # Will we reach the same conclusion? How do you find out? # Add education as a predictor to the model mA2 = lm(income ~ type_c1 + type_c2 + education, data=d) modelSummary(mA2) # Check the results for c_1 and c_2 # If you reach different conclusions after controlling for education, what does that mean? # The results have changed when controlling for education, such that the predicted pattern of means # no longer accounts for significant variance in income. The difference in means we saw previously # between white/blue collar work and professional work can be completely accounted for by education level. # This makes sense since these variables are related. varDescribeBy(d$education, d$type) ################################################################### #### PAIRWISE COMPARISONS (pt 1) ################################## ################################################################### # What if we didn't have any hypothesized pattern of means, but rather wanted to test our # prediction using pairwise comparisons? # What approach do we use to test multiple hypotheses using non-orthogonal contrasts? Dummy coding! # We can set the reference group and create dummy codes in varContrasts: ?varContrasts # We have to give this command a factor type variable contrasts(d$type) <- varContrasts(d$type, # The factor Type = "DUMMY", #type of contrast RefLevel = 1) # Specify the reference group # What does the first contrast represent? # Comparison between professional and blue collar. # What does the second contrast represent? # Comparison between white collar and blue collar. # If we change the reference group, # our contrasts represent different pairwise comparisons contrasts(d$type) <- varContrasts(d$type, "Dummy", RefLevel = 2) # Now what are we testing? # Whether white and blue collar are significantly different from professional # Let's refit our model: mProf <- lm(income ~ type, data = d) modelSummary(mProf) # Conclusion? Professional jobs have higher income than both white collar and blue collar jobs. varDescribeBy(d$income, d$type) # As you can see, the coefficient for the first dummy code is literally the mean difference # between blue collar occupations and professional occupations. # The coefficient for the second dummy code is literally the # mean difference in income between white collar and professional occupations. # We can recode type to get the remaining pairwise difference: # Blue collar as the reference group: contrasts(d$type) <- varContrasts(d$type, "Dummy", RefLevel = 1) mBC <- lm(income ~ type, data = d) modelSummary(mBC) # Conclusion? White collar jobs do not have significantly higher income than blue collar jobs. # In this situation, what method would we use to protect our p values? # We have three groups, so we are making 3 pairwise comparisons. # Fisher's LSD! # What does this involve? # Doing an omnibus test to see that the overall model is significant, further adjustment unnecessary. # We can use the same model we did above. We'll use the Anova command John explained in class. Anova(mProf, type=3) Anova(mBC, type=3) # notice the result is the same no matter the reference group # Do we satisfy the condition of Fisher LSD? # Yes, so we do not need to adjust p-values. # +-------------------------+ # | Graphing Interlude | # +-------------------------+ # We'll use a factor as our predictor and compute a new model for the graph d$Class =varRecode(d$type, c('bc','prof','wc'), c("Blue Collar","Professional","White Collar")) mG = lm(income ~ Class, data=d) modelSummary(mG) # When it's a factor with >2 levels, R automatically uses dummy-coding. # Are the model estimates the same as from our contrast+residual model? # Let's check to be sure! unique(round(modelPredictions(mA), 2)) unique(round(modelPredictions(mG), 2)) # "unique" deletes any duplicates present in a given list. compare this output to one without "unique" # Yes, it's generating the same model predictions pX = data.frame(Class =c("Blue Collar", "Professional", "White Collar")) pY = modelPredictions(mG,pX) library(ggplot2) #library(cowplot) plot1 = ggplot(d,aes(x=Class,y=income, fill=Class)) + geom_bar(data=pY, aes(y=Predicted), stat = "identity") + geom_errorbar(data=pY, aes(y=Predicted, ymax=CIHi, ymin=CILo),width=.25) + geom_point(position = position_jitter(w=.05, h=0), alpha=.2, size=.5) + # alpha tells it how transparent to make the dots theme(legend.position = 'none') plot1 plot1 = plot1 + # let's add bars to show us significance annotate("segment", x=1.1, xend=1.9, y=15000, yend=15000)+ annotate("segment", x=2.1, xend=2.9, y=15000, yend=15000)+ annotate("text", x=1.5, y=16000, label="***")+ annotate("text", x=2.5, y=16000, label="***") plot1 # +-------------------------+ # | More than 3 levels | # +-------------------------+ # The h dataset is taken from a national survey of high school seniors. # Two hundred observation were randomly sampled from the High School and Beyond survey. # we'll read in the data from the web! h = read.csv("http://stats.idre.ucla.edu/stat/data/hsb2.csv") some(h) # What's going on here? # obs: 200 # vars: 11 # ----------------------------------------------------------------------------- # variable name type about the variable # ----------------------------------------------------------------------------- # id scale student id # female nominal (0/1) # race nominal ethnicity (1=hispanic 2=asian 3=black 4=white) # ses ordinal (1=low 2=middle 3=high) # schtyp nominal type of school (1=public 2=private) # prog nominal type of program (1=general 2=academic 3=vocational) # read scale standardized reading score # write scale standardized writing score # math scale standardized math score # science scale standardized science score # socst scale standardized social studies score # We predict racial disparities for reading scores, such that historically # disadvantaged racial groups will perform worse than other groups. # If you look at the codebook above, our race variable is categorical # and codes students as Hispanic, Asian, Black, and White. # The hypothesis, based on previous findings (e.g., Ferguson, 2002, # http://files.eric.ed.gov/fulltext/ED474390.pdf), # is that White and Asian students will have higher reading scores than # Black and Hispanic students. # Get an initial look at the data varDescribeBy(h$read,h$race) # Let's translate our hypothesis into an orthogonal contrast: # -.5, .5, -.5, .5 # What are the orthogonal contrasts? # It helps to think about it intuitively. That is, what are some other ways # the groups could relate to each other besides the relationship we've # described above? # White and Asian individuals could differ, and Black and Hispanic individuals could differ # or, in contrast form... # .5, 0, -.5, 0 # 0, -.5, 0, .5 # How do we make sure these work? sum(c(-.5,.5,-.5,.5)*c(.5, 0, -.5, 0)) sum(c(-.5,.5,-.5,.5)*c(0, -.5, 0, .5)) sum(c(0, -.5, 0, .5)*c(.5, 0, -.5, 0)) h$race_c1 = varRecode(h$race, c(1,2,3,4), c(-.5,.5,-.5,.5)) h$race_c2 = varRecode(h$race, c(1,2,3,4), c(.5, 0, -.5, 0)) h$race_c3 = varRecode(h$race, c(1,2,3,4), c(0, -.5, 0, .5)) # run your model! modh = lm(read ~ race_c1 + race_c2 + race_c3, data=h) modelSummary(modh,t=F) # Conclusion? # the data are consistent with the hypothesized pattern of means ################################################################### #### PAIRWISE COMPARISONS (pt 2) ################################## ################################################################### # Let's say we want to test our prediction using pairwise comparisons instead. h$raceF = as.factor(h$race) contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 1) # remember, 1 = hispanic, 2 = asian, 3 = black, 4 = white mr1 = lm(read ~ raceF, data=h) modelSummary(mr1) # white students perform significantly better than hispanic students. No other sig diffs contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 2) mr2 = lm(read ~ raceF, data=h) modelSummary(mr2) # no significant differences between Asian people and other groups contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 3) mr3 = lm(read ~ raceF, data=h) modelSummary(mr3) # our last remaining pairwise test is the last line of the output. # White students perform significantly better than Black students # How many pairwise comparisons have we done? # 6 # What approach must we use to adjust our p values? # Holm-Bonferroni! # p.adjust() is the command we'll rely on. it expects a vector (list) of p-values and # makes the adjustment you tell it to (many other adjustments are available besides # Holm-Bonferroni). We can obtain our p-values directly from the models we fit above. # Make a variable to hold the p-values that we're obtaining ps <- numeric(length = 6) ps <- c(0,0,0,0,0,0) # this also works ps # Get the p-values from our models # The coefficients attribute from our summary stores the p-values for each of our statistical tests modelSummary(mr1)$coefficients modelSummary(mr2)$coefficients modelSummary(mr3)$coefficients # We'll use this feature to fill in our vector ps[1:3] <- modelSummary(mr1)$coefficients[2:4, 4] # In plain English: make the first three values in the list the same as those found in rows 2-4 of # column 4 in the output from the first model. ps[4:5] <- modelSummary(mr2)$coefficients[3:4, 4] # In plain English: make the third value in the list the same as that found in rows 3 and 4 of column 4 in the # output from the second model ps[6] <- modelSummary(mr3)$coefficients[4, 4] ps # Pass the p-values to p.adjust. It will order them for you automatically, but so it's # easier to compare let's do it in advance. ps = sort(ps) ?p.adjust p.adjust(ps, "holm") # More viewable round(p.adjust(ps, "holm"), digits = 5) # Compare to the old values: round(ps, digits=5) # These are our new p-values # Do any of our conclusions change? # Nope, things that were significant before are still significant now. # You'll notice this approach led to more nuanced results in terms of our focal hypothesis. # This can be a good way to get more specific results, but the trade-off is statistical power. # Even more wise is trying to design your studies such that you don't need to do many of these # kinds of comparisons. # +-------------------------+ # | MORE PRACTICE | # if time allows. otherwise do it at home! # +-------------------------+ # Using the dataset Lab11_DemoData.dat, test the hypothesis that people in the # "middle class" have more positive social attitudes than either the upper or # lower "working class". Further suppose that your hypothesis does NOT predict # a difference between the levels of working class. d = dfReadDat("Lab11_DemoData.dat",SubID=NULL) # +------------+-------------------------------------------+-----------------+ # | Name | Description | Values | # +------------+-------------------------------------------+-----------------+ # | SubID | SubID | n=264 | # | numpos | Number of positive response on 7 Q survey | 0 - 7 | # | class | 3-level IV | 1=middle | # | | | 2=upper working | # | | | 3=lower working | # +------------+-------------------------------------------+-----------------+ # +---------------------------------------------+ # | Translate your hypothesis into a contrast. | # +---------------------------------------------+ d$class_c1=varRecode(d$class,c(1,2,3),c(2,-1,-1)) # +-------------------------------------------------+ # | Step 3: generate a set of orthogonal contrasts. | # +-------------------------------------------------+ # How many of these additional contrasts are we going to need? # 1 d$class_c2 = varRecode(d$class, c(1,2,3), c(0,1,-1)) # How do we know this is orthogonal to the first contrast? temp = c(2,-1,-1)*c(0,1,-1) # From step 2 and 3 sum(temp) # They sum up to zero! # +------------------------------------------+ # | Step 4: Fit the model with all contrasts | # +------------------------------------------+ mA = lm(numpos ~ class_c1 + class_c2, data=d) summary(mA) # Is the contrast of interest significant? # Yes. # Is the orthogonal contrast non-significant? # Does it meet the criteria proposed by Forscher and Brauer? meClass =modelEffectSizes(mA) meClass[2,4]/meClass[2:3,4] # Write your results in APA format. # The contrast of interest (2, -1, -1) was statistically significant, t(1,1053) = 3.75, p < .001, partial η2 = .01, # whereas the contrast testing the residual between-group variance (0, 1, -1) was not, t(1,1053)= 0.16, p = .87, partial η2 < .0001. # The pattern is consistent with our hypothesis pattern that people in the ``middle class'' have more positive social attitudes # than either the upper or lower ``working class''