######################################################## #### Lab 11: Categorical variables with >2 levels #### #### Week 1: 17 Nov 17 #### ######################################################## # OVERVIEW OF TODAY # # 1. Categorical predictor with 3 levels # 2. Graph the model with 3-level predictor # 3. Categorical predictor with 4 levels # For our first example, we're going to use data that should be familiar: they're # based on the 1971 Canadian occupational prestige study. This time, we're going to # focus on a variable we haven't looked at before: "type." The codebook for our focal # variables appears below. library(lmSupport) library(car) d = Prestige # +-----------+-------------------------+------------------+ # | Name | Description | Values | # +-----------+-------------------------+------------------+ # | education | Mean education in years | 6.38--15.97 | # | income | Mean income | 611--25879 | # | type | 3-level IV | bc=blue collar | # | | | prof=professional| # | | | wc=white collar | # +-----------+-------------------------+------------------+ # Get a sense of what's going on # Is there initial evidence that income differs by type? # Analyzing these data will proceed in several steps: # +---------------------------------+ # | Step 1: State your hypothesis. | # +---------------------------------+ # When a variable has only two levels, the hypothesis is clear: the mean of one # level will be different from the mean of the other. When there are >2, there are many # possible hypotheses that could be tested: is 1>2&3? 2>1&3? 1>2>3? ...etc. Your # experiment will be aimed at testing one of these. # THAT IS TO SAY: your hypothesis is about a PATTERN of means. # For these data, one reasonable hypothesis is: # Professional income > White Collar income = Blue Collar income. # So Professionals make more money than White and Blue Collar # workers, while White and Blue Collar workers do not differ in their income. # Let's say this is our prediction we want to test. # +-----------------------------------------------------+ # | Step 2: Translate your hypothesis into a contrast. | # +-----------------------------------------------------+ # How should we code this contrast? What is our contrast of interest? # Remember: bc==>Blue Collar, prof==>Professional, wc==>White Collar. # When we use varRecode, our initial variable has to be a number and not a factor. # We can easily do this with the following command: d$typeN = as.numeric(d$type) # ...but which value is which? # Make your focal contrast # Is it centered? # Is it unit-weighted? # +-------------------------------------------------+ # | Step 3: generate a set of orthogonal contrasts. | # +-------------------------------------------------+ # The purpose of these contrasts is NOT to test other hypotheses. Rather, they # are selected to account for the remaining between-groups variance present in the outcome. # How many of these additional contrasts are we going to need in this example? # What are the 2 characteristics that our contrasts should have? # What should this contrast be? # How do we know this is orthogonal to the first contrast? # +------------------------------------------+ # | Step 4: Fit the model with all contrasts | # +------------------------------------------+ # What's going on here? # +-------------------+ # | Step 5: Interpret | # +-------------------+ # Specifically, is the contrast of interest significant? # NOTE! Having a significant focal contrast does not necessarily mean that your data perfectly reflect # the pattern of means you hypothesized. Rather, it simply says the data are consistent with the hypothesis. # It's entirely possible, and indeed likely if you have many groups, that there are other things # going on, some of which you'll get a sense of by looking at the significance of the other contrasts. # This is when pairwise comparisons can become useful. # How do we interpret the parameter estimate? # Why can we make this inference? #### +--- Quick covariate example ---+ #### # What if we control for education level in the above analysis? # Will we reach the same conclusion? How do you find out? # Add education as a predictor to the model # Check the results for c_1 and c_2 # If you reach different conclusions after controlling for education, what does that mean? ################################################################### #### PAIRWISE COMPARISONS (pt 1) ################################## ################################################################### # What if we didn't have any hypothesized pattern of means, but rather wanted to test our # prediction using pairwise comparisons? # What approach do we use to test multiple hypotheses using non-orthogonal contrasts? # We can set the reference group and create dummy codes in varContrasts: ?varContrasts # We have to give this command a factor type variable # What does the first contrast represent? # What does the second contrast represent? # If we change the reference group, # our contrasts represent different pairwise comparisons # Now what are we testing? # Let's refit our model: # Conclusion? # As you can see, the coefficient for the first dummy code is literally the mean difference # between blue collar occupations and professional occupations. # The coefficient for the second dummy code is literally the # mean difference in income between white collar and professional occupations. # We can recode type to get the remaining pairwise difference: # Blue collar as the reference group: # Conclusion? # In this situation, what method would we use to protect our p values? # What does this involve? # We can use the same models we did above. We'll use the Anova command John explained in class. # Do we satisfy the condition of Fisher LSD? # +-------------------------+ # | Graphing Interlude | # +-------------------------+ # We'll use a factor as our predictor and compute a new model for the graph d$Class =varRecode(d$type, c('bc','prof','wc'), c("Blue Collar","Professional","White Collar")) # When it's a factor with >2 levels, R automatically uses dummy-coding. # Are the model estimates the same as from our contrast+residual model? # Let's check to be sure! unique(round(modelPredictions(mA), 2)) unique(round(modelPredictions(mG), 2)) # "unique" deletes any duplicates present in a given list. compare this output to one without "unique" # Yes, it's generating the same model predictions pX = data.frame(Class =c("Blue Collar", "Professional", "White Collar")) library(ggplot2) #library(cowplot) # for after you do the main plot plot1 = plot1 + # let's add bars to show us significance annotate("segment", x=1.1, xend=1.9, y=15000, yend=15000)+ annotate("segment", x=2.1, xend=2.9, y=15000, yend=15000)+ annotate("text", x=1.5, y=16000, label="***")+ annotate("text", x=2.5, y=16000, label="***") plot1 # +-------------------------+ # | More than 3 levels | # +-------------------------+ # The h dataset is taken from a national survey of high school seniors. # Two hundred observation were randomly sampled from the High School and Beyond survey. # we'll read in the data from the web! h = read.csv("http://stats.idre.ucla.edu/stat/data/hsb2.csv") some(h) # What's going on here? # obs: 200 # vars: 11 # ----------------------------------------------------------------------------- # variable name type about the variable # ----------------------------------------------------------------------------- # id scale student id # female nominal (0/1) # race nominal ethnicity (1=hispanic 2=asian 3=black 4=white) # ses ordinal (1=low 2=middle 3=high) # schtyp nominal type of school (1=public 2=private) # prog nominal type of program (1=general 2=academic 3=vocational) # read scale standardized reading score # write scale standardized writing score # math scale standardized math score # science scale standardized science score # socst scale standardized social studies score # We predict racial disparities for reading scores, such that historically # disadvantaged racial groups will perform worse than other groups. # If you look at the codebook above, our race variable is categorical # and codes students as Hispanic, Asian, Black, and White. # The hypothesis, based on previous findings (e.g., Ferguson, 2002, # http://files.eric.ed.gov/fulltext/ED474390.pdf), # is that White and Asian students will have higher reading scores than # Black and Hispanic students. # Get an initial look at the data # Let's translate our hypothesis into an orthogonal contrast: # What are the orthogonal contrasts? # It helps to think about it intuitively. That is, what are some other ways # the groups could relate to each other besides the relationship we've # described above? # How do we make sure these work? # run your model! # Conclusion? ################################################################### #### PAIRWISE COMPARISONS (pt 2) ################################## ################################################################### # Let's say we want to test our prediction using pairwise comparisons instead. # remember, 1 = hispanic, 2 = asian, 3 = black, 4 = white # How many pairwise comparisons have we done? # What approach must we use to adjust our p values? # p.adjust() is the command we'll rely on. it expects a vector (list) of p-values and # makes the adjustment you tell it to (many adjustments are available). We can obtain # our p-values directly from the models we fit above. # Make a variable to hold the p-values that we're obtaining ps <- numeric(length = 6) ps <- c(0,0,0,0,0,0) # this also works ps # Get the p-values from our models # The coefficients attribute from our modelSummary stores the p-values for each of our statistical tests # We'll use this feature to fill in our vector # In plain English: # In plain English: # Pass the p-values to p.adjust. It will order them for you automatically, but so it's # easier to compare let's do it in advance. ps = sort(ps) ?p.adjust # More viewable # Compare to the old values: # These are our new p-values # Do any of our conclusions change? # You'll notice this approach led to more nuanced results in terms of our focal hypothesis. # This can be a good way to get more specific results, but the trade-off is statistical power. # Even more wise is trying to design your studies such that you don't need to do many of these # kinds of comparisons. # +-------------------------+ # | MORE PRACTICE | # if time allows. otherwise do it at home! # +-------------------------+ # Using the dataset Lab11_DemoData.dat, test the hypothesis that people in the # "middle class" have more positive social attitudes than either the upper or # lower "working class". Further suppose that your hypothesis does NOT predict # a difference between the levels of working class. d = dfReadDat("Lab11_DemoData.dat",SubID=NULL) # +------------+-------------------------------------------+-----------------+ # | Name | Description | Values | # +------------+-------------------------------------------+-----------------+ # | SubID | SubID | n=264 | # | numpos | Number of positive response on 7 Q survey | 0 - 7 | # | class | 3-level IV | 1=middle | # | | | 2=upper working | # | | | 3=lower working | # +------------+-------------------------------------------+-----------------+ # +---------------------------------------------+ # | Translate your hypothesis into a contrast. | # +---------------------------------------------+ # +-------------------------------------------------+ # | Step 3: generate a set of orthogonal contrasts. | # +-------------------------------------------------+ # How many of these additional contrasts are we going to need? # +------------------------------------------+ # | Step 4: Fit the model with all contrasts | # +------------------------------------------+ # Is the contrast of interest significant? # Write your results in APA format.