########################################################
####  Lab 11: Categorical variables with >2 levels  ####
####               Week 1: 17 Nov 17                ####
########################################################

# OVERVIEW OF TODAY #
# 1. Categorical predictor with 3 levels
# 2. Graph the model with 3-level predictor
# 3. Categorical predictor with 4 levels

# For our first example, we're going to use data that should be familiar: they're
# based on the 1971 Canadian occupational prestige study. This time, we're going to
# focus on a variable we haven't looked at before: "type." The codebook for our focal
# variables appears below.

library(lmSupport)
library(car)
d = Prestige

# +-----------+-------------------------+------------------+
# |   Name    |       Description       |      Values      |
# +-----------+-------------------------+------------------+
# | education | Mean education in years | 6.38--15.97      |
# | income    | Mean income             | 611--25879       |
# | type      | 3-level IV              | bc=blue collar   |
# |           |                         | prof=professional|
# |           |                         | wc=white collar  |
# +-----------+-------------------------+------------------+

# Get a sense of what's going on
str(d)
head(d)
summary(d)
varDescribe(d)

# Is there initial evidence that income differs by type?
varDescribeBy(d$income,d$type)
# yup.

# Analyzing these data will proceed in several steps:
#   +---------------------------------+
#   | Step 1: State your hypothesis.  |
#   +---------------------------------+
# When a variable has only two levels, the hypothesis is clear: the mean of one
# level will be different from the mean of the other. When there are >2, there are many
# possible hypotheses that could be tested: is 1>2&3? 2>1&3? 1>2>3? ...etc. Your
# experiment will be aimed at testing one of these. 
# THAT IS TO SAY: your hypothesis is about a PATTERN of means.

# For these data, one reasonable hypothesis is:
# Professional income > White Collar income = Blue Collar income. 
# So Professionals make more money than White and Blue Collar 
# workers, while White and Blue Collar workers do not differ in their income.
# Let's say this is our prediction we want to test.


#   +-----------------------------------------------------+
#   | Step 2: Translate your hypothesis into a contrast.  |
#   +-----------------------------------------------------+
# How should we code this contrast? What is our contrast of interest?
# Remember: bc==>Blue Collar, prof==>Professional, wc==>White Collar.

# When we use varRecode, our initial variable has to be a number and not a factor.
# We can easily do this with the following command:
d$typeN = as.numeric(d$type)
# ...but which value is which?
View(d)
# 1 = bc, 2 = prof, 3 = wc (you'll notice this is alphabetical)

# Make your focal contrast
d$type_c1 = varRecode(d$typeN, c(1,2,3), c(-.333,.667,-.333))
# Is it centered? yes
# Is it unit-weighted? yes


#   +-------------------------------------------------+
#   | Step 3: generate a set of orthogonal contrasts. |
#   +-------------------------------------------------+
# The purpose of these contrasts is NOT to test other hypotheses. Rather, they
# are selected to account for the remaining between-groups variance present in the outcome.

# How many of these additional contrasts are we going to need in this example? 
# 1 (m - 1 total contrasts, 1 focal and one orthogonal)

# Q: What are the 2 characteristics that our contrasts should have? 
# A: Orthogonal and centered

# What should this contrast be?
d$type_c2 = varRecode(d$typeN, c(1,2,3), c(.5,0,-.5))             

# How do we know this is orthogonal to the first contrast?
temp = c(-.333,.667,-.333)*c(.5,0,-.5)                                   
sum(temp) # This should be zero                                


#   +------------------------------------------+
#   | Step 4: Fit the model with all contrasts |
#   +------------------------------------------+
mA = lm(income ~ type_c1 + type_c2, data=d)
modelSummary(mA)

# What's going on here? We don't really know yet...


#   +-------------------+
#   | Step 5: Interpret |
#   +-------------------+
# Specifically, is the contrast of interest significant?
# It is!

# NOTE! Having a significant focal contrast does not necessarily mean that your data perfectly reflect
# the pattern of means you hypothesized. Rather, it simply says the data are consistent with the hypothesis.
# It's entirely possible, and indeed likely if you have many groups, that there are other things
# going on, some of which you'll get a sense of by looking at the significance of the other contrasts.
# This is when pairwise comparisons can become useful.

# How do we interpret the parameter estimate?
# The average salary in professional jobs, on average, is 5346 dollars more than white collar and blue
# collar jobs.
# Why can we make this inference? We unit-weighted our contrasts!


#### +--- Quick covariate example ---+ ####
# What if we control for education level in the above analysis? 
# Will we reach the same conclusion? How do you find out?

# Add education as a predictor to the model
mA2 = lm(income ~ type_c1 + type_c2 + education, data=d)
modelSummary(mA2)

# Check the results for c_1 and c_2

# If you reach different conclusions after controlling for education, what does that mean?
# The results have changed when controlling for education, such that the predicted pattern of means 
# no longer accounts for significant variance in income. The difference in means we saw previously 
# between white/blue collar work and professional work can be completely accounted for by education level.
# This makes sense since these variables are related.
varDescribeBy(d$education, d$type)


###################################################################
#### PAIRWISE COMPARISONS (pt 1) ##################################
###################################################################

# What if we didn't have any hypothesized pattern of means, but rather wanted to test our
# prediction using pairwise comparisons?

# What approach do we use to test multiple hypotheses using non-orthogonal contrasts? Dummy coding!

# We can set the reference group and create dummy codes in varContrasts:
?varContrasts
# We have to give this command a factor type variable
contrasts(d$type) <- varContrasts(d$type, # The factor
                                  Type = "DUMMY", #type of contrast
                                  RefLevel = 1) # Specify the reference group

# What does the first contrast represent?
# Comparison between professional and blue collar.
# What does the second contrast represent?
# Comparison between white collar and blue collar.

# If we change the reference group, 
# our contrasts represent different pairwise comparisons
contrasts(d$type) <- varContrasts(d$type, "Dummy", RefLevel = 2)
# Now what are we testing?
# Whether white and blue collar are significantly different from professional

# Let's refit our model:
mProf <- lm(income ~ type, data = d)
modelSummary(mProf)
# Conclusion? Professional jobs have higher income than both white collar and blue collar jobs.

varDescribeBy(d$income, d$type)
# As you can see, the coefficient for the first dummy code is literally the mean difference
# between blue collar occupations and professional occupations.
# The coefficient for the second dummy code is literally the
# mean difference in income between white collar and professional occupations.

# We can recode type to get the remaining pairwise difference:
# Blue collar as the reference group:
contrasts(d$type) <- varContrasts(d$type, "Dummy", RefLevel = 1)
mBC <- lm(income ~ type, data = d)
modelSummary(mBC)
# Conclusion? White collar jobs do not have significantly higher income than blue collar jobs.

# In this situation, what method would we use to protect our p values?
# We have three groups, so we are making 3 pairwise comparisons.
# Fisher's LSD!

# What does this involve?
# Doing an omnibus test to see that the overall model is significant, further adjustment unnecessary.

# We can use the same model we did above. We'll use the Anova command John explained in class.
Anova(mProf, type=3) 
Anova(mBC, type=3)  # notice the result is the same no matter the reference group
# Do we satisfy the condition of Fisher LSD?
# Yes, so we do not need to adjust p-values.


#   +-------------------------+
#   |   Graphing Interlude    |
#   +-------------------------+
# We'll use a factor as our predictor and compute a new model for the graph
d$Class =varRecode(d$type, c('bc','prof','wc'), c("Blue Collar","Professional","White Collar"))

mG = lm(income ~ Class, data=d)
modelSummary(mG)
# When it's a factor with >2 levels, R automatically uses dummy-coding.

# Are the model estimates the same as from our contrast+residual model?
# Let's check to be sure!
unique(round(modelPredictions(mA), 2))
unique(round(modelPredictions(mG), 2))
# "unique" deletes any duplicates present in a given list. compare this output to one without "unique"
# Yes, it's generating the same model predictions

pX = data.frame(Class =c("Blue Collar", "Professional", "White Collar"))
pY = modelPredictions(mG,pX)

library(ggplot2)
#library(cowplot)
plot1 = ggplot(d,aes(x=Class,y=income, fill=Class)) +
  geom_bar(data=pY, aes(y=Predicted), stat = "identity") +
  geom_errorbar(data=pY, aes(y=Predicted, ymax=CIHi, ymin=CILo),width=.25) +
  geom_point(position = position_jitter(w=.05, h=0), alpha=.2, size=.5) + # alpha tells it how transparent to make the dots
  theme(legend.position = 'none')
plot1

plot1 = plot1 + # let's add bars to show us significance
  annotate("segment", x=1.1, xend=1.9, y=15000, yend=15000)+
  annotate("segment", x=2.1, xend=2.9, y=15000, yend=15000)+
  annotate("text", x=1.5, y=16000, label="***")+
  annotate("text", x=2.5, y=16000, label="***")
plot1


#   +-------------------------+
#   |   More than 3 levels    | 
#   +-------------------------+
# The h dataset is taken from a national survey of high school seniors. 
# Two hundred observation were randomly sampled from the High School and Beyond survey. 

# we'll read in the data from the web!
h = read.csv("http://stats.idre.ucla.edu/stat/data/hsb2.csv")
some(h)

# What's going on here?
# obs: 200
# vars: 11
# -----------------------------------------------------------------------------
# variable name   type   about the variable 
# -----------------------------------------------------------------------------
# id              scale  student id                  
# female        nominal  (0/1)        
# race          nominal  ethnicity (1=hispanic 2=asian 3=black 4=white)        
# ses           ordinal  (1=low 2=middle 3=high)           
# schtyp        nominal  type of school (1=public 2=private)
# prog          nominal  type of program (1=general 2=academic 3=vocational)
# read            scale  standardized reading score
# write           scale  standardized writing score
# math            scale  standardized math score
# science         scale  standardized science score
# socst           scale  standardized social studies score

# We predict racial disparities for reading scores, such that historically
# disadvantaged racial groups will perform worse than other groups.
# If you look at the codebook above, our race variable is categorical
# and codes students as Hispanic, Asian, Black, and White.
# The hypothesis, based on previous findings (e.g., Ferguson, 2002,
# http://files.eric.ed.gov/fulltext/ED474390.pdf),
# is that White and Asian students will have higher reading scores than
# Black and Hispanic students.

# Get an initial look at the data
varDescribeBy(h$read,h$race)

# Let's translate our hypothesis into an orthogonal contrast:
# -.5, .5, -.5, .5

# What are the orthogonal contrasts?
# It helps to think about it intuitively. That is, what are some other ways
# the groups could relate to each other besides the relationship we've
# described above?
# White and Asian individuals could differ, and Black and Hispanic individuals could differ
# or, in contrast form...
# .5, 0, -.5, 0
# 0, -.5, 0, .5

# How do we make sure these work?
sum(c(-.5,.5,-.5,.5)*c(.5, 0, -.5, 0))
sum(c(-.5,.5,-.5,.5)*c(0, -.5, 0, .5))
sum(c(0, -.5, 0, .5)*c(.5, 0, -.5, 0))

h$race_c1 = varRecode(h$race, c(1,2,3,4), c(-.5,.5,-.5,.5))
h$race_c2 = varRecode(h$race, c(1,2,3,4), c(.5, 0, -.5, 0))
h$race_c3 = varRecode(h$race, c(1,2,3,4), c(0, -.5, 0, .5))

# run your model!
modh = lm(read ~ race_c1 + race_c2 + race_c3, data=h)
modelSummary(modh,t=F)
# Conclusion?
# the data are consistent with the hypothesized pattern of means

###################################################################
#### PAIRWISE COMPARISONS (pt 2) ##################################
###################################################################

# Let's say we want to test our prediction using pairwise comparisons instead.
h$raceF = as.factor(h$race)
contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 1)
# remember, 1 = hispanic, 2 = asian, 3 = black, 4 = white
mr1 = lm(read ~ raceF, data=h)
modelSummary(mr1)
# white students perform significantly better than hispanic students. No other sig diffs

contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 2)
mr2 = lm(read ~ raceF, data=h)
modelSummary(mr2)
# no significant differences between Asian people and other groups

contrasts(h$raceF) = varContrasts(h$raceF, Type = 'Dummy', RefLevel = 3)
mr3 = lm(read ~ raceF, data=h)
modelSummary(mr3)
# our last remaining pairwise test is the last line of the output.
# White students perform significantly better than Black students

# How many pairwise comparisons have we done?
# 6
# What approach must we use to adjust our p values?
# Holm-Bonferroni!

# p.adjust() is the command we'll rely on. it expects a vector (list) of p-values and 
# makes the adjustment you tell it to (many other adjustments are available besides 
# Holm-Bonferroni). We can obtain our p-values directly from the models we fit above.

# Make a variable to hold the p-values that we're obtaining
ps <- numeric(length = 6)
ps <- c(0,0,0,0,0,0) # this also works
ps

# Get the p-values from our models
# The coefficients attribute from our summary stores the p-values for each of our statistical tests
modelSummary(mr1)$coefficients
modelSummary(mr2)$coefficients
modelSummary(mr3)$coefficients
# We'll use this feature to fill in our vector
ps[1:3] <- modelSummary(mr1)$coefficients[2:4, 4]
# In plain English: make the first three values in the list the same as those found in rows 2-4 of
# column 4 in the output from the first model.
ps[4:5] <- modelSummary(mr2)$coefficients[3:4, 4]
# In plain English: make the third value in the list the same as that found in rows 3 and 4 of column 4 in the
# output from the second model
ps[6] <- modelSummary(mr3)$coefficients[4, 4]
ps

# Pass the p-values to p.adjust. It will order them for you automatically, but so it's
# easier to compare let's do it in advance.
ps = sort(ps)
?p.adjust
p.adjust(ps, "holm")
# More viewable
round(p.adjust(ps, "holm"), digits = 5)
# Compare to the old values:
round(ps, digits=5)
# These are our new p-values
# Do any of our conclusions change?
# Nope, things that were significant before are still significant now.

# You'll notice this approach led to more nuanced results in terms of our focal hypothesis.
# This can be a good way to get more specific results, but the trade-off is statistical power.
# Even more wise is trying to design your studies such that you don't need to do many of these
# kinds of comparisons.






#   +-------------------------+
#   |      MORE PRACTICE      | # if time allows. otherwise do it at home!
#   +-------------------------+
# Using the dataset Lab11_DemoData.dat, test the hypothesis that people in the
# "middle class" have more positive social attitudes than either the upper or
# lower "working class". Further suppose that your hypothesis does NOT predict
# a difference between the levels of working class. 

d = dfReadDat("Lab11_DemoData.dat",SubID=NULL)
#   +------------+-------------------------------------------+-----------------+
#   |    Name    |                Description                |     Values      |
#   +------------+-------------------------------------------+-----------------+
#   | SubID      | SubID                                     | n=264           |
#   | numpos     | Number of positive response on 7 Q survey | 0 - 7           |
#   | class      | 3-level IV                                | 1=middle        |
#   |            |                                           | 2=upper working |
#   |            |                                           | 3=lower working |
#   +------------+-------------------------------------------+-----------------+


#   +---------------------------------------------+
#   | Translate your hypothesis into a contrast.  |
#   +---------------------------------------------+
d$class_c1=varRecode(d$class,c(1,2,3),c(2,-1,-1))


#   +-------------------------------------------------+
#   | Step 3: generate a set of orthogonal contrasts. |
#   +-------------------------------------------------+
# How many of these additional contrasts are we going to need? 
# 1 

d$class_c2 = varRecode(d$class, c(1,2,3), c(0,1,-1))

# How do we know this is orthogonal to the first contrast?
temp = c(2,-1,-1)*c(0,1,-1)        # From step 2 and 3                      
sum(temp) # They sum up to zero!


#   +------------------------------------------+
#   | Step 4: Fit the model with all contrasts |
#   +------------------------------------------+
mA = lm(numpos ~ class_c1 + class_c2, data=d)
summary(mA)

# Is the contrast of interest significant? 
# Yes.

# Is the orthogonal contrast non-significant?

# Does it meet the criteria proposed by Forscher and Brauer?
meClass =modelEffectSizes(mA)
meClass[2,4]/meClass[2:3,4]

# Write your results in APA format. 
# The contrast of interest (2, -1, -1) was statistically significant, t(1,1053) = 3.75, p < .001, partial η2 = .01, 
# whereas the contrast testing the residual between-group variance (0, 1, -1) was not, t(1,1053)= 0.16, p = .87, partial η2 < .0001.
# The pattern is consistent with our hypothesis pattern that people in the ``middle class'' have more positive social attitudes 
# than either the upper or lower ``working class''