########################################################
####  Lab 11: Categorical variables with >2 levels  ####
####               Week 1: 17 Nov 17                ####
########################################################

# OVERVIEW OF TODAY #
# 1. Categorical predictor with 3 levels
# 2. Graph the model with 3-level predictor
# 3. Categorical predictor with 4 levels

# For our first example, we're going to use data that should be familiar: they're
# based on the 1971 Canadian occupational prestige study. This time, we're going to
# focus on a variable we haven't looked at before: "type." The codebook for our focal
# variables appears below.

library(lmSupport)
library(car)
d = Prestige

# +-----------+-------------------------+------------------+
# |   Name    |       Description       |      Values      |
# +-----------+-------------------------+------------------+
# | education | Mean education in years | 6.38--15.97      |
# | income    | Mean income             | 611--25879       |
# | type      | 3-level IV              | bc=blue collar   |
# |           |                         | prof=professional|
# |           |                         | wc=white collar  |
# +-----------+-------------------------+------------------+

# Get a sense of what's going on


# Is there initial evidence that income differs by type?


# Analyzing these data will proceed in several steps:
#   +---------------------------------+
#   | Step 1: State your hypothesis.  |
#   +---------------------------------+
# When a variable has only two levels, the hypothesis is clear: the mean of one
# level will be different from the mean of the other. When there are >2, there are many
# possible hypotheses that could be tested: is 1>2&3? 2>1&3? 1>2>3? ...etc. Your
# experiment will be aimed at testing one of these. 
# THAT IS TO SAY: your hypothesis is about a PATTERN of means.

# For these data, one reasonable hypothesis is:
# Professional income > White Collar income = Blue Collar income. 
# So Professionals make more money than White and Blue Collar 
# workers, while White and Blue Collar workers do not differ in their income.
# Let's say this is our prediction we want to test.


#   +-----------------------------------------------------+
#   | Step 2: Translate your hypothesis into a contrast.  |
#   +-----------------------------------------------------+
# How should we code this contrast? What is our contrast of interest?
# Remember: bc==>Blue Collar, prof==>Professional, wc==>White Collar.

# When we use varRecode, our initial variable has to be a number and not a factor.
# We can easily do this with the following command:
d$typeN = as.numeric(d$type)
# ...but which value is which?


# Make your focal contrast

# Is it centered? 
# Is it unit-weighted? 


#   +-------------------------------------------------+
#   | Step 3: generate a set of orthogonal contrasts. |
#   +-------------------------------------------------+
# The purpose of these contrasts is NOT to test other hypotheses. Rather, they
# are selected to account for the remaining between-groups variance present in the outcome.

# How many of these additional contrasts are we going to need in this example? 


# What are the 2 characteristics that our contrasts should have? 


# What should this contrast be?
             

# How do we know this is orthogonal to the first contrast?
                               


#   +------------------------------------------+
#   | Step 4: Fit the model with all contrasts |
#   +------------------------------------------+


# What's going on here? 


#   +-------------------+
#   | Step 5: Interpret |
#   +-------------------+
# Specifically, is the contrast of interest significant?


# NOTE! Having a significant focal contrast does not necessarily mean that your data perfectly reflect
# the pattern of means you hypothesized. Rather, it simply says the data are consistent with the hypothesis.
# It's entirely possible, and indeed likely if you have many groups, that there are other things
# going on, some of which you'll get a sense of by looking at the significance of the other contrasts.
# This is when pairwise comparisons can become useful.

# How do we interpret the parameter estimate?

# Why can we make this inference?


#### +--- Quick covariate example ---+ ####
# What if we control for education level in the above analysis? 
# Will we reach the same conclusion? How do you find out?

# Add education as a predictor to the model


# Check the results for c_1 and c_2

# If you reach different conclusions after controlling for education, what does that mean?



###################################################################
#### PAIRWISE COMPARISONS (pt 1) ##################################
###################################################################

# What if we didn't have any hypothesized pattern of means, but rather wanted to test our
# prediction using pairwise comparisons?

# What approach do we use to test multiple hypotheses using non-orthogonal contrasts? 

# We can set the reference group and create dummy codes in varContrasts:
?varContrasts
# We have to give this command a factor type variable


# What does the first contrast represent?

# What does the second contrast represent?


# If we change the reference group, 
# our contrasts represent different pairwise comparisons

# Now what are we testing?


# Let's refit our model:

# Conclusion? 


# As you can see, the coefficient for the first dummy code is literally the mean difference
# between blue collar occupations and professional occupations.
# The coefficient for the second dummy code is literally the
# mean difference in income between white collar and professional occupations.

# We can recode type to get the remaining pairwise difference:
# Blue collar as the reference group:


# Conclusion?

# In this situation, what method would we use to protect our p values?



# What does this involve?


# We can use the same models we did above. We'll use the Anova command John explained in class.

# Do we satisfy the condition of Fisher LSD?



#   +-------------------------+
#   |   Graphing Interlude    |
#   +-------------------------+
# We'll use a factor as our predictor and compute a new model for the graph
d$Class =varRecode(d$type, c('bc','prof','wc'), c("Blue Collar","Professional","White Collar"))


# When it's a factor with >2 levels, R automatically uses dummy-coding.

# Are the model estimates the same as from our contrast+residual model?
# Let's check to be sure!
unique(round(modelPredictions(mA), 2))
unique(round(modelPredictions(mG), 2))
# "unique" deletes any duplicates present in a given list. compare this output to one without "unique"
# Yes, it's generating the same model predictions

pX = data.frame(Class =c("Blue Collar", "Professional", "White Collar"))


library(ggplot2)
#library(cowplot)



# for after you do the main plot
plot1 = plot1 + # let's add bars to show us significance
  annotate("segment", x=1.1, xend=1.9, y=15000, yend=15000)+
  annotate("segment", x=2.1, xend=2.9, y=15000, yend=15000)+
  annotate("text", x=1.5, y=16000, label="***")+
  annotate("text", x=2.5, y=16000, label="***")
plot1


#   +-------------------------+
#   |   More than 3 levels    | 
#   +-------------------------+
# The h dataset is taken from a national survey of high school seniors. 
# Two hundred observation were randomly sampled from the High School and Beyond survey. 

# we'll read in the data from the web!
h = read.csv("http://stats.idre.ucla.edu/stat/data/hsb2.csv")
some(h)

# What's going on here?
# obs: 200
# vars: 11
# -----------------------------------------------------------------------------
# variable name   type   about the variable 
# -----------------------------------------------------------------------------
# id              scale  student id                  
# female        nominal  (0/1)        
# race          nominal  ethnicity (1=hispanic 2=asian 3=black 4=white)        
# ses           ordinal  (1=low 2=middle 3=high)           
# schtyp        nominal  type of school (1=public 2=private)
# prog          nominal  type of program (1=general 2=academic 3=vocational)
# read            scale  standardized reading score
# write           scale  standardized writing score
# math            scale  standardized math score
# science         scale  standardized science score
# socst           scale  standardized social studies score

# We predict racial disparities for reading scores, such that historically
# disadvantaged racial groups will perform worse than other groups.
# If you look at the codebook above, our race variable is categorical
# and codes students as Hispanic, Asian, Black, and White.
# The hypothesis, based on previous findings (e.g., Ferguson, 2002,
# http://files.eric.ed.gov/fulltext/ED474390.pdf),
# is that White and Asian students will have higher reading scores than
# Black and Hispanic students.

# Get an initial look at the data


# Let's translate our hypothesis into an orthogonal contrast:


# What are the orthogonal contrasts?
# It helps to think about it intuitively. That is, what are some other ways
# the groups could relate to each other besides the relationship we've
# described above?


# How do we make sure these work?




# run your model!

# Conclusion?


###################################################################
#### PAIRWISE COMPARISONS (pt 2) ##################################
###################################################################

# Let's say we want to test our prediction using pairwise comparisons instead.

# remember, 1 = hispanic, 2 = asian, 3 = black, 4 = white





# How many pairwise comparisons have we done?

# What approach must we use to adjust our p values?


# p.adjust() is the command we'll rely on. it expects a vector (list) of p-values and 
# makes the adjustment you tell it to (many adjustments are available). We can obtain 
# our p-values directly from the models we fit above.

# Make a variable to hold the p-values that we're obtaining
ps <- numeric(length = 6)
ps <- c(0,0,0,0,0,0) # this also works
ps

# Get the p-values from our models
# The coefficients attribute from our modelSummary stores the p-values for each of our statistical tests


# We'll use this feature to fill in our vector

# In plain English: 

# In plain English: 


# Pass the p-values to p.adjust. It will order them for you automatically, but so it's
# easier to compare let's do it in advance.
ps = sort(ps)
?p.adjust

# More viewable

# Compare to the old values:

# These are our new p-values
# Do any of our conclusions change?


# You'll notice this approach led to more nuanced results in terms of our focal hypothesis.
# This can be a good way to get more specific results, but the trade-off is statistical power.
# Even more wise is trying to design your studies such that you don't need to do many of these
# kinds of comparisons.






#   +-------------------------+
#   |      MORE PRACTICE      | # if time allows. otherwise do it at home!
#   +-------------------------+
# Using the dataset Lab11_DemoData.dat, test the hypothesis that people in the
# "middle class" have more positive social attitudes than either the upper or
# lower "working class". Further suppose that your hypothesis does NOT predict
# a difference between the levels of working class. 

d = dfReadDat("Lab11_DemoData.dat",SubID=NULL)
#   +------------+-------------------------------------------+-----------------+
#   |    Name    |                Description                |     Values      |
#   +------------+-------------------------------------------+-----------------+
#   | SubID      | SubID                                     | n=264           |
#   | numpos     | Number of positive response on 7 Q survey | 0 - 7           |
#   | class      | 3-level IV                                | 1=middle        |
#   |            |                                           | 2=upper working |
#   |            |                                           | 3=lower working |
#   +------------+-------------------------------------------+-----------------+


#   +---------------------------------------------+
#   | Translate your hypothesis into a contrast.  |
#   +---------------------------------------------+



#   +-------------------------------------------------+
#   | Step 3: generate a set of orthogonal contrasts. |
#   +-------------------------------------------------+
# How many of these additional contrasts are we going to need? 



#   +------------------------------------------+
#   | Step 4: Fit the model with all contrasts |
#   +------------------------------------------+

# Is the contrast of interest significant? 




# Write your results in APA format.