############################
#### Lab 6 Exercise Key ####
############################

#### PART 1 ####
# 1. 
# 2-3. The a priori criterion is good, but one participant just barely violated that time cutoff.
# I'd want to know where the rest of the participants were in duration.

# 2.
# 3-4. This doesn't speak to whethed they thought those completion times were unreasonable.

# 3.
# 1-2. The proped reaction to this is probably sympathy fod the stressed out grad student.

# 4.
# 5. This is EXTREMELY concerning.

# 5.
# 4-5. You should see this as being functionally pretty much the same as numbed 4. In general,
# data driven approaches to removing outliers are bad news.

# 6.
# 1-2. If theid claim is backed up by what they write in the paper, this is probably ok. Should
# have time kind of external validity though.

# They could ask the authors to report how results would change if the outliers remained in
# in the model, and ask to see any pre-registration documentation.


#### PART 2 ####
# 1.
d = read.csv("Lab6Ex_skivegdata.csv")
varDescribe(d)
str(d)

# 2. 
m1 = lm(fitness ~ skiing + veggies, data=d)
modelSummary(m1, t=F)
# Both vegetable consumption, F(1,50) = 18.2, p < .001 and skiing, F(1,50) = 51.346, p < .001,
# predict health at the end of the winter, together explaining 56.6% of the variance in fitness
# scores, R2 = .5658

# 3. 
hats = modelCaseAnalysis(m1, Type="HATVALUES")
hats
d[hats$Rownames,]
# 22 and 33 have extreme skiing scores, 49 has extreme veggies score

# 4. 
resids = modelCaseAnalysis(m1, Type="RESIDUALS")
resids
d[resids$Rownames,]
# 14 has very high fitness despite never skiing and averaging 0 servings of vegetables
# 33 has lower fitness than would be expected for someone who has skied so much
# 49 has lower fitness than would be expected for someone who consumes so many vegetables

# 5. 
cooks = modelCaseAnalysis(m1, Type="COOKSD")
cooks
d[resids$Rownames,]
# Nope, they're the same as the regression outliers.

# 6. 
inf = modelCaseAnalysis(m1, Type="INFLUENCEPLOT")
inf
d[inf$Rownames,]

# 7.
# Participant 22, like 33, also skies much more than the average participant (84 times). 
# However, this person also has a very high fitness score, in line with what we would 
# predict for this amount of skiing. Thus, despite the high skiing score, this participant 
# fits in with the trend of the data.

# 8. 
# I would probably tell them not to remove any of the participants, because the effects are highly 
# significant even with them included.

# 9. 
# Could have asked avid cross-country skiers not to sign up for the study.
# Could have asked vegetarians not to sign up for the study.

# 10.
d = dfRemoveCases(d, c(14,33,49))
m1b = lm(fitness ~ skiing + veggies,data=d)
modelSummary(m1b, t=F)
modelSummary(m1, t=F)
modelEffectSizes(m1b)
# Many of the individual statistics have changed, as have effect sizes, but the direction and
# interpretation of the effects is the same as before: we simply have more precision in our
# testing now than we did before.
# * Remember, we are using an EXTREMELY liberal criterion to remove 3 out of 53 participants. In
# your own data, you would only want to do this if you had a VERY good reason.

plot(d$skiing, d$fitness)
plot(d$veggies, d$fitness)
cor(d$skiing, d$veggies, use="pairwise.complete.obs")

# 11. 
# We were interested in seeing whether cross-country skiing activity and vegetable consumption
# reliably predicted health assessed at the end of the winter. We removed three participants
# with extreme influence on the data, two of whom were avid cross-country skiers who skied much more
# than any of the other participants. We observed the predicted relationships with skiing and
# vegetables, such that both skiing activity, F(1,50) = 288.4, p < .001, and vegetable consumption,
# F(1,50) = 150.65, p < .001, predicted better health. While we observed the same relationships
# before removing the outliers, the decision to remove them improved oud R squared from .5658 to 
# .8507, increasing the variance explained by 28.5%.

# 12.
m2 = lm(fitness ~ veggies, data=d)
modelSummary(m2, t=F)

pX = data.frame(veggies = seq(min(d$veggies),max(d$veggies), length=50), skiing=mean(d$skiing))
pY1 = modelPredictions(m1, pX)
pY2 = modelPredictions(m2, pX)

plot1 = ggplot(aes(x = veggies, y = fitness), data=d)
plot1 = plot1 + geom_point(color = 'grey40',position=position_jitter(w=.3,h=.3)) +
  geom_smooth(aes(y = Predicted, ymin = CILo, ymax = CIHi), data=pY1, stat='identity', color='limegreen') +
  geom_smooth(aes(y = Predicted, ymin = CILo, ymax = CIHi), data=pY2, stat='identity', color='darkgreen')
plot1 = plot1 + theme_bw() + labs(x = 'Daily vegetable servings', y = 'End-of-winter fitness')
plot1
# The line is slightly more extreme without the covariate included. If skiing and veggies were more strongly
# correlated, it would be even flatter.