############################ #### Lab 6 Exercise Key #### ############################ #### PART 1 #### # 1. # 2-3. The a priori criterion is good, but one participant just barely violated that time cutoff. # I'd want to know where the rest of the participants were in duration. # 2. # 3-4. This doesn't speak to whethed they thought those completion times were unreasonable. # 3. # 1-2. The proped reaction to this is probably sympathy fod the stressed out grad student. # 4. # 5. This is EXTREMELY concerning. # 5. # 4-5. You should see this as being functionally pretty much the same as numbed 4. In general, # data driven approaches to removing outliers are bad news. # 6. # 1-2. If theid claim is backed up by what they write in the paper, this is probably ok. Should # have time kind of external validity though. # They could ask the authors to report how results would change if the outliers remained in # in the model, and ask to see any pre-registration documentation. #### PART 2 #### # 1. d = read.csv("Lab6Ex_skivegdata.csv") varDescribe(d) str(d) # 2. m1 = lm(fitness ~ skiing + veggies, data=d) modelSummary(m1, t=F) # Both vegetable consumption, F(1,50) = 18.2, p < .001 and skiing, F(1,50) = 51.346, p < .001, # predict health at the end of the winter, together explaining 56.6% of the variance in fitness # scores, R2 = .5658 # 3. hats = modelCaseAnalysis(m1, Type="HATVALUES") hats d[hats$Rownames,] # 22 and 33 have extreme skiing scores, 49 has extreme veggies score # 4. resids = modelCaseAnalysis(m1, Type="RESIDUALS") resids d[resids$Rownames,] # 14 has very high fitness despite never skiing and averaging 0 servings of vegetables # 33 has lower fitness than would be expected for someone who has skied so much # 49 has lower fitness than would be expected for someone who consumes so many vegetables # 5. cooks = modelCaseAnalysis(m1, Type="COOKSD") cooks d[resids$Rownames,] # Nope, they're the same as the regression outliers. # 6. inf = modelCaseAnalysis(m1, Type="INFLUENCEPLOT") inf d[inf$Rownames,] # 7. # Participant 22, like 33, also skies much more than the average participant (84 times). # However, this person also has a very high fitness score, in line with what we would # predict for this amount of skiing. Thus, despite the high skiing score, this participant # fits in with the trend of the data. # 8. # I would probably tell them not to remove any of the participants, because the effects are highly # significant even with them included. # 9. # Could have asked avid cross-country skiers not to sign up for the study. # Could have asked vegetarians not to sign up for the study. # 10. d = dfRemoveCases(d, c(14,33,49)) m1b = lm(fitness ~ skiing + veggies,data=d) modelSummary(m1b, t=F) modelSummary(m1, t=F) modelEffectSizes(m1b) # Many of the individual statistics have changed, as have effect sizes, but the direction and # interpretation of the effects is the same as before: we simply have more precision in our # testing now than we did before. # * Remember, we are using an EXTREMELY liberal criterion to remove 3 out of 53 participants. In # your own data, you would only want to do this if you had a VERY good reason. plot(d$skiing, d$fitness) plot(d$veggies, d$fitness) cor(d$skiing, d$veggies, use="pairwise.complete.obs") # 11. # We were interested in seeing whether cross-country skiing activity and vegetable consumption # reliably predicted health assessed at the end of the winter. We removed three participants # with extreme influence on the data, two of whom were avid cross-country skiers who skied much more # than any of the other participants. We observed the predicted relationships with skiing and # vegetables, such that both skiing activity, F(1,50) = 288.4, p < .001, and vegetable consumption, # F(1,50) = 150.65, p < .001, predicted better health. While we observed the same relationships # before removing the outliers, the decision to remove them improved oud R squared from .5658 to # .8507, increasing the variance explained by 28.5%. # 12. m2 = lm(fitness ~ veggies, data=d) modelSummary(m2, t=F) pX = data.frame(veggies = seq(min(d$veggies),max(d$veggies), length=50), skiing=mean(d$skiing)) pY1 = modelPredictions(m1, pX) pY2 = modelPredictions(m2, pX) plot1 = ggplot(aes(x = veggies, y = fitness), data=d) plot1 = plot1 + geom_point(color = 'grey40',position=position_jitter(w=.3,h=.3)) + geom_smooth(aes(y = Predicted, ymin = CILo, ymax = CIHi), data=pY1, stat='identity', color='limegreen') + geom_smooth(aes(y = Predicted, ymin = CILo, ymax = CIHi), data=pY2, stat='identity', color='darkgreen') plot1 = plot1 + theme_bw() + labs(x = 'Daily vegetable servings', y = 'End-of-winter fitness') plot1 # The line is slightly more extreme without the covariate included. If skiing and veggies were more strongly # correlated, it would be even flatter.