Questions

13.2 Homework 1

I think I’m still a little confused about the idea of bias - if you know how biased a model is, are you really biased? Take John’s example: would you rather have a watch that gave you variable time estimates but on average was spot-on, or would you rather have a watch that was consistently one minute early? It seems like if you know the watch is one minute early, then you can easily add some sort of procedure (e.g., mentally adding a minute to the time in my head) that negates the bias at no cost/change of variability

Is how the textbook uses the term “inference” the same as how the Westfall article uses the term “explanation”? To me the two terms felt synonymous, but I was not sure if that should be the correct takeaway

I am also confused about what is meant by the term prediction when we use it in class. Is prediction using any sort of model on a novel test dataset? Sometimes it felt like what might be meant by prediction is only classification models that are guessing the outcome (e.g., “yes” or “no”) for a novel test set. If GLMs are being used on testing data (after training), isn’t that prediction too?

I’d like to have a clearer picture of what you’d expect for the mid-term and final projects, but I imagine you’ll cover that later

It’d also be good if we could solve a systematic problem in each assignment using the skills that we acquire throughout the semester. This would reinforce the knowledge and improve our coding skills.

For later weeks:

Underlying mechanism of the set.seed() function
I am confused about how the KNN classifier iterate. During the training period, every case has a category index. Does the cases’ category index be predicted by their neighbor one by one? How to calculate the accuracy on training data set? When move to test data set, will the cases in test data set use the cases in training set as neighbors?
I would like to understand more about the K-Nearest-Neighbors approach and any assumptions/requirements it has as a model. Do you need to suggest K, or will the model automatically find the most optimal “K” for training datasets?

We also got some positive feedback/comments. Several people thanked us for including this section, and several people said everything has been clear so far

I really like it that we started with the big-picture discussion about the general background, such as the reproducibility issue and the trade-off between bias and variance in data analysis. I think one of the biggest advantages of this course is that we don’t just learn the practical, hands-on coding skills. Furthermore, we would have a better understanding of the good practices in science and the pros and cons for each model under different application scenarios.
I think that it is really necessary to read all the readings before coming to the lecture, and I’ll keep doing that for future lectures. Otherwise I don’t think you can get as much from the lecture, as the discussion focuses on the big picture, so you’ll lose track and don’t know how to connect them to solid, real problems.

13.3 Homework 2

13.3.1 Conceptual

I am still confused about the choice of dummy codes in interaction models. Is there any other reason to use contrast codes in interaction models, except for meaningful coefficients? Is there any other rule to do the choice between different kinds of codes?
I realized I have so little understanding and experience with KNN vs. GLM that I limited my ability to play around with different models. For example, I didn’t know a meaningful way to graph things. I thought I understood it, because I understand what the graphs in section 2.4 show, but I wasn’t sure what (if anything) to plot when looking at many dimensions.
Strata argument in initial_split not sure what it is actually doing
The main material that I find unclear is how to implement the ideas we learn about in class to models we make in R. I think it would be helpful to start with a new dataset/question, and work through it together in class so we can go step by step through a sort of “general” processing pipeline
I found the math in the classification chapter to be tough to follow …
Could we go over again why we need to normalize X when doing KNN …?
In class we were not able to go over the section comparing where R^2, MAE, or RMSE is most descriptive. I read over the notes but still do not understand this section.

13.3.2 Homework expectations and structure

I found it quite difficult and it took me a long time to figure out how to explore the data and get the different models to run. For a first homework assignment, the instructions were also a bit open-ended, making it really hard to gauge the expectations for the assignment. I feel equally likely that I could have done far too little or far too much on the assignment
In the future, I think it would be good to split the data exploration and the model-building across two weeks. The first week could cement dplyr skills, and the second week could see students building models in class with John. Building the models was fun.
- I had a lot of difficulty deciding where to start, like what steps to take first in terms of cleaning up the data, making dummy codes, normalizing data, selecting features, etc.

13.3.3 The Homework

Data exploration issues
- Taking care of NAs (don’t usually have NAs in my own research data). In 610 and 710 we had a few lectures on how to get around missing value problem. Do the same ideas apply to fitting a predictive model?
- Where should data exploration end? With so many variables, it can just go on and on
- Data exploration strategies – I struggled with this quite a bit. Especially with this many variables. Any tips/online resources where I can look for ideas on initial data exploration?
- But data preprocessing can be very troublesome and time-consuming. Because it is very hard to decide on what to do
- It wasn’t a particularly challenging homework, but it was way too long. Data exploration takes a long time; I had no time to consider any real transformations or handling of extreme values, etc.
- I found it quite difficult and it took me a long time to figure out how to explore the data and get the different models to run.
- I liked how realistic it was. At the same time, I didn’t love how open-ended it was, I feel like I could have spent a full week playing around with different models endlessly, and it was sort of hard for me to decide when to stop. I guess that is also realistic?
R and data wrangling
- The syntax of R is not nearly as intuitive as other languages, so that really slowed me down. I probably spent something like 25+ hours on this, as I wanted to learn more about the syntax of functions and the inner-working of tibbles and dplyr, in general…
- I think it would be beneficial for future classes to have a more rigorous introduction to working with dplyr in the first homework. …
- I also think it would be great to encourage students to build functions to transform data. I found my code was unwieldy without them. I should have used them more. I think the notebook approach to programming discourages their use, but everything is just so much easier when a series of transformations are nested in one function you made.
- I spent a lot of time fighting with programming logistics, at the cost of spending time thinking about the concepts. I think it would help future students to make learning tidyverse one week and practicing modeling another week, instead of combining the two. At least for me, how much time I spent trying to get comfortable with all of the new functions took up all my time, to the point where I did not care about the models, I just wanted something that could run and be turned in
- I found the working of functions that made dummy codes to be confusing = it often generated variable names that were unintuitive or did not play well with dplyr and caret. This ended up being a simple fix through the use of janitor function, but perhaps it’s something to look out for? I also found it difficult to create interactions between subsets of variables without subsetting tibbles. Model.matrix() documentation is poor, so it’s kind of difficult to figure out how it works