############################## #### GENERAL LINEAR MODEL #### #### PSYCH 610 #### #### Week 1: Intro to R #### ############################## # 2015: Tammi Kral # 2016: Adrienne Wood # 2017: Mitchell Campbell #################################################################################### ##### 1. Basics - R is a language that you use to communicate with the computer #### #################################################################################### # Pound sign ('hashtag' for you young people) is used for commenting. # All text on a line after the pound sign is a 'comment' or 'translation' that your computer will not read. # Just as when you study a foreign language (say Greek), you may write down some notes for yourself in your # mother tongue, which won't be understood by the native Greek speakers. # If you put 4 pound signs before and after a line of text (like in our section titles), RStudio will # add it to the dropdown menu at the bottom of the script window (HANDY!). # We write our code in a script that is saved as a .R file (like this one). This allows you to rerun the same # set of commands and analyses without typing out the same code (if you've used SPSS, it's like writing something # in Syntax and saving the syntax file). # When you are using RStudio, when you 'run' your code, it appears, along with any # output or error messages, in the Console (below). ####################################################################################################### ##### 2. Arithmetic - R can be used as a fancy calculator, but here you use 'enter' instead of '=' #### ####################################################################################################### # Use spaces to clarify # Inequalities # Common rules for R language (make a note for these): # 1. # 2. # 3. ############################ ##### 3. Writing R code #### ############################ # R works by storing information in variable names. To store something in a variable, # use the assignment operator, '=' # You may also see some scripts that use '<-' for assignment. For our purposes these are equivalent, but # we will use '=' throughout the course. (Mitch will probably accidentally use '<-' on occasion out of habit) # Check the workspace (on the upper right in RStudio). # You can think in this way: every time you hit 'enter' after typing something, you're asking R # 'what is ...(the something that you just typed)'. # If we want our object's value to be non-numeric, we can make it a string using # single or double quotes (note that I can write over my original value for variable a) # We will use single quotes in this course, though doubles quotes aren't technically wrong. # If your object is currently categorized as a string but you want to treat it numerically, # you can convert it (as long as it consists of digits) # This will come up sometimes when you import data files. # Any "as.[something]" command basically tells R to convert this variable/object to type [something], # not whatever it currently is. # We can store different types of data as a variable, in addition to single numbers. For instance, vectors: # of numbers # or strings # or other vectors # INDEXING: # We can then call a subset of one of our vectors using brackets. These brackets # will come in handy later. ############################################# #### 4. Setting your 'Working Directory' #### ############################################# # R only has access to files in your current 'Working Directory.' # By default, when you start RStudio, you will be in your computer's 'home directory.' # Most of the files that you create, work with, and care about will be contained in folders # further down your directory tree than 'home.' So, run the command # below and make a note of what your computer's home directory is. # Set working directory from Graphical User Interface (GUI, pronounced 'gooey') # 'GUI' is computer programming jargon for a drop down menu # Session -> Set Working Directory -> Choose Directory... (ctrl+shift+H on Mac) # OR # # In the lower right corner of your screen, select the files tab and navigate to the # directory you want to work from (i.e. read data from / save scripts and results to). # Then click the 'More' button, and choose 'Set as working directory'. # Both methods will generate a command in the console of the form setwd('path/to/working_directory') # and issue it to the terminal to change your working directory. If you save your files for this class # in the same folder, it might be useful to include this line at the beginning of your files to make # this easier. # Another fun trick: If you drag and drop your .R file from the directory (i.e., folder) you want to # be working in, RStudio will automatically set your working directory to that folder. # The console also tells you what directory you're working in. It's the gray text directly to the right # of the word 'Console' # If you already know the path to your file, and don't need to look it up in the GUI, # you can just write the command yourself. ################################################# ##### 5. Download data from course website, Week 1 #### ################################################# # The file is Lab1_Data.dat # Move data file to your 'working directory' ########################## #### 6. Packages in R #### ########################## # R is an open-source programming language, which means anyone can write functions in R. These # functions are put together into packages, which you can download and use in your own R scripts. # (You can also write your own functions, which you may dabble with some in this course and beyond.) # Things related to packages are contained in the 'Tools' tab of RStudio. Like with the working directory # though, most of these functions can be performed in R (albeit in a somewhat less user-friendly way). # If you have not already, run this command to get the packages # that you will need for this course. listOfPackagesToInstall = c('lmSupport', 'car', 'foreign', 'MASS', 'psych', 'effects', 'mediation') # quick in-lab review: What did we just do in that command? # These packages contain most of the commands we need to use in this course. You can load a package like this: # What does "loading a package" do? # Helpfully, once a package is loaded, RStudio will autocomplete when you type a command, and even provide the arguments # you need to give the command for it to run (we'll explain more later). Select a command from the drop-down menu and hit 'enter.' # If one of your .R files uses commands from a given package, include library([packagename]) at the beginning of your script # to save yourself time in the future and prevent confusion if and when the file is run by your TAs, advisor, etc. Try # not to get in the habit of loading a bunch of libraries you won't actually use. # If you don't know what package a given command is in, Google it! Try 'R command (command name) package' #### GOOGLING INTERLUDE #### # Learning how to Google things effectively is one of the skills you should gain in this class. There are an insane # number of R resources on the web, from StackOverflow to R documentation to a variety of random websites. We'll cover # this a good deal in this class. ################################## ##### 7. Reading data into R #### ################################## # Data come in many different formats. Most of these formats are basically spreadsheets, # with a special character that separates each column of the spreadsheet. This special # character is called a 'delimiter'. Two of the most common formats are 'tab-delimited' data # (which usually end with '.dat', as in 'data.dat') and 'comma-delimited' (which usually # end with '.csv', as in 'data.csv'). # We will cover reading data into R in different formats (e.g., .xlsx) later in the semester. # Tab delimited data files are read with the command dfReadDat([filename]). # 'd' should now magically appear in your Environment pane # NOTE: we often refer to dataframes with the variable name 'd'. This is arbitrary. Whenever you see 'd' # in a line of code (for instance, 'data=d'), remember that our dataframe could also be called 'Larry' # and it would work: # It will be extremely important for you to name things in R in a clear, logical way (not just for you but # also for others who will see your scripts, including your TAs, advisors, people who request your data files, etc.). # So even though you can name anything whatever you want, it would be very unwise to do so. # There will be a number of naming conventions we will ask you to follow in this course. # What you do after this class is totally up to you, but your life will be made easier by # using consistent naming schemes that are not only clear to you, but also to other individuals. ###################################################################### ##### 8. Help!!! What if I don't remember how to use a command? #### ###################################################################### # There is a help file for every function we'll be using. # To access these help files, use this command: # This will pop open a help file in the 'Help' tab of your bottom right pane. This describes the function, # provides a list of arguments with descriptions, tells you what kind of output you'll get, provides examples, # and also tells you who wrote the command (does this name look familiar?). # You will also notice in the top right of this window, next to the name of the command, R tells you the package # this command belongs to. This R documentation page is the same one that will often come up when you do a Google # search, so you can always look for the package in that location. ##################################### ##### 9. Working with your data #### ##################################### # First we can get some descriptive statistics from the data # Data are typically stored in R using things called 'data frames'. As already mentioned, data frames are basically # represented as spreadsheets in R: they have rows and columns, and both the rows and the columns # can be labeled. By convention, each row in a data frame is usually a participant, while each # column is typically a variable. # Like anything else, if you enter the name of a data frame in the console, you'll see its contents # You can also click the name of the data frame in the Global Environment window (to the right) or enter the following: # This provides a nice spreadsheet view, but can flood the screen if there are too many data. # Fortunately, it is easy to summarize and get snapshots of your data in R. # What if we wanted the first ten rows instead of the first 6? # What if you want to look at only one variable? # You may want to 'describe by' the levels of a second variable. Simple! # Can we guess from this which value of d$Gender is male and which is female? # NOTE: We will often use gender as a dichotomous variable in this class. We are aware that this does not # reflect scientific understanding of gender (which is non-binary). This more reflects it being a simple # way to roughly halve a population and its historical presence as a dichotomous variable in a whole host # of prior research. We'll make an effort to use different dichotomous variables more frequently to stop # reinforcing the gender binary. ###################################### ##### 10. Creating new variables #### ###################################### # We can create new variables from our current variables. # When you do something (e.g., subtract 5) to a variable like d$Weight, # R will perform the subtraction on the Weight in every row of our dataframe. # However, transformations like this are temporary and don't affect our variable (d$Weight) # unless we use the 'assign' symbol to create a new variable. Now this new variable is in # our dataframe. # How do we mean-center a variable? # Now you see there is a new variable in the data frame # How can we check our work? # Create a mean-centered Height score also. # Create standardized Height variable: zScore = (x-mean(all x's)) / SD(all x's) # Standardized score = (score - mean) / standard deviation # Can you compute a standardized score for Weight? Consult with your neighbor and see if you get the same result. # Now let's create another variable for 'BodyBuild', simply defined as the mean of standardized weight and height # Before you run this make sure that you have created the standardized Weight score! # If you run the command WITHOUT creating d$WeightZ first you will NOT get an error - but can you see how the output is wrong? # This command is saying what? # The last part of the command 'na.rm=TRUE' tells R that you want to ignore values that are missing, i.e. 'NA' # It is VERY important to label missing values in your data files as 'NA' rather than setting them to zero # If missing values are zeros you may not realize it and if averaged, for example, will skew the means for which they are included # NOTE: the practice in this class will be to give variables capitalized names. Any suffixes (e.g., 'C') should also be capitalized. ########################################################## ##### 11. Mean-Center Dichotomous Variable 'Gender' #### ########################################################## # View values of d$Gender. # R recognizes this as an integer variable, but we recognize it as a categorical variable. # There are different ways to deal with categorical data in R. For now we are going to go # along with R and treat it as numerical data. We want to change Gender so that all the 0s # become -.5 and all the 1s become .5. # This is called 'recoding' #This is saying what? # Let's look at the new variable d$GenderC # Why can't we just center as we did the continuous variables? ################################ ##### 12. Bracket Notation #### ################################ # Bracket notation is a flexible way to obtain different cross-sections and subsets of a data frame or output. # Items before the comma refers to specific ROWS of d. Stuff after the comma refers to specific COLUMNS. # So d[,'GenderC'] calls the contents of all the rows (there is nothing before the comma) in the column called 'GenderC' # Why would we want to use bracket notation? Bracket notation can be very helpful for viewing specific columns of data d # Now we want to check whether we centered Gender successfully. # Bracket notation can be used to look at multiple variables at once. # View d$Gender and d$GenderC side by side # By putting nothing before the comma, R assumes that we want to see all of the rows. # Now you try using brackets to view all the columns for rows 1 and 2. # What is the gender (centered) of participant 13? ############################### ##### 13. Subset Data ######## ############################### # What if we want to look at just a subset of our data, for example within each gender separately? # We will need to create a new data frame with only those participants. We can do this using bracket notation. # Here we are selecting all the rows of our data frame 'd' where the Gender variable is set to 1 # You can include more than one rule in your subsetting command. Say you want men who are above 66 inches in height: # or participants who are below 62 inches in height and/or below 150 pounds # These commands use the bracket notation described in the previous section. # What is the significance of the comma, and why does it occur AFTER the "rule(s)?" Discuss this with your neighbor. # If you get an error with this function it is likely you have a missing or misplaced comma. # Now with your neighbor, create a new dataframe called dLarge that includes all the people taller than # the mean height and/or heavier than the mean weight. # NOTE: the practice we will use in this class is to name dataframes this way (i.e., dCaptialaddition) ################################# ##### 14. Basic plot types #### ################################# # Histograms # Basic scatter plot (x-axis data, y-axis data) # Label graph title (main), x-axis (xlab), and y-axis (ylab) # Include regression line to describe the relationship between height and weight # You can save this file by clicking on the 'Export' menu above the plot, and selecting a file type, location and name ##################################################### #### 15. Removing objects from the environment #### ##################################################### # You're probably going to have the experience where you are either working on multiple R files one right after # the other or opening and using multiple different dataframes within a single script. # If you try to use a dataframe that isn't loaded, you'll get an error. # However, the much bigger problem occurs when you accidentally start mixing together dataframes. For example, if you're # working from dataframe 'd2' but previously used 'd1', you might accidentally call a d1 variable and get your dataframes # confusingly mucked up (esp if you're copy-pasting). For these reasons, we suggest 2 best practices: # 1. # 2. # note: even if you're really good about doing this, you should still name your data frames different things within a given script. # For example, you could specify them by their download date (e.g., "d905" and "d906") # That is all for the first lesson in lab! This was a lot of information, # and we do not expect that you have really internalized much of it yet. # Rather, we expect most of you will struggle with R at first. # That's fine---you aren't alone! Ask lots of questions, keep trying things, # and feel free to email your TAs. And do drop by our office hours. # See the syllabus for our contact info. ########################################## ##### In Class Assignment: Begin HW1 #### ########################################## # You can find the homework on the course Wiki. # Feel free to work in groups, talk to those around you, and ask your TAs questions. # HOWEVER, try to work through the questions INDEPENDENTLY, as this is for your own learning experience!! # The purpose of this first assignment is not to stress you out, but to get your feet wet. # Don't forget to do the assigned readings---there are questions on the HW about them. # Also, please remember that in future weeks, the homework will not generally be available until 5 pm on Friday. # This week is an exception