1 Introduction to R
1.1 Transitioning to R
Moving to R from other statistical software generally requires a fundamental shift in the way we think about and interact with data. Aside from this shift in thinking, there is also a substantial amount of code to learn, which can be both frustrating and intimidating. The primary goal for this chapter is to make this shift less intimidating and the learning curve less steep.
We will focus primarily on three components: data processing (munging/manipulating/wrangling), data visualization, and reproducible workflows. We will be oriented around the philosophy of tidy data and, as such, primarily rely on tools within the tidyverse for manipulating and visualizing data. The tidyverse is a suite of packages developed by RStudio, generally led by Hadley Wickham, which are all optimized for tidy data. The focus of this course is on working with R, as opposed to the specifics of any given analysis. Statistical models will be used for illustrative examples throughout the chapter, but high-level statistical knowledge (e.g., multilevel modeling or structural equation modeling) is not a prerequisite.
1.2 Installing R
The installation process is straight forward. Download R and RStudio (Windows, Linux and Mac OS X), run the files and follow the instructions to install them (install R first and RStudio second). Once both installed, open RStudio, go to File > New File > R Markdown to create a new RMarkdown. There are many other types of files but we will mainly be using RMarkdown in this book. I will give you few minutes to complete these steps. If you get any errors or warnings, please look them up. Looking up errors is 90% of coding especially in the early stages of learning, the percentage starts decreasing as you become more fluent in the coding language. In looking up errors, I mostly find the solution in Stackoverflow.
All done? Perfect! Let’s talk about the R Markdown interface. It is divided into 4 sections: (1) upper left: the editor where you write your code, (2) upper right: the environment history where you will see you variables, and other details about your data and history of your edits, (3) lower left: the console where you see the execution of your code, and (4) lower right: the window that shows your files (databases and code files), the plots you generated, the packages you are using, a Help tab if needed (we will discuss later how to use help), and a Viewer tab. The editor section is the one you will use the most.
1.3 Programing in R
OK, let’s get started! First you need to import your data into R using the import function from the rio package, so make sure you install rio and call the library rio. Wait! What does all this mean? Great question. Before we get to how you import data, let’s clarify some key concepts in order to speak the same language.
Functions are a simple If Then
code. When you set your alarm, you are basically using the function: If DD/MM/YYYY HH:MM:SS = X Then Play Song Y
. Packages are primarily a collections of R functions. In addition to functions, packages can also contain data and some other details, all compiled in a well-defined format, created to add specific functionality. There are 10,000+ user contributed packages and growing. Library is the directory where the packages are installed. Does that mean installing R is nothing without its libraries and packages? Not really, there are a set of standard (or base) packages which are considered part of the R source code and automatically available as part of your R installation. Think of R as a scientific calculator, the base functions as the +, - operations, the base packages as the mean or standard deviation calculations, the additional packages as installed programs to conduct specific regressions like fixed effects or a Poisson model. One other important about libraries is the difference between installing and loading libraries. Think of a library as book. Installing it is like buying a book and putting it on your shelf. Each time you want to reach the book you still need to get the book off the shelf, that’s what calling a library is.
Let’s install then call our first library
rio. This is often the library used for data import and export.
install.packages("rio") #Installing rio. You only need to do this step once
library(rio) #Calling rio. You need to do this each time you create a new R Markdown file.
In writing code, you would probably have your first code chunk that calls all the libraries you will be using and install the ones that are not installed yet. I realize we did not talk about code chunks. It is basically the part of editor section that contains your code. All code chunks start and end with ``` – three backticks or graves. But you can also just insert it clicking on the Insert button you see in the middle of your editor.
Finally, we will now import data. There are many ways you can this, but I personally prefer importing from the same folder what my R Markdown is saved. This helps me avoid any confusion with working directories. I usually never run into problems using this method. Basically, for every project create a folder, put all your data in that folder, and save your .rmd file in it (.rmd is the extension of an R Markdown file). If you this you should be able to see your datasets if you click on Files on the lower right section of the interface. Using this method, all you need to do when importing or reading your data files is to write the data name.
Here, I am importing an spss dataset (.sav) called ‘sesame13’. I then store the data imported in an object called sesame13. Notice that I also set the class to be "tbl_df"
which returns a tibble (i.e., tibble::tibble()
) instead of a standard data frame, which mostly just has nicer printing.
The next thing we want to do is take a look at the data but without loading all of it. We use the function head
to look at the first 6 lines. You can also use tail
to look at the bottom 6 lines.
## # A tibble: 6 x 20
## id site sex age viewing setting treatmen prebody prelet preform
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 1 66 1 1 1 16 23 12
## 2 2 1 2 67 3 1 1 30 26 9
## 3 3 1 1 56 3 1 0 22 14 9
## 4 4 1 1 49 1 1 0 23 11 10
## 5 5 1 1 69 4 1 0 32 47 15
## 6 6 1 2 54 3 1 0 29 26 10
## # … with 10 more variables: prenumb <dbl>, prerelat <dbl>, preclass <dbl>,
## # postbody <dbl>, postlet <dbl>, postform <dbl>, postnumb <dbl>,
## # postrel <dbl>, postclas <dbl>, ppvt <dbl>
1.4 Data cleaning and processing
We will now move to cleaning our data starting with variable names. Often variable names have capital letters or spaces which are not ideal for our limited cognitive ability so it is easier to make all variables lower capital and no spaces. That’s what clean_names()
does. This function is part of the janitor
library so we will need to call it.