Starting up

Before we start, you’ll need to open a new script in RStudio. Scripts aren’t necessary (it’s possible to just run R through the command line), but they are very helpful for remembering what you’ve learned or done.

One nice feature of RStudio is that it allows you to organize your scripts and files by “projects”, which is a useful way to maintain your scripts and not lose any of your hard work. For this mini-course, we’ll be using projects to orgainze our stuff.

insert r project img

Each script takes up minimal space on your computer. For the sake of project replication, it is much better to write new scripts for different aspects of a project than to overwrite a single script. This way when you come back to the project a month later, all you have to do is open up the script and run it (instead of figuring out what you were trying to do all over again).

To get started, select File > New File > R Script. We’ll name it now, with a name that is meaningful and will be easily understood by Future You. This is not as easy as it might seem! See Hadley’s style guide for some excellent and accessible style guidlines. I’ll name the file 1-basics.R, since it’s the first day of the mini course, and it’ll be covering some basics. I already have my R Mini Course project open, so this new script is automatically saved to that project.

Keeping code readable with comments

In R, you can (and should!) write comments in your scripts, to remind yourself why (not how) you’re doing what you’re doing. Comments are mainly to help Future You remember what is what in your script. Anything written after the # and before the next line break will be commented out and not evaluated by R. Comments can go on their own lines or at the end of a short line - whichever is most readable.

example_vector <- c(1, 2, 3)   # Make a vector containing the values (1, 2, 3)

Don’t be alarmed if you don’t know what that code snippet means - we’ll come back to it a little later in this lesson.

Comments are also used to help separate parts of a longer script into smaller sections. RStudio automatically turns these lined comments into a navigation tool.

## Load data ==============



## Plot data ==============

As we go through the mini course, feel free to take notes right in your R scripts by making use of #! This way you’ll have everything in one place.

Simple operations in R

R as a calculator

A great starting point for thinking about R is that it’s like a calculator. When you type something into the prompt (the line starting with > below in the R Console), R will evaluate it and print out the answer. With basic math, this means that R functions as a calculator. Try it out on your own!

6 + 7 
## [1] 13
((2)^4 / 12) * 3
## [1] 4

Below is a table showing some of the basic mathematical operations that you can do in R:

Simple Operations Means
+ add
- subtract
* multiply
/ divide
^x raised to the x

Logical Operators

In addition to basic math, R can also evaluate the truth value of an operation. These expressions return a “logical value” (either TRUE or FALSE). TRUE and FALSE are also known as “boolean” values.

Some logical operations are shown below:

Operations Means
== exactly equal to
> greater than
< less than
>= greater than or equal to
<= less than or equal to

We can try it out:

1 < 3
## [1] TRUE

When evaluating whether two values are equal, we need to use a double equal sign:

1 + 1 == 4
## [1] FALSE

Variables

In R, you can store your values into a variable. This allows you to access it later without having to actually remember the value. To assign a value to a variable, we use the assignment operator <-.

Types of variables

There are different types of variables.

Numbers

x <- 42
y <- 10

Once variables have been assigned a value, you can use the variables in expressions in place of the underlying values:

x + y
## [1] 52
x > y
## [1] TRUE

Characters

Variables don’t have to be numerical values! We can also assign characters to a variable, or strings (a sequence of characters). Make sure to put your character strings in "" or '', otherwise R will try to evaluate it as a variable.

a <- "Hello"
b <- "world!"

We can now use these variables in the same way, and ask R to evaluate some expressions about them.

a == b
## [1] FALSE
a < b
## [1] TRUE
R uses the number of characters in the string to evaluate a > b. Since "world! has 6 characters and Hello" has only 5, a < b is true.
Edit:

R uses the alphabet to evaluate a > b:1

"abc" > "z"
## [1] FALSE

Boolean

This is a programming word meaning TRUE or FALSE (binary). Above, R produced boolean values as the outputs of some operations (e.g. a < b). Boolean values can also be stored as variables.

The usefulness of storing boolean values in a variable will become apparent once we’re working with vectors (lists of numbers).

Working with variables

R gives you the value of a variable if you put the variable name into the prompt:

a
## [1] "Hello"

You can reassign variables names at any time.

a <- 5
a
## [1] 5

Vectors

A vector is just a list of values.

Types of vectors

Just like we saw with variables, vectors can also be numbers, characters, or boolean values, BUT they all have to be of the same type. Note that since strings are just a sequence of characters, this means that strings and characters could as the same type. We can create a vector by combining a list of values. To do this, we use the combine function c(). We’ll go more into detail about functions below.

Numbers

number_vector <- c(1, 2, 19, 237, 571)

If you’re creating a vector that is a sequence of numbers, you can use ::

sequence_vector <- c(1:7)

Characters

You can also make a vector out of characters and strings:

character_vector <- c("z", "y", "x")
string_vector <- c("i", "scream", "for", "ice", "cream")

I use character vectors a lot. Let’s say you have a data set tracking what flavors of ice cream you’ve eaten in the past week. Chocolate is obviously your favorite, so you eat that flavor most of the days. We could represent this data by numbers (1 = “chocolate”, 2 = “pistachio”, 3 = “vanilla”), but this would be a bad idea for three reasons.

  1. It is hard to remember which number stands for which flavor. It’s always better to use labels that are easy to interpret.

  2. Using numbers implies that there’s an order between the flavors. But this is silly - discrete flavors don’t come in orders. This becomes especially important in regression analysis, where R will interpret numbered variables as scalar.

  3. Using numbers implies that they’re numbers. But "chocolate" + "pistachio" != "vanilla".

my_dessert <- c("chocolate", "vanilla", "chocolate", "chocolate", "chocolate", "pistachio", "vanilla")

Boolean

Boolean values are a useful way of comparing two sets of data. Let’s say you have two vectors (a and b), and you need to know when a is bigger than b. The logical operation a > b will give you the vector of answers, which you can store in its own variable (here, c) for later reference.

a <- c(1, 4, 63, 3, 76, 12)
b <- c(2, 4, 3, 11, 434, 5)
a > b
## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE
c <- a > b

Mixing vector types

What happens if you try to mix types?

c(571, "ice cream", FALSE)

Vector math

In R, you can do arithmetic on vectors. If we want to add 2 to our numeric vector from above, that’s easy:

number_vector + 2
## [1]   3   4  21 239 573

More than just scalar operations, we can also do math with two vectors. Let’s use a and b from above. In R, vector math is memberwise, which means that operations are performed member-by-member. The operation below takes the first element of a and divides it by the first member of b, and so on.

a / b
## [1]  0.5000000  1.0000000 21.0000000  0.2727273  0.1751152  2.4000000

Vector math is something I end up doing a lot. If I have a data set that contains speakers’ ages at the time of the interview, and the year the interview was conducted, I can easily figure out their date of birth by doing year of interview - age at interview.

The recycling rule

If two vectors have different lengths, the shorter one will be recycled until the lengths match. It’s pretty rare to do vector math on different length vectors on your own data (usually this only occurs if one of your vectors is missing data), so R prints out a warning message when you do this.

long_vector <- c(1, 2, 3, 5, 8, 11, 17, 28, 45, 73)
short_vector <- c(2, 4, 6)

long_vector * short_vector
## Warning in long_vector * short_vector: longer object length is not a
## multiple of shorter object length
##  [1]   2   8  18  10  32  66  34 112 270 146

Not only is vector math member-wise, but evaluations of logical expressions can also be member-wise.

In addition to the logical operators given above, there are additional operators that are very useful on the vector-level:

Operations Means
& and (member-wise)
&& and (first element only)
| or (member-wise)
|| or (first element only)
!x not x

Using indices to manipulate vectors

Vectors are a list of values. The first item in the list has an index of 1, the second has an index of 2, and so on. We can use these indices to work with vectors!

Accessing a value

string_vector
## [1] "i"      "scream" "for"    "ice"    "cream"

Remember that string_vector has 5 items in it. To check what the 5th item is, we just call that index by using []:

string_vector[5]
## [1] "cream"

Many programming languages start their indices with 0, so that the first item is actually called by string_vector[0]. R starts with 1, which I think is more intuitive. This isn’t important now if you’re just learning R, but may become important in the future if you ever switch between a 0-indexing language like Python and a 1-indexing language like R.

Changing a value

Just like we can overwrite variable names, we can also overwrite values in a vector by reassigning a different value to that index.

string_vector[1] <- "you"
string_vector
## [1] "you"    "scream" "for"    "ice"    "cream"

Adding a value

If we try to access the 6th element in string_vector, R will tell us that it doesn’t exist:

string_vector[6]
## [1] NA

But we can add a 6th element to an already existing vector just by assigning it a value:

string_vector[6] <- "YARRR!"
string_vector
## [1] "you"    "scream" "for"    "ice"    "cream"  "YARRR!"

Functions

Functions are an important building block for any programming language. For now, we’ll stick to functions that already exist. For example, instead of using the + operator to add numbers, we can use the function sum(). In R, a function is followed immediately by parentheses. The parentheses mean that the operation sum is being run over the arguments inside the parentheses. Inside the parentheses are all the arguments of the function, separated by ,:

sum(1, 3)
## [1] 4

Some functions have multiple arguments. sum() is a function that can take any number of arguments.

sum(1, 3, 5, 7, 11)
## [1] 27

Some functions have arguments that have to be named, as with the rep() function, which has to specify how many times a value is repeated:

rep("R is so fun!", times = 5)
## [1] "R is so fun!" "R is so fun!" "R is so fun!" "R is so fun!"
## [5] "R is so fun!"
rep(c(1, 2, 3, 4), times = 2)
## [1] 1 2 3 4 1 2 3 4

The argument times is written out, because it’s is an optional argument. We can give the rep function a different argument instead. If we want each value of a vector to be repeated a few times, instead of the whole vector to be repeated, we can use each:

rep(c(1, 2, 3, 4), each = 2)
## [1] 1 1 2 2 3 3 4 4

Functions and vectors

Functions can also be run over vectors. We can use sum() to discover the sum of our number_vector from before. We can also easily do some summary statistics to discover the mean and standard deviation

sum(number_vector)
## [1] 830
mean(number_vector) ## the mean of number_vector
## [1] 166
sd(number_vector)   ## standard deviation of number_vector
## [1] 247.3843

Data frames

The next step up in the hierarchy of data structures is the data frame. This is basically a matrix, with rows and columns of values. One way to think about data frames is that they’re kind of like a table. We’ll come back to this tomorrow in a little more detail.

Most of my work in R involves manipulating and plotting data that exists in data frame structure. For this tutorial, we’ll use a simple data frame that is already provided by R to learn about data frames. warpbreaks is a record of how many warp breaks there were for each loom. Each loom is represented on its own line, and the type of wool and the tension of the loom are also recorded.

warpbreaks
##    breaks wool tension
## 1      26    A       L
## 2      30    A       L
## 3      54    A       L
## 4      25    A       L
## 5      70    A       L
## 6      52    A       L
## 7      51    A       L
## 8      26    A       L
## 9      67    A       L
## 10     18    A       M
## 11     21    A       M
## 12     29    A       M
## 13     17    A       M
## 14     12    A       M
## 15     18    A       M
## 16     35    A       M
## 17     30    A       M
## 18     36    A       M
## 19     36    A       H
## 20     21    A       H
## 21     24    A       H
## 22     18    A       H
## 23     10    A       H
## 24     43    A       H
## 25     28    A       H
## 26     15    A       H
## 27     26    A       H
## 28     27    B       L
## 29     14    B       L
## 30     29    B       L
## 31     19    B       L
## 32     29    B       L
## 33     31    B       L
## 34     41    B       L
## 35     20    B       L
## 36     44    B       L
## 37     42    B       M
## 38     26    B       M
## 39     19    B       M
## 40     16    B       M
## 41     39    B       M
## 42     28    B       M
## 43     21    B       M
## 44     39    B       M
## 45     29    B       M
## 46     20    B       H
## 47     21    B       H
## 48     24    B       H
## 49     17    B       H
## 50     13    B       H
## 51     15    B       H
## 52     15    B       H
## 53     16    B       H
## 54     28    B       H

This data frame is getting too big to easily see. Typically, all we need to see in order to know what’s going on with the data frame is the first few rows. We can see this using the function head, which shows the first 6 observations of its argument. For a vector like my_dessert, this would produce the first 6 values, and for a data frame like warpbreaks it would produce the first 6 rows:

head(my_dessert)
## [1] "chocolate" "vanilla"   "chocolate" "chocolate" "chocolate" "pistachio"
head(warpbreaks)
##   breaks wool tension
## 1     26    A       L
## 2     30    A       L
## 3     54    A       L
## 4     25    A       L
## 5     70    A       L
## 6     52    A       L

Getting a first look at the data

We’ve already used functions on vectors - and on data frames as well, with the use of head() to see the first six rows of a data frame. Functions are a great way to get a first look at your data. Some things I use a lot are:

Function Does
head(data_frame) returns the first 6 rows
data_frame[1:10, ] returns the first 10 rows (or whatever numbers)
names(data_frame) returns the names of the columns
nrow(data_frame) provides the number of rows (= observations)
summary(data_frame) provides a summary of the data

Try it out on warpbreaks!

summary(warpbreaks)
##      breaks      wool   tension
##  Min.   :10.00   A:27   L:18   
##  1st Qu.:18.25   B:27   M:18   
##  Median :26.00          H:18   
##  Mean   :28.15                 
##  3rd Qu.:34.00                 
##  Max.   :70.00

Indices and data frames

Values in a data frame are indexed just like they are in vectors - only now, you have to specify the value’s row number and column number. To do this, we call data_frame[ROW, COLUMN]. We can also call just the row, by leaving the column argument blank (but including the comma that indicates there is a second argument) data_frame[ROW, ] or call just the column, by leaving the row argument blank: data_frame[, COLUMN].

warpbreaks[5, 2]
## [1] A
## Levels: A B
warpbreaks[5, ]
##   breaks wool tension
## 5     70    A       L

Note that when we ask R for the value that is in warpbreaks[5, 2], it returns the value A, and then tells us that the ‘levels’ (all the variants of that column) are A and B.

One nice thing about data frames is that they typically have named columns. If we want to call a specific column, we don’t have to remember its place in the data frame - we can make use of the $ operator to call that column. The syntax for this is data_frame$col_name. Here, I’m asking R for the first 6 values of the column named “breaks”.

head(warpbreaks$breaks)
## [1] 26 30 54 25 70 52

This is important to know, because R doesn’t automatically look inside data frames to find something. If you just ask R to give you the first 6 values of “breaks” without specifying which data frame it’s in, you’ll get an error. R will tell you that it can’t find the object “breaks”.

head(breaks)

There are actually several different ways to get R to look inside a data frame for the column you’re interested in.

Option Example
$ data_frame$col_name
with() with(data_frame)
attach attach(data_frame)
data = data = data_frame

$ is usually fine for me, but sometimes it can be tiresome to keep typing it out, especially if you’re looking at multiple columns at a time. One really common reason to look at multiple columns at a time is when you want to plot them. In base R, the plot() function will plot the first argument as the x-axis and the second argument as the y-axis.

plot(warpbreaks$tension, warpbreaks$breaks)

We can also use with() to achieve this. Note the syntax of with: it’s a little complicated. For this function, the first argument is the data frame, and the second argument is the thing you actually want to do. In this case, the second argument is another function which can call the column names as its arguments since it’s nested within the with(warpbreaks) that tells R to look inside the warpbreaks data frame.

with(warpbreaks, plot(tension, breaks))

Another option is to attach() a data frame. This is useful to do if you’re going to be working a lot with a single data frame. This function attaches the data frame to R’s search path, so that R can look inside that data frame for the objects you’re calling. There are a couple of caveats:

  • If your column name is the same as a different object in R’s search path (like another data frame), you may call the wrong thing accidentally.
  • This happens most often when people forget to detach() data frames after they’re done using them

Google’s R style sheet suggests avoiding attach() completely, primarily because the risk of R finding the wrong object is so high.

The possibilities for creating errors when using attach are numerous. Avoid it.

attach(warpbreaks)
plot(tension, breaks)
detach(warpbreaks)


The last option is to use data = data_frame as an argument in your function, to tell your function where to look for your objects. This is only available for some functions, so it’s not a universal solution. The following are equivalent:

lm(warpbreaks$breaks ~ warpbreaks$tension)
## 
## Call:
## lm(formula = warpbreaks$breaks ~ warpbreaks$tension)
## 
## Coefficients:
##         (Intercept)  warpbreaks$tensionM  warpbreaks$tensionH  
##               36.39               -10.00               -14.72
lm(breaks ~ tension, data=warpbreaks)
## 
## Call:
## lm(formula = breaks ~ tension, data = warpbreaks)
## 
## Coefficients:
## (Intercept)     tensionM     tensionH  
##       36.39       -10.00       -14.72

  1. Thanks to Daniel Ezra Johnson for pointing this out!