At this point, I’m going to move ahead to plots, before coming back to other ways to manipulate data frames. The reason is simple: plotting my data is usually my first pass at seeing what’s going on. I usually manipulate my data frames for the purpose of plotting it in a particular way, and seeing the data first gives me better ideas of how to manipulate the data frame in order to plot it more informatively.
For the hs_sorted
data, one question we might ask is how socioeconomic status affects test scores. A boxplot will start to answer this question informatively. In base R, we use the function boxplot()
to do this. In this function, we see a new character ~
. We read this as “by”. The function below can be read as “make a plot of how the test score value varies by socioeconomic status. Use hs_sorted
as the data”.
boxplot(value ~ ses, data = hs_sorted)
This is a nice first pass at the data. We learn that test scores go up for higher SES scores. Other plot commands in base R include:
Function | Does |
---|---|
boxplot(y ~ x, data = data_frame) |
boxplot |
plot(y ~ x, data = data_frame) |
scatter plot |
hist(data_frame$column) |
histogram |
density(data_frame$column) |
density plot |
For a detailed look at graphics in base R, check out Joe’s R Study Group.
ggplot
I find R’s base plots actually much harder to deal with than ggplot
(which stands for “graphics of grammar plot”), since base plots are hard to control - so my first step in looking at data is actually to plot it using ggplot
. Today, I’ll introduce ggplot
, and tomorrow we’ll go into more detail.
ggplot
is a package that give you a lot of freedom over your plots. Its defaults are set to make beautiful plots, which - let’s face it - makes it easier to convince people that your data is the real deal. My go-to for help with ggplot
is Winston Chang’s Cookbook for R. Again, Joe’s R Study Group is also a great tutorial, and Hadley Wickham’s (the author of ggplot) paper on ggplot provides some more in-depth info on the theory behind ggplot
as a data visualization tool. I especially recommend reading Section 7 (just one page long) Here’s a teaser:
How can we build on top of the grammar to help data analysts build compelling, revealing graphics? What would a poetry of graphics look like?
Who doesn’t want their graphics to be compelling and revealing - poetic in their informativeness!? I’m sold. ggplot
works a little differently than plotting in base R, because it’s not just a single line of code.
With ggplot
, you’re creating a plot out of layers. The main types of layers that you’ll add are:
Data and Aesthetics. The data is obviously an important part of the plot. Along with the dataset, we also need to know how to map the data onto visual aesthetics. Aesthetics assigns a variable to x and y, as well as designates other aesthetic ways to represent the data (variables by color or shape, for example). In ggplot
, this is shortened to aes()
.
Geometrics, or geom
objects. This tells ggplot
what kind of geometrics to put on the graph. Is it points, as in a scatter plot (geom_point()
)? Is it a box plot (`geom_box)
Statistics, or stat
layers. These layers perform summary statistics on the data. And example of a stat
layer is adding a smoothed line to a scatterplot, with stat_smooth()
.
Scales and themes are layers that you’ll add to fine-tune the plot - give it the color scheme that you want, or the background shade that you want.
We’ll start with the ggplot
equivalent of the boxplot that we made above in base graphics. First we need to install and load ggplot2
.
install.packages("ggplot2")
library(ggplot2)
##
## Attaching package: 'ggplot2'
##
## The following object is masked _by_ '.GlobalEnv':
##
## movies
head(hs_sorted)
## id female race ses schtyp prog variable value
## 99 1 1 1 1 1 3 read 34
## 299 1 1 1 1 1 3 write 44
## 499 1 1 1 1 1 3 math 40
## 699 1 1 1 1 1 3 science 39
## 899 1 1 1 1 1 3 socst 41
## 139 2 1 1 2 1 3 read 39
The basic way to create a plot is to put the data and aesthetics in the ggplot()
function, and then add the layers of geom and stat that you want. The first line calls the data frame as the first argument, and then the list of aesthetic mappings as the second argument.
ggplot(data_frame, aes(x = x_var, y = y_var))
After adding the data, we can make that data materialize with whatever geometric layer we want. We achieve this by adding a +
to the end of the first line, so that the function knows its not done yet. Most style guides will suggest adding each layer on its own line, to keep the code maximally readable.
ggplot(hs_sorted, aes(x = ses, y = value)) +
geom_boxplot()
Something’s wrong here … Why is ggplot
only giving us one overall boxplot instead of a separate boxplot for each ses? The answer has to do with how the data is coded. If we take a look at the summary
of hs_sorted
, we can see that R thinks that ses
is a measurement on a scale. It tells us that 2.055 is the mean socioeconomic status of this dataset. But this is silly: we know that students are given a discrete score of either 1, 2, or 3. In other words, even though ses
is represented as a numeric variable, it’s actually a factor variable. It increases in discrete steps.
summary(hs_sorted)
## id female race ses
## Min. : 1.00 Min. :0.000 Min. :1.00 Min. :1.000
## 1st Qu.: 50.75 1st Qu.:0.000 1st Qu.:3.00 1st Qu.:2.000
## Median :100.50 Median :1.000 Median :4.00 Median :2.000
## Mean :100.50 Mean :0.545 Mean :3.43 Mean :2.055
## 3rd Qu.:150.25 3rd Qu.:1.000 3rd Qu.:4.00 3rd Qu.:3.000
## Max. :200.00 Max. :1.000 Max. :4.00 Max. :3.000
## schtyp prog variable value
## Min. :1.00 Min. :1.000 read :200 Min. :26.00
## 1st Qu.:1.00 1st Qu.:2.000 write :200 1st Qu.:45.00
## Median :1.00 Median :2.000 math :200 Median :52.50
## Mean :1.16 Mean :2.025 science:200 Mean :52.38
## 3rd Qu.:1.00 3rd Qu.:2.250 socst :200 3rd Qu.:61.00
## Max. :2.00 Max. :3.000 Max. :76.00
Factors are actually a litte complicated. Underlyingly, they are numeric values that are represented with character strings. This mostly has implications for statistical analysis, which we won’t be getting into in this mini course. But it’s good to know that when you have e.g, gender represented as a factor (which is R’s default representation for character variables), it is underlying a series of 0 and 1.
For the purpose of plotting the data in an informative way, all we need to do is realize that the ses
variable looks linear to ggplot
, but that we intend it to be a factor. Easy! We can coerce variables into other data types by using the following:
Function | Does |
---|---|
as.factor() |
coerce to factor |
as.character() |
coerce to character string |
as.numeric() |
coerce to numbers |
For this data, we need to coerce ses
to a factor. We can do this either outside of the ggplot
function or within it. The following are equivalent:
## change the data frame and then plot it (good for if you want to continue working with ses as a factor)
hs_sorted$ses <- as.factor(hs_sorted$ses)
ggplot(hs_sorted, aes(x = ses, y = value)) +
geom_boxplot()
## change the variable just when you plot it
ggplot(hs_sorted, aes(x = as.factor(ses), y = value)) +
geom_boxplot()
Both of these options produce the graph we’re looking for:
ggplot
Okay great, so we made the same graph as before, and it looks a little bit nicer. But we’re just getting started! One of the biggest benefits of using ggplot
as a first pass at your data is the flexible aesthetics. With ggplot
, we can assign color or shape based on a variable - which is something we can’t do in base R1 that takes an extra step to do in base R.
Let’s see whether gender has an additional effect on test scores. Note that the gender variable gives us the same problem as the socioeconomic status variable, so we have to also coerce it to a factor first.
ggplot(hs_sorted, aes(x = ses, y = value, color = as.factor(female))) +
geom_boxplot()
Beautiful! What else might be informative? Maybe there’s a difference in subject score by socioeconomic status, or by gender? Note that here I’m also changing the color
aesthetic mapping to a fill
, because I want the boxes to be filled with color instead of just changing the color of the lines.
ggplot(hs_sorted, aes(x = variable, y = value, fill = ses)) +
geom_boxplot()
ggplot(hs_sorted, aes(x = variable, y = value, fill = as.factor(female))) +
geom_boxplot()
Below is a list of some of the aesthetic options, geoms, stat layers, and fine-tuning arguments that I use regularly. This is not a full list! (see ggplot documentation) for a current list of layers supported by ggplot.
Some of the levels need specified arguments (see, e.g. geom_boxplot
, which usually needs stat = "identity"
, for most data types). R is great at letting you know when you need to specify an argument, so there’s no risk in trying a level. If you haven’t given ggplot all the info it needs, it’ll tell you what to add.
Argument | Does |
---|---|
x = |
assigns variable to x. Not optional. |
y = |
assigns variable to y |
color = |
assigns color of lines or points. Note that colour = also works |
fill = |
assigns the fill of 2-dimensional shapes |
size = |
assigns size (of points or line width) |
shape = |
assigns shape |
alpha = |
assigns transparency |
linetype = |
assigns line type |
label = |
assigns text (for use with geom_text() ) |
Geom | Makes |
---|---|
geom_point() |
scatterplot |
geom_bar(stat = "identity") |
barplot |
geom_boxplot() |
boxplot |
geom_violinplot() |
violin plot |
geom_line() |
line plot (connects points left to right) |
geom_path() |
line plot (connects points in the order they appear in the data frame) |
geom_area() |
area plot (filled in from line to x axis) |
geom_text() |
scatterplot of text |
geom_blank() |
draws a blank layer. Unclear why this exists. |
geom_contour() |
contour plot |
geom_histogram() |
histogram |
geom_errorbar() |
error bars |
geom_jitter() |
points - jittered (to reduce overlap) |
Stat | Does |
---|---|
stat_smooth() |
smoothed line |
stat_density() |
density plot |
stat_summary() |
summarize y value for each x |
Scale | Does |
---|---|
scale_alpha() |
set alpha range |
scale_area() |
scale area instead of radius for size aesthetic |
scale_color_gradient() |
smooth gradient bewteen two colors |
scale_color_grey() |
gray scale |
scale_manual() |
set alpha, color, fill, linetype, shape, size, manually |
scale_linetype() |
set linetype scale |
scale_size() |
set size range |
scale_x() |
adjust x: reverse it, put in log space, |
Function | Does |
---|---|
labs() |
add title, x axis, y axis, legend titles |
xlim() |
change limits of x axis |
ylim() |
change limits of y axis |
theme_bw() |
white background, black gridlines |
theme_grey() |
grey background, white gridlines |
theme() |
set theme elements (text size, linetype, etc.) |
hs_sorted
dataggplot(hs_sorted, aes(x = ses, y = value, fill = as.factor(female))) +
geom_boxplot() +
labs(x = "\nSocioeconomic Status", y = "Test Scores\n", fill = "Gender", title = "Test scores by Gender and SES\n") +
scale_fill_brewer(breaks = c(0, 1),
labels = c("Male", "Female")) +
scale_x_discrete(breaks = c(1, 2, 3),
labels = c("Lower", "Middle", "Upper")) +
theme_bw()
Thanks to Daniel Ezra Johnson for pointing this out!↩