Manipulating data frames, part 2

“Split - Apply - Combine”
dplyr
- %>% operator

Let’s move on to some linguistic data. This is data from sociolinguistic interviews conducted in Philadelphia - which was cleaned up and made available by Joe Fruehwald. The data package we’ll be downloading also includes data of his own vowels, which you can play around with on your own! To download it, we first have to install some packages in R.

Once devtools is installed and opened, we can download the data.

install.packages("devtools", repos = "http://cran.us.r-project.org")
library(devtools)
install_github("jofrhwld/grammarOfVariationData")
library(grammarOfVariationData)

## 
## Attaching package: 'grammarOfVariationData'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     ing

To see what data frames come in this data set:

data(package = "grammarOfVariationData")

This automatically opens up another tab in the editor window, which gives a list of the data sets in grammarOfVariationData. We’ll be looking at the ing data, which is a data set of ING vs. IN’ according to different styles and grammatical status. Asking R for the number of rows with nrow() shows us that ing contains 1139 observations – too many to look at at once. To get a sense of what the data looks like, we’ll check it out with head(). Here I’m adding the optional argument , 5 to get the first 5 rows.

nrow(ing)

## [1] 1139

head(ing, 5)

##       Token DepVar   Style  GramStatus Following.Seg Sex Age Ethnicity
## 54    going     In careful progressive         vowel   f   2     Irish
## 55   giving     In careful progressive         vowel   f   2     Irish
## 56 upcoming    Ing careful   adjective         vowel   f   2     Irish
## 57    going     In careful progressive         vowel   f   2     Irish
## 58 fighting    Ing careful  participle        apical   f   2     Irish
##    Prop      prop
## 54  0.5 0.5065847
## 55  0.5 0.5065847
## 56  0.5 0.5065847
## 57  0.5 0.5065847
## 58  0.5 0.5065847

We can see in this data set that each token is its own row. DepVar shows whether the token was “ing” or “in”, Style is what style of speech the speaker was speaking in. GramStatus shows the grammatical category, Following.Seg the following segment, etc. One thing about only looking at the first few rows of a dataframe is that we may not see all the variants of a single column. To get a sense of the whole data frame, let’s see what summary puts out.

summary(ing)

##        Token     DepVar          Style           GramStatus  Following.Seg
##  something: 92   In :577   careful  :467   adjective  : 68   0      :201  
##  going    : 67   Ing:562   narrative:324   during     :  9   apical :318  
##  doing    : 57             soapbox  :133   gerund     :113   labial :161  
##  saying   : 49             response : 89   noun       : 66   palatal: 42  
##  getting  : 37             tangent  : 88   participle :309   velar  : 37  
##  talking  : 32             group    : 23   progressive:464   vowel  :380  
##  (Other)  :805             (Other)  : 15   thing      :110                
##  Sex          Age          Ethnicity        Prop             prop       
##  f:546   Min.   :2.000   Irish  :224   Min.   :0.5000   Min.   :0.5066  
##  m:593   1st Qu.:3.000   Italian:540   1st Qu.:0.5000   1st Qu.:0.5066  
##          Median :6.000   other  :279   Median :0.5126   Median :0.5066  
##          Mean   :4.684   polish : 96   Mean   :0.5066   Mean   :0.5066  
##          3rd Qu.:6.000                 3rd Qu.:0.5126   3rd Qu.:0.5066  
##          Max.   :6.000                 Max.   :0.5126   Max.   :0.5066  
##

Let’s see if men and women produce different proportions of the standard variant -ing.

ggplot(ing, aes(x = Sex, fill = DepVar)) + 
  geom_bar()

This is a nice looking graph, but it’s not the most informative one we could make. We learn that we have more -ing words spoken by males, but it’s hard to compare the percent of each variant when we’re looking at stacked graphs like this.

A clearer graph would plot the proportion of IN by gender, and maybe even provide the total token count per bar. But we don’t have the proportion of IN by gender. No problem - we’ll just have to make it ourselves.

“Split - Apply - Combine”

So let’s think about the task of calculating the proportion IN / (IN + ING) for each gender. We’ll do it the tedious way first, so that it’s most clear what the functions are doing - and then we’ll learn a sexy function family that makes our job way easier.

It’s pretty easy to create a column that is the proportion of IN / (IN + ING) in general. We just need to take the number of “In” and divide it by the total number of observations. There’s actually a few ways to get these numbers. I’ll walk through one way - if you can find your own, that’s great!

Find the proportion of “In” tokens, programmatically. We can use a logical operator to do this, ing$DepVar == "In". This code will run the code iteratively over the vector ing$DepVar, and produce a vector of boolean values as the output. Here’s the great thing about boolean values: they are underlyingly 0’s and 1’s, where a 0 == FALSE and a 1 == TRUE. This means we can find the proportion of TRUE by asking for the mean value of that vector.

head(ing$DepVar == "In")

## [1]  TRUE  TRUE FALSE  TRUE FALSE FALSE

mean(ing$DepVar == "In")

## [1] 0.5065847

Assign that proportion to a new column:

ing$prop <- mean(ing$DepVar == "In")
head(ing)

##       Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 54    going     In   careful progressive         vowel   f   2     Irish
## 55   giving     In   careful progressive         vowel   f   2     Irish
## 56 upcoming    Ing   careful   adjective         vowel   f   2     Irish
## 57    going     In   careful progressive         vowel   f   2     Irish
## 58 fighting    Ing   careful  participle        apical   f   2     Irish
## 59    going    Ing narrative progressive         vowel   f   2     Irish
##    Prop      prop
## 54  0.5 0.5065847
## 55  0.5 0.5065847
## 56  0.5 0.5065847
## 57  0.5 0.5065847
## 58  0.5 0.5065847
## 59  0.5 0.5065847

So great, we found the overall proportion of “In”. But to find something more informative, we want to find the proportion of “In” for each gender. We can do this by making two different subsets (one for male, one for female), applying the proportion formula over each one, and then combining them back together in a whole data frame.

female <- subset(ing, Sex == "f")                        ## create a data frame of female speakers
female$Prop <- mean(female$DepVar == "In") ## find proportion of "In"
male <- subset(ing, Sex == "m")                           ## create a data frame of male speakers
male$Prop <- mean(male$DepVar == "In")       ## find proportion of "In"

ing <- rbind(female, male)            ## put them back togther by binding the rows with rbind()
head(ing)

##       Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 54    going     In   careful progressive         vowel   f   2     Irish
## 55   giving     In   careful progressive         vowel   f   2     Irish
## 56 upcoming    Ing   careful   adjective         vowel   f   2     Irish
## 57    going     In   careful progressive         vowel   f   2     Irish
## 58 fighting    Ing   careful  participle        apical   f   2     Irish
## 59    going    Ing narrative progressive         vowel   f   2     Irish
##    Prop      prop
## 54  0.5 0.5065847
## 55  0.5 0.5065847
## 56  0.5 0.5065847
## 57  0.5 0.5065847
## 58  0.5 0.5065847
## 59  0.5 0.5065847

ggplot(ing, aes(x = Sex, y = Prop)) + 
  geom_bar(stat = "identity")

We use stat = "identity" to tell geom_bar() to use the values of the data, instead of the counts of the data, to create the bar. But we still have a problem - the y scale is messed up. Here, the bars are still mapping the proportion of the total observations that are “In”. So for females, there are 546 observations, with 50% “In” - giving us the value 273. For males, there are 595 observations * 51.26% “In”, giving us the value 305. This isn’t what we’re looking for (but is a great example of how if something seems wrong, it probably is!). The plot we ended up with here is even worse than the plot above, with just the raw counts - because now it’s even less informative!

The hangup for geom_bar() is that it’s being asked to plot a single value multiple times. What we need is a small data frame that acts as a summary, instead of a column that repeats the summary multiple times. No problem - all we need to do is create a smaller data frame when we subset. We’ll use the function data.frame() to do this, which constructs a column for each argument you pass to it.

female <- subset(ing, Sex == "f")  ## create a data frame of female speakers 
f.prop <- mean(female$DepVar == "In") ## find proportion of "In", save it as a variable
f.sum <- data.frame(Gender = "f", Prop = f.prop) ## create summary data frame

male <- subset(ing, Sex == "m")    ## create a data frame of male speakers    
m.prop <- mean(male$DepVar == "In") ## find proportion of "In"     
m.sum <- data.frame(Gender = "m", Prop = m.prop) ## create summary data frame

## put the two summary dfs togther into one summary df by binding the rows with rbind()
sum.ing <- rbind(f.sum, m.sum)          
head(sum.ing)

##   Gender      Prop
## 1      f 0.5000000
## 2      m 0.5126476

ggplot(sum.ing, aes(x = Gender, y = Prop)) + 
  geom_bar(stat="identity")

We did it!

Turns out, men and women in this data set use “In” at roughly the same proportion.

But this takes a lot of work, to split the data frames, apply a function, summarise the results, and combine them back together before we can plot informative graphs. Luckily, there’s a function family that exists to make this process easier and more intuitive. dplyr, by our pal Hadley Wickham.

dplyr

I’m calling dplyr a family of functions, because it comprises a set of its own functions, and uses its own syntax to accomplish data manipulation goals. First let’s install and load dplyr.

install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

`%>%` operator

To begin with, we see that dplyr makes use of a new operator, %>%. This is called a forward pipe, or “pipe”, and it works by taking the object on the left and piping it through the function on the right. An easy way to see this in action is to try it out with my go-to function head():

ing %>% head()

##       Token DepVar     Style  GramStatus Following.Seg Sex Age Ethnicity
## 54    going     In   careful progressive         vowel   f   2     Irish
## 55   giving     In   careful progressive         vowel   f   2     Irish
## 56 upcoming    Ing   careful   adjective         vowel   f   2     Irish
## 57    going     In   careful progressive         vowel   f   2     Irish
## 58 fighting    Ing   careful  participle        apical   f   2     Irish
## 59    going    Ing narrative progressive         vowel   f   2     Irish
##    Prop      prop
## 54  0.5 0.5065847
## 55  0.5 0.5065847
## 56  0.5 0.5065847
## 57  0.5 0.5065847
## 58  0.5 0.5065847
## 59  0.5 0.5065847

The usefulness of %>% becomes more apparent when we’re dealing with data frame manipulation rather than base R functions. Let’s see what the dplyr code for summarising the proportion of “In” looks like.

ing %>%
  group_by(Sex) %>%
  summarise(Prop = mean(DepVar == "In"))

## Source: local data frame [2 x 2]
## 
##   Sex      Prop
## 1   f 0.5000000
## 2   m 0.5126476

Here, what we’ve done is: 1. Taken the ing data frame, and 2. Split it up by gender 3. Calculated the proportion of “In” tokens

dplyr automatically combines the results into a single data frame. We can assign this data frame to its own variable and plot it:

new.ing <- ing %>%
  group_by(Sex) %>%
  summarise(Prop = mean(DepVar == "In"))

ggplot(new.ing, aes(x = Sex, y = Prop)) + 
  geom_bar(stat = "identity")

At this point, I’m going to take a step back and provide a list of some of the basic functions in dplyr.

Function	Does
`filter()`	Select just the rows you’re interested in
`group_by()`	Splits the data according to the arguments. If there are multiple arguments, it will split first by the first argument, then by the second.
`select()`	Select just the columns you’re interested in.
`mutate()`	Add new columns
`summarise()`	Summarise the data

So the rates of “In” by gender aren’t very informative. But what about grammatical category? (Spoiler alert - we expect grammatical category to have a large effect on rates of “In”)!

Thanks to dplyr, it’s easy to ask any of these questions that we might have!

gram_ing <- ing %>%
  group_by(GramStatus) %>%
  summarise(Prop = mean(DepVar == "In"))

ggplot(gram_ing, aes(x = GramStatus, y = Prop)) + 
  geom_bar(stat = "identity")

What if men and women use different proportions for each category?

gram_gend_ing <- ing %>%
  group_by(GramStatus, Sex) %>%
  summarise(Prop = mean(DepVar == "In"))

ggplot(gram_gend_ing, aes(x = GramStatus, y = Prop, fill = Sex)) + 
  geom_bar(stat = "identity", position = "dodge")

Challenge: Create a plot to see whether style has an effect on the proportion of “In”. Do men and women behave differently in different speech styles?

Challenge: try to add error bars to the grammatical cateogry plot.

This is primarily a dplyr challenge, with elements of ggplot added in

This documentation of geom_errorbar and this Stack Overflow question on calculating standard error will be helpful. Hint: answers with more upvotes are generally better.

Challenge: Plot the rate of “In” over time (using Age as a proxy for DOB)

This is entirely a ggplot challenge.

Hint: First figure out how to plot a scatter plot of categorical data. Then add a statistics layer that will smooth between the categorical data.

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.