Let’s move on to some linguistic data. This is data from sociolinguistic interviews conducted in Philadelphia - which was cleaned up and made available by Joe Fruehwald. The data package we’ll be downloading also includes data of his own vowels, which you can play around with on your own! To download it, we first have to install some packages in R.
Once devtools
is installed and opened, we can download the data.
install.packages("devtools", repos = "http://cran.us.r-project.org")
library(devtools)
install_github("jofrhwld/grammarOfVariationData")
library(grammarOfVariationData)
##
## Attaching package: 'grammarOfVariationData'
##
## The following object is masked _by_ '.GlobalEnv':
##
## ing
To see what data frames come in this data set:
data(package = "grammarOfVariationData")
This automatically opens up another tab in the editor window, which gives a list of the data sets in grammarOfVariationData
. We’ll be looking at the ing
data, which is a data set of ING vs. IN’ according to different styles and grammatical status. Asking R for the number of rows with nrow()
shows us that ing
contains 1139 observations – too many to look at at once. To get a sense of what the data looks like, we’ll check it out with head()
. Here I’m adding the optional argument , 5
to get the first 5 rows.
nrow(ing)
## [1] 1139
head(ing, 5)
## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity
## 54 going In careful progressive vowel f 2 Irish
## 55 giving In careful progressive vowel f 2 Irish
## 56 upcoming Ing careful adjective vowel f 2 Irish
## 57 going In careful progressive vowel f 2 Irish
## 58 fighting Ing careful participle apical f 2 Irish
## Prop prop
## 54 0.5 0.5065847
## 55 0.5 0.5065847
## 56 0.5 0.5065847
## 57 0.5 0.5065847
## 58 0.5 0.5065847
We can see in this data set that each token is its own row. DepVar
shows whether the token was “ing” or “in”, Style
is what style of speech the speaker was speaking in. GramStatus
shows the grammatical category, Following.Seg
the following segment, etc. One thing about only looking at the first few rows of a dataframe is that we may not see all the variants of a single column. To get a sense of the whole data frame, let’s see what summary
puts out.
summary(ing)
## Token DepVar Style GramStatus Following.Seg
## something: 92 In :577 careful :467 adjective : 68 0 :201
## going : 67 Ing:562 narrative:324 during : 9 apical :318
## doing : 57 soapbox :133 gerund :113 labial :161
## saying : 49 response : 89 noun : 66 palatal: 42
## getting : 37 tangent : 88 participle :309 velar : 37
## talking : 32 group : 23 progressive:464 vowel :380
## (Other) :805 (Other) : 15 thing :110
## Sex Age Ethnicity Prop prop
## f:546 Min. :2.000 Irish :224 Min. :0.5000 Min. :0.5066
## m:593 1st Qu.:3.000 Italian:540 1st Qu.:0.5000 1st Qu.:0.5066
## Median :6.000 other :279 Median :0.5126 Median :0.5066
## Mean :4.684 polish : 96 Mean :0.5066 Mean :0.5066
## 3rd Qu.:6.000 3rd Qu.:0.5126 3rd Qu.:0.5066
## Max. :6.000 Max. :0.5126 Max. :0.5066
##
Let’s see if men and women produce different proportions of the standard variant -ing.
ggplot(ing, aes(x = Sex, fill = DepVar)) +
geom_bar()
This is a nice looking graph, but it’s not the most informative one we could make. We learn that we have more -ing words spoken by males, but it’s hard to compare the percent of each variant when we’re looking at stacked graphs like this.
A clearer graph would plot the proportion of IN by gender, and maybe even provide the total token count per bar. But we don’t have the proportion of IN by gender. No problem - we’ll just have to make it ourselves.
So let’s think about the task of calculating the proportion IN / (IN + ING) for each gender. We’ll do it the tedious way first, so that it’s most clear what the functions are doing - and then we’ll learn a sexy function family that makes our job way easier.
It’s pretty easy to create a column that is the proportion of IN / (IN + ING) in general. We just need to take the number of “In” and divide it by the total number of observations. There’s actually a few ways to get these numbers. I’ll walk through one way - if you can find your own, that’s great!
ing$DepVar == "In"
. This code will run the code iteratively over the vector ing$DepVar, and produce a vector of boolean values as the output. Here’s the great thing about boolean values: they are underlyingly 0’s and 1’s, where a 0 == FALSE and a 1 == TRUE. This means we can find the proportion of TRUE by asking for the mean value of that vector.head(ing$DepVar == "In")
## [1] TRUE TRUE FALSE TRUE FALSE FALSE
mean(ing$DepVar == "In")
## [1] 0.5065847
ing$prop <- mean(ing$DepVar == "In")
head(ing)
## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity
## 54 going In careful progressive vowel f 2 Irish
## 55 giving In careful progressive vowel f 2 Irish
## 56 upcoming Ing careful adjective vowel f 2 Irish
## 57 going In careful progressive vowel f 2 Irish
## 58 fighting Ing careful participle apical f 2 Irish
## 59 going Ing narrative progressive vowel f 2 Irish
## Prop prop
## 54 0.5 0.5065847
## 55 0.5 0.5065847
## 56 0.5 0.5065847
## 57 0.5 0.5065847
## 58 0.5 0.5065847
## 59 0.5 0.5065847
So great, we found the overall proportion of “In”. But to find something more informative, we want to find the proportion of “In” for each gender. We can do this by making two different subsets (one for male, one for female), applying the proportion formula over each one, and then combining them back together in a whole data frame.
female <- subset(ing, Sex == "f") ## create a data frame of female speakers
female$Prop <- mean(female$DepVar == "In") ## find proportion of "In"
male <- subset(ing, Sex == "m") ## create a data frame of male speakers
male$Prop <- mean(male$DepVar == "In") ## find proportion of "In"
ing <- rbind(female, male) ## put them back togther by binding the rows with rbind()
head(ing)
## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity
## 54 going In careful progressive vowel f 2 Irish
## 55 giving In careful progressive vowel f 2 Irish
## 56 upcoming Ing careful adjective vowel f 2 Irish
## 57 going In careful progressive vowel f 2 Irish
## 58 fighting Ing careful participle apical f 2 Irish
## 59 going Ing narrative progressive vowel f 2 Irish
## Prop prop
## 54 0.5 0.5065847
## 55 0.5 0.5065847
## 56 0.5 0.5065847
## 57 0.5 0.5065847
## 58 0.5 0.5065847
## 59 0.5 0.5065847
ggplot(ing, aes(x = Sex, y = Prop)) +
geom_bar(stat = "identity")
We use stat = "identity"
to tell geom_bar()
to use the values of the data, instead of the counts of the data, to create the bar. But we still have a problem - the y scale is messed up. Here, the bars are still mapping the proportion of the total observations that are “In”. So for females, there are 546 observations, with 50% “In” - giving us the value 273. For males, there are 595 observations * 51.26% “In”, giving us the value 305. This isn’t what we’re looking for (but is a great example of how if something seems wrong, it probably is!). The plot we ended up with here is even worse than the plot above, with just the raw counts - because now it’s even less informative!
The hangup for geom_bar()
is that it’s being asked to plot a single value multiple times. What we need is a small data frame that acts as a summary, instead of a column that repeats the summary multiple times. No problem - all we need to do is create a smaller data frame when we subset. We’ll use the function data.frame()
to do this, which constructs a column for each argument you pass to it.
female <- subset(ing, Sex == "f") ## create a data frame of female speakers
f.prop <- mean(female$DepVar == "In") ## find proportion of "In", save it as a variable
f.sum <- data.frame(Gender = "f", Prop = f.prop) ## create summary data frame
male <- subset(ing, Sex == "m") ## create a data frame of male speakers
m.prop <- mean(male$DepVar == "In") ## find proportion of "In"
m.sum <- data.frame(Gender = "m", Prop = m.prop) ## create summary data frame
## put the two summary dfs togther into one summary df by binding the rows with rbind()
sum.ing <- rbind(f.sum, m.sum)
head(sum.ing)
## Gender Prop
## 1 f 0.5000000
## 2 m 0.5126476
ggplot(sum.ing, aes(x = Gender, y = Prop)) +
geom_bar(stat="identity")
Turns out, men and women in this data set use “In” at roughly the same proportion.
But this takes a lot of work, to split the data frames, apply a function, summarise the results, and combine them back together before we can plot informative graphs. Luckily, there’s a function family that exists to make this process easier and more intuitive. dplyr
, by our pal Hadley Wickham.
I’m calling dplyr
a family of functions, because it comprises a set of its own functions, and uses its own syntax to accomplish data manipulation goals. First let’s install and load dplyr
.
install.packages("dplyr")
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
%>%
operatorTo begin with, we see that dplyr
makes use of a new operator, %>%
. This is called a forward pipe, or “pipe”, and it works by taking the object on the left and piping it through the function on the right. An easy way to see this in action is to try it out with my go-to function head()
:
ing %>% head()
## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity
## 54 going In careful progressive vowel f 2 Irish
## 55 giving In careful progressive vowel f 2 Irish
## 56 upcoming Ing careful adjective vowel f 2 Irish
## 57 going In careful progressive vowel f 2 Irish
## 58 fighting Ing careful participle apical f 2 Irish
## 59 going Ing narrative progressive vowel f 2 Irish
## Prop prop
## 54 0.5 0.5065847
## 55 0.5 0.5065847
## 56 0.5 0.5065847
## 57 0.5 0.5065847
## 58 0.5 0.5065847
## 59 0.5 0.5065847
The usefulness of %>%
becomes more apparent when we’re dealing with data frame manipulation rather than base R functions. Let’s see what the dplyr
code for summarising the proportion of “In” looks like.
ing %>%
group_by(Sex) %>%
summarise(Prop = mean(DepVar == "In"))
## Source: local data frame [2 x 2]
##
## Sex Prop
## 1 f 0.5000000
## 2 m 0.5126476
Here, what we’ve done is: 1. Taken the ing
data frame, and 2. Split it up by gender 3. Calculated the proportion of “In” tokens
dplyr
automatically combines the results into a single data frame. We can assign this data frame to its own variable and plot it:
new.ing <- ing %>%
group_by(Sex) %>%
summarise(Prop = mean(DepVar == "In"))
ggplot(new.ing, aes(x = Sex, y = Prop)) +
geom_bar(stat = "identity")
At this point, I’m going to take a step back and provide a list of some of the basic functions in dplyr
.
Function | Does |
---|---|
filter() |
Select just the rows you’re interested in |
group_by() |
Splits the data according to the arguments. If there are multiple arguments, it will split first by the first argument, then by the second. |
select() |
Select just the columns you’re interested in. |
mutate() |
Add new columns |
summarise() |
Summarise the data |
So the rates of “In” by gender aren’t very informative. But what about grammatical category? (Spoiler alert - we expect grammatical category to have a large effect on rates of “In”)!
Thanks to dplyr
, it’s easy to ask any of these questions that we might have!
gram_ing <- ing %>%
group_by(GramStatus) %>%
summarise(Prop = mean(DepVar == "In"))
ggplot(gram_ing, aes(x = GramStatus, y = Prop)) +
geom_bar(stat = "identity")
What if men and women use different proportions for each category?
gram_gend_ing <- ing %>%
group_by(GramStatus, Sex) %>%
summarise(Prop = mean(DepVar == "In"))
ggplot(gram_gend_ing, aes(x = GramStatus, y = Prop, fill = Sex)) +
geom_bar(stat = "identity", position = "dodge")
This is primarily a dplyr
challenge, with elements of ggplot
added in
This documentation of geom_errorbar and this Stack Overflow question on calculating standard error will be helpful. Hint: answers with more upvotes are generally better.
This is entirely a ggplot
challenge.
Hint: First figure out how to plot a scatter plot of categorical data. Then add a statistics layer that will smooth between the categorical data.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.