This is how I went about answering the challenge questions from yesterday. If you got the same answer in a different way, that’s totally fine.
Create a plot to see whether style has an effect on the proportion of “In”. Do men and women behave differently in different speech styles?
To do this, I’ll first re-load the ing data from yesterday, and the packages that I’ll need to manipulate and plot the data.
library(devtools)
library(grammarOfVariationData)
library(dplyr)
library(ggplot2)
head(ing)
## Token DepVar Style GramStatus Following.Seg Sex Age Ethnicity
## 54 going In careful progressive vowel f 2 Irish
## 55 giving In careful progressive vowel f 2 Irish
## 56 upcoming Ing careful adjective vowel f 2 Irish
## 57 going In careful progressive vowel f 2 Irish
## 58 fighting Ing careful participle apical f 2 Irish
## 59 going Ing narrative progressive vowel f 2 Irish
## Prop prop
## 54 0.5 0.5065847
## 55 0.5 0.5065847
## 56 0.5 0.5065847
## 57 0.5 0.5065847
## 58 0.5 0.5065847
## 59 0.5 0.5065847
First, I’ll create a summary data frame using dplyr
. Here, I’m taking the ing
data frame, splitting it up according to style and gender, and calculating the proportion of In
tokens in each style for each gender.
style_gend_ing <- ing %>%
group_by(Style, Sex) %>%
summarise(Prop = mean(DepVar == "In"))
Next, I’ll plot it. Yesterday we didn’t assign the outputs of our ggplot()
functions to a variable, but in general practice, I prefer to do so. This allows me to call up graph objects later without having to re-run the code. I’ll save this plot as style_gend_plot
, since it plots style and gender.
As you can see, I’ve added some totally unnecessary cosmetic arguments to my plot. Within geom_bar
, I set the transparencey to 70% with alpha
for no reason at all, and added a black border around the bars with color = "black"
because I like it better. Note that if you’re saving your graphs as objects in your workspace, you have to call them in order to see them in the plotting window.
style_gend_plot <- ggplot(style_gend_ing, aes(x = Style, y = Prop*100, fill = Sex)) +
geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
labs(x = "Speech Style", y = "Percent `In`", color = "Gender") +
theme_bw()
style_gend_plot ## have to call the plot separately
One point that this plot brings up about geom_bar()
defaults is that if there is only one category of color for a given value on the x-axis (e.g., only females talking about language, as we see here), then ggplot
will plot that bar twice as fat, which looks terrible.
A solution to this problem is to manually add in an “m” variant in the summary data frame, with a proportion value of 0. This way ggplot
will plot a bar with height 0, instead of the weirdly proportioned single bar that we see above.
style_gend_ing <- ing %>%
group_by(Style, Sex) %>%
summarise(Prop = (mean(DepVar == "In")))
style_gend_ing[16,] <- c("language", "m", 0) ## add in row for males talking about language
style_gend_plot <- ggplot(style_gend_ing, aes(x = Style, y = 100*(as.numeric(Prop)), fill = Sex)) +
geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
labs(x = "Speech Style", y = "Percent `In`", color = "Gender",
title = "Percent 'IN' by Speech Style and Gender") +
theme_bw()
style_gend_plot
This seems to have fixed the problem. Of course it’s worth remembering that a bar graph of 0 in this case is actually a little misleading, since it means that males used “IN” 0% of the time when they were talking about language. But instead of an actual 0 there, our data has NA there, because we don’t have any data where males are talking about language. ggplot
’s default is to make this NA apparent by just extending the width of the bar graph. I don’t like that option because I think it looks bad - but creating a dummy row in this case also isn’t necessarily ideal, since it is misleading.
Some solutions might include:
I prefer 2, since I think it’s most informative, but there’s always a trade-off with informativity vs unnecessarily cluttering the graph. This takes another step in the dplyr
function, to add the number of tokens for each category. We can accomplish this with length()
, which we saw in base R, or with n()
, which is a dplyr
-specific function for finding the number of tokens in each category.
style_gend_ing <- ing %>%
group_by(Style, Sex) %>%
summarise(Prop = (mean(DepVar == "In")),
Count = n()) ## add in token count
style_gend_ing[16,] <- c("language", "m", 0, 0) ## add in row for males talking about language
style_gend_plot <- ggplot(style_gend_ing, aes(x = Style,
y = 100*(as.numeric(Prop)),
fill = Sex,
## define label for geom_text
label = paste("n=", Count, sep = ""))) +
geom_bar(stat = "identity", position = "dodge", alpha=.7, color = "black") +
## dodge to match columns, vertical adjust to be above them.
geom_text(position = position_dodge(width = .9), vjust = -0.5, size = 3) +
labs(x = "Speech Style", y = "Percent `In`", color = "Gender",
title = "Percent 'IN' by Speech Style and Gender") +
theme_bw()
style_gend_plot
## ymax not defined: adjusting position using y instead
Okay so that was supposed to be the easy challenge - but of course the problem with making nice graphs is that there’s endless room for tinkering!
Add error bars to the grammatical category by gender plot.
This documentation of geom_errorbar and this Stack Overflow question on calculating standard error will be helpful. Hint: answers with more upvotes are generally better.
This problem required a little bit of online searching for solutions. Just adding geom_errorbar()
to your original ggplot
call will give you an error message - so that means you have to check out the documentation of geom_errorbar()
to find out what’s missing. You can see that geom_errorbar()
requires x, ymax, and ymin to be specified. Furthermore, we can see in the example plot that ymax is specified by already having the values for standard error in the data frame.
Okay - so this means we need to add the standard error into our summary data frame. No problem, we’ll just ask Google how to find the standard error of the mean in R. That will take us to the Stack Overflow thread about this problem, where we learn that there is no se()
function in base R, but that we can easily makeo our own function. (This brings up an excellent point about Stack Overflow, by the way: it’s not always the first answer that’s the best answer for you!)
So I’ll copy that function that Ian Fellows supplied, and use it to calculate the standard error in my summary data frame.
std <- function(x) sd(x)/sqrt(length(x)) ## define the function std()
gram_gend_ing2 <- ing %>%
group_by(GramStatus, Sex) %>%
summarise(Prop = mean(DepVar == "In"),
SE = std(DepVar == "In")) %>% ## use std() to calculate standard error
mutate(ymax = Prop + SE, ## add ymax and ymin into data frame
ymin = Prop - SE)
This code adds another layer onto the dplyr
function, with mutate
. One of the amazing features of dplyr
is that because of chaining, you can calculate a new variable in your summary data frame with summarise
(as I did above with Prop and SE), and then turn around and immediately use that new variable to calculate another variable(here, calculating ymax and ymin using Prop and SE).
head(gram_gend_ing2)
## Source: local data frame [6 x 6]
## Groups: GramStatus
##
## GramStatus Sex Prop SE ymax ymin
## 1 adjective f 0.07692308 0.03731317 0.1142363 0.03960990
## 2 adjective m 0.18750000 0.10077822 0.2882782 0.08672178
## 3 during f 0.71428571 0.18442778 0.8987135 0.52985794
## 4 during m 1.00000000 0.00000000 1.0000000 1.00000000
## 5 gerund f 0.27500000 0.07149951 0.3464995 0.20350049
## 6 gerund m 0.23287671 0.04981147 0.2826882 0.18306524
Looks great, now we just have to plot it! I prefer percentages to proportions, because I think they’re easier to read, so I just need to multiply my proportions by 100 to get the percent. I could have changed this in the underlying data frame (preferable, if I were to continue working on ing), but it’s also possible just to adjust it in the ggplot
call:
gram_ing_plot <- ggplot(gram_gend_ing2, aes(x = GramStatus, y = Prop*100, fill = Sex)) +
geom_bar(stat = "identity", position = "dodge") +
geom_errorbar(aes(ymax = ymax*100, ymin = ymin*100),
position=position_dodge(width=.9), width=.5) +
labs(x = "\nGrammatical Category", y = "Percent 'In'", color="Gender",
title = "Rates of 'In' by Gender and Grammatical Category \n") +
scale_fill_brewer(breaks = c("f", "m"),
labels = c("Female", "Male"),
palette = "Paired") +
theme_bw()
gram_ing_plot
Plot the rate of “In” over time (using Age as a proxy for DOB)
age_ing <- ing %>%
group_by(Age) %>%
summarise(Prop = mean(DepVar == "In"))
age_ing_plot <- ggplot(ing, aes(x = Age, y = DepVar, color = Sex, group = Sex)) +
#geom_bar(stat = "identity", position = "dodge") +
geom_point(position=position_jitter(width=.1, height=.02), alpha=.05) +
stat_smooth()
age_ing_plot
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.