Main issues for today

Some homework revision

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(babynames)

Kelly visualisation:

read_tsv("./Swedish-Kelly_M3_CEFR.tsv") %>%
  arrange(desc(`Raw freq`)) %>%
  filter(!is.na(`Raw freq`)) %>%
  filter(WPM != 1000000) %>%
  mutate(Rank=1:length(ID)) %>%
  ggplot(aes(x=log(Rank), y=log(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

Another version of the same thing, less “idiomatic” tidyverse (old-fashioned way)

kelly <- read_tsv("Swedish-Kelly_M3_CEFR.tsv")
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)
kelly <- filter(kelly, !(is.na(`Raw freq`) | WPM == 1000000))
kelly$rank <- 1:nrow(kelly)
kelly %>% ggplot(aes(rank, `Raw freq`)) + geom_line()

Now plot this again taking the log values of rank and frequency

read_tsv("Swedish-Kelly_M3_CEFR.tsv") %>% 
  filter(!(is.na(`Raw freq`) | WPM == 1000000)) %>% 
  mutate(rank=1:nrow(.)) %>% 
  ggplot(aes(log10(rank), log10(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

What’s that “blip”? How can we fix it?

# Sort it by Raw freq, descending
kelly %>% arrange(desc(`Raw freq`))

And then redo everything with the fix incorporated

read_tsv("Swedish-Kelly_M3_CEFR.tsv") %>%
  arrange(desc(`Raw freq`)) %>%
  filter(!(is.na(`Raw freq`) | WPM == 1000000)) %>% 
  mutate(rank=1:nrow(.)) %>% 
  ggplot(aes(log10(rank), log10(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

Wide and long data

Wide data:

Long data:

head(babynames)
Wide vs. long

Wide vs. long

Wide format

data <- tibble(row=c("A", "B"), x=1:2, y=3:4, z=5:6)
data

Long format

data %>% gather("column", "value", c("x", "y", "z"))

Loading data directly from excel format

(this is relatively new, I didn’t know about it earlier)

These excel files are from the Swedish Central Statistics Agency, SCB

Look at the spreadsheet and the read_excel documentation - named sheets (we need to select a particular sheet) - blank lines at beginning (skip them) - column types (can you see what they are?)

library(readxl)
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 
girls <- read_excel("be0001namntab11_2017.xlsx", sheet = "Flickor", skip = 4)
boys <- read_excel("be0001namntab12_2017.xlsx", sheet = "Pojkar", skip = 4)
head(girls)

Check the column titles:

names(girls)
 [1] "Namn" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[18] "2014" "2015" "2016" "2017"
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 
names(boys)
 [1] "Namn" "1998" "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009" "2010" "2011" "2012" "2013"
[18] "2014" "2015" "2016" "2017"

We want to check that the column names of girls and boys are the same. You can just do one of the following:

names(girls) == names(boys) # expect a long vector of TRUEs
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 

You can also do:

all(names(girls) == names(boys)) # returns TRUE if all the values in the vector are TRUE
[1] TRUE

But a more elegant way to do it is to incorporate tests. The testthat package introduces a bunch of expect_ functions that make your script crash (infomatively!) if the expectation is violated.

library(testthat)

Attaching package: ‘testthat’

The following object is masked from ‘package:dplyr’:

    matches

The following object is masked from ‘package:purrr’:

    is_null
expect_equal(names(girls), names(boys))

The test does nothing if it passes. You can incorporate tests into your scripts to make sure nothing unexpected is happening after e.g. you update data.

a <- 1:5
b <- 1:4
expect_equal(a, b)
Error: `a` not equal to `b`.
Lengths differ: 5 is not 4

Assuming things work as expected you can add a column to specify male or female name, and then bind your tables together into a single table.

library(tidyverse)
girls <- girls %>% mutate(sex = "F")
boys <- boys %>% mutate(sex = "M")
data <- rbind(girls, boys)

Now take a look:

head(data)

Convert swedish babynames from wide to long

Back to Swedish baby names. In order to work with this we need to convert it from wide to long format: there should be a single year column with year column headers as variables.

Gathering

This is very important!

The following function all the year columns into one column with year (the old column header), and one column with the value of the cell.

We use as.character because this refers to the column headers, which count as text. If we used numerals what would it mean?

long.data <- gather(data, year, count, as.character(1998:2017)) 
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 
head(long.data)

Note that the year and count columns are shown as (character) rather than (a numeric type). You won’t be able to graph these until you fix them. The count data is character rather than numeric; fix this with mutate

long.data <- gather(data, year, count, as.character(1998:2017)) %>% 
  mutate(year=as.numeric(year), count=as.numeric(count))
NAs introduced by coercionWarning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 
head(long.data)

Now ggplot recognises the numbers as numeric rather than as character strings it can plot them:

long.data %>% filter(Namn=="Michael") %>% ggplot(aes(x=year, y=count)) + geom_line()

long.data %>% filter(Namn=="Linnéa") %>% ggplot(aes(x=year, y=count)) + geom_line()

long.data %>% filter(Namn=="Linnéa" | Namn=="Anna" | Namn=="Robert") %>% ggplot(aes(x=year, y=count, linetype=Namn)) + geom_line()
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 

Fixing (“coercing”) character types

Look again at head(data). All the numbers have been imported at characters. Can you guess why?

This is a problem:

values <- c("1", "7", "8?", "-", "not applicable")
values
[1] "1"              "7"              "8?"             "-"              "not applicable"
# You can't do mathematical operations with the character representations of numbers
# values + 1

Use as.numeric to coerce the type of an object to numeric. Anything that can’t be coerced turns into NA (not available)

numeric.values <- as.numeric(values)
NAs introduced by coercion
numeric.values # note NA for "not available"; the warning message
[1]  1  7 NA NA NA
numeric.values + 1
[1]  2  8 NA NA NA

You can do this the other way around too, with as.character (like we did in the gather example above)

There are other as.XXX functions for every other type of object, but you’re less likely to need these.

summarise() and group_by()

Reduces all the rows to one row

babynames %>% summarise(mean_n=mean(n), median_n=median(n))

group_by reduces all the rows to a smaller number of rows, according to the group_by term/s; summarise then works on each group (group_by doesn’t make much sense without a summarise or similar after it)

# group_by sex
babynames %>% filter(name=="Michael") %>% group_by(sex) %>% summarise(first_seen=min(year), last_seen=max(year))
our_names = c("Anna", "Bror-Magnus", "Lena", "Linnéa", "Maja", "Marc", "Mervi", "Rima", "Robert", "Rune", "Michael")
Warning messages:
1: Unknown or uninitialised column: 'F'. 
2: Unknown or uninitialised column: 'M'. 
babynames %>% filter(name %in% our_names) %>% group_by(name) %>% summarise(total=sum(n))

Exercise: what are the mean and median number of male and female names in the data?

You can group by multiple things at once to get every combination

babynames %>% 
  filter(name %in% c("Michael", "Magnus", "Anna", "Maja")) %>% 
  group_by(name, sex) %>% # every combination of name and sex
  summarise(total=sum(n)) 

Spread example

Here’s a chance to use spread. Let’s say we want to look at the ratio of male to female version of each of these names:

babynames %>% 
  filter(name %in% c("Michael", "Magnus", "Anna", "Maja")) %>% 
  group_by(name, sex) %>% # every combination of name and sex
  summarise(total=sum(n)) %>% 
  spread(sex, total)

We could convert this in a manliness rating for names:

babynames %>% 
  filter(name %in% c("Michael", "Magnus", "Anna", "Maja")) %>% 
  group_by(name, sex) %>% # every combination of name and sex
  summarise(total=sum(n)) %>% 
  spread(sex, total) %>% 
  mutate(manliness=M/(M+F))

Oops, have to change the NAs to 0, because anything + NA is NA

babynames %>% 
  filter(name %in% c("Michael", "Magnus", "Anna", "Maja")) %>% 
  group_by(name, sex) %>% # every combination of name and sex
  summarise(count=sum(n)) %>%  
  spread(sex, count) -> data
# writing an intermediate variable is a clunky way to do it, but I'm not too proud
data$F[is.na(data$F)] <- 0
data$M[is.na(data$M)] <- 0
data %>% 
  mutate(manliness=M/(M+F)) %>% 
  arrange(desc(manliness))

an aside on indexing

values <- c(3,6,17, NA, NA, 5)
values
[1]  3  6 17 NA NA  5
is.na(values)
[1] FALSE FALSE FALSE  TRUE  TRUE FALSE

Indexes in square brackets

values[2]
[1] 6
values[2:4]
[1]  6 17 NA

Assign to indexed vectors

values[1:2] <- -99
values
[1] -99 -99  17  NA  NA   5
values[is.na(values)] <- 0
values
[1] -99 -99  17   0   0   5

Other kinds of logic also possible

values[values < 0 ] <- NA
values
[1] NA NA 17  0  0  5

Histogram

babynames %>% filter(year==2000) %>% ggplot(aes(x=log10(n))) + geom_histogram()

stringr

Simple character manipulations, see documentation: https://stringr.tidyverse.org/articles/stringr.html

library(stringr) # you might need to load this separately (or it might be part of tidyverse)
# str_sub(x, start, stop)
our_names = c("Anna", "Bror-Magnus", "Lena", "Linnéa", "Maja", "Marc", "Mervi", "Rima", "Robert", "Rune")
str_sub(our_names, 1, 2)
# positive and negative indices
str_sub(our_names, 1, 1)
str_sub(our_names, -3, -1) %>% str_to_upper()
  • We can do lots of interesting things with this by adding the output of this to a new column using mutate

Exercises:

Try one of: - add a column called final_a with TRUE or FALSE values for whether the name has a final a - add a column for first_letter - explore this variable (e.g. interaction with sex, change over time)

