Main issues for today

Some homework revision

library(tidyverse)
── Attaching packages ─────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
✔ ggplot2 2.2.1     ✔ purrr   0.2.5
✔ tibble  1.4.2     ✔ dplyr   0.7.5
✔ tidyr   0.8.1     ✔ stringr 1.3.1
✔ readr   1.1.1     ✔ forcats 0.3.0
── Conflicts ────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
library(babynames)

Kelly visualisation:

read_tsv("./Swedish-Kelly_M3_CEFR.tsv") %>%
  arrange(desc(`Raw freq`)) %>%
  filter(!is.na(`Raw freq`)) %>%
  filter(WPM != 1000000) %>%
  mutate(Rank=1:length(ID)) %>%
  ggplot(aes(x=log(Rank), y=log(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

Another version of the same thing, less “idiomatic” tidyverse (old-fashioned way)

kelly <- read_tsv("Swedish-Kelly_M3_CEFR.tsv")
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)
kelly <- filter(kelly, !(is.na(`Raw freq`) | WPM == 1000000))
kelly$rank <- 1:nrow(kelly)
kelly %>% ggplot(aes(rank, `Raw freq`)) + geom_line()

Now plot this again taking the log values of rank and frequency

read_tsv("Swedish-Kelly_M3_CEFR.tsv") %>% 
  filter(!(is.na(`Raw freq`) | WPM == 1000000)) %>% 
  mutate(rank=1:nrow(.)) %>% 
  ggplot(aes(log10(rank), log10(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

What’s that “blip”? How can we fix it?

# Sort it by Raw freq, descending
kelly %>% arrange(desc(`Raw freq`))

And then redo everything with the fix incorporated

read_tsv("Swedish-Kelly_M3_CEFR.tsv") %>%
  arrange(desc(`Raw freq`)) %>%
  filter(!(is.na(`Raw freq`) | WPM == 1000000)) %>% 
  mutate(rank=1:nrow(.)) %>% 
  ggplot(aes(log10(rank), log10(`Raw freq`))) + geom_line()
Parsed with column specification:
cols(
  ID = col_integer(),
  `Raw freq` = col_integer(),
  WPM = col_double(),
  `CEFR levels` = col_character(),
  Source = col_character(),
  Grammar = col_character(),
  `Swedish items for translation` = col_character(),
  `Word classes` = col_character(),
  Examples = col_character()
)

Wide and long data

Wide data: