Revision

Humans tend to prefer “wide format” data for reading/writing

Wide versus long
Wide versus long

The key to the ggplot (Grammar of Graphics) approach is “long format”. In the long format it is easy to map dimensions of your data (the various things you know about an observation) to dimensions of your plot.

Mapping dimensions The gather function converts wide data to long data. Do this when several different columns are indicating something which should be graphed on a single dimension.

data <- tibble(
  speaker=c("Speaker 1", "Speaker 2", "Speaker 3"), 
  Q1=c(1,3,1), 
  Q2=c(2,2,2), 
  Q3=c(1,2,2))
data

Gather:

# data %>% gather(key = "key", value = "value",c("The", "Columns", "To", "Gather"))
data %>% gather("question", "score", c("Q1", "Q2", "Q3"))
data %>% 
  gather("question", "score", c("Q1", "Q2", "Q3")) %>% 
  ggplot(aes(x=question, y=score, colour=speaker)) + geom_jitter(height=0, width=0.1) 

(note: geom_jitter is like geom_point except that it adds a bit of random variation to each value; useful to avoid overplotting)

Use the spread function when you need to have two observations in one row, for instance if you need to make a compount score:

babynames %>% group_by(year, sex) %>% summarise(total.n=sum(n))

What if we wanted proportion female?

babynames %>% 
  group_by(year, sex) %>% 
  summarise(total.n=sum(n)) %>%
  spread(sex, total.n)
Spread
Spread

GREP

Grep review

  1. Are there any four letter names starting “Eri” apart from “Erik” and “Eric”?
  2. How many spelling variants of Mary/Maria can you find in a single search? How about Christine/Kristen etc. )
  3. Can you think of a way (using what we’ve learned) to get the second character of a string?

Question 1

Are there any four letter names starting “Eri” apart from “Erik” and “Eric”?

babynames %>% pull(name) %>% unique() %>% str_subset("Eri[^ck]$")
# babynames %>% filter(str_detect(name, "Eri[^ck]$"))

Question 2

How many spelling variants of Mary/Maria can you find in a single search?

all_names <- babynames %>% pull(name) %>% unique() 
all_names %>% str_subset("Mar+(y|ie|ye|i)$")
all_names %>% str_subset("Ma[aeiouh]?r+[aiey]{1,2}$")

how about the same for Christine etc.

all_names %>% str_subset("[CK]h?ristin[ae]$")
all_names %>% str_subset("Jean(pierre|claude|micha?el)")

Question 3

Can you think of a way (using what we’ve learned) to get the second character of a string?

str_sub("Eric", 2, 2)
[1] "r"
babynames %>% mutate(second_letter=str_extract(name, "[^A-Z]$"))

Facets

Multiple repeat plots — a way of adding one more dimension

There are two facet functions:

babynames %>% 
  mutate(final_vowel=str_extract(name, "[aeiouy]$")) %>%
  filter(!is.na(final_vowel)) %>% 
  group_by(year, sex, final_vowel) %>% summarise(total=sum(n)) %>% 
  ggplot(aes(x=year, y=log10(total))) + geom_line(aes(colour=sex)) + facet_wrap(~ final_vowel)

More-or-less the same thing again, but using str_sub instead of a regular expression:

babynames %>% mutate(final_letter=str_sub(name, -1, -1)) %>% group_by(year, sex, final_letter) %>% summarise(total=sum(n)) %>% ggplot(aes(x=year, y=log10(total), colour=sex)) + geom_line() + facet_wrap(~ final_letter)

If-then-else and facet_grid example

get_manner <- function(C) {
  if (C %in% c("B","D","G")){
    return("stop, voiced")
  } else if (C %in% c("P", "T", "K")){
    return("stop, voiceless")
  } else if (C %in% c("M", "N")){
    return("nasal")
  } else if (C %in% c("F", "S", "H")){
    return("fricative, voiceless")
  } else if (C %in% c("V", "Z")){
    return("fricative, voiced")
  } else return(NA)
}
get_place <- function(C) {
  if (C %in% c("B", "P", "F", "V", "M")){
    return("bilabial")
  } else if (C %in% c("D", "T", "S", "Z", "N")){
    return("alveolar")
  } else if (C %in% c("G", "K", "H")){
    return("velar etc.")
  } else {return(NA)}
}
babynames %>% mutate(initial_C=factor(str_extract(name, "^[BDGPTKMNVFSVZH]"))) %>%
  filter(!is.na(initial_C)) %>%
  group_by(year, sex, initial_C) %>%
  summarise(total=sum(n)) %>%
  mutate(
    manner=factor(map_chr(initial_C, get_manner),
                  levels=c("stop, voiced", "stop, voiceless", "fricative, voiced", "fricative, voiceless", "nasal")), 
    place=factor(map_chr(initial_C, get_place), 
                 levels=c("bilabial", "alveolar", "velar etc."))) %>%
  ggplot(aes(x=year, y=total, colour=sex)) + geom_line() + facet_grid(manner ~ place) + geom_text(x=1900, y=200000, colour="black", aes(label=initial_C)) + labs(title="Popularity of selected initial consonants in given names")

Mapping functions

Produce a vector from a function

Take names and capitalise all the vowel-final

vfinal2cap <- function(word){
  if (str_detect(word, "[aeiouy]$")){
    return(str_to_upper(word))
  } else {return(str_to_lower(word))}
}
vfinal2cap("Emma")
vfinal2cap("Bob")
babynames %>% pull(name) %>% unique() -> all_names
map_chr(all_names, vfinal2cap) %>% head(24)
---
title: 'Lecture 6: Data exploration'
author: "Michael Dunn, Dept. of Linguistics and Philology, Uppsala University"
date: "Lecture 6, 2018-05-02"
output:
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
library(tidyverse)
library(babynames)
library(stringr)
```

## Revision

Humans tend to prefer "wide format" data for reading/writing

![*Wide versus long*](./wide-vs-long.png)

The key to the ggplot (Grammar of Graphics) approach is "long format". In the long format it is easy to map dimensions of your data (the various things you know about an observation) to dimensions of your plot.

![*Mapping dimensions*](./aesthetic.png)
The `gather` function converts wide data to long data. Do this when several different columns are indicating something which should be graphed on a single dimension.

```{r}
data <- tibble(
  speaker=c("Speaker 1", "Speaker 2", "Speaker 3"), 
  Q1=c(1,3,1), 
  Q2=c(2,2,2), 
  Q3=c(1,2,2))
data
```

Gather: 
![](./gather.png)

```{r}
# data %>% gather(key = "key", value = "value",c("The", "Columns", "To", "Gather"))
data %>% gather("question", "score", c("Q1", "Q2", "Q3"))
```
```{r}
data %>% 
  gather("question", "score", c("Q1", "Q2", "Q3")) %>% 
  ggplot(aes(x=question, y=score, colour=speaker)) + geom_jitter(height=0, width=0.1) 
```
(note: `geom_jitter` is like `geom_point` except that it adds a bit of random variation to each value; useful to avoid overplotting)

Use the `spread` function when you *need* to have two observations in one row, for instance if you need to make a compount score:

```{r}
# The number of male and female babies each year
babynames %>% group_by(year, sex) %>% summarise(total.n=sum(n))
```
What if we wanted *proportion* female? 

```{r}
babynames %>% 
  group_by(year, sex) %>% 
  summarise(total.n=sum(n)) %>%
  spread(sex, total.n)
```

![*Spread*](./spread.png)

## GREP

- See slideshow

Grep review

1. Are there any four letter names starting "Eri" apart from "Erik" and "Eric"?
2. How many spelling variants of Mary/Maria can you find in a single search? How about Christine/Kristen etc. )
3. Can you think of a way (using what we've learned) to get the *second* character of a string?

## Question 1

Are there any four letter names starting "Eri" apart from "Erik" and "Eric"?

```{r}
babynames %>% pull(name) %>% unique() %>% str_subset("Eri[^ck]$")
# babynames %>% filter(str_detect(name, "Eri[^ck]$"))
```

## Question 2

How many spelling variants of Mary/Maria can you find in a single search?

```{r}
all_names <- babynames %>% pull(name) %>% unique() 
all_names %>% str_subset("Mar+(y|ie|ye|i)$")
```

```{r}
all_names %>% str_subset("Ma[aeiouh]?r+[aiey]{1,2}$")
```

how about the same for Christine etc.

- with K at the start instead of Ch, 
- C without h, 
- with a at the end

```{r}
all_names %>% str_subset("[CK]h?ristin[ae]$")
```

```{r}
all_names %>% str_subset("Jean(pierre|claude|micha?el)")
```

## Question 3

Can you think of a way (using what we've learned) to get the *second* character of a string?
----

- The non-regex way is best here,
```{r}

str_sub("Eric", 2, 2)
```


```{r}
babynames %>% mutate(second_letter=str_extract(name, "[^A-Z]$"))
```

## Facets

Multiple repeat plots — a way of adding one more dimension

There are two facet functions: 

- `facet_wrap` when you want to repeat a plot according a single dimension
- `facet_grid` when you want to repeat a plot according to two dimensions

```{r}
babynames %>% 
  mutate(final_vowel=str_extract(name, "[aeiouy]$")) %>%
  filter(!is.na(final_vowel)) %>% 
  group_by(year, sex, final_vowel) %>% 
  summarise(total=sum(n)) %>% 
  ggplot(aes(x=year, y=log10(total))) + geom_line(aes(colour=sex)) + facet_wrap(~ final_vowel)
```

More-or-less the same thing again, but using str_sub instead of a regular expression:

```{r}
babynames %>% 
  mutate(final_letter=str_sub(name, -1, -1)) %>%
  group_by(year, sex, final_letter) %>% 
  summarise(total=sum(n)) %>% 
  ggplot(aes(x=year, y=log10(total), colour=sex)) + geom_line() + facet_wrap(~ final_letter)
```
## If-then-else and facet_grid example

* Take the consonants, categorise them by *place* and *manner* of articulation, use **facet_grid**
* map_int, map_dbl, **map_chr** (equivalents to sapply, part of purrr)
* legends http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/

```{r}
get_manner <- function(C) {
  if (C %in% c("B","D","G")){
    return("stop, voiced")
  } else if (C %in% c("P", "T", "K")){
    return("stop, voiceless")
  } else if (C %in% c("M", "N")){
    return("nasal")
  } else if (C %in% c("F", "S", "H")){
    return("fricative, voiceless")
  } else if (C %in% c("V", "Z")){
    return("fricative, voiced")
  } else return(NA)
}

get_place <- function(C) {
  if (C %in% c("B", "P", "F", "V", "M")){
    return("bilabial")
  } else if (C %in% c("D", "T", "S", "Z", "N")){
    return("alveolar")
  } else if (C %in% c("G", "K", "H")){
    return("velar etc.")
  } else {return(NA)}
}
babynames %>% mutate(initial_C=factor(str_extract(name, "^[BDGPTKMNVFSVZH]"))) %>%
  filter(!is.na(initial_C)) %>%
  group_by(year, sex, initial_C) %>%
  summarise(total=sum(n)) %>%
  mutate(
    manner=factor(map_chr(initial_C, get_manner),
                  levels=c("stop, voiced", "stop, voiceless", "fricative, voiced", "fricative, voiceless", "nasal")), 
    place=factor(map_chr(initial_C, get_place), 
                 levels=c("bilabial", "alveolar", "velar etc."))) %>%
  ggplot(aes(x=year, y=total, colour=sex)) + geom_line() + facet_grid(manner ~ place) + geom_text(x=1900, y=200000, colour="black", aes(label=initial_C)) + labs(title="Popularity of selected initial consonants in given names")
```

## Mapping functions

Produce a vector from a function

- map_int
- map_dbl
- **map_chr**

Take names and capitalise all the vowel-final

```{r}
vfinal2cap <- function(word){
  if (str_detect(word, "[aeiouy]$")){
    return(str_to_upper(word))
  } else {return(str_to_lower(word))}
}
vfinal2cap("Emma")
vfinal2cap("Bob")
babynames %>% pull(name) %>% unique() -> all_names
map_chr(all_names, vfinal2cap) %>% head(24)
```






