Revision
Humans tend to prefer “wide format” data for reading/writing
Wide versus long
The key to the ggplot (Grammar of Graphics) approach is “long format”. In the long format it is easy to map dimensions of your data (the various things you know about an observation) to dimensions of your plot.
The gather
function converts wide data to long data. Do this when several different columns are indicating something which should be graphed on a single dimension.
data <- tibble(
speaker=c("Speaker 1", "Speaker 2", "Speaker 3"),
Q1=c(1,3,1),
Q2=c(2,2,2),
Q3=c(1,2,2))
data
Gather:
# data %>% gather(key = "key", value = "value",c("The", "Columns", "To", "Gather"))
data %>% gather("question", "score", c("Q1", "Q2", "Q3"))
data %>%
gather("question", "score", c("Q1", "Q2", "Q3")) %>%
ggplot(aes(x=question, y=score, colour=speaker)) + geom_jitter(height=0, width=0.1)
(note: geom_jitter
is like geom_point
except that it adds a bit of random variation to each value; useful to avoid overplotting)
Use the spread
function when you need to have two observations in one row, for instance if you need to make a compount score:
babynames %>% group_by(year, sex) %>% summarise(total.n=sum(n))
What if we wanted proportion female?
babynames %>%
group_by(year, sex) %>%
summarise(total.n=sum(n)) %>%
spread(sex, total.n)
Spread
GREP
Grep review
Are there any four letter names starting “Eri” apart from “Erik” and “Eric”?
How many spelling variants of Mary/Maria can you find in a single search? How about Christine/Kristen etc. )
Can you think of a way (using what we’ve learned) to get the second character of a string?
Question 1
Are there any four letter names starting “Eri” apart from “Erik” and “Eric”?
babynames %>% pull(name) %>% unique() %>% str_subset("Eri[^ck]$")
# babynames %>% filter(str_detect(name, "Eri[^ck]$"))
Question 2
How many spelling variants of Mary/Maria can you find in a single search?
all_names <- babynames %>% pull(name) %>% unique()
all_names %>% str_subset("Mar+(y|ie|ye|i)$")
all_names %>% str_subset("Ma[aeiouh]?r+[aiey]{1,2}$")
how about the same for Christine etc.
with K at the start instead of Ch,
C without h,
with a at the end
all_names %>% str_subset("[CK]h?ristin[ae]$")
all_names %>% str_subset("Jean(pierre|claude|micha?el)")
Can you think of a way (using what we’ve learned) to get the second character of a string?
The non-regex way is best here,
str_sub("Eric", 2, 2)
[1] "r"
babynames %>% mutate(second_letter=str_extract(name, "[^A-Z]$"))
Facets
Multiple repeat plots — a way of adding one more dimension
There are two facet functions:
facet_wrap
when you want to repeat a plot according a single dimension
facet_grid
when you want to repeat a plot according to two dimensions
babynames %>%
mutate(final_vowel=str_extract(name, "[aeiouy]$")) %>%
filter(!is.na(final_vowel)) %>%
group_by(year, sex, final_vowel) %>% summarise(total=sum(n)) %>%
ggplot(aes(x=year, y=log10(total))) + geom_line(aes(colour=sex)) + facet_wrap(~ final_vowel)
More-or-less the same thing again, but using str_sub instead of a regular expression:
babynames %>% mutate(final_letter=str_sub(name, -1, -1)) %>% group_by(year, sex, final_letter) %>% summarise(total=sum(n)) %>% ggplot(aes(x=year, y=log10(total), colour=sex)) + geom_line() + facet_wrap(~ final_letter)
If-then-else and facet_grid example
get_manner <- function(C) {
if (C %in% c("B","D","G")){
return("stop, voiced")
} else if (C %in% c("P", "T", "K")){
return("stop, voiceless")
} else if (C %in% c("M", "N")){
return("nasal")
} else if (C %in% c("F", "S", "H")){
return("fricative, voiceless")
} else if (C %in% c("V", "Z")){
return("fricative, voiced")
} else return(NA)
}
get_place <- function(C) {
if (C %in% c("B", "P", "F", "V", "M")){
return("bilabial")
} else if (C %in% c("D", "T", "S", "Z", "N")){
return("alveolar")
} else if (C %in% c("G", "K", "H")){
return("velar etc.")
} else {return(NA)}
}
babynames %>% mutate(initial_C=factor(str_extract(name, "^[BDGPTKMNVFSVZH]"))) %>%
filter(!is.na(initial_C)) %>%
group_by(year, sex, initial_C) %>%
summarise(total=sum(n)) %>%
mutate(
manner=factor(map_chr(initial_C, get_manner),
levels=c("stop, voiced", "stop, voiceless", "fricative, voiced", "fricative, voiceless", "nasal")),
place=factor(map_chr(initial_C, get_place),
levels=c("bilabial", "alveolar", "velar etc."))) %>%
ggplot(aes(x=year, y=total, colour=sex)) + geom_line() + facet_grid(manner ~ place) + geom_text(x=1900, y=200000, colour="black", aes(label=initial_C)) + labs(title="Popularity of selected initial consonants in given names")
Mapping functions
Produce a vector from a function
Take names and capitalise all the vowel-final
vfinal2cap <- function(word){
if (str_detect(word, "[aeiouy]$")){
return(str_to_upper(word))
} else {return(str_to_lower(word))}
}
vfinal2cap("Emma")
vfinal2cap("Bob")
babynames %>% pull(name) %>% unique() -> all_names
map_chr(all_names, vfinal2cap) %>% head(24)
---
title: 'Lecture 6: Data exploration'
author: "Michael Dunn, Dept. of Linguistics and Philology, Uppsala University"
date: "Lecture 6, 2018-05-02"
output:
  html_document:
    df_print: paged
---

```{r setup, include=FALSE}
library(tidyverse)
library(babynames)
library(stringr)
```

## Revision

Humans tend to prefer "wide format" data for reading/writing

![*Wide versus long*](./wide-vs-long.png)

The key to the ggplot (Grammar of Graphics) approach is "long format". In the long format it is easy to map dimensions of your data (the various things you know about an observation) to dimensions of your plot.

![*Mapping dimensions*](./aesthetic.png)
The `gather` function converts wide data to long data. Do this when several different columns are indicating something which should be graphed on a single dimension.

```{r}
data <- tibble(
  speaker=c("Speaker 1", "Speaker 2", "Speaker 3"), 
  Q1=c(1,3,1), 
  Q2=c(2,2,2), 
  Q3=c(1,2,2))
data
```

Gather: 
![](./gather.png)

```{r}
# data %>% gather(key = "key", value = "value",c("The", "Columns", "To", "Gather"))
data %>% gather("question", "score", c("Q1", "Q2", "Q3"))
```
```{r}
data %>% 
  gather("question", "score", c("Q1", "Q2", "Q3")) %>% 
  ggplot(aes(x=question, y=score, colour=speaker)) + geom_jitter(height=0, width=0.1) 
```
(note: `geom_jitter` is like `geom_point` except that it adds a bit of random variation to each value; useful to avoid overplotting)

Use the `spread` function when you *need* to have two observations in one row, for instance if you need to make a compount score:

```{r}
# The number of male and female babies each year
babynames %>% group_by(year, sex) %>% summarise(total.n=sum(n))
```
What if we wanted *proportion* female? 

```{r}
babynames %>% 
  group_by(year, sex) %>% 
  summarise(total.n=sum(n)) %>%
  spread(sex, total.n)
```

![*Spread*](./spread.png)

## GREP

- See slideshow

Grep review

1. Are there any four letter names starting "Eri" apart from "Erik" and "Eric"?
2. How many spelling variants of Mary/Maria can you find in a single search? How about Christine/Kristen etc. )
3. Can you think of a way (using what we've learned) to get the *second* character of a string?

## Question 1

Are there any four letter names starting "Eri" apart from "Erik" and "Eric"?

```{r}
babynames %>% pull(name) %>% unique() %>% str_subset("Eri[^ck]$")
# babynames %>% filter(str_detect(name, "Eri[^ck]$"))
```

## Question 2

How many spelling variants of Mary/Maria can you find in a single search?

```{r}
all_names <- babynames %>% pull(name) %>% unique() 
all_names %>% str_subset("Mar+(y|ie|ye|i)$")
```

```{r}
all_names %>% str_subset("Ma[aeiouh]?r+[aiey]{1,2}$")
```

how about the same for Christine etc.

- with K at the start instead of Ch, 
- C without h, 
- with a at the end

```{r}
all_names %>% str_subset("[CK]h?ristin[ae]$")
```

```{r}
all_names %>% str_subset("Jean(pierre|claude|micha?el)")
```

## Question 3

Can you think of a way (using what we've learned) to get the *second* character of a string?
----

- The non-regex way is best here,
```{r}

str_sub("Eric", 2, 2)
```


```{r}
babynames %>% mutate(second_letter=str_extract(name, "[^A-Z]$"))
```

## Facets

Multiple repeat plots — a way of adding one more dimension

There are two facet functions: 

- `facet_wrap` when you want to repeat a plot according a single dimension
- `facet_grid` when you want to repeat a plot according to two dimensions

```{r}
babynames %>% 
  mutate(final_vowel=str_extract(name, "[aeiouy]$")) %>%
  filter(!is.na(final_vowel)) %>% 
  group_by(year, sex, final_vowel) %>% 
  summarise(total=sum(n)) %>% 
  ggplot(aes(x=year, y=log10(total))) + geom_line(aes(colour=sex)) + facet_wrap(~ final_vowel)
```

More-or-less the same thing again, but using str_sub instead of a regular expression:

```{r}
babynames %>% 
  mutate(final_letter=str_sub(name, -1, -1)) %>%
  group_by(year, sex, final_letter) %>% 
  summarise(total=sum(n)) %>% 
  ggplot(aes(x=year, y=log10(total), colour=sex)) + geom_line() + facet_wrap(~ final_letter)
```
## If-then-else and facet_grid example

* Take the consonants, categorise them by *place* and *manner* of articulation, use **facet_grid**
* map_int, map_dbl, **map_chr** (equivalents to sapply, part of purrr)
* legends http://www.cookbook-r.com/Graphs/Legends_(ggplot2)/

```{r}
get_manner <- function(C) {
  if (C %in% c("B","D","G")){
    return("stop, voiced")
  } else if (C %in% c("P", "T", "K")){
    return("stop, voiceless")
  } else if (C %in% c("M", "N")){
    return("nasal")
  } else if (C %in% c("F", "S", "H")){
    return("fricative, voiceless")
  } else if (C %in% c("V", "Z")){
    return("fricative, voiced")
  } else return(NA)
}

get_place <- function(C) {
  if (C %in% c("B", "P", "F", "V", "M")){
    return("bilabial")
  } else if (C %in% c("D", "T", "S", "Z", "N")){
    return("alveolar")
  } else if (C %in% c("G", "K", "H")){
    return("velar etc.")
  } else {return(NA)}
}
babynames %>% mutate(initial_C=factor(str_extract(name, "^[BDGPTKMNVFSVZH]"))) %>%
  filter(!is.na(initial_C)) %>%
  group_by(year, sex, initial_C) %>%
  summarise(total=sum(n)) %>%
  mutate(
    manner=factor(map_chr(initial_C, get_manner),
                  levels=c("stop, voiced", "stop, voiceless", "fricative, voiced", "fricative, voiceless", "nasal")), 
    place=factor(map_chr(initial_C, get_place), 
                 levels=c("bilabial", "alveolar", "velar etc."))) %>%
  ggplot(aes(x=year, y=total, colour=sex)) + geom_line() + facet_grid(manner ~ place) + geom_text(x=1900, y=200000, colour="black", aes(label=initial_C)) + labs(title="Popularity of selected initial consonants in given names")
```

## Mapping functions

Produce a vector from a function

- map_int
- map_dbl
- **map_chr**

Take names and capitalise all the vowel-final

```{r}
vfinal2cap <- function(word){
  if (str_detect(word, "[aeiouy]$")){
    return(str_to_upper(word))
  } else {return(str_to_lower(word))}
}
vfinal2cap("Emma")
vfinal2cap("Bob")
babynames %>% pull(name) %>% unique() -> all_names
map_chr(all_names, vfinal2cap) %>% head(24)
```






