Plan:

**Revision** Consolodate visualisation skills by going over homework solutions

**New** Explore ways to summarise numerical data

- Characterising a sample
- range
- mean
- median

- Variance and standard deviation
- sd

- Distributions and histograms
- rnorm
- runif

Integer division: `%/%`

- This gives you the whole number part of a division

`7 %/% 2`

`[1] 3`

You can do this with a series of numerals

`1:10 %/% 3`

` [1] 0 0 1 1 1 2 2 2 3 3`

Modulo `%%`

:

- This gives you the remainder after youâ€™ve done division.

`17 %/% 3`

`[1] 5`

`17 %% 3`

`[1] 2`

This is often used to tell if a number is even or odd:

```
#numbers <- 1:10
numbers <- c(4,45,2,46,96,75,32)
numbers %% 2
```

`[1] 0 1 0 0 0 1 0`

`numbers %% 2 == 1`

`[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE`

Selecting the odd numbers

`numbers[numbers %% 2 == 1]`

`[1] 45 75`

Selecting the even numbers

`numbers[numbers %% 2 == 0]`

`[1] 4 2 46 96 32`

**NOTE** that this only works for integers (whole numbers); R often works with `numeric`

values, which can also be decimals.

- cf.Â the
`is.integer()`

and`as.integer()`

functions

Weâ€™ll look at this again later, but hereâ€™s a reminder about the `runif()`

function to generate random numbers

`runif(10) # by default from 0 to 1`

` [1] 0.3561 0.8718 0.1580 0.7354 0.9886 0.6004 0.7171 0.9770 0.4326 0.7649`

`runif(10, 0, 100) # or specify min max`

` [1] 82.735 8.145 84.839 8.281 8.890 9.847 54.866 13.696 51.634 77.782`

`as.integer(runif(10,0,100))`

` [1] 1 42 62 31 35 84 84 52 90 31`

- Make a function to calculate the mean
- needs
`sum`

and`length`

- needs

`values <- runif(30, 17, 987)`

`sum(values) / length(values)`

`[1] 490.3`

```
# my_func <- function(arg1) {do something with arg1}
custom.mean.func <- function(v) { sum(v)/length(v)}
```

`custom.mean.func(values)`

`[1] 490.3`

`custom.mean.func(1:10)`

`[1] 5.5`

`mean(values) ## the built-in function`

`[1] 490.3`

- Although there is actually a
`mean()`

function that does exactly what youâ€™d expect - Handling NAs

```
values[3] <- NA
values
```

```
[1] 295.28 21.33 NA 243.24 720.07 963.24 494.86 432.98 841.42 801.84 946.57 748.44 92.99 392.27 960.72 423.76 409.10 299.33 839.03 48.92 282.76
[22] 128.44 315.35 496.87 773.54 101.12 652.96 193.01 492.19 412.95
```

`mean(values)`

`[1] NA`

`mean(values, na.rm=TRUE)`

`[1] 476.7`

```
custom.mean.func <- function(v, na.rm=FALSE) {
sum(v, na.rm=na.rm)/length(v[!is.na(v)])}
custom.mean.func(values, na.rm=TRUE)
```

`[1] 476.7`

The middle value of a sorted series; if there are an even number of values then take the value halfway between the middle two.

```
values <- runif(21, 1, 55)
mean(values)
```

`[1] 26.62`

`median(values)`

`[1] 23.72`

```
# sort the values
sort(values)
```

` [1] 2.002 4.365 7.692 11.380 13.674 15.092 16.171 17.316 21.775 21.977 23.724 25.052 27.305 31.939 36.334 41.505 43.295 44.519 45.651 53.712 54.597`

```
# the index for the midpoint is integer-half plus one
(length(values) %/% 2) + 1
```

`[1] 11`

```
# use this number to index the sorted vector
sort(values)[11]
```

`[1] 23.72`

```
# put it all together
sort(values)[(length(values) %/% 2) + 1]
```

`[1] 23.72`

```
odd.median <- function(v) {sort(v)[(length(v) %/% 2) + 1]}
# some tests
values <- runif(101, 0, 10)
odd.median(values)
```

`[1] 4.927`

`median(values)`

`[1] 4.927`

```
values <- runif(100, 0, 10)
odd.median(values)
```

`[1] 5.355`

`median(values)`

`[1] 5.3`

```
even.median <- function(v){
v.1 = sort(v)[length(v) %/% 2]
v.2 = sort(v)[(length(v) %/% 2) + 1]
(v.1 + v.2) / 2
}
even.median(values)
```

`[1] 5.3`

`median(values)`

`[1] 5.3`

```
universal.median <- function(v){
if (length(v) %% 2 == 1) { ## ODD
sort(v)[(length(v) %/% 2) + 1]
} else { ## EVEN
v.1 = sort(v)[length(v) %/% 2]
v.2 = sort(v)[(length(v) %/% 2) + 1]
(v.1 + v.2) / 2
}
}
```

Letâ€™s run some tests

```
library(testthat)
for (i in 10:20){
values <- runif(i, 0, 100)
#print(universal.median(values) == median(values))
expect_equal(universal.median(values), median(values))
}
```

- Make your own function to calculate the median
`sort`

function`%%`

and`%/%`

(to carry out calculations over the indexes of the vector)

- Compare the output of your function to the built-in
`median`

function using the`testthat`

package - When is mean useful and when is median useful

```
income <- runif(100, 1000, 1500)
mean(income)
```

`[1] 1264`

`median(income)`

`[1] 1263`

But now someone really wealthy joins

```
income.2 <- c(income, 1000000)
mean(income.2)
```

`[1] 11153`

`median(income.2)`

`[1] 1264`

Knowing the range of a set of numbers can be useful

- min
- max
- range

Test these all out

```
values <- runif(10, 0, 100)
min(values)
```

`[1] 8.117`

`max(values)`

`[1] 87.95`

`range(values)`

`[1] 8.117 87.952`

Problem: doesnâ€™t handle outliers well Solution: look at the variance

The following shows how much each of these values differs from the mean value

`values`

` [1] 58.553 66.401 53.200 78.226 39.967 49.443 10.270 87.952 8.117 63.363`

`values - mean(values)`

` [1] 7.004 14.851 1.651 26.677 -11.583 -2.106 -41.279 36.402 -43.432 11.814`

Problem if we want the average difference from the mean, since some are positive and some are negative:

`sum(values - mean(values))`

`[1] 0`

Solution is to square the values (in addition this has the generally desirable side effect of emphasising the larger values over the smaller values)

`values - mean(values)`

` [1] 7.004 14.851 1.651 26.677 -11.583 -2.106 -41.279 36.402 -43.432 11.814`

`(values - mean(values))^2`

` [1] 49.056 220.567 2.726 711.661 134.156 4.437 1703.935 1325.135 1886.363 139.574`

```
# The average square of the difference from the mean
sum((values - mean(values))^2) / length(a.c)
```

`Error: object 'a.c' not found`

The average square of the difference from the mean is called the *population variance*

**This assumes that we are working with population, not a sample.** The population is everything, the sample is a random subset.

Summary

- \(\mu\) is the population mean (can be difficult or impossible to know)
- We could look at how much each value differs from the mean:

The average of this would be a nice statement of how far off the mean the values tend to be: \[\frac{\Sigma(x_i - \mu)}{N}\] But thereâ€™s a problem, because these add up to zero

i.e.Â zero (sometimes this will give an extremely small number instead of zero â€” the precision of a computer is **less** than your human precision since the computer has to convert all these values to binary)

The solution: square it \[\sigma^2 = \frac{\Sigma(x_i - \mu)^2}{N}\]

- Guaranteed positive values
- Big differences from the mean are more heavily weighted than small differences
- Note \(\sigma^2\) is the symbol for
*population variance*

- var

What if we donâ€™t know the true mean? If we have a sample of the members of the population rather than the whole population.

- We use the sample mean rather than true mean in the equation
- We replace N by N - 1 (mathematical proof is hard, but basicially: the sample mean is always closer to the middle of the sample than the true mean, so using the population variance formula on a sample
*underestimates*the value of the variance. Dividing by N-1 rather than by N*increases*the estimate of the variance slightly to compensate for this)

\[s^2 = \frac{\Sigma(x_i - \bar{x})^2}{N - 1}\]

- The symbol for
*sample variance*is \(s^2\) - The r function
`var()`

can be used to calculate the sample variance

- sd

Variance is kind of abstract

The solution is simply to take the square root. This is called the *standard deviation*

Usually we use the *sample standard deviation*: \[s = \sqrt{\frac{\Sigma(x_i - \bar{x})^2}{N - 1}}\]

There is also a *population standard deviation*, for when you are working with *all* the values that exist: \[\sigma = \sqrt{\frac{\Sigma(x_i - \mu)^2}{N}}\] - The r function `sd()`

calculates the *sample standard deviation* (see `help(sd)`

) - Just for practice, make your own function to calculate sample standard deviation and compare it to the built-in function (using `testthat`

package) - Make a function to calculate population standard devation and compare the values to sample standard deviation

Galton Board

In a normal distribution: - 68.2% of values within one standard deviation - 95.4% within two standard deviations - 99.7% within 3 - 99.99% within 4

Normal distributions can be described by two parameters, mean and standard deviation. The `rnorm()`

function generates vectors of random numbers accordingn to a normal distribution.

- Read the help for the
`rnorm()`

function. Generate some vectors of normally distributed random numbers given different means and standard deviations. - Graph them and work out their means and standard deviations from the vectors

- rnorm(N, mean, sd)

Hereâ€™s how you can visualise these:

A small sample will be jagged, but the larger the sample the smoother it gets:

There are many kinds of distributions.

- runif(N, min, max)

In the *Uniform distribution* there is an equal chance of getting any value between the minimum and the maximum

```
library(babynames)
library(stringr)
```

Annaâ€™s name plot compares four dimensions (year, popularity, name, sex)

```
babynames%>%
filter(name=="Taylor"|name=="Zachary"|name=="Isaac")%>%
ggplot(aes(x=year, y=n, colour=name, linetype=sex)) + geom_line()
```

Plotting mean name length of boys and girls (adapted from Bror-Magnus)

```
babynames %>%
group_by(year, sex) %>%
summarise(mean.length=sum(n*nchar(name))/sum(n)) %>%
ggplot(aes(x=year, y=mean.length, colour=sex)) + geom_line()
```

Majaâ€™s investigation of Chantal variants

- testing the regular expression:

`babynames$name %>% unique() %>% str_subset("[S|C+]c?+h?+a+n{1,}+t+[a{1,}e{1,}]+l{1,}+e?") `

```
[1] "Chantal" "Chantel" "Chantell" "Chantelle" "Shantel" "Chantele" "Shantell" "Shantelle" "Chantale" "Chantella" "Shantele" "Shantella"
[13] "Shantal" "Shantale" "Chantalle" "Schantel" "Shantala" "Chanteal" "Chantall" "Shantela" "Shantall" "Shanteal" "Shantalle"
```

- plotting:

Linneaâ€™s Game of Thrones plot. Note the use of `annotate`

to put the label onto an independent layer:

`filter(babynames, name=="Daenerys" | name=="Cersei" | name=="Arya" | name=="Sansa" | name=="Margaery" | name=="Ygritte" | name=="Brienne" | name=="Robb" | name=="Theon" | name=="Sandor" | name=="Tormund" | name=="Gendry" | name=="Jorah" | name=="Joffrey")%>% filter(year%in%2000:2015) %>% ggplot(aes(x=year, y=log(n), colour=name, linetype=sex)) + geom_line() + annotate("text", hjust=0, label="GoT release", x=2011, y=5.2) + geom_vline(xintercept=2011)`