Plan:
Revision Consolodate visualisation skills by going over homework solutions
New Explore ways to summarise numerical data
Integer division: %/%
7 %/% 2
[1] 3
You can do this with a series of numerals
1:10 %/% 3
[1] 0 0 1 1 1 2 2 2 3 3
Modulo %%
:
17 %/% 3
[1] 5
17 %% 3
[1] 2
This is often used to tell if a number is even or odd:
#numbers <- 1:10
numbers <- c(4,45,2,46,96,75,32)
numbers %% 2
[1] 0 1 0 0 0 1 0
numbers %% 2 == 1
[1] FALSE TRUE FALSE FALSE FALSE TRUE FALSE
Selecting the odd numbers
numbers[numbers %% 2 == 1]
[1] 45 75
Selecting the even numbers
numbers[numbers %% 2 == 0]
[1] 4 2 46 96 32
NOTE that this only works for integers (whole numbers); R often works with numeric
values, which can also be decimals.
is.integer()
and as.integer()
functionsWe’ll look at this again later, but here’s a reminder about the runif()
function to generate random numbers
runif(10) # by default from 0 to 1
[1] 0.3561 0.8718 0.1580 0.7354 0.9886 0.6004 0.7171 0.9770 0.4326 0.7649
runif(10, 0, 100) # or specify min max
[1] 82.735 8.145 84.839 8.281 8.890 9.847 54.866 13.696 51.634 77.782
as.integer(runif(10,0,100))
[1] 1 42 62 31 35 84 84 52 90 31
sum
and length
values <- runif(30, 17, 987)
sum(values) / length(values)
[1] 490.3
# my_func <- function(arg1) {do something with arg1}
custom.mean.func <- function(v) { sum(v)/length(v)}
custom.mean.func(values)
[1] 490.3
custom.mean.func(1:10)
[1] 5.5
mean(values) ## the built-in function
[1] 490.3
mean()
function that does exactly what you’d expectvalues[3] <- NA
values
[1] 295.28 21.33 NA 243.24 720.07 963.24 494.86 432.98 841.42 801.84 946.57 748.44 92.99 392.27 960.72 423.76 409.10 299.33 839.03 48.92 282.76
[22] 128.44 315.35 496.87 773.54 101.12 652.96 193.01 492.19 412.95
mean(values)
[1] NA
mean(values, na.rm=TRUE)
[1] 476.7
custom.mean.func <- function(v, na.rm=FALSE) {
sum(v, na.rm=na.rm)/length(v[!is.na(v)])}
custom.mean.func(values, na.rm=TRUE)
[1] 476.7
The middle value of a sorted series; if there are an even number of values then take the value halfway between the middle two.
values <- runif(21, 1, 55)
mean(values)
[1] 26.62
median(values)
[1] 23.72
# sort the values
sort(values)
[1] 2.002 4.365 7.692 11.380 13.674 15.092 16.171 17.316 21.775 21.977 23.724 25.052 27.305 31.939 36.334 41.505 43.295 44.519 45.651 53.712 54.597
# the index for the midpoint is integer-half plus one
(length(values) %/% 2) + 1
[1] 11
# use this number to index the sorted vector
sort(values)[11]
[1] 23.72
# put it all together
sort(values)[(length(values) %/% 2) + 1]
[1] 23.72
odd.median <- function(v) {sort(v)[(length(v) %/% 2) + 1]}
# some tests
values <- runif(101, 0, 10)
odd.median(values)
[1] 4.927
median(values)
[1] 4.927
values <- runif(100, 0, 10)
odd.median(values)
[1] 5.355
median(values)
[1] 5.3
even.median <- function(v){
v.1 = sort(v)[length(v) %/% 2]
v.2 = sort(v)[(length(v) %/% 2) + 1]
(v.1 + v.2) / 2
}
even.median(values)
[1] 5.3
median(values)
[1] 5.3
universal.median <- function(v){
if (length(v) %% 2 == 1) { ## ODD
sort(v)[(length(v) %/% 2) + 1]
} else { ## EVEN
v.1 = sort(v)[length(v) %/% 2]
v.2 = sort(v)[(length(v) %/% 2) + 1]
(v.1 + v.2) / 2
}
}
Let’s run some tests
library(testthat)
for (i in 10:20){
values <- runif(i, 0, 100)
#print(universal.median(values) == median(values))
expect_equal(universal.median(values), median(values))
}
sort
function%%
and %/%
(to carry out calculations over the indexes of the vector)median
function using the testthat
packageincome <- runif(100, 1000, 1500)
mean(income)
[1] 1264
median(income)
[1] 1263
But now someone really wealthy joins
income.2 <- c(income, 1000000)
mean(income.2)
[1] 11153
median(income.2)
[1] 1264
Knowing the range of a set of numbers can be useful
Test these all out
values <- runif(10, 0, 100)
min(values)
[1] 8.117
max(values)
[1] 87.95
range(values)
[1] 8.117 87.952
Problem: doesn’t handle outliers well Solution: look at the variance
The following shows how much each of these values differs from the mean value
values
[1] 58.553 66.401 53.200 78.226 39.967 49.443 10.270 87.952 8.117 63.363
values - mean(values)
[1] 7.004 14.851 1.651 26.677 -11.583 -2.106 -41.279 36.402 -43.432 11.814
Problem if we want the average difference from the mean, since some are positive and some are negative:
sum(values - mean(values))
[1] 0
Solution is to square the values (in addition this has the generally desirable side effect of emphasising the larger values over the smaller values)
values - mean(values)
[1] 7.004 14.851 1.651 26.677 -11.583 -2.106 -41.279 36.402 -43.432 11.814
(values - mean(values))^2
[1] 49.056 220.567 2.726 711.661 134.156 4.437 1703.935 1325.135 1886.363 139.574
# The average square of the difference from the mean
sum((values - mean(values))^2) / length(a.c)
Error: object 'a.c' not found
The average square of the difference from the mean is called the population variance
This assumes that we are working with population, not a sample. The population is everything, the sample is a random subset.
Summary
The average of this would be a nice statement of how far off the mean the values tend to be: \[\frac{\Sigma(x_i - \mu)}{N}\] But there’s a problem, because these add up to zero
i.e. zero (sometimes this will give an extremely small number instead of zero — the precision of a computer is less than your human precision since the computer has to convert all these values to binary)
The solution: square it \[\sigma^2 = \frac{\Sigma(x_i - \mu)^2}{N}\]
What if we don’t know the true mean? If we have a sample of the members of the population rather than the whole population.
\[s^2 = \frac{\Sigma(x_i - \bar{x})^2}{N - 1}\]
var()
can be used to calculate the sample varianceVariance is kind of abstract
The solution is simply to take the square root. This is called the standard deviation
Usually we use the sample standard deviation: \[s = \sqrt{\frac{\Sigma(x_i - \bar{x})^2}{N - 1}}\]
There is also a population standard deviation, for when you are working with all the values that exist: \[\sigma = \sqrt{\frac{\Sigma(x_i - \mu)^2}{N}}\] - The r function sd()
calculates the sample standard deviation (see help(sd)
) - Just for practice, make your own function to calculate sample standard deviation and compare it to the built-in function (using testthat
package) - Make a function to calculate population standard devation and compare the values to sample standard deviation
Galton Board
In a normal distribution: - 68.2% of values within one standard deviation - 95.4% within two standard deviations - 99.7% within 3 - 99.99% within 4
Standard deviations
Normal distributions can be described by two parameters, mean and standard deviation. The rnorm()
function generates vectors of random numbers accordingn to a normal distribution.
rnorm()
function. Generate some vectors of normally distributed random numbers given different means and standard deviations.Here’s how you can visualise these:
A small sample will be jagged, but the larger the sample the smoother it gets:
There are many kinds of distributions.
In the Uniform distribution there is an equal chance of getting any value between the minimum and the maximum
library(babynames)
library(stringr)
Anna’s name plot compares four dimensions (year, popularity, name, sex)
babynames%>%
filter(name=="Taylor"|name=="Zachary"|name=="Isaac")%>%
ggplot(aes(x=year, y=n, colour=name, linetype=sex)) + geom_line()
Plotting mean name length of boys and girls (adapted from Bror-Magnus)
babynames %>%
group_by(year, sex) %>%
summarise(mean.length=sum(n*nchar(name))/sum(n)) %>%
ggplot(aes(x=year, y=mean.length, colour=sex)) + geom_line()
Maja’s investigation of Chantal variants
babynames$name %>% unique() %>% str_subset("[S|C+]c?+h?+a+n{1,}+t+[a{1,}e{1,}]+l{1,}+e?")
[1] "Chantal" "Chantel" "Chantell" "Chantelle" "Shantel" "Chantele" "Shantell" "Shantelle" "Chantale" "Chantella" "Shantele" "Shantella"
[13] "Shantal" "Shantale" "Chantalle" "Schantel" "Shantala" "Chanteal" "Chantall" "Shantela" "Shantall" "Shanteal" "Shantalle"
Linnea’s Game of Thrones plot. Note the use of annotate
to put the label onto an independent layer:
filter(babynames, name=="Daenerys" | name=="Cersei" | name=="Arya" | name=="Sansa" | name=="Margaery" | name=="Ygritte" | name=="Brienne" | name=="Robb" | name=="Theon" | name=="Sandor" | name=="Tormund" | name=="Gendry" | name=="Jorah" | name=="Joffrey")%>% filter(year%in%2000:2015) %>% ggplot(aes(x=year, y=log(n), colour=name, linetype=sex)) + geom_line() + annotate("text", hjust=0, label="GoT release", x=2011, y=5.2) + geom_vline(xintercept=2011)