Lecture 1, 2018-09-24

VisStat HT18 #1

Course goals

What I want to teach you to do

  • Communicating with data
    • a data science approach rather than strictly statistical
  • Employ immediately useful and generally applicable techniques
  • Learn what you need to know in order to learn what you need
  • A responsible approach to dealing with data
  • Top down rather than bottom up

Why I think this is what you need to know

  • Own your own tools
    • Free
    • Auditable
    • Flexible (the question drives the method, not vice versa)
  • Communicate with real statisticians (or recognise the start of the path towards becoming one)

  • Visualising data may be more informative than looking at the numbers
    • Where the picture says ‘no’ and the numbers say ‘yes’‚ the picture wins!
    • (But maybe not the other way around)
    • Anscombe’s quartet

These graphs have the same summary statistics

  • x mean, y mean, x standard deviation, y standard deviation, and Pearson’s correlation (Matejka & Fitzmorris 2017)

A ‘responsible’ approach

  • No magic
    • not more complex than you can understand
    • people who have studied statistics for 20 years exist for a reason
  • Informative
    • this should help us discover scientific truths
  • Reproducible
    • part of the cumulative scientific endeavour (a starting place for other people)
    • possible to properly test

Good enough practices in scientific computing

  1. Data management: saving both raw and intermediate forms, documenting all steps, creating tidy data amenable to analysis.
  2. Software: writing, organizing, and sharing scripts and programs used in an analysis.
  3. Collaboration: making it easy for existing and new collaborators to understand and contribute to a project.

Good enough practices in scientific computing

  1. Project organization: organizing the digital artifacts of a project to ease discovery and understanding.
  2. Tracking changes: recording how various components of your project change over time.
  3. Manuscripts: writing manuscripts in a way that leaves an audit trail and minimizes manual merging of conflicts.

I firmly believe that R is the best way to achieve these goals

Why R?

  • R: a free, open source statistics language
  • Very widely used in linguistics, psychology, other sciences (an industry standard)
  • Empowering! (own your own tools)
  • Reproducible (scripting rather than point-and-click)

Syllabus

Feedback is welcome

  • This is a new course, and I will be learning what works and what doesn’t (educationally speaking) as I go along.
  • I am also happy to adjust the content to better reflect the interests and needs of the participants

1 Introduction to RStudio

RStudio is a supercharged calculator, and also the centre of your quantitative world. This week we will focus on basic skills: Running RStudio, getting help, recording your work with markdown notebooks, writing and running scripts, reading and writing data files, and some exploration of basic data types.

2 Good enough practices in scientific computing

The software carpentry analogy: we are like domestic carpenters, not professional cabinetmakers. We can build a functional birdhouse, or a bookcase, but we do not aspire to make a fancy inlaid dresser. This week we will learn how to develop a reproducible, self-documenting workflow. We will also introduce using git (via RStudio) and GitHub for managing data and analysis.

3 Data structures and plotting

Basic R programming with data and introduction to the tidyverse, a family of R tools that over the last few years have revolutionised how R is used

4 Data wrangling

A lot of our analytic work is what I call ‘data wrangling’: taking raw data and turning it into something that can be analysed. Often, once the data is in the right form, the analysis itself becomes easy. We will learn how to subset data, and other dataframe manipulations using dplyr.

5 Data visualisation and exploration

Introduction to basic plot types, and the joys of faceted data.

6 Same or different?

At this point in the course we should be comfortable with working with data: reading, writing, transforming, and visualising. Now we will look at some statistical tests for telling whether values and distributions of values are the same or different. After going through some basic tests, we will focus on the process of discovering for yourself what you need to know in more complex cases.

7 Working with similarity measures

Similarity (the flip-side of difference) is an important concept in humanities computing. We will learn how to produce useful similarity measures, and how to analyse and visualise them using techniques such as Principal Components Analysis and Multidimensional Scaling

8 Data in the world

Data about humans and human behaviour often has a geographic aspect. We will learn how to plot data on geographic maps and learn some commonly used geostatistical tests.

9 What’s what?

An introduction to inferring the underlying order in your data through statistical classification

10 Putting it all together

Techniques for collaboration. Producing publication quality graphics. Archiving and publishing research analyses online (using e.g. FigShare). How to report your analysis in a thesis or paper. Other workflows (leaving RStudio for the text editor and command line)

Homework

R and RStudio

RStudio tour

  • The console (quick-and-dirty testing)
  • Scripts (how you’ll usually workl)
  • R Notebooks (a kind of markdown; this is how we’ll submit class exercises)
    • Other markdown formats (html, pdf, word, slides, etc)

R demonstration

↓ This is what I type in the R console

2+2
R> [1] 4

↑ This is what is displayed after I press return.

But I will usually use R notebooks for anything complicated so you can follow my working more conveniently.

See 1.notebook.Rmd, 1.notebook.html

Running RStudio

Panel layout