Introduction to Data Analysis with R

Dr Nicola Rennie

Welcome!

Who am I?

Lecturer in Health Data Science within Centre for Health Informatics, Computing, and Statistics.


Academic background in statistics, with experience in data science consultancy and training.


Using R for over 10 years, and author of multiple R packages.

CHICAS logo

Workshop outline

  • 09:30 - 09:45: Welcome and set up
  • 09:45 - 10:15: Introduction to R and RStudio
  • 10:15 - 10:30: Performing operations in R
  • 10:30 - 11:00: Loading data into R
  • 11:00 - 11:15: BREAK
  • 11:15 - 11:45: Plotting single variables
  • 11:45 - 12:15: Reading help files in R

LUNCH

  • 13:15 - 13:45: Plotting multiple variables
  • 13:45 - 14:15: Computing summary statistics
  • 14:15 - 14:30: BREAK
  • 14:30 - 15:00: Summary tables
  • 15:00 - 15:45: Statistical tests
  • 15:45 - 16:00: Questions and discussion

What to expect during this workshop

  • Combines slides, live coding examples, and exercises for you to participate in.

  • Ask questions throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Stranger Things questions gif

Source: giphy.com

Workshop resources

Course website: nrennie.rbind.io/training-intro-to-r

Screenshot of course website

What is R?

R is an:

  • open source

  • programming language

  • commonly used for statistical analysis

  • that is widely used in many fields, including psychology and bioinformatics.

R logo

Why R?

  • It’s free and open source - better for reproducibility!

  • Writing code allows you to repeat an analysis more easily.

  • Offers a wide variety of statistical functions.

  • Handles large datasets more efficiently than software like Excel.

What is RStudio?

  • An Integrated Development Environment (IDE) for R

  • A more user-friendly way write R code

  • Has some nice features that make writing code easier

RStudio logo

Installing R

  • Open AppsAnywhere (if required sign in with your university username and password)
  • Locate RStudio from the list and launch it.

Appsanywhere screenshot showing RStudio

Installing R

Installing R option 2

Posit website screenshot

Installing R

Installing R option 2

  • Click on the Download and Install R button on the left, and install R.

  • Click on the Download RStudio Desktop for … button on the right, and install RStudio. If the option is available, install only for your user but otherwise use the default settings.

  • Open up RStudio from your Start menu.

Performing operations in R

Where do we write code in RStudio?

RStudio screenshot

Basic operations in R

At a most basic level, R is essentially a big calculator:

# Addition
2 + 2

# Subtraction
7 - 4

# Division
5 / 8

# Multiplication
4.2 * 1.3

Comments

# Addition
2 + 2


Tip

Any line that begins with a # is called a comment.

R doesn’t run these lines, but they’re very useful for writing notes to yourself (and other people who read your code!) to explain what your code is doing.

Assigning outputs

We often want to save the output from our code somewhere, so we can use it later:


x <- 2 + 2
x + 3

What are functions?

  • A chunk of code that takes an input(s) called an argument
  • processes them
  • returns output or performs an action.

What are functions?

  • log is the name of the function
  • 2 is the value of the argument x
  • Note the round brackets ()
log(x = 2)

This also works:

log(2)

You can also write your own functions, but we won’t cover this today!

What are packages?

  • A collection of functions, data, documentation, and resources bundled together.

  • An easy way to share code with other people (and use other people’s code!)

  • Some packages come with R when you install it.

  • You can download other packages from different places such as CRAN or GitHub.

Live Demo!

  • Creating vectors and performing operations on them.

  • Using functions in R.

  • Installing and loading packages.

Exercise 1

  1. Create a vector called age (like the one below). Multiply each element by 5.
age <- c(15, 25, 32, 87, 12)
  1. What does the sqrt() function do? Apply the sqrt() function to each element of age.

  2. Install the ggplot2 package.

  3. Load the ggplot2 package into R.

07:00

Loading data into R

Different types of data

  • CSV
  • Excel
  • Haven

For today, we’ll assume you have the data stored somewhere on your computer. But there are R tools to connect to other remote data sources!

Excel logo

Where do we put files?

  • Let’s make our own lives easier by keeping our files organised!

  • Organised files, also makes it easier for R to find our (e.g. data) files.

  • An easy way to do this is using an R project

    • Use a new R project for each analysis project
    • Double click on the project file to open RStudio with your files in the right place.

R projects

R Projects are a special type of file with a .Rproj extension that makes it easier for you to keep all of the data, code, and images for a project in one place.

Open up RStudio, then click File –> New Project –> New DirectoryNew Project.

  • Type in the name that you want to call your new folder e.g. Intro to R Workshop. Then use Browse to select where on your computer you want to make the folder. IMPORTANT: remember where this is!

  • Finally, click Create Project. Your new folder will be created and opened in RStudio - sometimes it can take a couple of minutes.

Loading data into R

There are two approaches to loading data into R:

  • Using point and click Import Dataset.

  • Writing code.

Using Import Dataset

Screenshot of RStudio Import Dataset window

Copy and paste the code in the bottom right!

Using code to import data

Different packages for different types of files:

  • CSV files: base R or the {readr} package

  • Excel files: {readxl} package (or other packages)

  • SPSS/SAS/Stats files: {haven} package (haven.tidyverse.org)

Live Demo!

  • Read in CSV data using base R or the {readr} package.

  • Read in Excel data using the {readxl} package.

  • Inspect the data visually.

  • Summarise the data.

Exercise 2

  1. Make an R project file (File –> New Project). Download either the CSV or Excel file of the Hypoxia data from the course website (or bring your own data!) and save it into your R project folder. Load the data into R.

Link: nrennie.rbind.io/training-intro-to-r/exercises.html

  1. Inspect the data using View().

  2. How many rows and columns are in the data?

  3. Create a summary of the data.

10:00

Plotting single variables

Why do we visualise data?

Because summary statistics aren’t enough…

Dataset A B
mean_x 54.2632732 54.2658818
mean_y 47.8322528 47.8314957
sd_x 16.7651420 16.7688527
sd_y 26.9354035 26.9386081
cor_xy -0.0644719 -0.0686092

Plotting different types of variables

Discrete

Categories e.g. types of fruit, hospitals, eye colour, …

  • Bar charts

  • Waffle charts / pictograms

Continuous

Numeric values e.g. height, weight, rain in millimeters, …

  • Box plots

  • Histograms

  • Density plots

Plotting in R

R comes packaged with its own plotting functions (often called base R graphics).

x <- c(1, 2, 3, 4, 5)
y <- c(4, 7, 3, 1, 3)
plot(x, y)


  • Base R graphics are great for taking a quick look at your data.

  • But they can be quite hard to customise, and there’s a more limited set of plot types available.

  • So we’re going to use another package…

ggplot2

  • Most popular visualisation package in R.

  • Lots of extension packages.

  • Build plots in layers.

  • Fairly intuitive way of making plots.

  • Part of the tidyverse!

ggplot2 logo

The {tidyverse}

  • Collection of R packages for data manipulation, exploration, and visualisation.

  • Functions are named (reasonably) well and use a consistent syntax.

  • Emphasizes the idea of tidy data, where data is structured to make analysis more efficient.

We’ll see a few of these packages in action today!

The pipe operator

The {tidyverse} packages encourage use of the pipe operator: %>%.

The pipe operator takes the object on the left hand side of the pipe, and places it as the first argument to the function on the right hand side.

This means that:

View(data)


can be written as:

data %>% View()

The pipe operator

  • You might also see the pipe operator written as |>. There are some small differences between %>% and |>, but none that will affect anything we do today.

  • Use the Ctrl + Shift + M keyboard shortcut in RStudio to add a pipe.

  • Using pipes can save you from saving multiple copies of an object when doing different operations to your data - you’ll see some examples in a second!

Live Demo!

  • Plot a histogram of Age.

  • Create a subset of data.

  • Plot a histogram of Age in the Trust1 organisation.

  • Save plot as an image.

Exercise 3

  1. Create a histogram of the age of all patients in the study. What does the bins argument in geom_histogram() do?

  2. Create a histogram of the age of all female patients in the study. Hint: remember that 0 = male and 1 = female in the Female column.

  3. Create a bar chart of the number of people who had each surgery type (`Type Surg`). Hint: geom_bar().

  4. Bonus: Edit the axis labels using the labs() function. Can you also add a title?

15:00

Reading help files in R

Help for functions

Most R functions come with some documentation (help files) and examples. To read the help files for the exp() function:

?exp()

or

?exp

or

help("exp")

Help for packages

To find help for a package:

help("ggplot2")

Vignettes are also sometimes available:

vignette(package = "ggplot2")

To read a specific vignette:

vignette("ggplot2-specs", "ggplot2")

Live Demo!

  • Reading the help files in R.

  • Running examples from help files.

Other places to find help

Go to the Resources tab on the course website: nrennie.rbind.io/training-intro-to-r/resources.html

Screenshot of course website

Exercise 4

  1. The ggplot2 package is often used for plotting in R. What does the geom_count() function do?

  2. What is the difference between geom_bar() and geom_col() ?

  3. Does this code do what you expect? Can you fix it?

ggplot(
  data = hypoxia,
  mapping = aes(x = `Type Surg`, fill = "blue")
) +
  geom_bar()
08:00

Plotting multiple variables

Variable combinations

  • Two continuous variables: Scatter plot, …

  • Two discrete variables: Bar chart, …

  • One continuous + one discrete: Multiple box plots, using colour, …

Live Demo!

  • Do older patients stay longer?

  • Does it vary by organisation?

  • Changing the order of categories.

Exercise 5

  1. Create a scatter plot of Age (on the x-axis) and BMI (on the y-axis).

  2. Create a boxplot of Age for each surgery type. Hint: make `Type Surg` a factor().

  3. Create a bar chart of the number of people who had each surgery type. Colour the bars based on whether people had diabetes. Hint: should Diabetes be a numeric or a factor?

  4. Bonus: Edit your bar chart to put the bars next to each other instead of stacked on top. Hint: look at the position argument in geom_bar().

15:00

Computing summary statistics

What are summary statistics?


Summary statistics

Brief numerical descriptions of a dataset that provide an overview of its main features.

Common summary statistics

Measures of central tendency:

  • mean
  • median
  • mode

Measures of dispersion:

  • variance
  • standard deviation
  • range

…and more!

Live Demo!

  • Calculating the mean and standard deviation in base R.

  • Calculating the mean and standard deviation using the {tidyverse}.

  • Calculating the mean and standard deviation of different categories.

Exercise 6

  1. For each of the columns `Duration of Surg` and `AHI`, calculate the following summary statistics: mean and standard deviation.

If there are any missing values, calculate the mean of the non-missing values. Hint: Look at the na.rm argument for ?mean.

  1. Repeat the calculations, but group the summary statistics by Surgery Type (`Type Surg`).

  2. Bonus: also calculate the median, minimum, and maximum.

15:00

Summary tables

Table 1

In papers, it’s very common to include a summary table (often called Table 1) that provides summaries of:

  • Demographic information
  • Clinical variables
  • Study interventions or outcomes

You could calculate all of the summary statistics as you did in the previous exercise and then copy and paste them into a Word document…

Or R could do it for you!

The {gtsummary} package

  • {gtsummary} helps create well-formatted summary tables, including descriptive statistics and results of statistical tests.

  • integrates well with {tidyverse} packages

  • save tables directly to Word format

gtsummary logo

Live Demo!

  • Creating summary tables.

  • Creating grouped summary tables.

  • Exporting to a Word document.

Exercise 7

  1. Create a descriptive table of patient characteristics, which includes the following variables: age, gender, race, smoking.

  2. Are Female and Smoking represented in the table in a way that makes sense? Change the Female and Smoking columns to a factor.

  3. Group the table by Smoking.

  4. Bonus: Change the labels for Smoking to Smoking and No smoking instead of 1 and 0.

10:00

Statistical tests

T-tests: different types of tests

Different test variations:

  • One sample t-test: compare the mean of the group to some value
    • e.g. Is the average length of stay equal to 50 days?
  • Two independent samples t-test: compare the mean of two groups
    • e.g. Is the average length of stay different for men and women?
  • Paired samples t-test: compare the paired groups e.g. before and after
    • e.g. Is blood pressure the same before and after a new treatment?

T-tests: different types of hypothesis

  • Null hypothesis: usually the boring option - there is no difference!

  • Alternative hypothesis:

    • Not equal to
    • Greater than
    • Less than

T-tests: what are we assuming?

Assumptions:

  • Observations are continuous

  • Observations are independent

  • Variance of two groups is equal (we can test for this and do a different type of test if needed)

  • Data should be approximately normally distributed

T-tests: how does it work?

  • Compute a specific summary statistics

  • Compare to some known distribution

  • Get a probability of seeing observations at least as extreme as those in your sample, assuming that the null hypothesis is true.

P-values are not binary!

Continuous p-values diagram

Source: theoreticalecology.wordpress.com

See also: www.ncbi.nlm.nih.gov/pmc/articles/PMC6532382

Live Demo!

  • Creating multiple subsets of data.

  • Comparing means of groups.

  • Comparing variances of groups.

Exercise 8

  1. Perform a t-test to test whether the age of patients is significantly different for males and females. Assume the variances of the two groups are equal.

  2. Test if the variances of the two groups are actually equal.

10:00

Workshop resources

Course website: nrennie.rbind.io/training-intro-to-r

Screenshot of course website

Feedback



Feedback form: forms.gle/L7AmGwnj2ZD2WvHx7