Lecturer in Health Data Science within Centre for Health Informatics, Computing, and Statistics.
Academic background in statistics, with experience in data science consultancy and training.
Using R for over 10 years, and author of multiple R packages.
LUNCH
Combines slides, live coding examples, and exercises for you to participate in.
Ask questions throughout!
I hope you end up with more questions than answers after this workshop!
Course website: nrennie.rbind.io/training-intro-to-r
R is an:
open source
programming language
commonly used for statistical analysis
that is widely used in many fields, including psychology and bioinformatics.
It’s free and open source - better for reproducibility!
Writing code allows you to repeat an analysis more easily.
Offers a wide variety of statistical functions.
Handles large datasets more efficiently than software like Excel.
An Integrated Development Environment (IDE) for R
A more user-friendly way write R code
Has some nice features that make writing code easier
Click on the Download and Install R button on the left, and install R.
Click on the Download RStudio Desktop for … button on the right, and install RStudio. If the option is available, install only for your user but otherwise use the default settings.
Open up RStudio from your Start menu.
At a most basic level, R is essentially a big calculator:
We often want to save the output from our code somewhere, so we can use it later:
log
is the name of the function2
is the value of the argument x
()
You can also write your own functions, but we won’t cover this today!
A collection of functions, data, documentation, and resources bundled together.
An easy way to share code with other people (and use other people’s code!)
Some packages come with R when you install it.
You can download other packages from different places such as CRAN or GitHub.
Source: roelverbelen.netlify.app
Creating vectors and performing operations on them.
Using functions in R.
Installing and loading packages.
age
(like the one below). Multiply each element by 5.What does the sqrt()
function do? Apply the sqrt()
function to each element of age
.
Install the ggplot2
package.
Load the ggplot2
package into R.
07:00
For today, we’ll assume you have the data stored somewhere on your computer. But there are R tools to connect to other remote data sources!
Let’s make our own lives easier by keeping our files organised!
Organised files, also makes it easier for R to find our (e.g. data) files.
An easy way to do this is using an R project
R Projects are a special type of file with a .Rproj extension that makes it easier for you to keep all of the data, code, and images for a project in one place.
Open up RStudio, then click File –> New Project –> New Directory – New Project.
Type in the name that you want to call your new folder e.g. Intro to R Workshop
. Then use Browse to select where on your computer you want to make the folder. IMPORTANT: remember where this is!
Finally, click Create Project. Your new folder will be created and opened in RStudio - sometimes it can take a couple of minutes.
There are two approaches to loading data into R:
Using point and click Import Dataset.
Writing code.
Copy and paste the code in the bottom right!
Different packages for different types of files:
CSV files: base R or the {readr} package
Excel files: {readxl} package (or other packages)
SPSS/SAS/Stats files: {haven} package (haven.tidyverse.org)
Read in CSV data using base R or the {readr} package.
Read in Excel data using the {readxl} package.
Inspect the data visually.
Summarise the data.
Link: nrennie.rbind.io/training-intro-to-r/exercises.html
Inspect the data using View()
.
How many rows and columns are in the data?
Create a summary of the data.
10:00
Because summary statistics aren’t enough…
Dataset | A | B |
---|---|---|
mean_x | 54.2632732 | 54.2658818 |
mean_y | 47.8322528 | 47.8314957 |
sd_x | 16.7651420 | 16.7688527 |
sd_y | 26.9354035 | 26.9386081 |
cor_xy | -0.0644719 | -0.0686092 |
Discrete
Categories e.g. types of fruit, hospitals, eye colour, …
Bar charts
Waffle charts / pictograms
Continuous
Numeric values e.g. height, weight, rain in millimeters, …
Box plots
Histograms
Density plots
…
R comes packaged with its own plotting functions (often called base R graphics).
Base R graphics are great for taking a quick look at your data.
But they can be quite hard to customise, and there’s a more limited set of plot types available.
So we’re going to use another package…
Most popular visualisation package in R.
Lots of extension packages.
Build plots in layers.
Fairly intuitive way of making plots.
Part of the tidyverse!
Collection of R packages for data manipulation, exploration, and visualisation.
Functions are named (reasonably) well and use a consistent syntax.
Emphasizes the idea of tidy data, where data is structured to make analysis more efficient.
We’ll see a few of these packages in action today!
Source: education.rstudio.com
The {tidyverse} packages encourage use of the pipe operator: %>%
.
The pipe operator takes the object on the left hand side of the pipe, and places it as the first argument to the function on the right hand side.
You might also see the pipe operator written as |>
. There are some small differences between %>%
and |>
, but none that will affect anything we do today.
Use the Ctrl + Shift + M keyboard shortcut in RStudio to add a pipe.
Using pipes can save you from saving multiple copies of an object when doing different operations to your data - you’ll see some examples in a second!
Plot a histogram of Age
.
Create a subset of data.
Plot a histogram of Age
in the Trust1
organisation.
Save plot as an image.
Create a histogram of the age of all patients in the study. What does the bins
argument in geom_histogram()
do?
Create a histogram of the age of all female patients in the study. Hint: remember that 0
= male and 1
= female in the Female
column.
Create a bar chart of the number of people who had each surgery type (`Type Surg`
). Hint: geom_bar()
.
Bonus: Edit the axis labels using the labs()
function. Can you also add a title?
15:00
Most R functions come with some documentation (help files) and examples. To read the help files for the exp()
function:
or
or
To find help for a package:
Reading the help files in R.
Running examples from help files.
Go to the Resources tab on the course website: nrennie.rbind.io/training-intro-to-r/resources.html
The ggplot2
package is often used for plotting in R. What does the geom_count()
function do?
What is the difference between geom_bar()
and geom_col()
?
Does this code do what you expect? Can you fix it?
08:00
Two continuous variables: Scatter plot, …
Two discrete variables: Bar chart, …
One continuous + one discrete: Multiple box plots, using colour, …
Do older patients stay longer?
Does it vary by organisation?
Changing the order of categories.
Create a scatter plot of Age
(on the x-axis) and BMI
(on the y-axis).
Create a boxplot of Age
for each surgery type. Hint: make `Type Surg`
a factor()
.
Create a bar chart of the number of people who had each surgery type. Colour the bars based on whether people had diabetes. Hint: should Diabetes
be a numeric or a factor?
Bonus: Edit your bar chart to put the bars next to each other instead of stacked on top. Hint: look at the position
argument in geom_bar()
.
15:00
Summary statistics
Brief numerical descriptions of a dataset that provide an overview of its main features.
Measures of central tendency:
Measures of dispersion:
…and more!
Calculating the mean and standard deviation in base R.
Calculating the mean and standard deviation using the {tidyverse}.
Calculating the mean and standard deviation of different categories.
`Duration of Surg`
and `AHI`
, calculate the following summary statistics: mean and standard deviation.If there are any missing values, calculate the mean of the non-missing values. Hint: Look at the na.rm
argument for ?mean
.
Repeat the calculations, but group the summary statistics by Surgery Type (`Type Surg`
).
Bonus: also calculate the median, minimum, and maximum.
15:00
In papers, it’s very common to include a summary table (often called Table 1) that provides summaries of:
You could calculate all of the summary statistics as you did in the previous exercise and then copy and paste them into a Word document…
Or R could do it for you!
{gtsummary} helps create well-formatted summary tables, including descriptive statistics and results of statistical tests.
integrates well with {tidyverse} packages
save tables directly to Word format
Creating summary tables.
Creating grouped summary tables.
Exporting to a Word document.
Create a descriptive table of patient characteristics, which includes the following variables: age, gender, race, smoking.
Are Female
and Smoking
represented in the table in a way that makes sense? Change the Female
and Smoking
columns to a factor
.
Group the table by Smoking
.
Bonus: Change the labels for Smoking
to Smoking
and No smoking
instead of 1
and 0
.
10:00
Different test variations:
Null hypothesis: usually the boring option - there is no difference!
Alternative hypothesis:
Assumptions:
Observations are continuous
Observations are independent
Variance of two groups is equal (we can test for this and do a different type of test if needed)
Data should be approximately normally distributed
Compute a specific summary statistics
Compare to some known distribution
Get a probability of seeing observations at least as extreme as those in your sample, assuming that the null hypothesis is true.
Source: theoreticalecology.wordpress.com
Creating multiple subsets of data.
Comparing means of groups.
Comparing variances of groups.
Perform a t-test to test whether the age of patients is significantly different for males and females. Assume the variances of the two groups are equal.
Test if the variances of the two groups are actually equal.
10:00
Course website: nrennie.rbind.io/training-intro-to-r
Comments
Tip
Any line that begins with a
#
is called a comment.R doesn’t run these lines, but they’re very useful for writing notes to yourself (and other people who read your code!) to explain what your code is doing.