Examples
Data
Length of Stay Model Data
The LOS
dataset is a simulated hospital length-of-stay dataset, available through the {NHSRdatasets} R package. The dataset is licensed under creativecommons.org/publicdomain/zero/1.0.
Download CSV: LOS.csv
Download Excel: LOS.xlsx
If you have downloaded the data, and stored it in your R project folder in a folder called data
, then you can read the data in:
Or if you’re using the Excel version:
Examples
This section includes code for the examples shown. These may differ slightly from the examples shown in the live demonstration.
Example 1: Performing operations in R
See example
Make a vector using the c()
function and assign it to a variable called x
:
x <- c("January", "February", "March")
x
[1] "January" "February" "March"
There are three types of data, and we can’t mix and match them in a vector:
x <- c(23, "Yes", TRUE, 44)
x
[1] "23" "Yes" "TRUE" "44"
Note: you don’t get an error, but the vector probably isn’t what you expected it to be. Hint: ""
means character i.e. a word.
Functions take an input(s) and produce an output. Write the function name, followed by round brackets, with the input inside the brackets. For example, calculate the exponential of 5:
exp(5)
[1] 148.4132
Install packages (only need to do this once):
install.packages("readxl")
Load the package (each time you open R and want to use the package):
Example 2: Loading data into R
See example
Reading in a CSV file using base R:
LOS_data <- read.csv("data/LOS.csv")
Reading in a CSV file using the {readr} package:
The difference between the two is mainly the way that column names are processed.
Reading in an Excel file using the {readxl} package:
Look at the data:
View(LOS_data)
Number of rows and columns
Example 3: Plotting single variables
See example
Histogram using the geom_histogram()
function:
library(ggplot2)
ggplot(data = LOS_data, mapping = aes(x = Age)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You don’t need to explicitly write the arguments:
ggplot(LOS_data, aes(x = Age)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Change the binwidth
:
ggplot(data = LOS_data, mapping = aes(x = Age)) +
geom_histogram(binwidth = 5)
We want to subset the data to include only rows where the Organisation
is equal to Trust1
. We use the filter()
function from the {dplyr} package to make a new data set:
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
LOS_trust1 <- LOS_data |>
filter(Organisation == "Trust1")
Then use the same code to plot a histogram of age for only Trust 1 patient, but change the data
we pass in:
ggplot(data = LOS_trust1, mapping = aes(x = Age)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Example 4: Reading help files in R
Example 5: Plotting multiple variables
See example
Do older patients stay longer?
ggplot(
data = LOS_data,
mapping = aes(x = Age, y = LOS)
) +
geom_point()
Does it vary by organisation? Let’s use colours to find out. R orders categories alphabetically unless you tell it otherwise, this is why Trust10
is before Trust2
.
ggplot(
data = LOS_data,
mapping = aes(x = Age, y = LOS, colour = Organisation)
) +
geom_point()
For colours based on numeric variables, R adds a continuous colour scale:
ggplot(
data = LOS_data,
mapping = aes(x = Age, y = LOS, colour = Age)
) +
geom_point()
Sometimes we need to change the type of a variable for plotting e.g. from a numeric to a character.
ggplot(
data = LOS_data,
mapping = aes(x = Age, y = LOS, colour = Death)
) +
geom_point()
The Death
column is encoded as 0
and 1
which R interprets as numbers, but these are actually categories (No
and Yes
). We can change it to a character or factor
(ordered category).
ggplot(
data = LOS_data,
mapping = aes(x = Age, y = LOS, colour = factor(Death))
) +
geom_point()
Example 6: Computing summary statistics
See example
What is the mean LOS
?
In base R:
mean(LOS_data$LOS)
[1] 4.936667
Calculate the average LOS
for each Organisation
.
# A tibble: 10 × 2
Organisation LOS_mean
<chr> <dbl>
1 Trust1 5.07
2 Trust10 4.3
3 Trust2 4.23
4 Trust3 5.07
5 Trust4 4.87
6 Trust5 6.1
7 Trust6 4.9
8 Trust7 5.1
9 Trust8 4.7
10 Trust9 5.03
Also calculate the mean age for each Organisation
:
# A tibble: 10 × 3
Organisation LOS_mean Age_mean
<chr> <dbl> <dbl>
1 Trust1 5.07 55.4
2 Trust10 4.3 51.0
3 Trust2 4.23 51.2
4 Trust3 5.07 47.9
5 Trust4 4.87 49.5
6 Trust5 6.1 45.7
7 Trust6 4.9 48.6
8 Trust7 5.1 53.7
9 Trust8 4.7 51.4
10 Trust9 5.03 52.3
What about the standard deviation?
LOS_data |>
group_by(Organisation) |>
summarise(
LOS_mean = mean(LOS),
Age_mean = mean(Age),
LOS_sd = sd(LOS),
Age_sd = sd(Age)
)
# A tibble: 10 × 5
Organisation LOS_mean Age_mean LOS_sd Age_sd
<chr> <dbl> <dbl> <dbl> <dbl>
1 Trust1 5.07 55.4 3.52 28.2
2 Trust10 4.3 51.0 3.35 29.5
3 Trust2 4.23 51.2 3.51 27.3
4 Trust3 5.07 47.9 3.98 28.4
5 Trust4 4.87 49.5 3.42 26.4
6 Trust5 6.1 45.7 4.30 28.3
7 Trust6 4.9 48.6 3.34 27.8
8 Trust7 5.1 53.7 3.82 29.2
9 Trust8 4.7 51.4 3.71 27.8
10 Trust9 5.03 52.3 3.32 28.7
Example 7: Summary tables
See example
Drop the ID
column, then make a summary table of the rest of the variables:
Warning: package 'gtsummary' was built under R version 4.4.1
tbl1_data <- LOS_data |>
select(-ID)
tbl1_data |>
tbl_summary()
Characteristic | N = 3001 |
---|---|
Organisation | |
Trust1 | 30 (10%) |
Trust10 | 30 (10%) |
Trust2 | 30 (10%) |
Trust3 | 30 (10%) |
Trust4 | 30 (10%) |
Trust5 | 30 (10%) |
Trust6 | 30 (10%) |
Trust7 | 30 (10%) |
Trust8 | 30 (10%) |
Trust9 | 30 (10%) |
Age | 54 (24, 76) |
LOS | 4.0 (2.0, 7.0) |
Death | 53 (18%) |
1 n (%); Median (Q1, Q3) |
Group by Organisation
:
tbl1 <- tbl1_data |>
tbl_summary(by = Organisation)
tbl1
Characteristic |
Trust1 N = 301 |
Trust10 N = 301 |
Trust2 N = 301 |
Trust3 N = 301 |
Trust4 N = 301 |
Trust5 N = 301 |
Trust6 N = 301 |
Trust7 N = 301 |
Trust8 N = 301 |
Trust9 N = 301 |
---|---|---|---|---|---|---|---|---|---|---|
Age | 62 (30, 80) | 49 (26, 77) | 57 (19, 74) | 52 (21, 73) | 48 (23, 75) | 39 (18, 72) | 49 (25, 74) | 58 (26, 78) | 51 (26, 79) | 54 (26, 74) |
LOS | 5.0 (2.0, 7.0) | 3.0 (2.0, 5.0) | 3.0 (2.0, 6.0) | 4.5 (2.0, 7.0) | 4.0 (2.0, 8.0) | 4.5 (2.0, 11.0) | 4.0 (2.0, 7.0) | 4.0 (2.0, 7.0) | 3.0 (2.0, 7.0) | 4.0 (2.0, 8.0) |
Death | 7 (23%) | 4 (13%) | 5 (17%) | 6 (20%) | 4 (13%) | 7 (23%) | 4 (13%) | 8 (27%) | 5 (17%) | 3 (10%) |
1 Median (Q1, Q3); n (%) |
Export to a Word document:
Example 8: Statistical tests
See example
Is the LOS
significantly different for patients under the age of 50, compare to those who are 50 or older?
Are the means different?
t.test(LOS_younger$LOS, LOS_older$LOS)
Welch Two Sample t-test
data: LOS_younger$LOS and LOS_older$LOS
t = -8.0643, df = 296.03, p-value = 1.844e-14
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.767660 -2.289483
sample estimates:
mean of x mean of y
3.321429 6.350000
Are the variances different?
var.test(LOS_younger$LOS, LOS_older$LOS)
F test to compare two variances
data: LOS_younger$LOS and LOS_older$LOS
F = 0.64874, num df = 139, denom df = 159, p-value = 0.009204
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.4704584 0.8977698
sample estimates:
ratio of variances
0.6487431