Stop making spaghetti (code)

Tips for writing better R code

Nicola Rennie, Lancaster University
useR, July 2024

nrennie

nrennie.rbind.io

About Me

Academic background in statistics

Experience in data science consultancy

Lecturer in Health Data Science in Lancaster Medical School.

Research interests: healthcare data, reproducible research, data visualisation, R pedagogy…

What’s this talk about?

A talk about things that would have made my life so much easier if I’d known them five years ago.

How does a typical R programming journey begin?

Often as part of a statistics or data analysis course.
Or learning on the job.
Not often taught by computer scientists.

And then…

You’re reasonably comfortable writing R code to do some analysis

But…

Your scripts are getting longer and longer
There are R scripts everywhere
You don’t want other people to see the code you’ve written because it’s all a bit of a mess!

A.K.A spaghetti code!

How did we get into this mess?

The jump from learning R in class to using R in research projects is big.
Supervisors might have varying levels of experience.
Hand-me-down code can reinforce bad habits.
It’s not something that’s covered in a lot of textbooks*.

*but there are some excellent ones out there!

Why do we care?

Writing code that is readable and understandable is something that future you will be grateful for.
Writing code that is readable and understandable is something that other people will be grateful for.
- Sharing code prevents duplicating work.
- It makes work easier to replicate.
- Some journals may require analysis code to be shared.

How do we fix it?

By sharing useful tips we know
By sharing useful resources we find
By reviewing other people’s code
By having other people review our code

You don’t have to fix everything at once!

Structuring and styling
your scripts

One file at a time

Adding comments

Add comments using a # in R (in a separate line)
Comments don’t need to explain what your code does.
Comments should explain why you did it.

starwars |> 
  summarise(
    mean_height = mean(height, na.rm = TRUE), # calc height mean 
    sd_height = sd(height, na.rm = TRUE) # calc height sd
  )

# calculate summary statistics and 
# remove NA values as missing `height`
# values are also missing `mass`
starwars |> 
  summarise(
    mean_height = mean(height, na.rm = TRUE),
    sd_height = sd(height, na.rm = TRUE)
  )

Sections and subsections

You can add sections and subsections to code:

# Load data ---------------------------------------

## Geospatial files -------------------------------

## Population files -------------------------------

Code style

This code runs without errors but…

starwars |> filter(height>100) |>select(eye_color, mass)|> group_by(eye_color) |>summarise(mean_mass =mean(mass, na.rm = T))

this is the same code:

starwars |> 
  filter(height > 100) |> 
  select(eye_color, mass) |> 
  group_by(eye_color) |> 
  summarise(mean_mass = mean(mass, na.rm = TRUE))

Linting

Linting - analysing source code for:

stylistic issues e.g. x<-3 vs x <- 3
common errors e.g. mean(x, na.rm = T, na.rm = F)
missing packages
…

In R, linting is performed by the {lintr} package.

{lintr}

Run lintr::lint("file.R"):

Keyboard shortcuts

Use keyboard shortcuts to lint the current file (or package).

Styling

{lintr} tells you what’s wrong, but doesn’t fix it.

The {styler} R package will style your code for you.

Keyboard shortcuts

Add a keyboard shortcut for styler::style_active_file()!

Note: {styler} doesn’t fix all issues found by {lintr}.

Structuring and styling
your projects

One directory at a time

Breaking up a single file

Imagine a directory structure like this:

project
│   Rscript.R

that contains all of the code for your analysis.

This is fine but:

it’s not great if Rscript.R is 4,000 lines long.
sections and subsections are great, but sometimes they aren’t enough.
it’s not a very descriptive name.
it’s a script that probably does lot’s of different things.

Breaking up a single file

Multiple files

Okay names

project
│   data wrangling.R
│   load data.R
│   modelling.R
│   packages.R
│   plots.R
│   plots2.R

Better names

project
│   00_packages.R
│   01_load_data.R
│   02_data_wrangling.R
│   03_exploratory_plots.R
│   04_modelling.R
│   05_final_plots_tables.R

Multiple files

Naming files

Prefix with numbers to give them an order (add leading zeros).
Give them sensible, descriptive names.
Avoid spaces (computers prefer - or _).

Note: similar rules apply for variable and function names.

We’ll come back to avoiding analysis_final.R and analysis_final_final.R later!

Multiple folders

Often, you don’t just have R code for a project…

project
│   00_packages.R
│   01_load_data.R
│   02_data_wrangling.R
│   03_exploratory_plots.R
│   04_modelling.R
│   05_final_plots_tables.R
│   data.csv
│   residuals.png
│   outcome_by_age.png
│   outcome_by_occupation.png

Multiple folders

… so don’t just organise your R code!

project
│   project.Rproj
│   README.md
└───data
│   │   data.csv
└───plots
│   │   residuals.png
│   │   outcome_by_age.png
│   │   outcome_by_occupation.png
└───R
│   │   00_packages.R
│   │   01_load_data.R
│   │   02_data_wrangling.R
│   │   03_exploratory_plots.R
│   │   04_modelling.R
│   │   05_final_plots_tables.R

R script dependencies

project
└───R
│   │   00_packages.R
│   │   01_load_data.R
│   │   02_data_wrangling.R
│   │   03_exploratory_plots.R
│   │   04_modelling.R
│   │   05_final_plots_tables.R

Script 01 depends on 00
Script 02 depends on 01 (and 00)
Script 03 depends on 02 (and 01 and 00)
Script 04 depends on 02 (and 01 and 00, but not 03)
…

Documentation

Write this stuff down (in a README.md file)!

A better solution…

{targets}: a pipeline tool for statistics and data science in R.

watches the dependencies of your workflow
skips steps whose code, data, and upstream dependencies have not changed
unlike source(script.R) approach, it also manages changes to data
visualise the dependencies using tar_visnetwork()

Useful links

Slides: nrennie.rbind.io/talks/user-spaghetti-code
The Turing Way: www.turing.ac.uk/research/research-projects/turing-way
Data Management in Large-Scale Education Research: datamgmtinedresearch.com
Building reproducible analytical pipelines with R: raps-with-r.dev
Happy Git with R: happygitwithr.com

Keep spaghetti in a pasta bowl, not your R scripts!

nicola-rennie

nrennie

nrennie.rbind.io

About Me

What’s this talk about?

How does a typical R programming journey begin?

And then…

How did we get into this mess?

Why do we care?

How do we fix it?

Structuring and stylingyour scripts

Adding comments

Sections and subsections

Code style

Linting

{lintr}

Styling

Structuring and stylingyour projects

Breaking up a single file

Breaking up a single file

Multiple files

Multiple files

Multiple folders

Multiple folders

R script dependencies

A better solution…

Sharing your projectswith others

Sharing code with other people

Sharing code with other people

Sharing code with other people

Sharing code with the world

Sharing code with the world

Sharing code with the world

Useful links

Structuring and styling
your scripts

Structuring and styling
your projects

Sharing your projects
with others