Stop making spaghetti (code)

Tips for writing better R code

Nicola Rennie, Lancaster University
useR, July 2024

About Me


Academic background in statistics

Experience in data science consultancy

Lecturer in Health Data Science in Lancaster Medical School.

Research interests: healthcare data, reproducible research, data visualisation, R pedagogy…

What’s this talk about?




A talk about things that would have made my life so much easier if I’d known them five years ago.

How does a typical R programming journey begin?

  • Often as part of a statistics or data analysis course.

  • Or learning on the job.

  • Not often taught by computer scientists.

And then…

  • You’re reasonably comfortable writing R code to do some analysis

But…

  • Your scripts are getting longer and longer
  • There are R scripts everywhere
  • You don’t want other people to see the code you’ve written because it’s all a bit of a mess!

A.K.A spaghetti code!

How did we get into this mess?

  • The jump from learning R in class to using R in research projects is big.

  • Supervisors might have varying levels of experience.

  • Hand-me-down code can reinforce bad habits.

  • It’s not something that’s covered in a lot of textbooks*.



*but there are some excellent ones out there!

Why do we care?

  • Writing code that is readable and understandable is something that future you will be grateful for.

  • Writing code that is readable and understandable is something that other people will be grateful for.

    • Sharing code prevents duplicating work.
    • It makes work easier to replicate.
    • Some journals may require analysis code to be shared.

How do we fix it?

  • By sharing useful tips we know

  • By sharing useful resources we find

  • By reviewing other people’s code

  • By having other people review our code

You don’t have to fix everything at once!



Structuring and styling
your scripts

One file at a time

Adding comments

  • Add comments using a # in R (in a separate line)

  • Comments don’t need to explain what your code does.

  • Comments should explain why you did it.

starwars |> 
  summarise(
    mean_height = mean(height, na.rm = TRUE), # calc height mean 
    sd_height = sd(height, na.rm = TRUE) # calc height sd
  )
# calculate summary statistics and 
# remove NA values as missing `height`
# values are also missing `mass`
starwars |> 
  summarise(
    mean_height = mean(height, na.rm = TRUE),
    sd_height = sd(height, na.rm = TRUE)
  )

Sections and subsections

You can add sections and subsections to code:

# Load data ---------------------------------------

## Geospatial files -------------------------------

## Population files -------------------------------

Code style

This code runs without errors but…

starwars |> filter(height>100) |>select(eye_color, mass)|> group_by(eye_color) |>summarise(mean_mass =mean(mass, na.rm = T))


this is the same code:

starwars |> 
  filter(height > 100) |> 
  select(eye_color, mass) |> 
  group_by(eye_color) |> 
  summarise(mean_mass = mean(mass, na.rm = TRUE))

Linting

Linting - analysing source code for:

  • stylistic issues e.g. x<-3 vs x <- 3
  • common errors e.g. mean(x, na.rm = T, na.rm = F)
  • missing packages

In R, linting is performed by the {lintr} package.

{lintr}

Run lintr::lint("file.R"):

Keyboard shortcuts

Use keyboard shortcuts to lint the current file (or package).

Styling

{lintr} tells you what’s wrong, but doesn’t fix it.

The {styler} R package will style your code for you.

Keyboard shortcuts

Add a keyboard shortcut for styler::style_active_file()!

Note: {styler} doesn’t fix all issues found by {lintr}.



Structuring and styling
your projects

One directory at a time

Breaking up a single file

Imagine a directory structure like this:

project
│   Rscript.R

that contains all of the code for your analysis.

This is fine but:

  • it’s not great if Rscript.R is 4,000 lines long.
  • sections and subsections are great, but sometimes they aren’t enough.
  • it’s not a very descriptive name.
  • it’s a script that probably does lot’s of different things.

Breaking up a single file



Multiple files

Okay names

project
│   data wrangling.R
│   load data.R
│   modelling.R
│   packages.R
│   plots.R
│   plots2.R

Better names

project
│   00_packages.R
│   01_load_data.R
│   02_data_wrangling.R
│   03_exploratory_plots.R
│   04_modelling.R
│   05_final_plots_tables.R

Multiple files

Naming files

  • Prefix with numbers to give them an order (add leading zeros).
  • Give them sensible, descriptive names.
  • Avoid spaces (computers prefer - or _).

Note: similar rules apply for variable and function names.


We’ll come back to avoiding analysis_final.R and analysis_final_final.R later!

Multiple folders

Often, you don’t just have R code for a project…

project
│   00_packages.R
│   01_load_data.R
│   02_data_wrangling.R
│   03_exploratory_plots.R
│   04_modelling.R
│   05_final_plots_tables.R
│   data.csv
│   residuals.png
│   outcome_by_age.png
│   outcome_by_occupation.png

Multiple folders

… so don’t just organise your R code!

project
│   project.Rproj
│   README.md
└───data
│   │   data.csv
└───plots
│   │   residuals.png
│   │   outcome_by_age.png
│   │   outcome_by_occupation.png
└───R
│   │   00_packages.R
│   │   01_load_data.R
│   │   02_data_wrangling.R
│   │   03_exploratory_plots.R
│   │   04_modelling.R
│   │   05_final_plots_tables.R

R script dependencies

project
└───R
│   │   00_packages.R
│   │   01_load_data.R
│   │   02_data_wrangling.R
│   │   03_exploratory_plots.R
│   │   04_modelling.R
│   │   05_final_plots_tables.R
  • Script 01 depends on 00
  • Script 02 depends on 01 (and 00)
  • Script 03 depends on 02 (and 01 and 00)
  • Script 04 depends on 02 (and 01 and 00, but not 03)

Documentation

Write this stuff down (in a README.md file)!

A better solution…

{targets}: a pipeline tool for statistics and data science in R.

  • watches the dependencies of your workflow
  • skips steps whose code, data, and upstream dependencies have not changed
  • unlike source(script.R) approach, it also manages changes to data
  • visualise the dependencies using tar_visnetwork()



Sharing your projects
with others

It’s not just you who runs your code

Sharing code with other people

Imagine:

  • You’ve written some code, and you want me to review it so you email a file called code.R.
  • I add some comments or changes, and email you code back in a file called code_Nicola_comments.R.
  • You apply changes and add some more code, and ask for a review the next week. You email me a file called code_v2.R.
  • I add some comments or changes, and email you code back in a file called code_v2_Nicola_comments.R.
  • and so on…

Sharing code with other people

  • Git: a free, open source version control tool that you install on your laptop.

  • GitHub / GitLab / BitBucket: a place to host online Git repositories, which allows people in different locations to work together on the code.



Sharing code with other people

GitHub* allows you to:

  • Keep track of different versions, and history, of files without keeping multiple copies.
  • Keep track of changes, even after they are accepted.
  • Ask someone to review your code, and they can add comments, ask questions, or suggest changes.
  • Make issues, and set deadlines (i.e. use it as a project management tool).
  • Add automatic code checks e.g. linting.
  • Use it on your own, or with other people.

*or other online repository hosting services

Sharing code with the world

  • GitHub repositories can be private or public:

    • Develop in private (or public)
    • Make public when submitting to journal
  • Add a license file to explain how other people can use your code

  • Use .gitignore and GitHub secrets to make sure you never accidentally upload data or sensitive information like passwords.

Sharing code with the world

Software versions

If someone else is running your code, it’s best to make sure they have the same version of software as you:

  • R
  • R packages

Multiple solutions exist:

  • {renv}
  • {groundhog}
  • nix
  • Docker …

Sharing code with the world


You might not be able to make everything perfect but…



At least write down what you did!

Keep spaghetti in a pasta bowl, not your R scripts!

QR code