|>
starwars summarise(
mean_height = mean(height, na.rm = TRUE), # calc height mean
sd_height = sd(height, na.rm = TRUE) # calc height sd
)
Stop making spaghetti (code)
Tips for writing better R code
Nicola Rennie, Lancaster University
useR, July 2024
Academic background in statistics
Experience in data science consultancy
Lecturer in Health Data Science in Lancaster Medical School.
Research interests: healthcare data, reproducible research, data visualisation, R pedagogy…
A talk about things that would have made my life so much easier if I’d known them five years ago.
Often as part of a statistics or data analysis course.
Or learning on the job.
Not often taught by computer scientists.
But…
A.K.A spaghetti code!
The jump from learning R in class to using R in research projects is big.
Supervisors might have varying levels of experience.
Hand-me-down code can reinforce bad habits.
It’s not something that’s covered in a lot of textbooks*.
*but there are some excellent ones out there!
Writing code that is readable and understandable is something that future you will be grateful for.
Writing code that is readable and understandable is something that other people will be grateful for.
By sharing useful tips we know
By sharing useful resources we find
By reviewing other people’s code
By having other people review our code
You don’t have to fix everything at once!
One file at a time
Add comments using a #
in R (in a separate line)
Comments don’t need to explain what your code does.
Comments should explain why you did it.
This code runs without errors but…
Linting - analysing source code for:
x<-3
vs x <- 3
mean(x, na.rm = T, na.rm = F
)In R, linting is performed by the {lintr} package.
Run lintr::lint("file.R")
:
Keyboard shortcuts
Use keyboard shortcuts to lint the current file (or package).
{lintr} tells you what’s wrong, but doesn’t fix it.
The {styler} R package will style your code for you.
Keyboard shortcuts
Add a keyboard shortcut for styler::style_active_file()
!
Note: {styler} doesn’t fix all issues found by {lintr}.
One directory at a time
Imagine a directory structure like this:
that contains all of the code for your analysis.
This is fine but:
Rscript.R
is 4,000 lines long.Naming files
-
or _
).Note: similar rules apply for variable and function names.
We’ll come back to avoiding analysis_final.R
and analysis_final_final.R
later!
Often, you don’t just have R code for a project…
… so don’t just organise your R code!
01
depends on 00
02
depends on 01
(and 00
)03
depends on 02
(and 01
and 00
)04
depends on 02
(and 01
and 00
, but not 03
)Documentation
Write this stuff down (in a README.md file)!
{targets}: a pipeline tool for statistics and data science in R.
source(script.R)
approach, it also manages changes to datatar_visnetwork()
It’s not just you who runs your code
Imagine:
code.R
.code_Nicola_comments.R
.code_v2.R
.code_v2_Nicola_comments.R
.Git: a free, open source version control tool that you install on your laptop.
GitHub / GitLab / BitBucket: a place to host online Git repositories, which allows people in different locations to work together on the code.
GitHub* allows you to:
*or other online repository hosting services
GitHub repositories can be private or public:
Add a license file to explain how other people can use your code
Use .gitignore
and GitHub secrets to make sure you never accidentally upload data or sensitive information like passwords.
Software versions
If someone else is running your code, it’s best to make sure they have the same version of software as you:
Multiple solutions exist:
You might not be able to make everything perfect but…
At least write down what you did!
The Turing Way: www.turing.ac.uk/research/research-projects/turing-way
Data Management in Large-Scale Education Research: datamgmtinedresearch.com
Building reproducible analytical pipelines with R: raps-with-r.dev
Happy Git with R: happygitwithr.com