R / Medicine 2023
Lecturer in Health Data Science at Lancaster University.
Academic background in statistics, and experience in data science consultancy.
Blog about R and data science at nrennie.rbind.io/blog.
5,272 sound recordings of heartbeats.
1,568 different patients.
4 recording locations.
Patient information such as height, weight, age, …, and whether or not they had been diagnosed with a heart murmur.
Each recording is around ~10 seconds.
Collected at 4,000 Hz.
~40,000 observations per time series.
Not all time series are of the same length!
Aim: predict which time series of recordings belong to those with heart murmurs.
(Euclidean) distance approaches
Dynamic time warping
Shapelet-based methods
Kernel-based methods
Feature-based approaches1
Calculate some features of the time series.
Use the features as input to classification algorithms instead of the raw time series data.
Some time series features will tell us useful things…
… some won’t.
In R, we can use {tsfeatures} to calculate features.
A collection of R packages for statistical modelling and machine learning.
Follows the {tidyverse} principles.
install.packages("tidymodels")
Binary classification \(\rightarrow\) logistic regression.
Many variables \(\rightarrow\) Lasso logistic regression.
Split the data:
Make a recipe:
Specify model:
Tune hyper parameter:
Specify model:
Tune hyper parameter:
Accuracy: 0.81
ROC AUC: 0.65
Accuracy: 0.80
ROC AUC: 0.71
as an additional diagnostic tool
longer term monitoring
looking at different categories of heart murmurs
Bias: systematic error due to incorrect assumptions.
Under-represented groups are most susceptible to the impact of bias.
Ensure sample reflects the population.
Evaluate performance across different groups, not just as a whole.
Are false positives as equally bad as false negatives?
(No.)