Using {tidymodels} to detect heart murmurs

R / Medicine 2023

Nicola Rennie

About Me

Lecturer in Health Data Science at Lancaster University.


Academic background in statistics, and experience in data science consultancy.


Blog about R and data science at nrennie.rbind.io/blog.

Photo of speaker wearing red jacket

Data

Data

  • CirCor DigiScope Phonocardiogram Dataset.1

  • 5,272 sound recordings of heartbeats.

  • 1,568 different patients.

  • 4 recording locations.

  • Patient information such as height, weight, age, …, and whether or not they had been diagnosed with a heart murmur.

Time series plot of aortic valve sound recording for subject 13918.

Data

  • Each recording is around ~10 seconds.

  • Collected at 4,000 Hz.

  • ~40,000 observations per time series.

  • Not all time series are of the same length!

Time series plot of aortic valve sound recording for subject 13918.

Aim: predict which time series of recordings belong to those with heart murmurs.

Time series analysis

Classifying time series

  • (Euclidean) distance approaches

  • Dynamic time warping

  • Shapelet-based methods

  • Kernel-based methods

  • Feature-based approaches1

Time series plot plotted in different colours above and below zero

Classifying time series features

  • Calculate some features of the time series.

  • Use the features as input to classification algorithms instead of the raw time series data.

Scatter plot of mean against standard deviation of time series

Classifying time series features

Some time series features will tell us useful things…




… some won’t.

Calculating time series features

In R, we can use {tsfeatures} to calculate features.

library(tsfeatures)
ts_fts <- tsfeatures(
  ts_data,
  features = c(
      "acf_features", "outlierinclude_mdrmd",
      "arch_stat", "max_level_shift",
      "max_var_shift", "entropy",
      "pacf_features", "firstmin_ac",
      "std1st_der", "stability",
      "firstzero_ac", "hurst",
      "lumpiness", "motiftwo_entro3"
    )
  )

Calculating time series features

Box plot showing ACF distributions for four locations, coloured by those with and without heart murmurs.

Fitting models with {tidymodels}

What is {tidymodels}?

  • A collection of R packages for statistical modelling and machine learning.

  • Follows the {tidyverse} principles.

  • install.packages("tidymodels")

tidymodels R package hex sticker logo

Choosing a model

flowchart of process of choosing a model

  • Binary classification \(\rightarrow\) logistic regression.

  • Many variables \(\rightarrow\) Lasso logistic regression.

Fitting models

Split the data:

murmurs_split <- initial_split(all_model_data, strata = murmur)
murmurs_train <- training(murmurs_split)
murmurs_test <- testing(murmurs_split)
murmurs_folds <- vfold_cv(murmurs_train, strata = murmur)

Make a recipe:

murmurs_recipe <- recipe(murmur ~ ., data = murmurs_train) |> 
  step_normalize(all_numeric(), -all_outcomes())

Create a workflow:

wf <- workflow() |> 
  add_recipe(murmurs_recipe) 

Fitting models

Lasso logistic regression

Specify model:

tune_spec <- logistic_reg(penalty = tune(), mixture = 1) |> set_engine("glmnet")

Tune hyper parameter:

lasso_grid <- tune_grid(wf |> add_model(tune_spec),
                        resamples = murmurs_folds,
                        grid = grid_regular(penalty(), levels = 50))

Choose the best value:

highest_roc_auc <- lasso_grid |> select_best("roc_auc")

Evaluate the final model:

final_lasso <- finalize_workflow(wf |> add_model(tune_spec), highest_roc_auc)
last_fit(final_lasso, murmurs_split) |> collect_metrics()

Fitting models

Random forest

Specify model:

rf_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune()) |>
  set_mode("classification") |> set_engine("ranger")

Tune hyper parameter:

rf_grid <- tune_grid(
  wf |> add_model(tune_spec),
  resamples = murmurs_folds,
  grid = grid_regular(mtry(range = c(5, 25)), min_n(range = c(1, 25)), levels = 5))

Choose the best value:

highest_roc_auc <- rf_grid |> select_best("roc_auc")

Evaluate the final model:

final_rf <- finalize_workflow(wf |> add_model(tune_spec), highest_roc_auc)
last_fit(final_rf, murmurs_split) |> collect_metrics()

Initial results

Lasso logisitic regression

Accuracy: 0.81
ROC AUC: 0.65

confusion matrix of lasso regression results

Random forest

Accuracy: 0.80
ROC AUC: 0.71

confusion matrix of random forest results

Using ML in healthcare and medicine

Potential clinical use cases

  • as an additional diagnostic tool

    • can pick up sub-audible sounds
    • cost-effective, first-line screening
  • longer term monitoring

  • looking at different categories of heart murmurs

Bias

  • Bias: systematic error due to incorrect assumptions.

  • Under-represented groups are most susceptible to the impact of bias.

  • Ensure sample reflects the population.

  • Evaluate performance across different groups, not just as a whole.

Two groups of people one showing 50% accuracy with the other 100% giving an average of 75%

Model evaluation

Are false positives as equally bad as false negatives?


(No.)

Grid of 100 users with one highlighted

Questions?