Introduction to machine learning with {tidymodels} in R

RSS International Conference 2024

Nicola Rennie

Welcome!

What to expect during this workshop

The workshop will run for 80 minutes.

  • Combines slides, live coding examples, and exercises for you to participate in.

  • Ask questions throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Stranger Things questions gif

Source: giphy.com

Workshop resources

Requirements

  • Access to R on your laptop or via Posit Cloud.

  • Installed the following packages:

    • tidymodels
    • glmnet
    • ranger
    • openintro (for data)
    • dplyr, tidyr, ggplot2, forcats (optional)

Data

We’ll use data from the {openintro} R package:

library(openintro) 
View(smoking)
View(resume)

Getting started with {tidymodels}

What is {tidymodels}?

  • A collection of R packages for statistical modelling and machine learning.

  • Follows the {tidyverse} principles.

  • install.packages("tidymodels")

tidymodels R package hex sticker logo

What is {tidymodels}?

There are some core {tidymodels} packages…


… and plenty of extensions!

Before we start fitting models…

Types of machine learning

Supervised learning: requires labelled input data

  • Classification

  • Regression-based models

Unsupervised learning: does not require labelled input data

  • Clustering

  • Association rules

Other types of machine learning include semi-supervised learning and reinforcement learning.

Training and testing data

Training and testing diagram

Hyperparameter tuning

We can’t always learn every parameter from the data.

Workflows and recipes

recipes package hex sticker

Recipe

A series of preprocessing steps performed on data before you fit a model.


Workflow

An object that can combine your pre-processing, modelling, and post-processing steps. E.g. combine a recipe with a model.

Pre-processing in {tidymodels}

Live demo!



See examples/example_01.R for full code.

Exercise 1

Open exercises/exercise_01.R for prompts.

  • Load the resume data from {openintro}. Do you need to do any pre-processing?

  • Perform the initial split (choose your own proportion!).

  • Create some cross-validation folds.

  • Build a recipe and workflow. The outcome is received_callback.

05:00

See exercise_solutions/exercise_solutions_01.R on GitHub for full code.

LASSO regression

Linear and logistic regression models

Let’s go back a little bit first…

Linear regression

lm(y ~ x, data = model_data)

Linear regression plot

Logistic regression

glm(y ~ x, family = "binomial", data = model_data)

Linear regression plot

LASSO regression

Standard regression: minimise distance between predicted and observed values


Least Absolute Shrinkage and Selection Operator (LASSO): minimise (distance between predicted and observed values + \(\lambda\) \(\times\) sum of coefficients)


See also: ridge regression

Hyperparameters for LASSO regression

\(\lambda\) (penalty) takes a value between 0 and \(\infty\).

  • Higher value: more coefficients are pushed towards zero

  • Lower value: closer to standard regression models. (\(\lambda = 0\) ~ standard regression model)

Model evaluation

(Binary) Classification Metrics

  • Accuracy: proportion of the data that are predicted correctly.

  • ROC AUC: area under the ROC (receiver operating characteristic) curve.

  • Kappa: similar to accuracy but normalised by the accuracy expected by chance alone.

See yardstick.tidymodels.org/articles/metric-types.html.

LASSO logistic regression in {tidymodels}

Live demo!



See examples/example_02.R for full code.

Exercise 2

Open exercises/exercise_02.R for prompts. You can also use examples/example_02.R as a starting point.

  • Specify the model using logistic_reg().

  • Tune the hyperparameter.

  • Choose the best value and fit the final model.

  • Evaluate the model performance.

10:00

See exercise_solutions/exercise_solutions_02.R for full code.

Random Forests

Decision trees

A tree-like model of decisions and their possible consequences.

Decision tree about walking to work and the weather

What are Random Forests?

  • An ensemble method

  • Combines many decision trees.

  • Can be used for classification or regression problems.

  • For classification tasks, the output of the random forest is the class selected by most trees.

Hyperparameters for random forests

trees: number of trees in the ensemble.


mtry: number of predictors that will be randomly sampled at each split when creating the tree models.


min_n: minimum number of data points in a node that are required for the node to be split further.

Random Forests in {tidymodels}

Live demo!



See examples/example_03.R for full code.

Exercise 3

Open exercises/exercise_03.R for prompts. You can also use examples/example_03.R as a starting point.

  • Specify a random forest model using rand_forest()

  • Tune the hyperparameters using the cross-validation folds.

  • Fit the final model and evaluate it.

10:00

See exercise_solutions/exercise_solutions_03.R for full code.

Additional Information

Additional resources

Workshop resources