Introduction to machine learning with {tidymodels} in R

RSS International Conference 2024

Nicola Rennie

Welcome!

What to expect during this workshop

The workshop will run for 80 minutes.

Combines slides, live coding examples, and exercises for you to participate in.
Ask questions throughout!

What to expect during this workshop

I hope you end up with more questions than answers after this workshop!

Source: giphy.com

Workshop resources

GitHub: github.com/nrennie/rss-2024-tidymodels
Slides: nrennie.github.io/rss-2024-tidymodels

Requirements

Access to R on your laptop or via Posit Cloud.
Installed the following packages:
- tidymodels
- glmnet
- ranger
- openintro (for data)
- dplyr, tidyr, ggplot2, forcats (optional)

Data

We’ll use data from the {openintro} R package:

library(openintro) 
View(smoking)
View(resume)

Getting started with {tidymodels}

What is {tidymodels}?

A collection of R packages for statistical modelling and machine learning.
Follows the {tidyverse} principles.
install.packages("tidymodels")

tidymodels R package hex sticker logo

What is {tidymodels}?

There are some core {tidymodels} packages…

… and plenty of extensions!

tidymodels packages hex sticker logo

Source: rpubs.com/chenx/tidymodels_tutorial

Before we start fitting models…

Types of machine learning

Supervised learning: requires labelled input data

Classification
Regression-based models
…

Unsupervised learning: does not require labelled input data

Clustering
Association rules
…

Other types of machine learning include semi-supervised learning and reinforcement learning.

Training and testing data

Hyperparameter tuning

We can’t always learn every parameter from the data.

Source: Introduction to Support Vector Machines and Kernel Methods. Ashfaque and Iqbal. 2019.

Workflows and recipes

Recipe

A series of preprocessing steps performed on data before you fit a model.

Workflow

An object that can combine your pre-processing, modelling, and post-processing steps. E.g. combine a recipe with a model.

Pre-processing in {tidymodels}

Live demo!

See examples/example_01.R for full code.

Exercise 1

Open exercises/exercise_01.R for prompts.

Load the resume data from {openintro}. Do you need to do any pre-processing?
Perform the initial split (choose your own proportion!).
Create some cross-validation folds.
Build a recipe and workflow. The outcome is received_callback.

05:00

See exercise_solutions/exercise_solutions_01.R on GitHub for full code.

LASSO regression

Linear and logistic regression models

Let’s go back a little bit first…

Linear regression

lm(y ~ x, data = model_data)

Logistic regression

glm(y ~ x, family = "binomial", data = model_data)

LASSO regression

Standard regression: minimise distance between predicted and observed values

Least Absolute Shrinkage and Selection Operator (LASSO): minimise (distance between predicted and observed values + \(\lambda\) \(\times\) sum of coefficients)

Hyperparameters for LASSO regression

\(\lambda\) (penalty) takes a value between 0 and \(\infty\).

Higher value: more coefficients are pushed towards zero
Lower value: closer to standard regression models. (\(\lambda = 0\) ~ standard regression model)

Model evaluation

(Binary) Classification Metrics

Accuracy: proportion of the data that are predicted correctly.
ROC AUC: area under the ROC (receiver operating characteristic) curve.
Kappa: similar to accuracy but normalised by the accuracy expected by chance alone.

See yardstick.tidymodels.org/articles/metric-types.html.

Source: Martin Thoma (Wikipedia)

LASSO logistic regression in {tidymodels}

Live demo!

See examples/example_02.R for full code.

Exercise 2

Open exercises/exercise_02.R for prompts. You can also use examples/example_02.R as a starting point.

Specify the model using logistic_reg().
Tune the hyperparameter.
Choose the best value and fit the final model.
Evaluate the model performance.

10:00

See exercise_solutions/exercise_solutions_02.R for full code.

Random Forests

Decision trees

A tree-like model of decisions and their possible consequences.

Decision tree about walking to work and the weather

What are Random Forests?

An ensemble method
Combines many decision trees.
Can be used for classification or regression problems.
For classification tasks, the output of the random forest is the class selected by most trees.