RSS International Conference 2024
The workshop will run for 80 minutes.
Combines slides, live coding examples, and exercises for you to participate in.
Ask questions throughout!
I hope you end up with more questions than answers after this workshop!
Access to R on your laptop or via Posit Cloud.
Installed the following packages:
We’ll use data from the {openintro} R package:
A collection of R packages for statistical modelling and machine learning.
Follows the {tidyverse} principles.
install.packages("tidymodels")
There are some core {tidymodels} packages…
… and plenty of extensions!
Supervised learning: requires labelled input data
Classification
Regression-based models
…
Unsupervised learning: does not require labelled input data
Clustering
Association rules
…
Other types of machine learning include semi-supervised learning and reinforcement learning.
We can’t always learn every parameter from the data.
Recipe
A series of preprocessing steps performed on data before you fit a model.
Workflow
An object that can combine your pre-processing, modelling, and post-processing steps. E.g. combine a recipe
with a model.
See examples/example_01.R
for full code.
Open exercises/exercise_01.R
for prompts.
Load the resume
data from {openintro}. Do you need to do any pre-processing?
Perform the initial split (choose your own proportion!).
Create some cross-validation folds.
Build a recipe and workflow. The outcome is received_callback
.
05:00
See exercise_solutions/exercise_solutions_01.R
on GitHub for full code.
Let’s go back a little bit first…
Standard regression: minimise distance between predicted and observed values
Least Absolute Shrinkage and Selection Operator (LASSO): minimise (distance between predicted and observed values + \(\lambda\) \(\times\) sum of coefficients)
See also: ridge regression
\(\lambda\) (penalty) takes a value between 0 and \(\infty\).
Higher value: more coefficients are pushed towards zero
Lower value: closer to standard regression models. (\(\lambda = 0\) ~ standard regression model)
(Binary) Classification Metrics
Accuracy: proportion of the data that are predicted correctly.
ROC AUC: area under the ROC (receiver operating characteristic) curve.
Kappa: similar to accuracy but normalised by the accuracy expected by chance alone.
Source: Martin Thoma (Wikipedia)
See examples/example_02.R
for full code.
Open exercises/exercise_02.R
for prompts. You can also use examples/example_02.R
as a starting point.
Specify the model using logistic_reg()
.
Tune the hyperparameter.
Choose the best value and fit the final model.
Evaluate the model performance.
10:00
See exercise_solutions/exercise_solutions_02.R
for full code.
A tree-like model of decisions and their possible consequences.
An ensemble method
Combines many decision trees.
Can be used for classification or regression problems.
For classification tasks, the output of the random forest is the class selected by most trees.
Source: Tse Ki Chun (Wikimedia)
trees: number of trees in the ensemble.
mtry: number of predictors that will be randomly sampled at each split when creating the tree models.
min_n: minimum number of data points in a node that are required for the node to be split further.
See examples/example_03.R
for full code.
Open exercises/exercise_03.R
for prompts. You can also use examples/example_03.R
as a starting point.
Specify a random forest model using rand_forest()
Tune the hyperparameters using the cross-validation folds.
Fit the final model and evaluate it.
10:00
See exercise_solutions/exercise_solutions_03.R
for full code.