10:00
Lecturer in Health Data Science within Centre for Health Informatics, Computing, and Statistics.
Academic background in statistics, with experience in data science consultancy and training.
Using R for over 10 years, and author of multiple R packages.
Combines slides, live coding examples, and exercises for you to participate in.
Ask questions throughout!
I hope you end up with more questions than answers after this workshop!
Course website: nrennie.rbind.io/training-intro-to-tidymodels
Have you done machine learning in R before?
No
Yes, with {tidymodels}
Yes, with {caret}
Yes, with something else
A collection of R packages for statistical modelling and machine learning.
Follows the {tidyverse} principles.
install.packages("tidymodels")
There are some core {tidymodels} packages…
… and plenty of extensions!
Learning from data
Mostly used to make predictions or classifications.
Learning from data
Mostly used to make predictions or classifications.
Supervised learning: requires labelled input data
Classification
Regression-based models
…
Unsupervised learning: does not require labelled input data
Clustering
Association rules
…
Other types of machine learning include semi-supervised learning and reinforcement learning.
We can’t always learn every parameter from the data.
Recipe
A series of preprocessing steps performed on data before you fit a model.
Workflow
An object that can combine your pre-processing, modelling, and post-processing steps. E.g. combine a recipe
with a model.
Load the {tidyverse} and {tidymodels} packages
Read in the exercises.csv
data
View and explore the data
Perform the initial split (choose your own proportion!)
Create some cross-validation folds
Build a recipe and workflow
10:00
Let’s go back a little bit first…
How do you choose which explanatory variables to include?
Using background knowledge
p-values and correlations between variables
Stepwise procedures (forward/backward/bi-directional)
Something else
Standard regression: minimise distance between predicted and observed values
Least Absolute Shrinkage and Selection Operator (LASSO): minimise (distance between predicted and observed values + \(\lambda\) \(\times\) sum of coefficients)
See also: ridge regression
\(\lambda\) (penalty) takes a value between 0 and \(\infty\).
Higher value: more coefficients are pushed towards zero
Lower value: closer to standard regression models. (\(\lambda = 0\) ~ standard regression model)
(Binary) Classification Metrics
Accuracy: proportion of the data that are predicted correctly.
ROC AUC: area under the ROC (receiver operating characteristic) curve.
Kappa: similar to accuracy but normalised by the accuracy expected by chance alone.
Source: Martin Thoma (Wikipedia)
Specify the model using logistic_reg()
.
Tune the hyperparameter.
Choose the best value and fit the final model.
Evaluate the model performance.
10:00
What was the ROC AUC of your LASSO model?
Less than 70%
70-80%
80-90%
90-100%
A tree-like model of decisions and their possible consequences.
An ensemble method
Combines many decision trees.
Can be used for classification or regression problems.
For classification tasks, the output of the random forest is the class selected by most trees.
Source: Tse Ki Chun (Wikimedia)
trees: number of trees in the ensemble.
mtry: number of predictors that will be randomly sampled at each split when creating the tree models.
min_n: minimum number of data points in a node that are required for the node to be split further.
Specify a random forest model using rand_forest()
Tune the hyperparameters using the cross-validation folds.
Fit the final model and evaluate it.
10:00
Support Vector Machines (SVMs) draw a decision boundary that best separates two groups.
Support Vector Machines (SVMs) draw a decision boundary that best separates two groups.
There are different types of kernel functions, including:
Linear
svm_linear()
Polynomial
svm_poly()
Radial Basis Functions
svm_rbf()
Cost
Higher value: emphasises fitting the data
Lower value: prioritises avoiding overfitting
Gamma (shape and smoothness of decision boundary)
Higher value: more flexible boundaries
Lower value: simpler boundaries
There may be other hyperparameters, depending on the choice of kernel.
Specify a support vector machine using svm_rbf()
(or one of the other svm_*
functions if you’re feeling confident!)
Tune the cost()
hyperparameter using the cross-validation folds.
Fit the final model and evaluate it.
Look at some other evaluation metrics.
10:00
Which of the three models performed best on the exercises
data?
LASSO
Random Forest
SVM
There are many machine learning topics and {tidymodels} functions we haven’t covered today. As well as learning about the technical details of models and how to write code, it’s important to learn about:
Ethics
Bias and discrimination
Explainability and validation
…
Course website: nrennie.rbind.io/training-intro-to-tidymodels
Tidymodels documentation: www.tidymodels.org
Blog by Julia Silge: juliasilge.com/blog
Tidy Modelling with R: www.tmwr.org
Efficient Machine Learning with R: emlwr.org
An Introduction to Statistical Learning: www.statlearning.com