Making messy data: creating more realistic, synthetic data for teaching and testing
In this session, I'll introduce the 'messy' R package designed to introduce controlled levels of messiness into existing, clean datasets. This package allows you to retain the structure and simplicity of familiar example datasets while providing students with a realistic, manageable data cleaning experience.
By Nicola Rennie in Seminar
Abstract
Equipping students in statistics and data science with the necessary data wrangling skills to handle real-world data is a crucial aspect of their education. Real data, unlike the clean, structured examples often used in teaching, can include a variety of challenges such as typographical errors, missing values encoded in unconventional ways, or unexpected spaces in text. These issues, and others stemming from human error or software incompatibilities, are common in real-world datasets and it is essential for students to learn how to address them in order to develop the practical skills needed for professional data analysis. Similarly, when developing methodology and the software that implements it, realistic data for testing purposes is necessary to ensure robustness.
In this session, I’ll introduce the ‘messy’ R package designed to introduce controlled levels of messiness into existing, clean datasets. This package allows you to retain the structure and simplicity of familiar example datasets while providing students with a realistic, manageable data cleaning experience. I’ll also demonstrate some ways in which the package can be used, and discuss the future direction of its development.
Registration: rss.org.uk/training-events/events/events-2025/local-groups/making-messy-data-creating-more-realistic,-synthet