Introducing the {messy} package
The {messy} R package takes a clean dataset, and randomly adds mess to create data more similar to that which you'd find in the real world. This is an easy way for educators to create data sets that give students the opportunity to practice their data wrangling skills without having to change all of their examples.
December 3, 2024
When teaching examples using R, instructors often use nice datasets, but these aren’t very realistic, and aren’t what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces - to name just a few issues. At the same, it’s quite rare to teach a module solely on data wrangling. Since genuinely real datasets aren’t always ideal for teaching because they might not fit the assumptions of the model you’re trying to teach, or it’s just too messy. That’s where the {messy} package comes in!
What does {messy} do?
The {messy} R package takes a clean dataset, and randomly adds these features of real datasets in - giving students the opportunity to practice their data cleaning and wrangling skills without educators having to change all of their examples.
Installing {messy}
As of early December 2024, {messy} is officially available on CRAN. You can install {messy} using install.packages("messy")
.
You can also install the development version from GitHub:
|
|
A few examples
Before we jump into showcasing some examples, I want to thank the people who have contributed to {messy} already:
- Jack Davison: added functions for creating messy date/time data.
- Athanasia Monika Mowinckel: added functions for creating messy column names and even messier character strings.
- Philip Leftwich: added functions for randomly duplicating and reordering rows of data.
Using the messy()
function
Let’s start with the first 10 rows of the ToothGrowth
data as an example of a small, clean dataset:
|
|
which looks like:
|
|
The easiest way to use the {messy} package, is through the messy()
function. Simply pass in the data frame that you want to make messier:
|
|
|
|
Chaining together multiple functions
You can also vary the amount of messiness
, and pick and choose which functions are applied to which columns, by chaining together multiple functions:
|
|
|
|
Tip: If you’re adding
messy_colnames()
to a chain (and you specify only some columns in other functions), make suremessy_colnames()
comes at the end. Otherwise, the column names you try to select may no longer exist!
See the package documentation for more examples, and descriptions of all available functions.
Some other thoughts about {messy}
Though this is an R package, it’s also useful for teaching programming in other languages. You create a messy dataset and then save it as a CSV (or Excel) file and then use it for any data wrangling practice, regardless of language. It also has uses beyond teaching. For example, testing functions or R packages to make sure that functions work as expected, or give useful and appropriate errors when they don’t.
I’m incredibly grateful for the excellent response to {messy} so far, and especially to those people who have contributed suggestions, issues, and pull requests. If you have another suggestion of how to make messy data that you think would be useful to add, please open a GitHub issue.
The source code for the messy
R package can be found on GitHub at
github.com/nrennie/messy.
For attribution, please cite this work as:
Introducing the {messy} package.
Nicola Rennie. December 3, 2024.
nrennie.rbind.io/blog/introducing-messy-r-package
BibLaTeX Citation
@online{rennie2024, author = {Nicola Rennie}, title = {Introducing the {messy} package}, date = {2024-12-03}, url = {https://nrennie.rbind.io/blog/introducing-messy-r-package} }
Licence: creativecommons.org/licenses/by/4.0