::pkg_install("nrennie/messy") pak
Introducing the messy
package
messy
R package takes a clean dataset, and randomly adds mess to create data more similar to that which you’d find in the real world. This is an easy way for educators to create data sets that give students the opportunity to practice their data wrangling skills without having to change all of their examples.
When teaching examples using R, instructors often use nice datasets, but these aren’t very realistic, and aren’t what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces - to name just a few issues. At the same, it’s quite rare to teach a module solely on data wrangling. Since genuinely real datasets aren’t always ideal for teaching because they might not fit the assumptions of the model you’re trying to teach, or it’s just too messy. That’s where the messy
package comes in!
What does messy
do?
The messy
R package takes a clean dataset, and randomly adds these features of real datasets in - giving students the opportunity to practice their data cleaning and wrangling skills without educators having to change all of their examples.
Installing messy
As of early December 2024, messy
is officially available on CRAN. You can install messy
using install.packages("messy")
.
You can also install the development version from GitHub:
You can then load it in the normal way:
library(messy)
A few examples
Before we jump into showcasing some examples, I want to thank the people who have contributed to
messy
already:
- Jack Davison: added functions for creating messy date/time data.
- Athanasia Monika Mowinckel: added functions for creating messy column names and even messier character strings.
- Philip Leftwich: added functions for randomly duplicating and reordering rows of data.
Using the messy()
function
Let’s start with the first 10 rows of the ToothGrowth
data as an example of a small, clean dataset:
1:10,] ToothGrowth[
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
The easiest way to use the messy
package, is through the messy()
function. Simply pass in the data frame that you want to make messier:
set.seed(1234)
messy(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 <NA> <NA>
3 7.3 VC 0.5
4 5.8 (VC 0.5
5 6.4 VC <NA>
6 10 VC 0.5
7 11.2 <NA> 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7 VC 0.5
Chaining together multiple functions
You can also vary the amount of messiness
, and pick and choose which functions are applied to which columns, by chaining together multiple functions:
set.seed(1234)
1:10,] |>
ToothGrowth[make_missing(cols = "supp", missing = " ") |>
make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |>
add_whitespace(cols = "supp", messiness = 0.5) |>
add_special_chars(cols = "supp") |>
messy_colnames()
!l_e)n S^UPP d^o)se
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 *VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 0.5
8 11.2 V#C NA
9 5.2 !VC 0.5
10 7.0 VC* 0.5
Tip: If you’re adding
messy_colnames()
to a chain (and you specify only some columns in other functions), make suremessy_colnames()
comes at the end. Otherwise, the column names you try to select may no longer exist!
See the package documentation for more examples, and descriptions of all available functions.
Some other thoughts about messy
Though this is an R package, it’s also useful for teaching programming in other languages. You create a messy dataset and then save it as a CSV (or Excel) file and then use it for any data wrangling practice, regardless of language. It also has uses beyond teaching. For example, testing functions or R packages to make sure that functions work as expected, or give useful and appropriate errors when they don’t.
I’m incredibly grateful for the excellent response to messy
so far, and especially to those people who have contributed suggestions, issues, and pull requests. If you have another suggestion of how to make messy data that you think would be useful to add, please open a GitHub issue.
The source code for the messy
R package can be found on GitHub at github.com/nrennie/messy.
Reuse
Citation
@online{rennie2024,
author = {Rennie, Nicola},
title = {Introducing the `Messy` Package},
date = {2024-12-03},
url = {https://nrennie.rbind.io/blog/messy-r-package/},
langid = {en}
}