Skip to contents

When teaching examples using R, instructors often using nice datasets - but these aren’t very realistic, and aren’t what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.

Installation

Install from GitHub using:

remotes::install_github("nrennie/messy")

Usage

messy()

set.seed(1234)
messy(ToothGrowth[1:10,])
    len supp dose
1   4.2   VC  0.5
2  11.5 <NA> <NA>
3  7.3    VC  0.5
4   5.8  (VC  0.5
5   6.4   VC <NA>
6    10   VC  0.5
7  11.2 <NA>  0.5
8  11.2   VC  0.5
9  5.2    VC  0.5
10    7   VC 0.5 

Increase how messy the data is:

set.seed(1234)
messy(ToothGrowth[1:10,], messiness = 0.7)
     len  supp dose
1   <NA>  <NA> <NA>
2  11.5   <NA> <NA>
3   <NA>  <NA> <NA>
4   5.8   <NA> <NA>
5   <NA> .v*c  <NA>
6   <NA>  <NA> <NA>
7   <NA>  <NA> <NA>
8   <NA>  <NA> 0.5 
9   <NA>  v@c  <NA>
10  <NA>  <NA> <NA>

add_whitespace()

Randomly adds a whitespace to the ends of some values, meaning that numeric columns may be converted to characters:

set.seed(1234)
add_whitespace(ToothGrowth[1:10,])
     len supp dose
1    4.2   VC  0.5
2   11.5   VC  0.5
3    7.3   VC  0.5
4    5.8   VC 0.5 
5    6.4   VC  0.5
6     10   VC  0.5
7  11.2    VC  0.5
8   11.2   VC  0.5
9    5.2   VC  0.5
10     7   VC 0.5 

Apply to only some columns:

set.seed(1234)
add_whitespace(ToothGrowth[1:10,], cols = "supp")
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2  VC   0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

change_case()

Randomly switches the case between upper case, lower case, and no change of character or factor columns:

set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5)
    len supp dose
1   4.2   vc  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   vc  0.5
8  11.2   vc  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

By default, the case of the entire string is changes. Alternatively, you can specify to change the case of each individual letter:

set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5, case_type = "letter")
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   vC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   Vc  0.5
8  11.2   Vc  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

add_special_chars()

Randomly add special characters to character strings:

set.seed(1234)
add_special_chars(ToothGrowth[1:10,])
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8  (VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2   VC  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

make_missing()

Randomly make some values missing using NA:

set.seed(1234)
make_missing(ToothGrowth[1:10,])
    len supp dose
1   4.2   VC  0.5
2  11.5   VC   NA
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7    NA   VC  0.5
8  11.2   VC   NA
9   5.2   VC  0.5
10  7.0   VC  0.5

Add a different missing value representation for some columns:

set.seed(1234)
make_missing(ToothGrowth[1:10,], cols = "supp", missing = "999")
    len supp dose
1   4.2   VC  0.5
2  11.5   VC  0.5
3   7.3   VC  0.5
4   5.8   VC  0.5
5   6.4   VC  0.5
6  10.0   VC  0.5
7  11.2  999  0.5
8  11.2   VC  0.5
9   5.2   VC  0.5
10  7.0   VC  0.5

messy_colnames()

Create messy column names:

set.seed(1234)
messy_colnames(ToothGrowth[1:10,])
   )len s(upp  dose
1   4.2     VC  0.5
2  11.5     VC  0.5
3   7.3     VC  0.5
4   5.8     VC  0.5
5   6.4     VC  0.5
6  10.0     VC  0.5
7  11.2     VC  0.5
8  11.2     VC  0.5
9   5.2     VC  0.5
10  7.0     VC  0.5

Combining functions

You can pipe together multiple functions to create custom messy transformations:

set.seed(1234)
ToothGrowth[1:10,] |> 
  make_missing(cols = "supp", missing = " ") |> 
  make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |> 
  add_whitespace(cols = "supp", messiness = 0.5) |> 
  add_special_chars(cols = "supp")
    len supp dose
1   4.2   VC  0.5
2  11.5  VC    NA
3   7.3   VC  0.5
4   5.8 *VC   0.5
5   6.4  VC   0.5
6  10.0   VC  0.5
7  11.2       0.5
8  11.2  V#C   NA
9   5.2  !VC  0.5
10  7.0 VC*   0.5

If you’re adding messy_colnames() to a chain (and you specify only some columns in other functions), make sure messy_colnames() comes at the end:

set.seed(1234)
ToothGrowth[1:10,] |> 
  make_missing(cols = "supp", missing = " ") |> 
  make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |> 
  add_whitespace(cols = "supp", messiness = 0.5) |> 
  add_special_chars(cols = "supp") |> 
  messy_colnames()
   !l_e)n  S^UPP d^o)se 
1      4.2    VC     0.5
2     11.5   VC       NA
3      7.3    VC     0.5
4      5.8  *VC      0.5
5      6.4   VC      0.5
6     10.0    VC     0.5
7     11.2           0.5
8     11.2   V#C      NA
9      5.2   !VC     0.5
10     7.0  VC*      0.5

Otherwise, the column names you try to select may no longer exist!