Skip to contents

This function takes an arbitrary number of 'joining' columns and any number of additional column names and splits a dataframe in two such that a user could then re-join using merge() or dplyr::left_join(). The user may find it appropriate to go on and apply messy() to each new dataframe independently to impede rejoining.

Usage

unjoin(data, by, cols, distinct = "none", names = c("left", "right"))

Arguments

data

input dataframe

by

a vector of column names which will be present in both outputs, to rejoin the dataframes

cols

specific columns to be present in the 'right' dataframe. implicitly, all other columns not in 'cols' will be present in the 'left' dataframe.

distinct

Apply dplyr::distinct() to "both" dataframes, the "left" or "right" dataframes, or "none" of the dataframes. This may be useful if one table is a 'lookup' or metadata table that has its values repeated many times in data.

names

The names of the output list. If NULL the list will be unnamed.

Value

A list of two dataframes

Details

Real data is often found across multiple datasets. For example, in environmental monitoring, measurements at a monitoring station may need to be bound with metadata about the station such as geographic coordinates, or even meteorological data from an external source, to produce desired outputs. In clinical research it may be necessary to combine the results of a clinical trial with relevant patient information, such as weight or sex. This function undoes existing joins to present learners with an authentic problem to solve; joining two independent datasets to achieve some goal.

See also

Other data deconstructors: unrbind()

Author

Jack Davison

Examples

dummy <-
  dplyr::tibble(
    patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
    test = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
    result = c("++", "+", "-", "--", "+", "-", "+", "++", "-"),
    sex = c("M", "M", "M", "M", "M", "M", "F", "F", "F"),
    age = c(50, 50, 50, 25, 25, 25, 30, 30, 30)
  )

unjoin(
  dummy,
  by = "patient_id",
  cols = c("sex", "age"),
  distinct = "right",
  names = c("tests", "patient_info")
)
#> $tests
#> # A tibble: 9 × 3
#>   patient_id  test result
#>        <dbl> <dbl> <chr> 
#> 1          1     1 ++    
#> 2          1     2 +     
#> 3          1     3 -     
#> 4          2     1 --    
#> 5          2     2 +     
#> 6          2     3 -     
#> 7          3     1 +     
#> 8          3     2 ++    
#> 9          3     3 -     
#> 
#> $patient_info
#> # A tibble: 3 × 3
#>   patient_id sex     age
#>        <dbl> <chr> <dbl>
#> 1          1 M        50
#> 2          2 M        25
#> 3          3 F        30
#>