
Splits a dataframe into two, such that it could be reassembled with a mutating join
Source:R/unjoin.R
unjoin.Rd
This function takes an arbitrary number of 'joining' columns and any number
of additional column names and splits a dataframe in two such that a user
could then re-join using merge()
or dplyr::left_join()
. The user may find
it appropriate to go on and apply messy()
to each new dataframe
independently to impede rejoining.
Usage
unjoin(data, by, cols, distinct = "none", names = c("left", "right"))
Arguments
- data
input dataframe
- by
a vector of column names which will be present in both outputs, to rejoin the dataframes
- cols
specific columns to be present in the 'right' dataframe. implicitly, all other columns not in 'cols' will be present in the 'left' dataframe.
- distinct
Apply
dplyr::distinct()
to"both"
dataframes, the"left"
or"right"
dataframes, or"none"
of the dataframes. This may be useful if one table is a 'lookup' or metadata table that has its values repeated many times indata
.- names
The names of the output list. If
NULL
the list will be unnamed.
Details
Real data is often found across multiple datasets. For example, in environmental monitoring, measurements at a monitoring station may need to be bound with metadata about the station such as geographic coordinates, or even meteorological data from an external source, to produce desired outputs. In clinical research it may be necessary to combine the results of a clinical trial with relevant patient information, such as weight or sex. This function undoes existing joins to present learners with an authentic problem to solve; joining two independent datasets to achieve some goal.
See also
Other data deconstructors:
unrbind()
Examples
dummy <-
dplyr::tibble(
patient_id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
test = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
result = c("++", "+", "-", "--", "+", "-", "+", "++", "-"),
sex = c("M", "M", "M", "M", "M", "M", "F", "F", "F"),
age = c(50, 50, 50, 25, 25, 25, 30, 30, 30)
)
unjoin(
dummy,
by = "patient_id",
cols = c("sex", "age"),
distinct = "right",
names = c("tests", "patient_info")
)
#> $tests
#> # A tibble: 9 × 3
#> patient_id test result
#> <dbl> <dbl> <chr>
#> 1 1 1 ++
#> 2 1 2 +
#> 3 1 3 -
#> 4 2 1 --
#> 5 2 2 +
#> 6 2 3 -
#> 7 3 1 +
#> 8 3 2 ++
#> 9 3 3 -
#>
#> $patient_info
#> # A tibble: 3 × 3
#> patient_id sex age
#> <dbl> <chr> <dbl>
#> 1 1 M 50
#> 2 2 M 25
#> 3 3 F 30
#>