Clean and Possibly Convert Numbers in character Representation
Source: R/utils-chr_to_num.R
chr_to_num.Rdchr_to_num() cleans strings representing numeric values. It is tailored for
use with the ML in HCT dataset; use with other data may not go as expected.
Usage
chr_to_num(
x,
std = TRUE,
warn = TRUE,
convert = TRUE,
na = na_patterns,
replace = data.frame(pattern = c("MRD BY NGS ?", "^NEGATIVE$",
"^(?:WE[EA]KLY )?POSITIVE$", ".*\\bIN CR\\b.*"), replacement = c("", "0", ">0",
"0")),
per_action = c("drop", "divide", "ignore"),
multiple_decimals = c("use_first", "use_last", "ignore"),
donor_host = c("use_donor", "use_host", "ignore")
)Arguments
- x
A
charactervector- std
Whether to standardize the vector before cleaning and converting
- convert
Whether to actually convert to
numeric- na
Regular expressions to convert to
NA- replace
A
data.frameof regular expressions and strings to replace them; regular expression should be in a column namedpattern, and replacements should be in a column namedreplacement. Each row is passed tostringr::str_replace().- per_action
How to treat %/percent/per million/etc labels.
dropsimply removes the labels,dividedivides the value by the appropriate denominator, andignoredoes nothing.- multiple_decimals
How to handle multiple decimals within a number
- donor_host
Which value to use when values for both a donor and a host are given
Details
The function first converts strings matching na_patterns to missing values.
It then simplifies any numeric representations it finds, including 10^x and
various common typos observed in the data (unneeded decimals, commas, and zeros).
Additionally, it converts Excel datetimes of the form 1/x/1900 hh:mm back to
decimal representation, and additionally converts fractions less than 1 to
decimals. Lastly, it handles a few idiosyncratic text strings, including
conversion of POSITIVE and NEGATIVE values, as well as W(E|A)KLY POSITIVE
and IN CR (complete remission). Optionally, it will extract values labelled as
donor or host (patient), which is specific to the chimerism dataset. It also
removes the leading text MRD BY NGS. By default, the function emits a warning
when potential numeric values are not able to be converted to numeric.