Skip to contents

chr_to_num() cleans strings representing numeric values. It is tailored for use with the ML in HCT dataset; use with other data may not go as expected.

Usage

chr_to_num(
  x,
  std = TRUE,
  warn = TRUE,
  convert = TRUE,
  na = na_patterns,
  replace = data.frame(pattern = c("MRD BY NGS ?", "^NEGATIVE$",
    "^(?:WE[EA]KLY )?POSITIVE$", ".*\\bIN CR\\b.*"), replacement = c("", "0", ">0",
    "0")),
  per_action = c("drop", "divide", "ignore"),
  multiple_decimals = c("use_first", "use_last", "ignore"),
  donor_host = c("use_donor", "use_host", "ignore")
)

Arguments

x

A character vector

std

Whether to standardize the vector before cleaning and converting

convert

Whether to actually convert to numeric

na

Regular expressions to convert to NA

replace

A data.frame of regular expressions and strings to replace them; regular expression should be in a column named pattern, and replacements should be in a column named replacement. Each row is passed to stringr::str_replace().

per_action

How to treat %/percent/per million/etc labels. drop simply removes the labels, divide divides the value by the appropriate denominator, and ignore does nothing.

multiple_decimals

How to handle multiple decimals within a number

donor_host

Which value to use when values for both a donor and a host are given

Value

A numeric or character vector, depending on the value of convert

Details

The function first converts strings matching na_patterns to missing values. It then simplifies any numeric representations it finds, including 10^x and various common typos observed in the data (unneeded decimals, commas, and zeros). Additionally, it converts Excel datetimes of the form 1/x/1900 hh:mm back to decimal representation, and additionally converts fractions less than 1 to decimals. Lastly, it handles a few idiosyncratic text strings, including conversion of POSITIVE and NEGATIVE values, as well as W(E|A)KLY POSITIVE and IN CR (complete remission). Optionally, it will extract values labelled as donor or host (patient), which is specific to the chimerism dataset. It also removes the leading text MRD BY NGS. By default, the function emits a warning when potential numeric values are not able to be converted to numeric.