prep_linelist() converts a linelist to an incidence curve, replaces anomalous counts with the expected value, and decomposes the result into trend, seasonality, and remainder components. It also filters out reporting errors and truncates the data at the last fully observed incidence date (as defined by pct_reported, see that parameter for details). All calculations are performed on the log scale, but the result is returned on the input scale (assumed linear).

prep_linelist(
  .data,
  .collection_date = "collection_date",
  .report_date = "report_date",
  start_date = "2020-03-12",
  trend = "30 days",
  period = "7 days",
  delay_period = "14 days",
  pct_reported = 0.9,
  cutoff = 0.05,
  plot_anomalies = FALSE
)

Arguments

.data

A data frame containing one incident observation per row

.collection_date

<tidy-select> A Date column to use as the collection date of the observed case

.report_date

<tidy-select> A Date column to use as the report date of the observed case

start_date

The start date of the epidemic; defaults to "2020-03-12", which is the beginning of the contiguous part of Shelby County's observed cases (at least one case observed per day since that date).

trend

The length of time to use in trend decomposition; can be a time-based definition (e.g. "1 month") or an integer number of days. If NULL or "auto", trend is set automatically using the tunable heuristics in the timetk package.

period

The length of time to use in seasonal decomposition; can be a time-based definition (e.g. "1 week") or an integer number of days. If NULL or "auto", period is set automatically using the tunable heuristics in the timetk package.

delay_period

The length of time to use in calculating reporting delay; can be a time-based definition (e.g. "2 weeks") or an integer number of days. If NULL, delay_period is set to "14 days".

pct_reported

The percent of total cases reported before considering a collection date to be fully observed. It is not recommended to set this to 1, as reporting delays typically contain very large outliers which will skew the results. The default is 0.9, which strikes a balance between sensitivity and robustness in Shelby County data.

cutoff

The cutoff value for anomaly detection; controls both the maximum percentage of data points that may be considered anomalies, as well as the critical value for the Generalized Extreme Studentized Deviate test used to detect the anomalies. Can be interpreted as the desired maximum probability that an individual data point is labeled an anomaly.

plot_anomalies

Should anomalies be plotted for visual inspection? If TRUE, the plot will be on the log-scale.

Value

A tibble with a date column (named the same as the column specified by .collection_date) and observed, season, trend, and remaindercolumns. All numeric columns have outlier replaced.