Skip to contents

Setup PreprocessorParameters

Usage

setup_Preprocessor(
  complete_cases = FALSE,
  remove_features_thres = NULL,
  remove_cases_thres = NULL,
  missingness = FALSE,
  impute = FALSE,
  impute_type = c("missRanger", "micePMM", "meanMode"),
  impute_missRanger_params = list(pmm.k = 3, maxiter = 10, num.trees = 500),
  impute_discrete = "get_mode",
  impute_continuous = "mean",
  integer2factor = FALSE,
  integer2numeric = FALSE,
  logical2factor = FALSE,
  logical2numeric = FALSE,
  numeric2factor = FALSE,
  numeric2factor_levels = NULL,
  numeric_cut_n = 0,
  numeric_cut_labels = FALSE,
  numeric_quant_n = 0,
  numeric_quant_NAonly = FALSE,
  unique_len2factor = 0,
  character2factor = FALSE,
  factorNA2missing = FALSE,
  factorNA2missing_level = "missing",
  factor2integer = FALSE,
  factor2integer_startat0 = TRUE,
  scale = FALSE,
  center = scale,
  scale_centers = NULL,
  scale_coefficients = NULL,
  remove_constants = FALSE,
  remove_constants_skip_missing = TRUE,
  remove_features = NULL,
  remove_duplicates = FALSE,
  one_hot = FALSE,
  one_hot_levels = NULL,
  add_date_features = FALSE,
  date_features = c("weekday", "month", "year"),
  add_holidays = FALSE,
  exclude = NULL
)

Arguments

complete_cases

Logical: If TRUE, only retain complete cases (no missing data).

remove_features_thres

Float (0, 1): Remove features with missing values in >= to this fraction of cases.

remove_cases_thres

Float (0, 1): Remove cases with >= to this fraction of missing features.

missingness

Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.

impute

Logical: If TRUE, impute missing cases. See impute_discrete and impute_continuous.

impute_type

Character: Package to use for imputation.

impute_missRanger_params

Named list with elements "pmm.k" and "maxiter", which are passed to missRanger::missRanger. pmm.k greater than 0 results in predictive mean matching. Default pmm.k = 3 maxiter = 10 num.trees = 500. Reduce num.trees for faster imputation especially in large datasets. Set pmm.k = 0 to disable predictive mean matching.

impute_discrete

Character: Name of function that returns single value: How to impute discrete variables for impute_type = "meanMode".

impute_continuous

Character: Name of function that returns single value: How to impute continuous variables for impute_type = "meanMode".

integer2factor

Logical: If TRUE, convert all integers to factors. This includes bit64::integer64 columns.

integer2numeric

Logical: If TRUE, convert all integers to numeric (will only work if integer2factor = FALSE). This includes bit64::integer64 columns.

logical2factor

Logical: If TRUE, convert all logical variables to factors.

logical2numeric

Logical: If TRUE, convert all logical variables to numeric.

numeric2factor

Logical: If TRUE, convert all numeric variables to factors.

numeric2factor_levels

Character vector: Optional - will be passed to levels arg of factor() if numeric2factor = TRUE. For advanced/ specific use cases; need to know unique values of numeric vector(s) and given all numeric vars have same unique values.

numeric_cut_n

Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number.

numeric_cut_labels

Logical: The labels argument of base::cut.

numeric_quant_n

Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number of quantiles. produced using stats::quantile.

numeric_quant_NAonly

Logical: If TRUE, only bin numeric variables with missing values.

unique_len2factor

Integer (>=2): Convert all variables with less than or equal to this number of unique values to factors. For example, if binary variables are encoded with 1, 2, you could use unique_len2factor = 2 to convert them to factors.

character2factor

Logical: If TRUE, convert all character variables to factors.

factorNA2missing

Logical: If TRUE, make NA values in factors be of level factorNA2missing_level. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the same preprocess call.

factorNA2missing_level

Character: Name of level if factorNA2missing = TRUE.

factor2integer

Logical: If TRUE, convert all factors to integers.

factor2integer_startat0

Logical: If TRUE, start integer coding at 0.

scale

Logical: If TRUE, scale columns of x.

center

Logical: If TRUE, center columns of x. Note that by default it is the same as scale.

scale_centers

Named vector: Centering values for each feature.

scale_coefficients

Named vector: Scaling values for each feature.

remove_constants

Logical: If TRUE, remove constant columns.

remove_constants_skip_missing

Logical: If TRUE, skip missing values, before checking if feature is constant.

remove_features

Character vector: Features to remove.

remove_duplicates

Logical: If TRUE, remove duplicate cases.

one_hot

Logical: If TRUE, convert all factors using one-hot encoding.

one_hot_levels

List: Named list of the form "feature_name" = "levels". Used when applying one-hot encoding to validation or test data using Preprocessor.

add_date_features

Logical: If TRUE, extract date features from date columns.

date_features

Character vector: Features to extract from dates.

add_holidays

Logical: If TRUE, extract holidays from date columns.

exclude

Integer, vector: Exclude these columns from preprocessing.

Value

PreprocessorParameters object.

Author

EDG