Setup PreprocessorParameters — setup

Setup PreprocessorParameters

Usage

setup_Preprocessor(
  complete_cases = FALSE,
  remove_features_thres = NULL,
  remove_cases_thres = NULL,
  missingness = FALSE,
  impute = FALSE,
  impute_type = c("missRanger", "micePMM", "meanMode"),
  impute_missRanger_params = list(pmm.k = 3, maxiter = 10, num.trees = 500),
  impute_discrete = "get_mode",
  impute_continuous = "mean",
  integer2factor = FALSE,
  integer2numeric = FALSE,
  logical2factor = FALSE,
  logical2numeric = FALSE,
  numeric2factor = FALSE,
  numeric2factor_levels = NULL,
  numeric_cut_n = 0,
  numeric_cut_labels = FALSE,
  numeric_quant_n = 0,
  numeric_quant_NAonly = FALSE,
  unique_len2factor = 0,
  character2factor = FALSE,
  factorNA2missing = FALSE,
  factorNA2missing_level = "missing",
  factor2integer = FALSE,
  factor2integer_startat0 = TRUE,
  scale = FALSE,
  center = scale,
  scale_centers = NULL,
  scale_coefficients = NULL,
  remove_constants = FALSE,
  remove_constants_skip_missing = TRUE,
  remove_features = NULL,
  remove_duplicates = FALSE,
  one_hot = FALSE,
  one_hot_levels = NULL,
  add_date_features = FALSE,
  date_features = c("weekday", "month", "year"),
  add_holidays = FALSE,
  exclude = NULL
)

Arguments

complete_cases: Logical: If TRUE, only retain complete cases (no missing data).
remove_features_thres: Float (0, 1): Remove features with missing values in >= to this fraction of cases.
remove_cases_thres: Float (0, 1): Remove cases with >= to this fraction of missing features.
missingness: Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.
impute: Logical: If TRUE, impute missing cases. See impute_discrete and impute_continuous.
impute_type: Character: Package to use for imputation.
impute_missRanger_params: Named list with elements "pmm.k" and "maxiter", which are passed to missRanger::missRanger. pmm.k greater than 0 results in predictive mean matching. Default pmm.k = 3 maxiter = 10 num.trees = 500. Reduce num.trees for faster imputation especially in large datasets. Set pmm.k = 0 to disable predictive mean matching.
impute_discrete: Character: Name of function that returns single value: How to impute discrete variables for impute_type = "meanMode".
impute_continuous: Character: Name of function that returns single value: How to impute continuous variables for impute_type = "meanMode".
integer2factor: Logical: If TRUE, convert all integers to factors. This includes bit64::integer64 columns.
integer2numeric: Logical: If TRUE, convert all integers to numeric (will only work if integer2factor = FALSE). This includes bit64::integer64 columns.
logical2factor: Logical: If TRUE, convert all logical variables to factors.
logical2numeric: Logical: If TRUE, convert all logical variables to numeric.
numeric2factor: Logical: If TRUE, convert all numeric variables to factors.
numeric2factor_levels: Character vector: Optional - will be passed to levels arg of factor() if numeric2factor = TRUE. For advanced/ specific use cases; need to know unique values of numeric vector(s) and given all numeric vars have same unique values.
numeric_cut_n: Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number.
numeric_cut_labels: Logical: The labels argument of base::cut.
numeric_quant_n: Integer: If > 0, convert all numeric variables to factors by binning using base::cut with breaks equal to this number of quantiles. produced using stats::quantile.
numeric_quant_NAonly: Logical: If TRUE, only bin numeric variables with missing values.
unique_len2factor: Integer (>=2): Convert all variables with less than or equal to this number of unique values to factors. For example, if binary variables are encoded with 1, 2, you could use unique_len2factor = 2 to convert them to factors.
character2factor: Logical: If TRUE, convert all character variables to factors.
factorNA2missing: Logical: If TRUE, make NA values in factors be of level factorNA2missing_level. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the same preprocess call.
factorNA2missing_level: Character: Name of level if factorNA2missing = TRUE.
factor2integer: Logical: If TRUE, convert all factors to integers.
factor2integer_startat0: Logical: If TRUE, start integer coding at 0.
scale: Logical: If TRUE, scale columns of x.
center: Logical: If TRUE, center columns of x. Note that by default it is the same as scale.
scale_centers: Named vector: Centering values for each feature.
scale_coefficients: Named vector: Scaling values for each feature.
remove_constants: Logical: If TRUE, remove constant columns.
remove_constants_skip_missing: Logical: If TRUE, skip missing values, before checking if feature is constant.
remove_features: Character vector: Features to remove.
remove_duplicates: Logical: If TRUE, remove duplicate cases.
one_hot: Logical: If TRUE, convert all factors using one-hot encoding.
one_hot_levels: List: Named list of the form "feature_name" = "levels". Used when applying one-hot encoding to validation or test data using Preprocessor.
add_date_features: Logical: If TRUE, extract date features from date columns.
date_features: Character vector: Features to extract from dates.
add_holidays: Logical: If TRUE, extract holidays from date columns.
exclude: Integer, vector: Exclude these columns from preprocessing.

Value

PreprocessorParameters object.

Author

EDG