Setup PreprocessorParameters
Usage
setup_Preprocessor(
complete_cases = FALSE,
remove_features_thres = NULL,
remove_cases_thres = NULL,
missingness = FALSE,
impute = FALSE,
impute_type = c("missRanger", "micePMM", "meanMode"),
impute_missRanger_params = list(pmm.k = 3, maxiter = 10, num.trees = 500),
impute_discrete = "get_mode",
impute_continuous = "mean",
integer2factor = FALSE,
integer2numeric = FALSE,
logical2factor = FALSE,
logical2numeric = FALSE,
numeric2factor = FALSE,
numeric2factor_levels = NULL,
numeric_cut_n = 0,
numeric_cut_labels = FALSE,
numeric_quant_n = 0,
numeric_quant_NAonly = FALSE,
unique_len2factor = 0,
character2factor = FALSE,
factorNA2missing = FALSE,
factorNA2missing_level = "missing",
factor2integer = FALSE,
factor2integer_startat0 = TRUE,
scale = FALSE,
center = scale,
scale_centers = NULL,
scale_coefficients = NULL,
remove_constants = FALSE,
remove_constants_skip_missing = TRUE,
remove_features = NULL,
remove_duplicates = FALSE,
one_hot = FALSE,
one_hot_levels = NULL,
add_date_features = FALSE,
date_features = c("weekday", "month", "year"),
add_holidays = FALSE,
exclude = NULL
)
Arguments
- complete_cases
Logical: If TRUE, only retain complete cases (no missing data).
- remove_features_thres
Float (0, 1): Remove features with missing values in >= to this fraction of cases.
- remove_cases_thres
Float (0, 1): Remove cases with >= to this fraction of missing features.
- missingness
Logical: If TRUE, generate new boolean columns for each feature with missing values, indicating which cases were missing data.
- impute
Logical: If TRUE, impute missing cases. See
impute_discrete
andimpute_continuous
.- impute_type
Character: Package to use for imputation.
- impute_missRanger_params
Named list with elements "pmm.k" and "maxiter", which are passed to
missRanger::missRanger
.pmm.k
greater than 0 results in predictive mean matching. Defaultpmm.k = 3
maxiter = 10
num.trees = 500
. Reducenum.trees
for faster imputation especially in large datasets. Setpmm.k = 0
to disable predictive mean matching.- impute_discrete
Character: Name of function that returns single value: How to impute discrete variables for
impute_type = "meanMode"
.- impute_continuous
Character: Name of function that returns single value: How to impute continuous variables for
impute_type = "meanMode"
.- integer2factor
Logical: If TRUE, convert all integers to factors. This includes
bit64::integer64
columns.- integer2numeric
Logical: If TRUE, convert all integers to numeric (will only work if
integer2factor = FALSE
). This includesbit64::integer64
columns.- logical2factor
Logical: If TRUE, convert all logical variables to factors.
- logical2numeric
Logical: If TRUE, convert all logical variables to numeric.
- numeric2factor
Logical: If TRUE, convert all numeric variables to factors.
- numeric2factor_levels
Character vector: Optional - will be passed to
levels
arg offactor()
ifnumeric2factor = TRUE
. For advanced/ specific use cases; need to know unique values of numeric vector(s) and given all numeric vars have same unique values.- numeric_cut_n
Integer: If > 0, convert all numeric variables to factors by binning using
base::cut
withbreaks
equal to this number.- numeric_cut_labels
Logical: The
labels
argument of base::cut.- numeric_quant_n
Integer: If > 0, convert all numeric variables to factors by binning using
base::cut
withbreaks
equal to this number of quantiles. produced usingstats::quantile
.- numeric_quant_NAonly
Logical: If TRUE, only bin numeric variables with missing values.
- unique_len2factor
Integer (>=2): Convert all variables with less than or equal to this number of unique values to factors. For example, if binary variables are encoded with 1, 2, you could use
unique_len2factor = 2
to convert them to factors.- character2factor
Logical: If TRUE, convert all character variables to factors.
- factorNA2missing
Logical: If TRUE, make NA values in factors be of level
factorNA2missing_level
. In many cases this is the preferred way to handle missing data in categorical variables. Note that since this step is performed before imputation, you can use this option to handle missing data in categorical variables and impute numeric variables in the samepreprocess
call.- factorNA2missing_level
Character: Name of level if
factorNA2missing = TRUE
.- factor2integer
Logical: If TRUE, convert all factors to integers.
- factor2integer_startat0
Logical: If TRUE, start integer coding at 0.
- scale
Logical: If TRUE, scale columns of
x
.- center
Logical: If TRUE, center columns of
x
. Note that by default it is the same asscale
.- scale_centers
Named vector: Centering values for each feature.
- scale_coefficients
Named vector: Scaling values for each feature.
- remove_constants
Logical: If TRUE, remove constant columns.
- remove_constants_skip_missing
Logical: If TRUE, skip missing values, before checking if feature is constant.
- remove_features
Character vector: Features to remove.
- remove_duplicates
Logical: If TRUE, remove duplicate cases.
- one_hot
Logical: If TRUE, convert all factors using one-hot encoding.
- one_hot_levels
List: Named list of the form "feature_name" = "levels". Used when applying one-hot encoding to validation or test data using
Preprocessor
.- add_date_features
Logical: If TRUE, extract date features from date columns.
- date_features
Character vector: Features to extract from dates.
- add_holidays
Logical: If TRUE, extract holidays from date columns.
- exclude
Integer, vector: Exclude these columns from preprocessing.