Preprocess, tune, train, and test supervised learning models with a single function using nested resampling
Usage
train(
x,
dat_validation = NULL,
dat_test = NULL,
algorithm = NULL,
preprocessor_config = NULL,
hyperparameters = NULL,
tuner_config = NULL,
outer_resampling_config = NULL,
weights = NULL,
question = NULL,
outdir = NULL,
parallel_type = c("future", "mirai", "none"),
future_plan = getOption("future.plan", "multicore"),
n_workers = max(future::availableCores() - 3L, 1L),
verbosity = 1L
)Arguments
- x
data.frame or similar: Training set data.
- dat_validation
data.frame or similar: Validation set data.
- dat_test
data.frame or similar: Test set data.
- algorithm
Character: Algorithm to use. Can be left NULL, if
hyperparametersis defined.- preprocessor_config
PreprocessorConfig object or NULL: Setup using setup_Preprocessor.
- hyperparameters
Hyperparametersobject: Setup using one ofsetup_*functions.- tuner_config
TunerConfig object: Setup using setup_GridSearch.
- outer_resampling_config
ResamplerConfig object or NULL: Setup using setup_Resampler. This defines the outer resampling method, i.e. the splitting into training and test sets for the purpose of assessing model performance. If NULL, no outer resampling is performed, in which case you might want to use a
dat_testdataset to assess model performance on a single test set.- weights
Optional vector of case weights.
- question
Optional character string defining the question that the model is trying to answer.
- outdir
Character, optional: String defining the output directory.
- parallel_type
Character: "none", "future", or "mirai".
- future_plan
Character: Future plan to use for parallel processing.
- n_workers
Integer: Number of workers to use for parallel processing in total. Parallelization may happen at three different levels, from innermost to outermost:
Algorithm training (e.g. a parallelized learner like LightGBM)
Tuning (inner resampling, where multiple resamples can be processed in parallel)
Outer resampling (where multiple outer resamples can be processed in parallel) The
train()function will assign the number of workers to the innermost available parallelization level. Best to leave a few cores for the OS and other processes, especially on shared systems or when working with large datasets, since parallelization will increase memory usage.
- verbosity
Integer: Verbosity level.
hyperparametersis not defined. Avoid relying on this, instead use the appropriatesetup_*function with thehyperparametersargument.
Value
Object of class Regression(Supervised), RegressionRes(SupervisedRes),
Classification(Supervised), or ClassificationRes(SupervisedRes).
Details
Important: For binary classification, the outcome should be a factor where the 2nd level corresponds to the positive class.
Note on resampling: You should never use an outer resampling method with replacement if you will also be using an inner resampling (for tuning). The duplicated cases from the outer resampling may appear both in the training and test sets of the inner resamples, leading to underestimated test error.