library(rtemis)
6 Resample
Resampling refers to a collection of techniques for selecting cases from a sample. It is central to many machine learning algorithms and pipelines. The two core uses of resampling are:
- Model selection (a.k.a. tuning) - Find a combination of hyperparameters that works well
- Model assessment - Assessing how well an a performs
By convention, we use the terms training and validation sets when referring to model selection, and training and testing sets when referring to model assessment. The terminology is unfortunately not intuitive and has led to confusion. Some people reverse the terms, but we use the terms training, validation, and testing as they are used in the Elements of Statistical Learning (p. 222, Second edition, 12th printing)
6.1 Concepts: Model Selection and Assessment
- Model Selection aka Hyperparameter tuning
Resamples of the training set are drawn. For each resample, a combination of hyperparameters is used to train a model. The mean validation-set error across resamples is calculated. The combination of hyperparameters with the minimum loss on average across validation-set resamples is selected to train the full training sample. - Model assessment
Resamples of the full sample is split into multiple training - testing sets. A model is trained on each training set and its performance assessed on the corresponding test set. Model performance is averaged across all test sets.
Nested resampling or nested crossvalidation is the procedure where 1. and 2. are nested so that hyperparameter tuning (resampling of the training set) is performed within each of multiple training resamples and performance is tested in each corresponding test set. [train] performs automatic nested resampling and is one of the core supervised learning functions in rtemis.
6.2 The resample()
function
The resample()
function is responsible for all resampling in rtemis. It returns a Resampler
object:
<- rnorm(1000)
x <- resample(x)
res class(res)
[1] "rtemis::Resampler" "S7_object"
It contains two fields: resamples
and parameters
. The resamples
field is a list of integer indices of the training cases for each resample. The parameters
field is an object of class ResamplerParameters
that contains the parameters used for resampling.
res
KFold Resampler
resamples:
Fold_1: <int> 2, 4, 5, 6...
Fold_2: <int> 1, 2, 3, 4...
Fold_3: <int> 1, 2, 3, 4...
Fold_4: <int> 1, 2, 3, 4...
Fold_5: <int> 1, 2, 3, 4...
Fold_6: <int> 1, 3, 4, 5...
Fold_7: <int> 1, 2, 3, 4...
Fold_8: <int> 1, 2, 3, 4...
Fold_9: <int> 1, 2, 3, 5...
Fold_10: <int> 1, 2, 3, 4...
parameters:
KFold ResamplerParameters
n: <int> 10
stratify_var: <NUL> NULL
strat_n_bins: <int> 4
id_strat: <NUL> NULL
seed: <NUL> NULL
resample()
supports 5 types of resampling:
- k-fold crossvalidation (Stratified)
You split the cases into k sets (folds). Each set is used once as the validation or testing set. This means each cases is left out exactly once and there is no overlap between different validation/test sets. In rtemis, the folds are also stratified by default on the outcome unless otherwise chosen. Stratification tries to maintain the full sample’s distribution in both training and left-out sets. This is crucial for non-normally distributed continuous outcomes or imbalanced datasets. 10 is a common value for k, called 10-fold. Note that the size of the training and left-out sets depends on the sample size.
<- resample(
res_10fold
x,parameters = setup_Resampler(n = 10L, type = "KFold")
) res_10fold
KFold Resampler
resamples:
Fold_1: <int> 1, 2, 3, 4...
Fold_2: <int> 1, 2, 3, 4...
Fold_3: <int> 1, 2, 3, 4...
Fold_4: <int> 2, 3, 4, 5...
Fold_5: <int> 1, 4, 5, 6...
Fold_6: <int> 1, 2, 3, 4...
Fold_7: <int> 1, 2, 3, 4...
Fold_8: <int> 1, 2, 3, 4...
Fold_9: <int> 1, 2, 3, 6...
Fold_10: <int> 1, 2, 3, 4...
parameters:
KFold ResamplerParameters
n: <int> 10
stratify_var: <NUL> NULL
strat_n_bins: <int> 4
id_strat: <NUL> NULL
seed: <NUL> NULL
- Stratified subsampling
Drawn.resamples
stratified samples from the data given a certain probability (train.p
) that each case belongs to the training set. Since you are randomly sampling from the full sample each time, there will be overlap in the test set cases, but you control the training-to-testing ratio and number of resamples independently, unlike in k-fold resampling.
<- resample(
res_25ss
x,parameters = setup_Resampler(n = 25L, type = "StratSub")
) res_25ss
StratSub Resampler
resamples:
Showing first 12 of 25 items.
Subsample_1: <int> 1, 2, 4, 5...
Subsample_2: <int> 1, 2, 4, 6...
Subsample_3: <int> 2, 4, 5, 6...
Subsample_4: <int> 1, 2, 3, 4...
Subsample_5: <int> 2, 6, 7, 8...
Subsample_6: <int> 1, 2, 4, 6...
Subsample_7: <int> 1, 2, 6, 7...
Subsample_8: <int> 1, 3, 4, 5...
Subsample_9: <int> 1, 7, 8, 10...
Subsample_10: <int> 3, 5, 6, 7...
Subsample_11: <int> 1, 3, 4, 5...
Subsample_12: <int> 1, 2, 5, 6...
... 13 more items not shown.
parameters:
StratSub ResamplerParameters
n: <int> 25
train_p: <nmr> 0.75
stratify_var: <NUL> NULL
strat_n_bins: <int> 4
id_strat: <NUL> NULL
seed: <NUL> NULL
- Bootstrap
The bootstrap: random sampling with replacement. Since cases are replicated, you should use bootstrap as the outer resampler if you will also have inner resampling for tuning, since the same case may end up in both training and validation sets.
<- resample(
res_100boot
x,parameters = setup_Resampler(n = 100L, type = "Bootstrap")
) res_100boot
Bootstrap Resampler
resamples:
Showing first 12 of 100 items.
Bootsrap_1: <int> 1, 2, 4, 5...
Bootsrap_2: <int> 1, 1, 2, 2...
Bootsrap_3: <int> 2, 4, 4, 5...
Bootsrap_4: <int> 2, 2, 3, 3...
Bootsrap_5: <int> 5, 5, 6, 8...
Bootsrap_6: <int> 2, 3, 3, 4...
Bootsrap_7: <int> 2, 2, 5, 5...
Bootsrap_8: <int> 4, 5, 5, 6...
Bootsrap_9: <int> 2, 6, 6, 10...
Bootsrap_10: <int> 1, 5, 6, 7...
Bootsrap_11: <int> 3, 5, 6, 9...
Bootsrap_12: <int> 1, 3, 4, 6...
... 88 more items not shown.
parameters:
Bootstrap ResamplerParameters
n: <int> 100
id_strat: <NUL> NULL
seed: <NUL> NULL
- Stratified Bootstrap
This is stratified subsampling with random replication of cases to match the length of the original sample. Same as the bootstrap, do not use if you will be further resampling each resample.
<- resample(
res_100sboot
x,parameters = setup_Resampler(n = 100L, type = "StratBoot")
) res_100sboot
StratBoot Resampler
resamples:
Showing first 12 of 100 items.
StratBoot_1: <int> 1, 2, 5, 5...
StratBoot_2: <int> 1, 3, 3, 4...
StratBoot_3: <int> 1, 2, 4, 4...
StratBoot_4: <int> 1, 2, 2, 5...
StratBoot_5: <int> 1, 4, 5, 6...
StratBoot_6: <int> 2, 2, 3, 4...
StratBoot_7: <int> 1, 4, 6, 7...
StratBoot_8: <int> 1, 3, 4, 5...
StratBoot_9: <int> 1, 2, 3, 4...
StratBoot_10: <int> 1, 3, 4, 5...
StratBoot_11: <int> 1, 1, 2, 2...
StratBoot_12: <int> 1, 2, 3, 4...
... 88 more items not shown.
parameters:
StratBoot ResamplerParameters
n: <int> 100
stratify_var: <NUL> NULL
train_p: <nmr> 0.75
strat_n_bins: <int> 4
target_length: <NUL> NULL
id_strat: <NUL> NULL
seed: <NUL> NULL
- Leave-One-Out-Crossvalidation (LOOCV)
This is k-fold crossvalidation where \(k = N\), where \(N\) is number of data points/cases in the whole sample. It has been included for experimentation and completenes, but it is not recommended either for model selection or assessment over the other resampling methods.
<- resample(
res_loocv
x,parameters = setup_Resampler(type = "LOOCV")
) res_loocv
LOOCV Resampler
resamples:
Showing first 12 of 1000 items.
Fold_1: <int> 2, 3, 4, 5...
Fold_2: <int> 1, 3, 4, 5...
Fold_3: <int> 1, 2, 4, 5...
Fold_4: <int> 1, 2, 3, 5...
Fold_5: <int> 1, 2, 3, 4...
Fold_6: <int> 1, 2, 3, 4...
Fold_7: <int> 1, 2, 3, 4...
Fold_8: <int> 1, 2, 3, 4...
Fold_9: <int> 1, 2, 3, 4...
Fold_10: <int> 1, 2, 3, 4...
Fold_11: <int> 1, 2, 3, 4...
Fold_12: <int> 1, 2, 3, 4...
... 988 more items not shown.
parameters:
LOOCV ResamplerParameters
n: <int> 1000
6.3 Setting up a Resampler
6.4 Example: Stratified vs. random sampling in a binomial distribution
Assume y
is the outcome of interest where events occur with a probability of .1 - a common scenario in many fields.
set.seed(2020)
<- rbinom(500, 1, .05)
x draw_dist(x, type = "hist")
<- table(x)
freq <- freq[2] / sum(freq)
prob # Random sampling with `sample()`
<- lapply(seq(10), function(i) sample(seq(x), .75*length(x)))
res_nonstrat # Stratified subsampling with `resample()`
<- resample(
res_strat
x,parameters = setup_Resampler(n = 10L, type = "StratSub", train_p = .75)
)<- sapply(seq(10), function(i) {
prob_nonstrat <- table(x[res_nonstrat[[i]]])
freq 2]/sum(freq)
freq[
})<- sapply(seq(10), function(i) {
prob_strat <- table(x[res_strat[[i]]])
freq 2]/sum(freq)
freq[
}) prob_nonstrat
1 1 1 1 1 1 1
0.06933333 0.05066667 0.05066667 0.05866667 0.06133333 0.06400000 0.05333333
1 1 1
0.06133333 0.05333333 0.05600000
sd(prob_nonstrat)
[1] 0.006164815
prob_strat
1 1 1 1 1 1 1
0.05614973 0.05614973 0.05614973 0.05614973 0.05614973 0.05614973 0.05614973
1 1 1
0.05614973 0.05614973 0.05614973
sd(prob_strat)
[1] 0
As expected, the random sampling resulted in slightly different event probability in each resample, whereas stratified subsampling maintained a constant probability across resamples.
6.5 Resampling pipeline
flowchart TB A("setup_Resampler()") --> B[ResamplerParameters] B --> C["resample()"] C("resample()") --> D[Resampler]