6 Resample

library(rtemis)

Resampling refers to a collection of techniques for selecting cases from a sample. It is central to many machine learning algorithms and pipelines. The two core uses of resampling are:

Model selection (a.k.a. tuning) - Find a combination of hyperparameters that works well
Model assessment - Assessing how well an a performs

By convention, we use the terms training and validation sets when referring to model selection, and training and testing sets when referring to model assessment. The terminology is unfortunately not intuitive and has led to confusion. Some people reverse the terms, but we use the terms training, validation, and testing as they are used in the Elements of Statistical Learning (p. 222, Second edition, 12th printing)

6.1 Concepts: Model Selection and Assessment

Model Selection aka Hyperparameter tuning
Resamples of the training set are drawn. For each resample, a combination of hyperparameters is used to train a model. The mean validation-set error across resamples is calculated. The combination of hyperparameters with the minimum loss on average across validation-set resamples is selected to train the full training sample.
Model assessment
Resamples of the full sample is split into multiple training - testing sets. A model is trained on each training set and its performance assessed on the corresponding test set. Model performance is averaged across all test sets.

Nested resampling or nested crossvalidation is the procedure where 1. and 2. are nested so that hyperparameter tuning (resampling of the training set) is performed within each of multiple training resamples and performance is tested in each corresponding test set. [train] performs automatic nested resampling and is one of the core supervised learning functions in rtemis.

6.2 The `resample()` function

The resample() function is responsible for all resampling in rtemis. It returns a Resampler object:

x <- rnorm(1000)
res <- resample(x)
class(res)

[1] "rtemis::Resampler" "S7_object"

It contains two fields: resamples and parameters. The resamples field is a list of integer indices of the training cases for each resample. The parameters field is an object of class ResamplerParameters that contains the parameters used for resampling.

res

KFold Resampler
   resamples: 
               Fold_1: <int> 1, 2, 3, 4...
               Fold_2: <int> 1, 2, 3, 4...
               Fold_3: <int> 1, 2, 4, 5...
               Fold_4: <int> 1, 2, 3, 4...
               Fold_5: <int> 1, 2, 3, 4...
               Fold_6: <int> 1, 3, 4, 5...
               Fold_7: <int> 1, 2, 3, 5...
               Fold_8: <int> 2, 3, 4, 5...
               Fold_9: <int> 1, 2, 3, 4...
              Fold_10: <int> 1, 2, 3, 4...
  parameters:  
              KFold ResamplerParameters
                           n: <int> 10
                stratify_var: <NUL> NULL
                strat_n_bins: <int> 4
                    id_strat: <NUL> NULL
                        seed: <NUL> NULL

resample() supports 5 types of resampling:

k-fold crossvalidation (Stratified)
You split the cases into k sets (folds). Each set is used once as the validation or testing set. This means each cases is left out exactly once and there is no overlap between different validation/test sets. In rtemis, the folds are also stratified by default on the outcome unless otherwise chosen. Stratification tries to maintain the full sample’s distribution in both training and left-out sets. This is crucial for non-normally distributed continuous outcomes or imbalanced datasets. 10 is a common value for k, called 10-fold. Note that the size of the training and left-out sets depends on the sample size.

res_10fold <- resample(
  x,
  parameters = setup_Resampler(n = 10L, type = "KFold")
)
res_10fold

KFold Resampler
   resamples: 
               Fold_1: <int> 1, 2, 4, 5...
               Fold_2: <int> 1, 2, 3, 4...
               Fold_3: <int> 1, 2, 3, 4...
               Fold_4: <int> 1, 2, 3, 4...
               Fold_5: <int> 1, 2, 3, 4...
               Fold_6: <int> 2, 3, 4, 5...
               Fold_7: <int> 1, 3, 4, 5...
               Fold_8: <int> 1, 2, 3, 4...
               Fold_9: <int> 1, 2, 3, 5...
              Fold_10: <int> 1, 2, 3, 4...
  parameters:  
              KFold ResamplerParameters
                           n: <int> 10
                stratify_var: <NUL> NULL
                strat_n_bins: <int> 4
                    id_strat: <NUL> NULL
                        seed: <NUL> NULL

Stratified subsampling
Draw n.resamples stratified samples from the data given a certain probability (train.p) that each case belongs to the training set. Since you are randomly sampling from the full sample each time, there will be overlap in the test set cases, but you control the training-to-testing ratio and number of resamples independently, unlike in k-fold resampling.

res_25ss <- resample(
  x,
  parameters = setup_Resampler(n = 25L, type = "StratSub")
)
res_25ss

StratSub Resampler
   resamples: 
                Showing first 12 of 25 items.
               Subsample_1: <int> 2, 4, 5, 7...
               Subsample_2: <int> 1, 2, 3, 4...
               Subsample_3: <int> 1, 2, 3, 4...
               Subsample_4: <int> 1, 2, 3, 4...
               Subsample_5: <int> 1, 2, 3, 4...
               Subsample_6: <int> 1, 2, 3, 4...
               Subsample_7: <int> 1, 3, 5, 6...
               Subsample_8: <int> 1, 2, 3, 4...
               Subsample_9: <int> 1, 3, 4, 5...
              Subsample_10: <int> 1, 2, 5, 7...
              Subsample_11: <int> 1, 2, 3, 4...
              Subsample_12: <int> 1, 3, 4, 5...
                ... 13 more items not shown.
  parameters:  
              StratSub ResamplerParameters
                           n: <int> 25
                     train_p: <nmr> 0.75
                stratify_var: <NUL> NULL
                strat_n_bins: <int> 4
                    id_strat: <NUL> NULL
                        seed: <NUL> NULL

Bootstrap
The bootstrap: random sampling with replacement. Since cases are replicated, you should use bootstrap as the outer resampler if you will also have inner resampling for tuning, since the same case may end up in both training and validation sets.

res_100boot <- resample(
  x,
  parameters = setup_Resampler(n = 100L, type = "Bootstrap")
)
res_100boot

Bootstrap Resampler
   resamples: 
                Showing first 12 of 100 items.
                Bootsrap_1: <int> 1, 1, 5, 5...
                Bootsrap_2: <int> 4, 5, 5, 6...
                Bootsrap_3: <int> 2, 3, 4, 6...
                Bootsrap_4: <int> 2, 3, 5, 5...
                Bootsrap_5: <int> 1, 2, 2, 2...
                Bootsrap_6: <int> 1, 3, 3, 4...
                Bootsrap_7: <int> 2, 3, 3, 4...
                Bootsrap_8: <int> 1, 1, 1, 3...
                Bootsrap_9: <int> 1, 5, 6, 7...
               Bootsrap_10: <int> 1, 3, 4, 5...
               Bootsrap_11: <int> 2, 3, 5, 6...
               Bootsrap_12: <int> 1, 2, 2, 4...
                ... 88 more items not shown.
  parameters:  
              Bootstrap ResamplerParameters
                       n: <int> 100
                id_strat: <NUL> NULL
                    seed: <NUL> NULL

Stratified Bootstrap
This is stratified subsampling with random replication of cases to match the length of the original sample. Same as the bootstrap, do not use if you will be further resampling each resample.

res_100sboot <- resample(
  x,
  parameters = setup_Resampler(n = 100L, type = "StratBoot")
)
res_100sboot

StratBoot Resampler
   resamples: 
                Showing first 12 of 100 items.
                StratBoot_1: <int> 1, 2, 3, 4...
                StratBoot_2: <int> 1, 2, 2, 3...
                StratBoot_3: <int> 2, 3, 3, 4...
                StratBoot_4: <int> 1, 2, 2, 3...
                StratBoot_5: <int> 1, 2, 3, 4...
                StratBoot_6: <int> 1, 2, 3, 6...
                StratBoot_7: <int> 1, 2, 3, 4...
                StratBoot_8: <int> 1, 2, 5, 6...
                StratBoot_9: <int> 1, 2, 2, 3...
               StratBoot_10: <int> 1, 1, 2, 2...
               StratBoot_11: <int> 1, 1, 2, 3...
               StratBoot_12: <int> 2, 2, 4, 4...
                ... 88 more items not shown.
  parameters:  
              StratBoot ResamplerParameters
                            n: <int> 100
                 stratify_var: <NUL> NULL
                      train_p: <nmr> 0.75
                 strat_n_bins: <int> 4
                target_length: <NUL> NULL
                     id_strat: <NUL> NULL
                         seed: <NUL> NULL

Leave-One-Out-Crossvalidation (LOOCV)
This is k-fold crossvalidation where \(k = N\), where \(N\) is number of data points/cases in the whole sample. It has been included for experimentation and completenes, but it is not recommended either for model selection or assessment over the other resampling methods.

res_loocv <- resample(
  x,
  parameters = setup_Resampler(type = "LOOCV")
)
res_loocv

LOOCV Resampler
   resamples: 
                Showing first 12 of 1000 items.
                 Fold_1: <int> 2, 3, 4, 5...
                 Fold_2: <int> 1, 3, 4, 5...
                 Fold_3: <int> 1, 2, 4, 5...
                 Fold_4: <int> 1, 2, 3, 5...
                 Fold_5: <int> 1, 2, 3, 4...
                 Fold_6: <int> 1, 2, 3, 4...
                 Fold_7: <int> 1, 2, 3, 4...
                 Fold_8: <int> 1, 2, 3, 4...
                 Fold_9: <int> 1, 2, 3, 4...
                Fold_10: <int> 1, 2, 3, 4...
                Fold_11: <int> 1, 2, 3, 4...
                Fold_12: <int> 1, 2, 3, 4...
                ... 988 more items not shown.
  parameters:  
              LOOCV ResamplerParameters
                n: <int> 1000

6.3 Setting up a `Resampler`

6.4 Example: Stratified vs. random sampling in a binomial distribution

Assume y is the outcome of interest where events occur with a probability of .1 - a common scenario in many fields.

set.seed(2020)
x <- rbinom(500, 1, .05)
draw_dist(x, type = "hist")

freq <- table(x)
prob <- freq[2] / sum(freq)
# Random sampling with `sample()`
res_nonstrat <- lapply(seq(10), function(i) sample(seq(x), .75*length(x)))
# Stratified subsampling with `resample()`
res_strat <- resample(
  x,
  parameters = setup_Resampler(n = 10L, type = "StratSub", train_p = .75)
)
prob_nonstrat <- sapply(seq(10), function(i) {
  freq <- table(x[res_nonstrat[[i]]])
  freq[2]/sum(freq)
})
prob_strat <- sapply(seq(10), function(i) {
  freq <- table(x[res_strat[[i]]])
  freq[2]/sum(freq)
})
prob_nonstrat

         1          1          1          1          1          1          1 
0.06933333 0.05066667 0.05066667 0.05866667 0.06133333 0.06400000 0.05333333 
         1          1          1 
0.06133333 0.05333333 0.05600000

sd(prob_nonstrat)

[1] 0.006164815

prob_strat

         1          1          1          1          1          1          1 
0.05614973 0.05614973 0.05614973 0.05614973 0.05614973 0.05614973 0.05614973 
         1          1          1 
0.05614973 0.05614973 0.05614973

sd(prob_strat)

[1] 0

As expected, the random sampling resulted in slightly different event probability in each resample, whereas stratified subsampling maintained a constant probability across resamples.

6.5 Resampling pipeline

flowchart TB
    A("setup_Resampler()") --> B[ResamplerParameters]
    B --> C["resample()"]
    C("resample()") --> D[Resampler]

6.1 Concepts: Model Selection and Assessment

6.2 The resample() function

6.3 Setting up a Resampler

6.4 Example: Stratified vs. random sampling in a binomial distribution

6.5 Resampling pipeline

6.2 The `resample()` function

6.3 Setting up a `Resampler`