11 Classification

11.1 Setup

11.1.1 Packages

library(rtemis)

  .:rtemis 0.99.95 🌊 aarch64-apple-darwin20

library(data.table)

11.1.2 Data

For this example, we shall use the BreastCancer dataset from the mlbench package:

data(BreastCancer, package = "mlbench")

Important

In rtemis, the last column is the outcome variable.

We optionally convert the dataset to a data.table:

Note

train() supports data.frame, data.table, or tibble inputs.

dat <- as.data.table(BreastCancer)
dat

          Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
      <char>        <ord>     <ord>      <ord>         <ord>        <ord>
  1: 1000025            5         1          1             1            2
  2: 1002945            5         4          4             5            7
  3: 1015425            3         1          1             1            2
  4: 1016277            6         8          8             1            3
  5: 1017023            4         1          1             3            2
 ---                                                                     
695:  776715            3         1          1             1            3
696:  841769            2         1          1             1            2
697:  888820            5        10         10             3            7
698:  897471            4         8          6             4            3
699:  897471            4         8          8             5            4
     Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
          <fctr>      <fctr>          <fctr>  <fctr>    <fctr>
  1:           1           3               1       1    benign
  2:          10           3               2       1    benign
  3:           2           3               1       1    benign
  4:           4           3               7       1    benign
  5:           1           3               1       1    benign
 ---                                                          
695:           2           1               1       1    benign
696:           1           1               1       1    benign
697:           3           8              10       2 malignant
698:           4          10               6       1 malignant
699:           5          10               4       1 malignant

Also optionally, we clean the dataset, in this case to replace periods with underscores in column names:

dt_set_clean_all(dat)
dat

Tip

dt_* functions operate on data.table objects. dt_set_* functions modify their input in-place.

Class is already the last column, otherwise we could use set_outcome() to move it.

Important

For classification, the outcome variable must be a factor. For binary classification, the second factor level is considered the positive case.

11.2 Check data

check_data(dat)

  dat: A data.table with 699 rows and 11 columns.

  Data types
  * 0 numeric features
  * 0 integer features
  * 10 factors, of which 5 are ordered
  * 1 character feature
  * 0 date features

  Issues
  * 0 constant features
  * 8 duplicate cases
  * 1 feature includes 'NA' values; 16 'NA' values total
    * 1 factor

  Recommendations
  * Consider converting character features to factors or excluding them.
  * Consider removing the duplicate cases.
  * Consider imputing missing values or using algorithms that can handle missingness.

11.3 Train a single model

11.3.1 Resample

res <- resample(dat, setup_Resampler(1L, "StratSub"))

2025-06-10 21:13:43 Input contains more than one column; will stratify on last. [resample]
2025-06-10 21:13:43 Using max n bins possible = 2 [strat_sub]
2025-06-10 21:13:43 Updated strat_n_bins from 4 to 2 in ResamplerParameters object. [resample]

res

StratSub Resampler
   resamples: 
              Subsample_1: <int> 1, 2, 3, 4...
  parameters:  
              StratSub ResamplerParameters
                           n: <int> 1
                     train_p: <nmr> 0.75
                stratify_var: <NUL> NULL
                strat_n_bins: <int> 2
                    id_strat: <NUL> NULL
                        seed: <NUL> NULL

dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)

523 x 11

size(dat_test)

176 x 11

11.3.2 Train model

Using LightRF as an example to train a random forest model:

mod_lightrf <- train(
  dat_training,
  dat_test = dat_test,
  algorithm = "LightRF"
)

2025-06-10 21:13:43 Hello. [train]
2025-06-10 21:13:43 Input data summary: [summarize_supervised_data]
2025-06-10 21:13:43   Training set: 523 cases x 10 features. [summarize_supervised_data]
2025-06-10 21:13:43       Test set: 176 cases x 10 features. [summarize_supervised_data]
2025-06-10 21:13:43 Training LightRF Classification... [train]
2025-06-10 21:13:43 Checking data is ready for training... ✔ [check_supervised_data]

.:Classification Model
  LightRF (LightGBM Random Forest)

  Training Classification Metrics

                   Predicted 
        Reference  malignant  benign  
        malignant        167      13
           benign          9     334

                   Overall  
      Sensitivity  0.928  
      Specificity  0.974  
Balanced_Accuracy  0.951  
              PPV  0.949  
              NPV  0.963  
               F1  0.938  
         Accuracy  0.958  
              AUC  0.989  
      Brier_Score  0.065  

   Positive Class  malignant 

  Test Classification Metrics

                   Predicted 
        Reference  malignant  benign  
        malignant         53       8
           benign          4     111

                   Overall  
      Sensitivity  0.869  
      Specificity  0.965  
Balanced_Accuracy  0.917  
              PPV  0.930  
              NPV  0.933  
               F1  0.898  
         Accuracy  0.932  
              AUC  0.981  
      Brier_Score  0.069  

   Positive Class  malignant


2025-06-10 21:13:44 Done in 0.83 seconds. [train]

11.3.3 Describe model

describe(mod_lightrf)

LightGBM Random Forest was used for classification.
Balanced accuracy was 0.95 on the training set and 0.92 in the test set.

11.3.4 Plot Confusion Matrix

plot(mod_lightrf)

11.3.5 Plot ROC Curve

plot_roc(mod_lightrf)

11.3.6 Present model

present() combines describe() and plot() or plot_roc() (default):

present(mod_lightrf)

LightGBM Random Forest was used for classification.
Balanced accuracy was 0.95 on the training set and 0.92 in the test set.

type defaults to "ROC", but can be set to "confusion" to show training and test confusion matrices side by side:

present(mod_lightrf, type = "confusion")

LightGBM Random Forest was used for classification.
Balanced accuracy was 0.95 on the training set and 0.92 in the test set.

11.3.7 Plot Variable Importance

plot_varimp(mod_lightrf)

11.4 Train on multiple training/test resamples

To train on multiple resamples, we use the outer_resampling argument:

resmod_lightrf <- train(
  dat_training,
  algorithm = "LightRF",
  outer_resampling = setup_Resampler(n_resamples = 10L, type = "KFold")
)

2025-06-10 21:13:45 Hello. [train]
2025-06-10 21:13:45 Input data summary: [summarize_supervised_data]
2025-06-10 21:13:45   Training set: 523 cases x 10 features. [summarize_supervised_data]
2025-06-10 21:13:45 Training LightRF Classification using 10-fold crossvalidation... [train]
2025-06-10 21:13:45 Input contains more than one column; will stratify on last. [resample]
2025-06-10 21:13:45 Using max n bins possible = 2. [kfold]
2025-06-10 21:13:49 Crossvalidation done. [train]

.:Resampled Classification Model
  LightRF (LightGBM Random Forest)
  ⟳ Tested using 10-fold crossvalidation.

  Resampled Classification Training Metrics
  Showing mean (sd) across resamples.

        Sensitivity: 0.909 (0.013)
        Specificity: 0.974 (3.2e-03)
  Balanced_Accuracy: 0.941 (0.007)
                PPV: 0.948 (0.006)
                NPV: 0.953 (0.006)
                 F1: 0.928 (0.008)
           Accuracy: 0.951 (0.005)
                AUC: 0.989 (1.3e-03)
        Brier_Score: 0.067 (1.2e-03)

  Resampled Classification Test Metrics
  Showing mean (sd) across resamples.

        Sensitivity: 0.894 (0.067)
        Specificity: 0.974 (0.029)
  Balanced_Accuracy: 0.934 (0.034)
                PPV: 0.951 (0.055)
                NPV: 0.947 (0.029)
                 F1: 0.919 (0.043)
           Accuracy: 0.947 (0.026)
                AUC: 0.987 (0.014)
        Brier_Score: 0.070 (0.009)


2025-06-10 21:13:49 Done in 4.52 seconds. [train]

Now, train() produced a ClassificationRes object:

class(resmod_lightrf)

[1] "rtemis::ClassificationRes" "rtemis::SupervisedRes"    
[3] "S7_object"

11.4.1 Describe

describe(resmod_lightrf)

LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.94 in the training set and 0.93 in the test set across 10 independent folds.

11.4.2 Plot

The plot() method for ClassificationRes objects plots boxplots of the training and test set metrics:

plot(resmod_lightrf)

11.4.3 Present

The present() method for ClassificationRes objects combines the describe() and plot() methods:

present(resmod_lightrf)

LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.94 in the training set and 0.93 in the test set across 10 independent folds.