11  Classification

11.1 Setup

11.1.1 Packages

library(rtemis)
  .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)

11.1.2 Data

For this example, we shall use the BreastCancer dataset from the mlbench package:

data(BreastCancer, package = "mlbench")

In rtemis, the last column is the outcome variable.

We optionally convert the dataset to a data.table:

train() supports data.frame, data.table, or tibble inputs.

dat <- as.data.table(BreastCancer)
dat
          Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
      <char>        <ord>     <ord>      <ord>         <ord>        <ord>
  1: 1000025            5         1          1             1            2
  2: 1002945            5         4          4             5            7
  3: 1015425            3         1          1             1            2
  4: 1016277            6         8          8             1            3
  5: 1017023            4         1          1             3            2
 ---                                                                     
695:  776715            3         1          1             1            3
696:  841769            2         1          1             1            2
697:  888820            5        10         10             3            7
698:  897471            4         8          6             4            3
699:  897471            4         8          8             5            4
     Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses     Class
          <fctr>      <fctr>          <fctr>  <fctr>    <fctr>
  1:           1           3               1       1    benign
  2:          10           3               2       1    benign
  3:           2           3               1       1    benign
  4:           4           3               7       1    benign
  5:           1           3               1       1    benign
 ---                                                          
695:           2           1               1       1    benign
696:           1           1               1       1    benign
697:           3           8              10       2 malignant
698:           4          10               6       1 malignant
699:           5          10               4       1 malignant

Also optionally, we clean the dataset, in this case to replace periods with underscores in column names:

dt_set_clean_all(dat)
dat

dt_* functions operate on data.table objects. dt_set_* functions modify their input in-place.

Class is already the last column, otherwise we could use set_outcome() to move it.

For classification, the outcome variable must be a factor. For binary classification, the second factor level is considered the positive case.

The first column, “Id”, is not a predictor, so we remove it:

dat[, Id := NULL]

11.2 Check data

check_data(dat)
  dat: A data.table with 699 rows and 10 columns.

  Data types
  * 0 numeric features
  * 0 integer features
  * 10 factors, of which 5 are ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 236 duplicate cases
  * 1 feature includes 'NA' values; 16 'NA' values total
    * 1 factor

  Recommendations
  * Consider removing the duplicate cases.
  * Consider using algorithms that can handle missingness or imputing missing values. 

11.3 Train a single model

11.3.1 Resample

res <- resample(dat, setup_Resampler(1L, "StratSub"))
2025-10-18 19:01:55 Input contains more than one column; stratifying on last. [resample]
2025-10-18 19:01:55 Using max n bins possible = 2 [strat_sub]
2025-10-18 19:01:55 Updated strat_n_bins from 4 to 2 in ResamplerConfig object. [resample]
res
<rt StratSub Resampler>
  resamples: 
             Subsample_1: <int> 1, 2, 3, 5...
     config:  
             <rt StratSub ResamplerConfig>
                          n: <int> 1
                    train_p: <nmr> 0.75
               stratify_var: <NUL> NULL
               strat_n_bins: <int> 2
                   id_strat: <NUL> NULL
                       seed: <NUL> NULL
dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)
523 x 10 
size(dat_test)
176 x 10 

11.3.2 Train model

Using LightRF as an example to train a random forest model:

mod_lightrf <- train(
  dat_training,
  dat_test = dat_test,
  algorithm = "LightRF"
)
2025-10-18 19:01:55  [train]
2025-10-18 19:01:55 Training set: 523 cases x 9 features. [summarize_supervised]
2025-10-18 19:01:55     Test set: 176 cases x 9 features. [summarize_supervised]
2025-10-18 19:01:55 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2025-10-18 19:01:55 Training LightRF Classification... [train]
2025-10-18 19:01:55 Checking data is ready for training... [check_supervised]

<rt Classification>
LightRF (LightGBM Random Forest)

  <rt Training Classification Metrics>
                     Predicted
          Reference  malignant  benign  
          malignant        162      18
             benign         10     333

                     Overall  
        Sensitivity  0.900  
        Specificity  0.971  
  Balanced_Accuracy  0.935  
                PPV  0.942  
                NPV  0.949  
                 F1  0.920  
           Accuracy  0.946  
                AUC  0.985  
        Brier_Score  0.072  

     Positive Class malignant

  <rt Test Classification Metrics>
                     Predicted
          Reference  malignant  benign  
          malignant         55       6
             benign          3     112

                     Overall  
        Sensitivity  0.902  
        Specificity  0.974  
  Balanced_Accuracy  0.938  
                PPV  0.948  
                NPV  0.949  
                 F1  0.924  
           Accuracy  0.949  
                AUC  0.991  
        Brier_Score  0.072  

     Positive Class malignant

2025-10-18 19:01:56 Done in 0.64 seconds. [train]

11.3.3 Describe model

describe(mod_lightrf)
LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.

11.3.4 Plot Confusion Matrix

plot_true_pred(mod_lightrf)

11.3.5 Plot ROC Curve

plot_roc(mod_lightrf)

11.3.6 Present model

present() combines describe() and plot() or plot_roc() (default):

present(mod_lightrf)
LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.

type defaults to "ROC", but can be set to "confusion" to show training and test confusion matrices side by side:

present(mod_lightrf, type = "confusion")
LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.

11.3.7 Plot Variable Importance

plot_varimp(mod_lightrf)

11.3.8 Predict on new data

For this example, we’ll use the dat_test we created. Remember that if the dataset includes the outcome variable, it must be removed before predicting. You can either delete the column, or use indexing to exclude it. rtemis includes a convenience function features() which excludes the last column of data.frames, data.tables, or tibbles:

head(features(dat_test))
   Cl_thickness Cell_size Cell_shape Marg_adhesion Epith_c_size Bare_nuclei
          <ord>     <ord>      <ord>         <ord>        <ord>      <fctr>
1:            6         8          8             1            3           4
2:            1         1          1             1            1           1
3:            5         3          3             3            2           3
4:            8         7          5            10            7           9
5:            4         1          1             1            2           1
6:            1         1          1             1            2           1
   Bl_cromatin Normal_nucleoli Mitoses
        <fctr>          <fctr>  <fctr>
1:           3               7       1
2:           3               1       1
3:           4               4       1
4:           5               5       4
5:           3               1       1
6:           3               1       1

In binary classification, the output of predict() is a vector of probabilities for the positive class:

pred <- predict(mod_lightrf, features(dat_test))
head(pred)
[1] 0.6461656 0.1560078 0.3754721 0.7414489 0.1512322 0.1510922

11.4 Train on multiple training/test resamples

To train on multiple resamples, we use the outer_resampling_config argument:

resmod_lightrf <- train(
  dat_training,
  algorithm = "LightRF",
  outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold")
)
2025-10-18 19:01:57  [train]
2025-10-18 19:01:57 Training set: 523 cases x 9 features. [summarize_supervised]
2025-10-18 19:01:57 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2025-10-18 19:01:57 <> Training LightRF Classification using 10-fold crossvalidation... [train]
2025-10-18 19:01:57 Input contains more than one column; stratifying on last. [resample]
2025-10-18 19:01:57 Using max n bins possible = 2. [kfold]
2025-10-18 19:02:01 </> Outer resampling done. [train]
<rt Resampled Classification Model>
LightRF (LightGBM Random Forest)
⟳ Tested using 10-fold crossvalidation.

  <rt Resampled Classification Training Metrics>
    Showing mean (sd) across resamples.
          Sensitivity: 0.895 (0.009)
          Specificity: 0.971 (4.6e-03)
    Balanced_Accuracy: 0.933 (0.005)
                  PPV: 0.942 (0.009)
                  NPV: 0.946 (4.2e-03)
                   F1: 0.918 (0.007)
             Accuracy: 0.945 (4.4e-03)
                  AUC: 0.984 (2.6e-03)
          Brier_Score: 0.076 (2.2e-03)

  <rt Resampled Classification Test Metrics>
    Showing mean (sd) across resamples.
          Sensitivity: 0.878 (0.086)
          Specificity: 0.971 (0.041)
    Balanced_Accuracy: 0.924 (0.053)
                  PPV: 0.944 (0.077)
                  NPV: 0.939 (0.043)
                   F1: 0.908 (0.068)
             Accuracy: 0.939 (0.046)
                  AUC: 0.984 (0.026)
          Brier_Score: 0.079 (0.016)

2025-10-18 19:02:01 Done in 3.83 seconds. [train]

Now, train() produced a ClassificationRes object:

class(resmod_lightrf)
[1] "rtemis::ClassificationRes" "rtemis::SupervisedRes"    
[3] "S7_object"                

11.4.1 Describe

describe(resmod_lightrf)
LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.93 in the training set and 0.92 in the test set across 10 independent folds. 

11.4.2 Plot

The plot() method for ClassificationRes objects plots boxplots of the training and test set metrics:

plot_true_pred(resmod_lightrf)

11.4.3 Present

The present() method for ClassificationRes objects combines the describe() and plot() methods:

present(resmod_lightrf)
LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.93 in the training set and 0.92 in the test set across 10 independent folds. 
© 2025 E.D. Gennatas