10  Regression

10.1 Setup

10.1.1 Packages

library(rtemis)
  .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)

10.1.2 Data

As an example, we will use the penguins dataset from the palmerpenguins package.
For regression, we will predict the body_mass_g from the other features.

data(penguins, package = "palmerpenguins")
dat <- penguins

In rtemis, the last column is the outcome variable.

We optionally convert the dataset to a data.table and inspect it:

dat <- as.data.table(dat)
str(dat)
Classes 'data.table' and 'data.frame':  344 obs. of  8 variables:
 $ species          : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ island           : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ bill_length_mm   : num  39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
 $ bill_depth_mm    : num  18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
 $ flipper_length_mm: int  181 186 195 NA 193 190 181 195 193 190 ...
 $ body_mass_g      : int  3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
 $ sex              : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
 $ year             : int  2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
 - attr(*, ".internal.selfref")=<externalptr> 

Finally, we use set_outcome to move “body_mass_g” to the last column, thereby making it the outcome variable:

dat <- set_outcome(dat, "body_mass_g")
dat
       species    island bill_length_mm bill_depth_mm flipper_length_mm    sex
        <fctr>    <fctr>          <num>         <num>             <int> <fctr>
  1:    Adelie Torgersen           39.1          18.7               181   male
  2:    Adelie Torgersen           39.5          17.4               186 female
  3:    Adelie Torgersen           40.3          18.0               195 female
  4:    Adelie Torgersen             NA            NA                NA   <NA>
  5:    Adelie Torgersen           36.7          19.3               193 female
 ---                                                                          
340: Chinstrap     Dream           55.8          19.8               207   male
341: Chinstrap     Dream           43.5          18.1               202 female
342: Chinstrap     Dream           49.6          18.2               193   male
343: Chinstrap     Dream           50.8          19.0               210   male
344: Chinstrap     Dream           50.2          18.7               198 female
      year body_mass_g
     <int>       <int>
  1:  2007        3750
  2:  2007        3800
  3:  2007        3250
  4:  2007          NA
  5:  2007        3450
 ---                  
340:  2009        4000
341:  2009        3400
342:  2009        3775
343:  2009        4100
344:  2009        3775

10.2 Check data

check_data(dat)
  dat: A data.table with 344 rows and 8 columns.

  Data types
  * 2 numeric features
  * 3 integer features
  * 3 factors, of which 0 are ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 5 features include 'NA' values; 19 'NA' values total
    * 1 factor; 2 integer; 2 numeric 
    * 2 missing values in the last column

  Recommendations
  * Consider using algorithms that can handle missingness or imputing missing values. 
  * Filter cases with missing values in the last column if using dataset for supervised learning.
 

There are 2 missing values in our chosen output, body_mass_g. As suggested, we must filter out these rows before training a model.

dat <- dat[!is.na(body_mass_g)]

Let’s verify the last column has no missing values now:

check_data(dat)
  dat: A data.table with 342 rows and 8 columns.

  Data types
  * 2 numeric features
  * 3 integer features
  * 3 factors, of which 0 are ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 1 feature includes 'NA' values; 9 'NA' values total
    * 1 factor

  Recommendations
  * Consider using algorithms that can handle missingness or imputing missing values. 

10.3 Train a single model on training and test sets

10.3.1 Resample

res <- resample(dat, setup_Resampler(1L, "StratSub"))
2025-10-18 19:01:46 Input contains more than one column; stratifying on last. [resample]
res
<rt StratSub Resampler>
  resamples: 
             Subsample_1: <int> 1, 2, 3, 4...
     config:  
             <rt StratSub ResamplerConfig>
                          n: <int> 1
                    train_p: <nmr> 0.75
               stratify_var: <NUL> NULL
               strat_n_bins: <int> 4
                   id_strat: <NUL> NULL
                       seed: <NUL> NULL
dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)
254 x 8 
size(dat_test)
88 x 8 

As an example, we’ll train a LightRF model:

mod_lightrf <- train(
  dat_training,
  dat_test = dat_test,
  algorithm = "LightRF"
)
2025-10-18 19:01:46  [train]
2025-10-18 19:01:46 Training set: 254 cases x 7 features. [summarize_supervised]
2025-10-18 19:01:46     Test set: 88 cases x 7 features. [summarize_supervised]
2025-10-18 19:01:46 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2025-10-18 19:01:46 Training LightRF Regression... [train]
2025-10-18 19:01:46 Checking data is ready for training... [check_supervised]

<rt Regression>
LightRF (LightGBM Random Forest)

  <rt Training Regression Metrics>
     MAE: 304.63
     MSE: 153082.68
    RMSE: 391.26
     Rsq: 0.76

  <rt Test Regression Metrics>
     MAE: 323.89
     MSE: 173442.84
    RMSE: 416.46
     Rsq: 0.75

2025-10-18 19:01:47 Done in 0.71 seconds. [train]

10.3.2 Describe model

describe(mod_lightrf)
LightGBM Random Forest was used for regression.
R-squared was 0.76 in the training set and 0.75 in the test.

10.3.3 Plot model

plot_true_pred(mod_lightrf)

10.3.4 Present model

The present() method for Supervised objects combines the describe() and plot() functions

present(mod_lightrf)
LightGBM Random Forest was used for regression.
R-squared was 0.76 in the training set and 0.75 in the test.

10.3.5 Plot Variable Importance

plot_varimp(mod_lightrf)

10.4 Train on multiple training/test resamples

train() can as easily train on multiple resamples, which will output objects of class RegressionRes object for regression. All you need to do is specify the outer resampling configuration using the outer_resampling_config argument.

resmod_lightrf <- train(
  dat_training,
  algorithm = "LightRF",
  outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold")
)
2025-10-18 19:01:48  [train]
2025-10-18 19:01:48 Training set: 254 cases x 7 features. [summarize_supervised]
2025-10-18 19:01:48 // Max workers: 7 => Algorithm: 7; Tuning: 1; Outer Resampling: 1 [get_n_workers]
2025-10-18 19:01:48 <> Training LightRF Regression using 10-fold crossvalidation... [train]
2025-10-18 19:01:48 Input contains more than one column; stratifying on last. [resample]
⠙ 9/10 ETA:  0s | Training outer resamples... 
2025-10-18 19:01:53 </> Outer resampling done. [train]
<rt Resampled Regression Model>
LightRF (LightGBM Random Forest)
⟳ Tested using 10-fold crossvalidation.

  <rt Resampled Regression Training Metrics>
    Showing mean (sd) across resamples.
     MAE: 305.568 (4.190)
     MSE: 155042.602 (4586.526)
    RMSE: 393.716 (5.830)
     Rsq: 0.753 (5e-03)

  <rt Resampled Regression Test Metrics>
    Showing mean (sd) across resamples.
     MAE: 320.186 (42.985)
     MSE: 171046.692 (56975.828)
    RMSE: 408.560 (67.701)
     Rsq: 0.731 (0.052)

2025-10-18 19:01:53 Done in 4.94 seconds. [train]

Now, train() produced a RegressionRes object:

class(resmod_lightrf)
[1] "rtemis::RegressionRes" "rtemis::SupervisedRes" "S7_object"            

10.4.1 Describe

describe(resmod_lightrf)
LightGBM Random Forest was used for regression. Mean R-squared was 0.75 on the training set and 0.73 on the test set across 10 independent folds. 

10.4.2 Plot

plot_true_pred(resmod_lightrf)

10.4.3 Present

The present() method for RegressionRes objects combines the describe() and plot() methods:

present(resmod_lightrf)
LightGBM Random Forest was used for regression. Mean R-squared was 0.75 on the training set and 0.73 on the test set across 10 independent folds. 
© 2025 E.D. Gennatas