10 Regression

10.1 Setup

10.1.1 Packages

library(rtemis)

  .:rtemis 0.99.95 🌊 aarch64-apple-darwin20

library(data.table)

10.1.2 Data

As an example, we will use the penguins dataset from the palmerpenguins package.
For regression, we will predict the body_mass_g from the other features.

data(penguins, package = "palmerpenguins")
dat <- penguins

Important

In rtemis, the last column is the outcome variable.

We optionally convert the dataset to a data.table:

dat <- as.data.table(dat)

Finally, we set body_mass_g as the outcome, by moving it to be the last column:

dat <- set_outcome(dat, "body_mass_g")
dat

       species    island bill_length_mm bill_depth_mm flipper_length_mm    sex
        <fctr>    <fctr>          <num>         <num>             <int> <fctr>
  1:    Adelie Torgersen           39.1          18.7               181   male
  2:    Adelie Torgersen           39.5          17.4               186 female
  3:    Adelie Torgersen           40.3          18.0               195 female
  4:    Adelie Torgersen             NA            NA                NA   <NA>
  5:    Adelie Torgersen           36.7          19.3               193 female
 ---                                                                          
340: Chinstrap     Dream           55.8          19.8               207   male
341: Chinstrap     Dream           43.5          18.1               202 female
342: Chinstrap     Dream           49.6          18.2               193   male
343: Chinstrap     Dream           50.8          19.0               210   male
344: Chinstrap     Dream           50.2          18.7               198 female
      year body_mass_g
     <int>       <int>
  1:  2007        3750
  2:  2007        3800
  3:  2007        3250
  4:  2007          NA
  5:  2007        3450
 ---                  
340:  2009        4000
341:  2009        3400
342:  2009        3775
343:  2009        4100
344:  2009        3775

10.2 Check data

check_data(dat)

  dat: A data.table with 344 rows and 8 columns.

  Data types
  * 2 numeric features
  * 3 integer features
  * 3 factors, of which 0 are ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 5 features include 'NA' values; 19 'NA' values total
    * 1 factor; 2 integer; 2 numeric 
    * 2 missing values in the last column

  Recommendations
  * Consider imputing missing values or using algorithms that can handle missingness. 
  * Filter cases with missing values in the last column if using dataset for supervised learning.

There are 2 missing values in our chosen output, body_mass_g. As suggested, we must filter out these rows before training a model.

dat <- dat[!is.na(body_mass_g)]

Let’s verify the last column has no missing values now:

check_data(dat)

  dat: A data.table with 342 rows and 8 columns.

  Data types
  * 2 numeric features
  * 3 integer features
  * 3 factors, of which 0 are ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 1 feature includes 'NA' values; 9 'NA' values total
    * 1 factor

  Recommendations
  * Consider imputing missing values or using algorithms that can handle missingness.

10.3 Train a single model on training and test sets

10.3.1 Resample

res <- resample(dat, setup_Resampler(1L, "StratSub"))

2025-06-10 21:13:22 Input contains more than one column; will stratify on last. [resample]

res

StratSub Resampler
   resamples: 
              Subsample_1: <int> 2, 3, 4, 7...
  parameters:  
              StratSub ResamplerParameters
                           n: <int> 1
                     train_p: <nmr> 0.75
                stratify_var: <NUL> NULL
                strat_n_bins: <int> 4
                    id_strat: <NUL> NULL
                        seed: <NUL> NULL

dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)

254 x 8

size(dat_test)

88 x 8

As an example, we’ll train a LightRF model:

mod_lightrf <- train(
  dat_training,
  dat_test = dat_test,
  algorithm = "LightRF"
)

2025-06-10 21:13:22 Hello. [train]
2025-06-10 21:13:22 Input data summary: [summarize_supervised_data]
2025-06-10 21:13:22   Training set: 254 cases x 7 features. [summarize_supervised_data]
2025-06-10 21:13:22       Test set: 88 cases x 7 features. [summarize_supervised_data]
2025-06-10 21:13:22 Training LightRF Regression... [train]
2025-06-10 21:13:22 Checking data is ready for training... ✔ [check_supervised_data]

.:Regression Model
  LightRF (LightGBM Random Forest)

  Training Regression Metrics

   MAE: 299.19
   MSE: 148893.90
  RMSE: 385.87
   Rsq: 0.77

  Test Regression Metrics

   MAE: 325.38
   MSE: 171247.57
  RMSE: 413.82
   Rsq: 0.73


2025-06-10 21:13:22 Done in 0.68 seconds. [train]

10.3.2 Describe model

describe(mod_lightrf)

LightGBM Random Forest was used for regression.
R-squared was 0.77 in the training set and 0.73 in the test.

10.3.3 Plot model

plot(mod_lightrf)

10.3.4 Present model

Note

The present() method for Supervised objects combines the describe() and plot() functions

present(mod_lightrf)

LightGBM Random Forest was used for regression.
R-squared was 0.77 in the training set and 0.73 in the test.

10.3.5 Plot Variable Importance

plot_varimp(mod_lightrf)

10.4 Train on multiple training/test resamples

train() can as easily train on multiple resamples, which will output objects of class RegressionRes object for regression. All you need to do is specify the outer resampling parameters using the outer_resampling argument.

resmod_lightrf <- train(
  dat_training,
  algorithm = "LightRF",
  outer_resampling = setup_Resampler(n_resamples = 10L, type = "KFold")
)

2025-06-10 21:13:23 Hello. [train]
2025-06-10 21:13:23 Input data summary: [summarize_supervised_data]
2025-06-10 21:13:23   Training set: 254 cases x 7 features. [summarize_supervised_data]
2025-06-10 21:13:23 Training LightRF Regression using 10-fold crossvalidation... [train]
2025-06-10 21:13:23 Input contains more than one column; will stratify on last. [resample]
2025-06-10 21:13:27 Crossvalidation done. [train]

.:Resampled Regression Model
  LightRF (LightGBM Random Forest)
  ⟳ Tested using 10-fold crossvalidation.

  Resampled Regression Training Metrics
  Showing mean (sd) across resamples.

   MAE: 302.307 (4.679)
   MSE: 152362.227 (3982.335)
  RMSE: 390.305 (5.166)
   Rsq: 0.764 (0.006)

  Resampled Regression Test Metrics
  Showing mean (sd) across resamples.

   MAE: 312.338 (47.875)
   MSE: 163771.594 (41797.776)
  RMSE: 401.885 (50.107)
   Rsq: 0.745 (0.049)


2025-06-10 21:13:27 Done in 4.36 seconds. [train]

Now, train() produced a RegressionRes object:

class(resmod_lightrf)

[1] "rtemis::RegressionRes" "rtemis::SupervisedRes" "S7_object"

10.4.1 Describe

describe(resmod_lightrf)

LightGBM Random Forest was used for regression. Mean R-squared was 0.76 on the training set and 0.74 on the test set across 10 independent folds.

10.4.2 Plot

plot(resmod_lightrf)

10.4.3 Present

The present() method for RegressionRes objects combines the describe() and plot() methods:

present(resmod_lightrf)

LightGBM Random Forest was used for regression. Mean R-squared was 0.76 on the training set and 0.74 on the test set across 10 independent folds.