library(rtemis) .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)library(rtemis) .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)As an example, we will use the penguins dataset from the palmerpenguins package.
For regression, we will predict the body_mass_g from the other features.
data(penguins, package = "palmerpenguins")
dat <- penguinsIn rtemis, the last column is the outcome variable.
We optionally convert the dataset to a data.table and inspect it:
dat <- as.data.table(dat)
str(dat)Classes 'data.table' and 'data.frame': 344 obs. of 8 variables:
$ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
$ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
$ bill_length_mm : num 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
$ bill_depth_mm : num 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
$ flipper_length_mm: int 181 186 195 NA 193 190 181 195 193 190 ...
$ body_mass_g : int 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
$ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
$ year : int 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
- attr(*, ".internal.selfref")=<externalptr>
Finally, we use set_outcome to move “body_mass_g” to the last column, thereby making it the outcome variable:
dat <- set_outcome(dat, "body_mass_g")
dat species island bill_length_mm bill_depth_mm flipper_length_mm sex
<fctr> <fctr> <num> <num> <int> <fctr>
1: Adelie Torgersen 39.1 18.7 181 male
2: Adelie Torgersen 39.5 17.4 186 female
3: Adelie Torgersen 40.3 18.0 195 female
4: Adelie Torgersen NA NA NA <NA>
5: Adelie Torgersen 36.7 19.3 193 female
---
340: Chinstrap Dream 55.8 19.8 207 male
341: Chinstrap Dream 43.5 18.1 202 female
342: Chinstrap Dream 49.6 18.2 193 male
343: Chinstrap Dream 50.8 19.0 210 male
344: Chinstrap Dream 50.2 18.7 198 female
year body_mass_g
<int> <int>
1: 2007 3750
2: 2007 3800
3: 2007 3250
4: 2007 NA
5: 2007 3450
---
340: 2009 4000
341: 2009 3400
342: 2009 3775
343: 2009 4100
344: 2009 3775
check_data(dat) dat: A data.table with 344 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 5 features include 'NA' values; 19 'NA' values total
* 1 factor; 2 integer; 2 numeric
* 2 missing values in the last column
Recommendations
* Consider using algorithms that can handle missingness or imputing missing values.
* Filter cases with missing values in the last column if using dataset for supervised learning.
There are 2 missing values in our chosen output, body_mass_g. As suggested, we must filter out these rows before training a model.
dat <- dat[!is.na(body_mass_g)]Let’s verify the last column has no missing values now:
check_data(dat) dat: A data.table with 342 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 1 feature includes 'NA' values; 9 'NA' values total
* 1 factor
Recommendations
* Consider using algorithms that can handle missingness or imputing missing values.
res <- resample(dat, setup_Resampler(1L, "StratSub"))res<rt StratSub Resampler>
resamples:
Subsample_1: <int> 1, 2, 3, 4...
config:
<rt StratSub ResamplerConfig>
n: <int> 1
train_p: <nmr> 0.75
stratify_var: <NUL> NULL
strat_n_bins: <int> 4
id_strat: <NUL> NULL
seed: <NUL> NULL
dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)254 x 8
size(dat_test)88 x 8
As an example, we’ll train a LightRF model:
mod_lightrf <- train(
dat_training,
dat_test = dat_test,
algorithm = "LightRF"
)<rt Regression>
LightRF (LightGBM Random Forest)
<rt Training Regression Metrics>
MAE: 304.63
MSE: 153082.68
RMSE: 391.26
Rsq: 0.76
<rt Test Regression Metrics>
MAE: 323.89
MSE: 173442.84
RMSE: 416.46
Rsq: 0.75
describe(mod_lightrf)LightGBM Random Forest was used for regression.
R-squared was 0.76 in the training set and 0.75 in the test.
plot_true_pred(mod_lightrf)The present() method for Supervised objects combines the describe() and plot() functions
present(mod_lightrf)LightGBM Random Forest was used for regression.
R-squared was 0.76 in the training set and 0.75 in the test.
plot_varimp(mod_lightrf)train() can as easily train on multiple resamples, which will output objects of class RegressionRes object for regression. All you need to do is specify the outer resampling configuration using the outer_resampling_config argument.
resmod_lightrf <- train(
dat_training,
algorithm = "LightRF",
outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold")
)<rt Resampled Regression Model>
LightRF (LightGBM Random Forest)
⟳ Tested using 10-fold crossvalidation.
<rt Resampled Regression Training Metrics>
Showing mean (sd) across resamples.
MAE: 305.568 (4.190)
MSE: 155042.602 (4586.526)
RMSE: 393.716 (5.830)
Rsq: 0.753 (5e-03)
<rt Resampled Regression Test Metrics>
Showing mean (sd) across resamples.
MAE: 320.186 (42.985)
MSE: 171046.692 (56975.828)
RMSE: 408.560 (67.701)
Rsq: 0.731 (0.052)
Now, train() produced a RegressionRes object:
class(resmod_lightrf)[1] "rtemis::RegressionRes" "rtemis::SupervisedRes" "S7_object"
describe(resmod_lightrf)LightGBM Random Forest was used for regression. Mean R-squared was 0.75 on the training set and 0.73 on the test set across 10 independent folds.
plot_true_pred(resmod_lightrf)The present() method for RegressionRes objects combines the describe() and plot() methods:
present(resmod_lightrf)LightGBM Random Forest was used for regression. Mean R-squared was 0.75 on the training set and 0.73 on the test set across 10 independent folds.