library(rtemis)
.:rtemis 0.99.94 🌊 aarch64-apple-darwin20
library(data.table)
library(rtemis)
.:rtemis 0.99.94 🌊 aarch64-apple-darwin20
library(data.table)
As an example, we will use the penguins
dataset from the palmerpenguins
package.
For regression, we will predict the body_mass_g
from the other features.
data(penguins, package = "palmerpenguins")
<- penguins dat
In rtemis, the last column is the outcome variable.
We optionally convert the dataset to a data.table
:
<- as.data.table(dat) dat
Finally, we set body_mass_g
as the outcome, by moving it to be the last column:
<- set_outcome(dat, "body_mass_g")
dat dat
species island bill_length_mm bill_depth_mm flipper_length_mm sex
<fctr> <fctr> <num> <num> <int> <fctr>
1: Adelie Torgersen 39.1 18.7 181 male
2: Adelie Torgersen 39.5 17.4 186 female
3: Adelie Torgersen 40.3 18.0 195 female
4: Adelie Torgersen NA NA NA <NA>
5: Adelie Torgersen 36.7 19.3 193 female
---
340: Chinstrap Dream 55.8 19.8 207 male
341: Chinstrap Dream 43.5 18.1 202 female
342: Chinstrap Dream 49.6 18.2 193 male
343: Chinstrap Dream 50.8 19.0 210 male
344: Chinstrap Dream 50.2 18.7 198 female
year body_mass_g
<int> <int>
1: 2007 3750
2: 2007 3800
3: 2007 3250
4: 2007 NA
5: 2007 3450
---
340: 2009 4000
341: 2009 3400
342: 2009 3775
343: 2009 4100
344: 2009 3775
check_data(dat)
dat: A data.table with 344 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 5 features include 'NA' values; 19 'NA' values total
* 1 factor; 2 integer; 2 numeric
* 2 missing values in the last column
Recommendations
* Consider imputing missing values or using algorithms that can handle missingness.
* Filter cases with missing values in the last column if using dataset for supervised learning.
There are 2 missing values in our chosen output, body_mass_g
. As suggested, we must filter out these rows before training a model.
<- dat[!is.na(body_mass_g)] dat
Let’s verify the last column has no missing values now:
check_data(dat)
dat: A data.table with 342 rows and 8 columns.
Data types
* 2 numeric features
* 3 integer features
* 3 factors, of which 0 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 1 feature includes 'NA' values; 9 'NA' values total
* 1 factor
Recommendations
* Consider imputing missing values or using algorithms that can handle missingness.
<- resample(dat, setup_Resampler(1L, "StratSub")) res
res
StratSub Resampler
resamples:
Subsample_1: <int> 1, 2, 3, 4...
parameters:
StratSub ResamplerParameters
n: <int> 1
train_p: <nmr> 0.75
stratify_var: <NUL> NULL
strat_n_bins: <int> 4
id_strat: <NUL> NULL
seed: <NUL> NULL
<- dat[res$Subsample_1, ]
dat_training <- dat[-res$Subsample_1, ]
dat_test size(dat_training)
254 x 8
size(dat_test)
88 x 8
As an example, we’ll train a LightRF model:
<- train(
mod_lightrf
dat_training,dat_test = dat_test,
algorithm = "LightRF"
)
Input data summary:
Training set: 254 cases x 7 features.
Test set: 88 cases x 7 features.
.:Regression Model
LightRF (LightGBM Random Forest)
Training Regression Metrics
MAE: 295.27
MSE: 141846.43
RMSE: 376.63
Rsq: 0.78
Test Regression Metrics
MAE: 327.64
MSE: 199950.02
RMSE: 447.16
Rsq: 0.70
describe(mod_lightrf)
LightGBM Random Forest was used for regression.
R-squared was 0.78 in the training set and 0.70 in the test.
plot(mod_lightrf)
The present()
method for Supervised
objects combines the describe()
and plot()
functions
present(mod_lightrf)
LightGBM Random Forest was used for regression.
R-squared was 0.78 in the training set and 0.70 in the test.
plot_varimp(mod_lightrf)
train()
can as easily train on multiple resamples, which will output objects of class RegressionRes
object for regression. All you need to do is specify the outer resampling parameters using the outer_resampling
argument.
<- train(
resmod_lightrf
dat_training,dat_test = dat_test,
algorithm = "LightRF",
outer_resampling = setup_Resampler(n_resamples = 10L, type = "KFold")
)
Input data summary:
Training set: 254 cases x 7 features.
Test set: 88 cases x 7 features.
.:Resampled Regression Model
LightRF (LightGBM Random Forest)
⟳ Tested using 10-fold crossvalidation.
Resampled Regression Training Metrics
Showing mean (sd) across resamples.
MAE: 296.891 (3.776)
MSE: 144150.410 (3923.628)
RMSE: 379.640 (5.168)
Rsq: 0.772 (0.008)
Resampled Regression Test Metrics
Showing mean (sd) across resamples.
MAE: 308.825 (30.375)
MSE: 157438.441 (28496.890)
RMSE: 395.409 (34.801)
Rsq: 0.746 (0.055)
Now, train()
produced a RegressionRes
object:
class(resmod_lightrf)
[1] "rtemis::RegressionRes" "rtemis::SupervisedRes" "S7_object"
describe(resmod_lightrf)
LightGBM Random Forest was used for regression. R-squared was 0.77 on the training set and 0.75 on the test set across 10 independent folds.
plot(resmod_lightrf)
The present()
method for RegressionRes
objects combines the describe()
and plot()
methods:
present(resmod_lightrf)
LightGBM Random Forest was used for regression. R-squared was 0.77 on the training set and 0.75 on the test set across 10 independent folds.