library(rtemis) .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)library(rtemis) .:rtemis 0.99.1000 🌊 aarch64-apple-darwin20
library(data.table)For this example, we shall use the BreastCancer dataset from the mlbench package:
data(BreastCancer, package = "mlbench")In rtemis, the last column is the outcome variable.
We optionally convert the dataset to a data.table:
train() supports data.frame, data.table, or tibble inputs.
dat <- as.data.table(BreastCancer)
dat Id Cl.thickness Cell.size Cell.shape Marg.adhesion Epith.c.size
<char> <ord> <ord> <ord> <ord> <ord>
1: 1000025 5 1 1 1 2
2: 1002945 5 4 4 5 7
3: 1015425 3 1 1 1 2
4: 1016277 6 8 8 1 3
5: 1017023 4 1 1 3 2
---
695: 776715 3 1 1 1 3
696: 841769 2 1 1 1 2
697: 888820 5 10 10 3 7
698: 897471 4 8 6 4 3
699: 897471 4 8 8 5 4
Bare.nuclei Bl.cromatin Normal.nucleoli Mitoses Class
<fctr> <fctr> <fctr> <fctr> <fctr>
1: 1 3 1 1 benign
2: 10 3 2 1 benign
3: 2 3 1 1 benign
4: 4 3 7 1 benign
5: 1 3 1 1 benign
---
695: 2 1 1 1 benign
696: 1 1 1 1 benign
697: 3 8 10 2 malignant
698: 4 10 6 1 malignant
699: 5 10 4 1 malignant
Also optionally, we clean the dataset, in this case to replace periods with underscores in column names:
dt_set_clean_all(dat)
datdt_* functions operate on data.table objects. dt_set_* functions modify their input in-place.
Class is already the last column, otherwise we could use set_outcome() to move it.
For classification, the outcome variable must be a factor. For binary classification, the second factor level is considered the positive case.
The first column, “Id”, is not a predictor, so we remove it:
dat[, Id := NULL]check_data(dat) dat: A data.table with 699 rows and 10 columns.
Data types
* 0 numeric features
* 0 integer features
* 10 factors, of which 5 are ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 236 duplicate cases
* 1 feature includes 'NA' values; 16 'NA' values total
* 1 factor
Recommendations
* Consider removing the duplicate cases.
* Consider using algorithms that can handle missingness or imputing missing values.
res <- resample(dat, setup_Resampler(1L, "StratSub"))res<rt StratSub Resampler>
resamples:
Subsample_1: <int> 1, 2, 3, 5...
config:
<rt StratSub ResamplerConfig>
n: <int> 1
train_p: <nmr> 0.75
stratify_var: <NUL> NULL
strat_n_bins: <int> 2
id_strat: <NUL> NULL
seed: <NUL> NULL
dat_training <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]
size(dat_training)523 x 10
size(dat_test)176 x 10
Using LightRF as an example to train a random forest model:
mod_lightrf <- train(
dat_training,
dat_test = dat_test,
algorithm = "LightRF"
)<rt Classification>
LightRF (LightGBM Random Forest)
<rt Training Classification Metrics>
Predicted
Reference malignant benign
malignant 162 18
benign 10 333
Overall
Sensitivity 0.900
Specificity 0.971
Balanced_Accuracy 0.935
PPV 0.942
NPV 0.949
F1 0.920
Accuracy 0.946
AUC 0.985
Brier_Score 0.072
Positive Class malignant
<rt Test Classification Metrics>
Predicted
Reference malignant benign
malignant 55 6
benign 3 112
Overall
Sensitivity 0.902
Specificity 0.974
Balanced_Accuracy 0.938
PPV 0.948
NPV 0.949
F1 0.924
Accuracy 0.949
AUC 0.991
Brier_Score 0.072
Positive Class malignant
describe(mod_lightrf)LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.
plot_true_pred(mod_lightrf)plot_roc(mod_lightrf)present() combines describe() and plot() or plot_roc() (default):
present(mod_lightrf)LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.
type defaults to "ROC", but can be set to "confusion" to show training and test confusion matrices side by side:
present(mod_lightrf, type = "confusion")LightGBM Random Forest was used for classification.
Balanced accuracy was 0.94 on the training set and 0.94 in the test set.
plot_varimp(mod_lightrf)For this example, we’ll use the dat_test we created. Remember that if the dataset includes the outcome variable, it must be removed before predicting. You can either delete the column, or use indexing to exclude it. rtemis includes a convenience function features() which excludes the last column of data.frames, data.tables, or tibbles:
head(features(dat_test)) Cl_thickness Cell_size Cell_shape Marg_adhesion Epith_c_size Bare_nuclei
<ord> <ord> <ord> <ord> <ord> <fctr>
1: 6 8 8 1 3 4
2: 1 1 1 1 1 1
3: 5 3 3 3 2 3
4: 8 7 5 10 7 9
5: 4 1 1 1 2 1
6: 1 1 1 1 2 1
Bl_cromatin Normal_nucleoli Mitoses
<fctr> <fctr> <fctr>
1: 3 7 1
2: 3 1 1
3: 4 4 1
4: 5 5 4
5: 3 1 1
6: 3 1 1
In binary classification, the output of predict() is a vector of probabilities for the positive class:
pred <- predict(mod_lightrf, features(dat_test))
head(pred)[1] 0.6461656 0.1560078 0.3754721 0.7414489 0.1512322 0.1510922
To train on multiple resamples, we use the outer_resampling_config argument:
resmod_lightrf <- train(
dat_training,
algorithm = "LightRF",
outer_resampling_config = setup_Resampler(n_resamples = 10L, type = "KFold")
)<rt Resampled Classification Model>
LightRF (LightGBM Random Forest)
⟳ Tested using 10-fold crossvalidation.
<rt Resampled Classification Training Metrics>
Showing mean (sd) across resamples.
Sensitivity: 0.895 (0.009)
Specificity: 0.971 (4.6e-03)
Balanced_Accuracy: 0.933 (0.005)
PPV: 0.942 (0.009)
NPV: 0.946 (4.2e-03)
F1: 0.918 (0.007)
Accuracy: 0.945 (4.4e-03)
AUC: 0.984 (2.6e-03)
Brier_Score: 0.076 (2.2e-03)
<rt Resampled Classification Test Metrics>
Showing mean (sd) across resamples.
Sensitivity: 0.878 (0.086)
Specificity: 0.971 (0.041)
Balanced_Accuracy: 0.924 (0.053)
PPV: 0.944 (0.077)
NPV: 0.939 (0.043)
F1: 0.908 (0.068)
Accuracy: 0.939 (0.046)
AUC: 0.984 (0.026)
Brier_Score: 0.079 (0.016)
Now, train() produced a ClassificationRes object:
class(resmod_lightrf)[1] "rtemis::ClassificationRes" "rtemis::SupervisedRes"
[3] "S7_object"
describe(resmod_lightrf)LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.93 in the training set and 0.92 in the test set across 10 independent folds.
The plot() method for ClassificationRes objects plots boxplots of the training and test set metrics:
plot_true_pred(resmod_lightrf)The present() method for ClassificationRes objects combines the describe() and plot() methods:
present(resmod_lightrf)LightGBM Random Forest was used for classification. Mean balanced accuracy was 0.93 in the training set and 0.92 in the test set across 10 independent folds.