data(Sonar, package = "mlbench")
<- Sonar
dat c(10, 20 , 30 , 40 , 50), 1] <- NA
dat[c(15, 25 , 35 , 45 , 55), 2] <- NA dat[
5 Preprocess
Data preprocessing is an important step in data pipelines.
Let’s start with the Sonar dataset and introduce some missing values for this example.
5.1 Check data
To check your data, simply enough use the check_data()
function:
check_data(dat)
dat: A data.table with 208 rows and 61 columns.
Data types
* 60 numeric features
* 0 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 2 features include 'NA' values; 10 'NA' values total
* 2 numeric
Recommendations
* Consider imputing missing values or using algorithms that can handle missingness.
The output produces a list of useful information about your dataset, followed by recommendations, if any.
5.2 Preprocess
To clean / preprocess the data, use the preprocess()
command. In this case we want to impute missing data. By default, preprocess()
uses the missRanger package to predict missing values from the available data using random forest in an iterative procedure.
<- preprocess(
dat_pre
dat,parameters = setup_Preprocessor(impute = TRUE)
)
Variables to impute: V1, V2
Variables used to impute: V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25, V26, V27, V28, V29, V30, V31, V32, V33, V34, V35, V36, V37, V38, V39, V40, V41, V42, V43, V44, V45, V46, V47, V48, V49, V50, V51, V52, V53, V54, V55, V56, V57, V58, V59, V60, Class
iter 1
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
iter 2
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
iter 3
|
| | 0%
|
|=================================== | 50%
|
|======================================================================| 100%
preprocess()
returns a Preprocessor
S7 object, which is a list of preprocessed data and additional information about the preprocessing steps taken.
class(dat_pre)
[1] "rtemis::Preprocessor" "S7_object"
Printing the object gives you a look at its structure:
dat_pre
Preprocessor
parameters:
PreprocessorParameters
Showing first 12 of 39 items.
complete_cases: <lgc> FALSE
remove_features_thres: <NUL> NULL
remove_cases_thres: <NUL> NULL
missingness: <lgc> FALSE
impute: <lgc> TRUE
impute_type: <chr> missRanger
impute_missRanger_params:
pmm.k: <nmr> 3.00
maxiter: <nmr> 10.00
num.trees: <nmr> 500.00
impute_discrete: <chr> get_mode
impute_continuous: <chr> mean
integer2factor: <lgc> FALSE
integer2numeric: <lgc> FALSE
logical2factor: <lgc> FALSE
... 27 more items not shown.
preprocessed:
data.frame with 208 rows and 61 columns.
values:
scale_centers: <NUL> NULL
scale_coefficients: <NUL> NULL
one_hot_levels: <NUL> NULL
remove_features: <NUL> NULL
The preprocessed data is stored in the preprocessed
field of the object. You can therefore access it using dat_pre["preprocessed"]
or dat_pre$preprocessed
:
head(dat_pre["preprocessed"])
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 0.0200 0.0371 0.0428 0.0207 0.0954 0.0986 0.1539 0.1601 0.3109 0.2111 0.1609
2 0.0453 0.0523 0.0843 0.0689 0.1183 0.2583 0.2156 0.3481 0.3337 0.2872 0.4918
3 0.0262 0.0582 0.1099 0.1083 0.0974 0.2280 0.2431 0.3771 0.5598 0.6194 0.6333
4 0.0100 0.0171 0.0623 0.0205 0.0205 0.0368 0.1098 0.1276 0.0598 0.1264 0.0881
5 0.0762 0.0666 0.0481 0.0394 0.0590 0.0649 0.1209 0.2467 0.3564 0.4459 0.4152
6 0.0286 0.0453 0.0277 0.0174 0.0384 0.0990 0.1201 0.1833 0.2105 0.3039 0.2988
V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22
1 0.1582 0.2238 0.0645 0.0660 0.2273 0.3100 0.2999 0.5078 0.4797 0.5783 0.5071
2 0.6552 0.6919 0.7797 0.7464 0.9444 1.0000 0.8874 0.8024 0.7818 0.5212 0.4052
3 0.7060 0.5544 0.5320 0.6479 0.6931 0.6759 0.7551 0.8929 0.8619 0.7974 0.6737
4 0.1992 0.0184 0.2261 0.1729 0.2131 0.0693 0.2281 0.4060 0.3973 0.2741 0.3690
5 0.3952 0.4256 0.4135 0.4528 0.5326 0.7306 0.6193 0.2032 0.4636 0.4148 0.4292
6 0.4250 0.6343 0.8198 1.0000 0.9988 0.9508 0.9025 0.7234 0.5122 0.2074 0.3985
V23 V24 V25 V26 V27 V28 V29 V30 V31 V32 V33
1 0.4328 0.5550 0.6711 0.6415 0.7104 0.8080 0.6791 0.3857 0.1307 0.2604 0.5121
2 0.3957 0.3914 0.3250 0.3200 0.3271 0.2767 0.4423 0.2028 0.3788 0.2947 0.1984
3 0.4293 0.3648 0.5331 0.2413 0.5070 0.8533 0.6036 0.8514 0.8512 0.5045 0.1862
4 0.5556 0.4846 0.3140 0.5334 0.5256 0.2520 0.2090 0.3559 0.6260 0.7340 0.6120
5 0.5730 0.5399 0.3161 0.2285 0.6995 1.0000 0.7262 0.4724 0.5103 0.5459 0.2881
6 0.5890 0.2872 0.2043 0.5782 0.5389 0.3750 0.3411 0.5067 0.5580 0.4778 0.3299
V34 V35 V36 V37 V38 V39 V40 V41 V42 V43 V44
1 0.7547 0.8537 0.8507 0.6692 0.6097 0.4943 0.2744 0.0510 0.2834 0.2825 0.4256
2 0.2341 0.1306 0.4182 0.3835 0.1057 0.1840 0.1970 0.1674 0.0583 0.1401 0.1628
3 0.2709 0.4232 0.3043 0.6116 0.6756 0.5375 0.4719 0.4647 0.2587 0.2129 0.2222
4 0.3497 0.3953 0.3012 0.5408 0.8814 0.9857 0.9167 0.6121 0.5006 0.3210 0.3202
5 0.0981 0.1951 0.4181 0.4604 0.3217 0.2828 0.2430 0.1979 0.2444 0.1847 0.0841
6 0.2198 0.1407 0.2856 0.3807 0.4158 0.4054 0.3296 0.2707 0.2650 0.0723 0.1238
V45 V46 V47 V48 V49 V50 V51 V52 V53 V54 V55
1 0.2641 0.1386 0.1051 0.1343 0.0383 0.0324 0.0232 0.0027 0.0065 0.0159 0.0072
2 0.0621 0.0203 0.0530 0.0742 0.0409 0.0061 0.0125 0.0084 0.0089 0.0048 0.0094
3 0.2111 0.0176 0.1348 0.0744 0.0130 0.0106 0.0033 0.0232 0.0166 0.0095 0.0180
4 0.4295 0.3654 0.2655 0.1576 0.0681 0.0294 0.0241 0.0121 0.0036 0.0150 0.0085
5 0.0692 0.0528 0.0357 0.0085 0.0230 0.0046 0.0156 0.0031 0.0054 0.0105 0.0110
6 0.1192 0.1089 0.0623 0.0494 0.0264 0.0081 0.0104 0.0045 0.0014 0.0038 0.0013
V56 V57 V58 V59 V60 Class
1 0.0167 0.0180 0.0084 0.0090 0.0032 R
2 0.0191 0.0140 0.0049 0.0052 0.0044 R
3 0.0244 0.0316 0.0164 0.0095 0.0078 R
4 0.0073 0.0050 0.0044 0.0040 0.0117 R
5 0.0015 0.0072 0.0048 0.0107 0.0094 R
6 0.0089 0.0057 0.0027 0.0051 0.0062 R
In this case, we don’t need the rest of the object, we can simply extract the preprocessed data:
<- dat_pre["preprocessed"] dat_pre
Let’s check the preprocessed data. The missing values should be imputed now:
check_data(dat_pre)
dat_pre: A data.table with 208 rows and 61 columns.
Data types
* 60 numeric features
* 0 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 0 duplicate cases
* 0 missing values
Recommendations
* Everything looks good