After reading data, you should check your data. The check_data()
function prints useful information about your dataset, along with recommendations, when applicable:
iris: A data.table with 150 rows and 5 columns.
Data types
* 4 numeric features
* 0 integer features
* 1 factor, which is not ordered
* 0 character features
* 0 date features
Issues
* 0 constant features
* 1 duplicate case
* 0 missing values
Recommendations
* Consider removing the duplicate case
It turns out the popular iris
dataset contains one duplicate row.
It is very important to ask for a data dictionary whenever you are given a dataset to analyze. However, that may not always be available. In that case, you need to do some further investigation to understand the data and assign the correct types to the features.