The heart is one of the most important part in your body if not the most important one. It is important to take good care of the heart in order to prevent cardiovascular disease. Today, I’m going to explore the use of skelearn different scalers in UCI heart disease dataset in an effort to improve your machine learning performance and coverage level. According to scikit-learn website scalers can have an effect on data with outliers. Many features within a dataset can have different range and characteristics which in-turn can degrade the predictive performance of machine learning algorithms.
The UCI Heart Disease Dataset
The published dataset has 14 features meaning and types can be listed as follows :
- (age) — Age in years
- (sex) — (Male = 1 or Female=0)
- (cp) — (Chest Pain Type=(1,2,3,4) )
- (trestbps) — (Resting Blood Pressure )
- (chol) — (Serum Cholesterol)
- (fbs) — (Fasting Blood Sugar > 120 mg/dl True =1 , False =0)
- (restecg) — (Resting electrocardiographic results = (Normal = 0, ST-T wave abnormality = 1, Probable Left Ventricular Hypertrophy=2
- (thalach) — (Max. Heart rate achieved )
- (exang) — (Exercise induced angina)
- (oldpeak) — (ST depression induced by exercise relative to rest)
- (slope) — ( the slope of the peak exercise ST segment)
- (ca) — (Number of major vessels (0–3) colored by fluoroscopy)
- (thal) — (3 = normal; 6 = fixed defect; 7 = reversible defect)
- (num) (the predicted attribute) — (Diagnosis of heart disease)
Let’s first start importing the heart data:
We use the python function describe to get some understanding of the basic features in the data. This would allow us to infer and get summaries of different measures.
We can see from the describe function the different presentation of statistical measures for each feature in a manageable format.
In order to select the right scaling or standardizing processing techniques for our data we need to investigate data distribution of different features. But before that its important to define what we mean when we scale or standardize the data.
For example, scaling is about changing the range not the shape of the data while standardization of the data is about changing the values where your your deviation is closer to one.
Feature Scaling Methods
In this post we explore 4 methods of feature scaling techniques that are implemented in scikit-learn. We will apply those techniques on the following 3 features sex, cp , and fbs then we will plot how using those scalers affect the feature distributions. It is obvious that the values of most features are within the range [0,1] except the cp values which is [1 – 4] range.
In this example we will use different scalers, transformers, and normalizers to bring the data within a pre-defined range and compare their effects on the data.
Standardization is a scaling technique where we transform the features shape closely to be normally distributed. Please note ethat
The code to implement this is shown in the figure below :
This technique scales the range of the features to some thing close to the [0 1] range.
The code to implement this is available below :
Applying Robust Scaler on the heart dataset scaled the data set and removed outliers from features with high value range.
The code to implement this is available below:
According to sickit-learn documentation normalizer acts row-wise on the data. It does not remove the mean and scale by deviation but scales the whole row to unit norm. This can be seen on the second plot generated applying Normalizer on the dataset.
The code for this method is can be seen below :
Machine Learning Algorithms Performance
Now let’s compare the effect of using scalers vs not using the scalers on different machine learning algorithms and select the most accurate.
ML Algorithms Before using the Robust ScalersLR: 0.690833 (0.067955)
LDA: 0.666333 (0.073295)
KNN: 0.681500 (0.063204)
CART: 0.731333 (0.059758)
NB: 0.670500 (0.085248)
SVM: 0.764167 (0.076939)
ML Algorithms After Using the Robust ScalersLR: 0.693833 (0.090419)
LDA: 0.702167 (0.074633)
KNN: 0.698833 (0.066186)
CART: 0.752000 (0.082774)
NB: 0.681500 (0.084217)
SVM: 0.743667 (0.100755)
We can see that it looks like most of model’s scores increased after using the Robust Scalers except Support Vector Classification “ SVM” model.
We can deduce the followings from the features scaling methods applied above :
- Scaling is very important technique commonly used in fitting data in machine learning models , since data often consists of many different features with different range of values or measures. If data is not scaled features with high values can dominate and affect your machine learning algorithm accuracy and performance.
- Compared to different scaling techniques using Robust scaler on the heart dataset removed outliers and standardize the distribution.
- Descriptive statistics should be used to infer information about the data such as the ranges , standard deviation , mean etc. This will aid you in optimizing your scaler ranges and help you in selecting the best scaling approach.
- UCI Machine Learning Repository: Heart Disease Data Set: https://archive.ics.uci.edu/ml/datasets/Heart+Disease
2. Compare the effect of different scalers on data with outliers: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py.