Feature scaling is a data preprocessing step performed on numerical columns to bring all of them on the same scale. Suppose, we have to build a classification model which predicts whether a customer will buy a specific product or not. Now, along with other variables, there are 'age'(15-80 years) and 'salary' (10,000-1,50,000) in the data. It can be seen there is a huge difference between the scale of these two features.
Some of the machine learning algorithms (especially the ones depending on distance and weights) are sensitive to the magnitude of features and can assume that the features with larger magnitudes have more importance which is not true always.
These algorithms include Linear and Logistic Regression, Artificial Neural Network, Gradient Descent, K nearest-neighbors, PCA etc. Feature scaling is a must exercise for these algorithms. Tree based algorithms like decision tree, random forest etc. doesn't require feature scaling.
Some of the most commonly used feature scaling techniques are given below. All of them can be implemented in Python using Preprocessing module in sklearn library.
1. Standardization
It rescales the feature so that new values follows a distribution with mean 0 and standard devation1. It is based on the assumption that the original distribution is normal. This technique doesn't change the shape of the distribution of the data. Also, it is not much sensitive to outliers. StandardScaler() is used to implement this.
2. Min-Max Scaling
It rescales the feature so that new values lies between 0 and 1 (default range). However, a new range can be defined based on the requirement. It doesn't require the underlying distribution to be normal. This technique is sensitive to outliers. So, if there are outliers in the data, other techniques should be used. MinMaxScaler() is used to implement this.
3. Maximum Absolute Scaling
It is similar to Min-Max scaling, but it rescales the data between -1 and 1. It is recommended for the sparse data. MaxAbsScaler() is used to implement this.
4. Robust Scaling
This is preferred if there are outliers in the data. It makes the use of median instead of mean, min, max values which are sensitive to outliers. RobustScaler() is used to implement this.
There are some other techniques also, but I have provided only the ones I have used in the past.
What other feature scaling techniques have you used in your machine learning projects?
Curious about a specific AI/ML topic? Let me know in comments.
Also, please share your feedbacks and suggestions. That will help me keep going. Even a “like” on my posts will tell me that my posts are helpful to you.
See you next Friday!
-Kavita
Quote of the day
“Experience is a hard teacher because she gives the test first, the lesson afterward.” ―Vernon Sanders Law
P.S. Let’s grow our tribe. Know someone who is curious to dive into ML and AI? Share this newsletter with them and invite them to be a part of this exciting learning journey.
How Normalisation is different from Standardization