Are You Also Making This Mistake With .info() Method In Pandas Library? #40
Check out for data science and machine learning projects!
Hello Everyone,
Welcome to the 40th edition of my newsletter ML & AI Cupcakes!
Recently, I started working on Telco Customer Churn project. This is an imbalanced classification problem where the objective is to predict churn and identify patterns that drive it. Predicting and analyzing churn is very important for telecommunication companies because it helps them to develop retention strategies for its existing customers. Why is it important? Because, retaining existing customers is cheaper than acquiring new ones.
In today’s newsletter, I’ll be talking about a mistake I was making while working on telco customer churn dataset. The purpose is to make you aware about this pitfall. So that, you can avoid this during your data science and machine learning projects.
In the upcoming newsletters, I plan to share more mistakes/lessons/insights I get while working on this project. It’ll give you practical tips and will be really helpful if you’re a beginner, preparing for interviews or already working on real-world projects.
Let me know what you feel about this idea!
I downloaded the Telco Customer Churn dataset from kaggle. The link is here:
As a preliminary check regarding the health of the dataset, I used .info() method of pandas library in python. It is used to get the dataframe information like column names, their non-null counts, their data types, memory usage etc.
I got the following output and it showed that there were no missing values.
But I was making a mistake here.
I was not paying attention to the data type of each column before concluding that there were no missing values.
Somehow, I realized this mistake after some time.
There was a column called ‘TotalCharges’ in the dataframe. This column tells the total amount charged from the customer. From the description, it was clear that it was supposed to be a numeric column which was stored as object data type.
So, I used to_numeric function in pandas to convert this column into numeric form.
df['TotalCharges']=pd.to_numeric(df['TotalCharges'], errors='coerce')
The errors = ‘coerce’ setting makes sure that the values that can’t be parsed or converted into desired format are replaced a NaN (Not a Number).
After this conversion, I checked the missing values again.
Surprisingly, this time it was showing 11 missing values in the ‘TotalCharges’ column.
As I explored further, I realized that before conversion, this column was containing empty strings which are not captured as null values while using .info() method.
So, my learnings from this mistake,
Always read the description of each column in the dataset to know what each column contains and what is the expected data type.
Check the data type of each column during preliminary analysis to make sure they align with the description. If they don’t align, transform them into the required format.
Let me know if it was helpful to you and if you look forward to more of such insights!
Quick Questions for you!
Writing each newsletter takes a lot of research, time and effort. Just want to make sure it reaches maximum people to help them grow in their AI/ML journey.
It would be great if you could share this newsletter with your network.
Also, please let me know your feedbacks and suggestions in the comments section. That will help me keep going. Even a “like” on my posts will tell me that my posts are helpful to you.
See you soon!
-Kavita
Is there any way I can start from scratch or basics regarding AI and ML, Thanks