Common ML assumptions you should know! #6

Assumptions impact almost every aspect of an ML project!

Apr 12, 2024

In the last week’s newsletter, we talked about what assumptions are and why it is important to list them, both in general and from business perspective. We used an example of hiking to make it easier to understand for beginners. In case you missed it, check it out here.

https://kavitagupta.substack.com/p/dont-forget-to-list-down-assumptions

Today, we’ll talk about the key assumptions based on different aspects of an ML project. Let’s discuss them one by one.

Assumptions about Data

Data Resources: Data is collected from authorized and reliable resources.
Data Sufficiency: Available data is sufficient for model training. Limited data can lead to unstable or unreliable results.
Data Quality: Data is complete, consistent, error-free and reliable. Data may contain missing values, outliers, duplicate values etc. that needs to be treated before model training.
Data Distribution: Data follows a certain distribution. For example, normal distribution in most of the cases.
Data Representativeness: Data represents the real-world scenario accurately and contains all the instances expected in the target class.

Assumptions about Features

Feature Relevance: Only relevant and significant features are used for model training.
Feature Importance: Some features are more relevant or important to the target variable than other features.
Feature Independence: There is no high correlation among features. Multicollinearity can lead to unstable model co-efficients.
Feature Scaling: All the features are on the same scale. Certain machine learning algorithms require the features to be on the same scale before model training.

Assumptions about Model Selection

Model Suitability: Assumptions about why a particular machine learning model is chosen over others and why it is suitable to solve the problem at hand.
Model Complexity: Depending on the data and the problem, assumptions are made about the required level of model complexity. It is needed because assuming a simpler model (e.g., linear regression) when the problem is complex may lead to underfitting. Conversely, using a complex model for a simple problem may lead to overfitting.
Model Interpretability: Assumptions are made based on the level of interpretability needed for the problem. More complex models may offer better performance but less interpretability.

Assumptions about Model

These assumptions are often specific to the selected model. Some of the common assumptions are:

Linearity: There is a linear relationship between input features and the target variable (e.g. linear regression).
Independence of Errors: Many models assume that prediction errors (residuals) are independent of each other.
No Multicollinearity: Linear models often assume that input features are not highly correlated with each other.

Assumptions about Training and Testing

Data Split: The split between training and test data impacts model performance. Training data should be representative of the target population or domain.
Data Independence: Training and test data are independent of each other.
Same Distribution: Both the training and test data follow the same distribution.
Time Series: The order and temporal relationships are maintained in case of a time series.

Assumptions about Evaluation Metrics

Metrics Selection: Assumptions about why specific evaluation metrics (e.g., accuracy, precision, recall, F1-score) are chosen and how they align with business goals.

Assumptions about Deployment

Deployment Environment: The deployment environment, such as APIs will be stable or hardware resources will be consistent.
Scalability: The model is scalable and its infrastructure can handle increased usage or data volume in a production environment.

External Assumptions

Regulations and Privacy: The data privacy laws and regulations are followed. Ignoring these can lead to legal, financial, reputational and ethical issues to the firm.

Assumptions about the Future Data

Data Distribution Changes: Assumptions about how the data distribution might change over time and the model's robustness to such changes.
Model Adaptation: Assumptions about the ability of the model to adapt to changes in the data distribution over time.

Did I miss out anything important? Let me know in comments.

Quote of the day

"Success is not final, failure is not fatal: It is the courage to continue that counts." — Winston Churchill

P.S. Let’s grow our tribe. Know someone who is curious to dive into ML and AI? Share this newsletter with them and invite them to be a part of this exciting learning journey.

ML & AI Cupcakes

Discussion about this post