Back to Developer Roadmap

Imbalanced Data

src/data/question-groups/data-science/content/imbalanced-data.md

4.0879 B
Original Source

Imbalanced data is a situation where the classes, labels, or categories in a dataset are not equally represented. In imbalanced datasets, one category has significantly more or fewer samples than the others. Imbalanced data refers to a common issue in supervised machine learning and deep learning where there is a non-uniform distribution of samples among different classes. This can lead to biased outcomes in models, such as those used in healthcare services, impacting their reliability and effectiveness.

For example, if you build an email spam detection model and your dataset contains 5% spam emails and 95% non-spam emails, the data is imbalanced toward non-spam. This imbalance can negatively affect the model's performance, especially in production, because it may achieve high accuracy by simply predicting the majority class while failing to detect spam effectively.