src/data/question-groups/data-science/content/handling-missing-data.md
After observing that my data set has missing values, I'll figure out how it occurs. Are they represented as NaN, None, empty strings, weird characters like -999, a combination of two or more, or something else?
Once I make sense of what my missing data looks like, I dig into why these values are missing, and they usually fall into three categories:
Missing Completely At Random (MCAR): No pattern, just random gaps. These are usually safe to drop, especially if there aren't many.
Example: In a survey dataset, 10% of income entries are missing due to a technical glitch that affected a random subset of responses. There's no pattern based on age, education, employment status, or anything else.
Missing At Random (MAR): This is when the missing data is related to other observed variables, but not to the income value itself.
Example: In the same dataset, 10% of income values are missing, mostly among respondents who are students. Here, missing data is related to the occupation variable, not the actual income value. Impute based on related features like occupation, education level, or age. Impute based on related features like occupation, education level, or age. Safe to drop or impute with mean/median since the missing data doesn't introduce bias.
MNAR (Missing Not At Random): The reason it's missing is tied to the value itself.
Example: If high spenders choose not to share income, that's tougher to handle and sometimes better tracked with a missingness flag. The probability of missingness increases with the income amount. Imputation is risky here. I'll consider flagging missingness with a binary indicator (income_missing) or using models that can account for MNAR, like EM algorithms or data augmentation techniques.
Once I know the type of missingness, I choose one of the following: a. Deletion (if safe):
b. Simple imputation:
c. Advanced imputation:
d. Use missingness as a feature:
e. Oversampling or undersampling:
Common pitfall: Filling in values without understanding the pattern of missingness. For example, using mean imputation on MNAR data can introduce bias and weaken your model's predictive power.