Missing Value and Outlier Detection in Python IN SHORT
“To ensure that the trained model generalises and stabilises well to the valid range of test inputs, it’s important to detect then handle missing values and outliers well.”
Outliers
Outliers are those data points that are significantly different from the rest of the dataset. They are often abnormal observations that skew the data distribution.
For example in a dataset of Fixed Deposit Investment, if I find a few investments priced above Rs 1 Crore then that is much higher than the median investment (for now consider Rs 5 Lacs per FD), they’re likely outliers (but they are actual data points).
Should we remove outliers?
No, it will lead to data loss. Natural variation can also produce outliers and it’s not necessarily a problem. An FD of Rs. 1 Crore is possible in the above example. So why remove it, if we can handle those extreme data points in some other way (the hint is to use the capping method)
Few Outlier Detection Techniques
- Standard Deviation
- Z Score
- Inter Quartile Range (IQR) / Box Plot Method
- Percentile
Outlier Detection Using Standard Deviation
For this outlier detection method, the mean and standard deviation of the residuals are calculated and compared. If a value is a certain number of standard deviations away from the mean, that data point is identified as an outlier.
The specified number of standard deviations is called the threshold. The default value is 3.
NOTE: This method can fail to detect outliers because the outliers increase the standard deviation. The more extreme the outlier, the more the standard deviation is affected (because we use Mean in the formula and Mean tends to get affected by outliers).
Detect Outliers Using the Z-Score
For a normal distribution with mean μ and standard deviation σ, the z-score for a value x in the dataset is given by: z = (x — μ)/σ
From the above equation, we have the following:
- When x = μ, the value of the z-score is 0.
- When x = μ ± 1, μ ± 2, or μ ± 3, the z-score is ± 1, ± 2, or ± 3, respectively.
Notice how this technique is equivalent to the scores based on the standard deviation above.
Under this transformation, all data points that lie below the lower limit, μ — 3*σ, now map to points that are less than -3 on the z-score scale. Similarly, all points that lie above the upper limit, μ + 3*σ map to a value above 3 on the z-score scale. So [lower_limit, upper_limit]
becomes [-3, 3].
Outlier Detection Using Interquartile Range (IQR)
Box plot is the non-parametric method. It displays variation without making any assumptions about the underlying distribution. Outliers are shown as dots.
The minimum/Lower Range is the lowest data point excluding any outliers. The maximum/Upper Range is the largest data point excluding any outliers. The median (Q2 / 50th percentile) is the middle value of the dataset. The first quartile (Q1 / 25th percentile), is also known as the lower quartile(0.25). The third quartile (Q3 / 75th percentile) is also known as the upper quartile(0.75). Outliers are shown as dots or stars.
Outliers Using Percentile
The interquartile range works by dropping all points that are outside the range [q25 - 1.5*IQR, q75 + 1.5*IQR]
as outliers. But removing outliers this way may not be the most optimal choice when your observations have a wide distribution. And you may be discarding more points—than you actually should—as outliers.
Depending on the domain, you may want to widen the range of permissible values to estimate the outliers better using percentile. For example, the custom range accommodates all data points that lie anywhere between 0.5 and 99.5 percentile of the dataset.
Here’s a summary of Outlier Detection
- If the data or feature of interest is normally distributed, we may use standard deviation and z-score to label points that are farther than three standard deviations away from the mean as outliers.
- If the data is not normally distributed, we can use the interquartile range or percentage methods to detect outliers.
Missing Values
The absence of values is a cause of concern for datasets and in the Machine Learning model
Methods to impute Missing Values: Mean, Median, Mode and KNN Imputation
Impute with
1. Mean / Median Value — For Continuous Data
2. Mode Value — For Categorical Data
KNN Imputation
- Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbours found in the training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance metric supports missing values. nan_euclidean_distances is used to find the nearest neighbours
- For imputing missing values in CATEGORICAL VALUES, we have to encode the categorical values into numeric values. kNNImputer works only for numeric variables. We can perform this using a mapping of categories to numeric variables.
GitHub Code:
If it was helpful, please give a thumbs up. Thank You and Please follow:
Medium: https://medium.com/@sandipanpaul
GitHub: https://github.com/sandipanpaul21