Missing Value and Outlier Detection in Python IN SHORT

Sandipan Paul
5 min readJan 8, 2023

--

“To ensure that the trained model generalises and stabilises well to the valid range of test inputs, it’s important to detect then handle missing values and outliers well.”

Outliers

Outliers are those data points that are significantly different from the rest of the dataset. They are often abnormal observations that skew the data distribution.

For example in a dataset of Fixed Deposit Investment, if I find a few investments priced above Rs 1 Crore then that is much higher than the median investment (for now consider Rs 5 Lacs per FD), they’re likely outliers (but they are actual data points).

Should we remove outliers?

No, it will lead to data loss. Natural variation can also produce outliers and it’s not necessarily a problem. An FD of Rs. 1 Crore is possible in the above example. So why remove it, if we can handle those extreme data points in some other way (the hint is to use the capping method)

Few Outlier Detection Techniques

  1. Standard Deviation
  2. Z Score
  3. Inter Quartile Range (IQR) / Box Plot Method
  4. Percentile

Outlier Detection Using Standard Deviation

For this outlier detection method, the mean and standard deviation of the residuals are calculated and compared. If a value is a certain number of standard deviations away from the mean, that data point is identified as an outlier.

The specified number of standard deviations is called the threshold. The default value is 3.

NOTE: This method can fail to detect outliers because the outliers increase the standard deviation. The more extreme the outlier, the more the standard deviation is affected (because we use Mean in the formula and Mean tends to get affected by outliers).

Importing necessary libraries and sample dataset creation based on Fixed Deposit Investment
From the distribution of Fixed Deposit Investment, the inference is some kind of outliers are present in the data as we can see on the extreme right of the distribution. (Hint: we can detect outliers using plots also)
Outlier Detection using Standard Deviation. First, we calculate the lower and upper ranges of the data and then filtered only those data points which are in the range of upper and lower data points. Notably, we filtered the outlier data, which was observed earlier during the plot (for example, Sukanya investment is an outlier).

Detect Outliers Using the Z-Score

For a normal distribution with mean μ and standard deviation σ, the z-score for a value x in the dataset is given by: z = (x — μ)/σ

From the above equation, we have the following:

  • When x = μ, the value of the z-score is 0.
  • When x = μ ± 1, μ ± 2, or μ ± 3, the z-score is ± 1, ± 2, or ± 3, respectively.

Notice how this technique is equivalent to the scores based on the standard deviation above.

Under this transformation, all data points that lie below the lower limit, μ — 3*σ, now map to points that are less than -3 on the z-score scale. Similarly, all points that lie above the upper limit, μ + 3*σ map to a value above 3 on the z-score scale. So [lower_limit, upper_limit] becomes [-3, 3].

Outlier Detection using Z Score. First, we calculate the z score of each data point and then filtered only those data points which are in the range -3 and +3. Notably, we filtered the outlier data also which was observed earlier during the plot.

Outlier Detection Using Interquartile Range (IQR)

Box plot is the non-parametric method. It displays variation without making any assumptions about the underlying distribution. Outliers are shown as dots.

Box Plot Sample Image. Picture Reference: Byjus

The minimum/Lower Range is the lowest data point excluding any outliers. The maximum/Upper Range is the largest data point excluding any outliers. The median (Q2 / 50th percentile) is the middle value of the dataset. The first quartile (Q1 / 25th percentile), is also known as the lower quartile(0.25). The third quartile (Q3 / 75th percentile) is also known as the upper quartile(0.75). Outliers are shown as dots or stars.

Box Plot of FD Investment. The inference is we can observe some outliers are there in the 400000 range
IQR or Box Plot Outliers Filter. As we can observe, there are lots of data points filtered in the Outlier Dataset because it will only take data points in the range of 25% percentile to 75% percentile

Outliers Using Percentile

The interquartile range works by dropping all points that are outside the range [q25 - 1.5*IQR, q75 + 1.5*IQR] as outliers. But removing outliers this way may not be the most optimal choice when your observations have a wide distribution. And you may be discarding more points—than you actually should—as outliers.

Depending on the domain, you may want to widen the range of permissible values to estimate the outliers better using percentile. For example, the custom range accommodates all data points that lie anywhere between 0.5 and 99.5 percentile of the dataset.

Outliers Using Percentile based on 5% — 95% Data. The inference is we capped the data between the 5% percentile to 95% percentile data. The only outlier is Sukanya whose investment is an outlier.

Here’s a summary of Outlier Detection

  • If the data or feature of interest is normally distributed, we may use standard deviation and z-score to label points that are farther than three standard deviations away from the mean as outliers.
  • If the data is not normally distributed, we can use the interquartile range or percentage methods to detect outliers.

Missing Values

The absence of values is a cause of concern for datasets and in the Machine Learning model

Methods to impute Missing Values: Mean, Median, Mode and KNN Imputation

Impute with

1. Mean / Median Value — For Continuous Data

2. Mode Value — For Categorical Data

KNN Imputation

- Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbours found in the training set. Two samples are close if the features that neither is missing are close. By default, a euclidean distance metric supports missing values. nan_euclidean_distances is used to find the nearest neighbours

- For imputing missing values in CATEGORICAL VALUES, we have to encode the categorical values into numeric values. kNNImputer works only for numeric variables. We can perform this using a mapping of categories to numeric variables.

KNN Imputation in Python (for continuous data points)
Consider another dataset for KNN Imputation in Python (for categorical data points)
KNN Imputation in Python (for categorical data points) first we need to encode the categorical data points and then we can apply the imputation as we can see above

GitHub Code:

If it was helpful, please give a thumbs up. Thank You and Please follow:

Medium: https://medium.com/@sandipanpaul

GitHub: https://github.com/sandipanpaul21

--

--

Sandipan Paul
Sandipan Paul

Responses (1)