Outliers | Detection, Impacts and Remedies

Right from the childhood we strive to excel in all aspects of our lives; so that we can stand out from the average population.Sadly, when our lives become mere data points for a data scientist, those outstanding achievements can be treated as ‘outliers’ in a data set. Let’s try to understand why an outlier can be a pain point for a data scientist.

Outliers can drastically change the results of the data analysis and statistical modeling. There can be numerous unfavorable impacts of the outliers in a data set:

  • It increases the error variance and reduces the power of statistical tests
  • They can bias or influence estimates
  • They can also impact the validity of basic assumptions of linear regression

Whenever we come across outliers, the ideal way to tackle them is to find out the reason of having these outliers. The methods to deal with them would then depend on the reason of their occurrence. Outliers can come up due to natural variance in the data or due to data entry/ processing errors.

Univariate outlier detection technique:

If we are dealing with a single variable and want to know if there is any outlier in the data we can simply use the interquartile range (IQR) method. The IQR is calculated as the difference between the 75th and the 25th percentiles of the data. It can be used to identify outliers by defining limits on the sample values which is a factor of the IQR below the 25th percentile or above the 75th percentile.

In financial risk modeling, a common practice, described in multiple BASEL Committee publications, is to detect outliers using percentiles.For example, we can consider the top 1% and bottom 1% values as outliers but if we have a large sample size this method may not give a very conclusive result. Median absolute deviation (MAD) is another popular univariate outlier detection technique.

MAD = median (| x – median (x)|) ; where x represents a series

Multivariate outlier detection technique:

Suppose in our data set we have a person having 30 years of experience but his age is 25.This is a simple case of data entry error.How to detect these kind of outliers when we are dealing with a large data set with multiple variables? This can be done using multivariate outlier detection technique.Also, we can use a grouping variable to assess outliers separately within each group. Popular indices such as Mahalanobis’ distance and Cook’s D are frequently used for these cases.

Various visualization methods like scatter plots, box plots, histograms etc. can also be used to identify outliers.

How to treat outliers:

  • We can remove those outliers arising due to data entry error or if the outliers are very small in numbers.
  • Appropriate transformation e.g. natural log of a value can reduce the variation caused by extreme values.
  • We can also create a separate dummy variable to represent these outliers if the outliers are arising due to a very specific event.
  • Decision tree algorithm allows to deal with outliers by creating separate bins for the outliers.

Hope you find this post helpful. Happy learning 🙂

Leave a comment