Outlier Detection in Machine Learning

Learn how to efficiently detect outliers!

Bhanwar Saini
5 min readMar 11, 2021

During a recent project, I was struggling with a clustering problem with data collected from users of a mobile app. The goal was to classify the users in the duration of their behaviour, possibly with the use of K-means clustering. But, after inspecting the data it set out that some users represented unusual behaviour — they were outliers.

A bunch of machine learning algorithms suffer in terms of their performance when exceptions are not dealt with. In order to solve this kind of problem you could, for example, drop them from your data frame, cap the values at some reasonable point or transform the data. However, in this article, I would like to focus on recognising them and give possible solutions for another time.

As for my situation, I took a lot of variables into consideration, I needed to have an algorithm that would identify the outliers in a higher multidimensional space. That is the point at which I ran over Isolation Forest, a process which in principle is similar to the well-known and popular Random Forest. In this article, I will focus on the Isolation Forest.

How it works

Isolation Forests build a Random Forest in which each Decision Tree is grown randomly. At each node, it picks a feature randomly, then it picks a random threshold value (between the min and max value) to split the dataset in two. The dataset gradually gets chopped into pieces this way…