Learn how to efficiently detect outliers!

During a recent project, I was struggling with a clustering problem with data collected from users of a mobile app. The goal was to classify the users in the duration of their behaviour, possibly with the use of K-means clustering. But, after inspecting the data it set out that some users represented unusual behaviour — they were outliers.

A bunch of machine learning algorithms suffer in terms of their performance when exceptions are not dealt with. In order to solve this kind of problem you could, for example, drop them from your data frame, cap the values at some reasonable…


Photo by Austin Distel on Unsplash

Intra-day trading is buying and selling stocks on the same day. Countless traders do this type of day trading to earn quick money. Day traders exit positions before the market closes to avoid uncontrollable risks and negative price breaks between one dayʹs close and the next dayʹs price at the open. Some of the more regularly day-traded financial instruments are stocks, options, currencies, contracts for difference, and a host of futures contracts such as equity index futures, interest rate futures, currency futures, and commodity futures.

Because of the nature of financial support and the rapid returns possible, day trading results…


Gradient boosting has become a big part of Kaggle competition winners’ toolkits. It was initially searched in earnest by Jerome Friedman in the paper Greedy Function Approximation: A Gradient Boosting Machine. In this post, we’ll see gradient boosting and its use in python with the scikit-learn library. Gradient boosting is a boosting ensemble method. Ensemble machine learning methods are things in which several predictors are aggregated to produce a final prediction, which has lower bias and variance than any specific predictors. Ensemble machine learning methods come in 2 different flavors — bagging and boosting. Bagging is a method in which…


Photo by Myriam Jessier on Unsplash

Data Science Dojo has added 32 data sets to its repository, which is freely available for data science and AI enthusiasts. The repository carries a diverse range of themes, difficulty levels, sizes, and attributes. The data sets are categorized according to various difficulty levels to be suitable for everyone. They offer the ability to challenge one’s knowledge and get hands-on practice to boost their skills in areas, including but not limited to exploratory data analysis, data visualization, data wrangling, and machine learning.

The data sets below have been sorted with increasing difficulty for convenience (Beginner, Intermediate, Advanced). We recommend you…


During this marvelous time with the pandemic, many are finding their careers affected. This includes some of the most talented data scientists with which I have ever worked. Having shared my personal experience with some close friends to help them find a new job after being laid off, I thought it worth sharing publicly. After all, this affects more than my friends and me. Any data scientist who was laid off due to the pandemic. …


Market Psychology Books Can Improve Your Trading Strategies

Photo by Lloyd Blunk on Unsplash

Trading is as much about psychology as it is about developing a solid strategy. Without the mental fortitude to stick to a plan, the most great-conceived strategy in the world won’t do you any good. Successful traders not only develop and master a strategy, but they also become more familiar with their own psychological qualities (such as discipline and patience) and develop them, which allows them to be more effective in implementing their strategies.

A variety of books can help traders take steps towards understanding psychology from an investment perspective.

1. Trading in the Zone

Written by Mark Douglas, this is a must-read for anyone…


Photo by Chris Liverani on Unsplash

Whether you day trade forex, stocks, or futures, don’t get distracted by fundamental analysis. While fundamentals are relevant to long-term investors, day traders will likely find the significant investigation that fundamental analysis doesn’t improve their short-term trades. Most successful day traders don’t bother themselves with fundamentals. Here’s why.

Fundamental Analysis Is Irrelevant on Short Time Frames

An organization’s accounting report won’t make any difference much for an exchange that endures five minutes. An organization can have horrible financial statements, but for quite a long time, it can mobilize. A company can be strong financially, with extraordinary income, but then a few days, the offer cost will drop like…


Photo by Annie Spratt on Unsplash

In particular, the non-probabilistic nature of k-means and its use of simple distance from cluster center to assign cluster membership leads to poor performance for many real-world situations.

In this section, we will take a look at Gaussian mixture models (GMMs), which can be viewed as an extension of the ideas behind k-means but can also be a powerful tool for estimation beyond simple clustering.

Here, I will explain Gaussian Mixture Models with help code.

Gaussian Mixture Models implementation in python:

We start with the standard imports:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np

Motivating GMM: Weaknesses of k-Means

Suppose we take a…


In supervised learning, we know the labels of the data points and their distribution. However, the labels may not always be known. Clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. Clustering can either be semi-parametric or probabilistic.

1. K-Means Clustering:

K-Means Clustering is an iterative algorithm that starts with k random numbers used as mean values to define clusters. Data points belong to the cluster defined by the mean value to which they are closest. This mean value co-ordinate is called the centroid.

Iteratively, the mean value of the data points of each cluster…


Hierarchical Clustering uses the approach of finding groups in the data such that the instances are more similar to each other than to cases in different groups. This similarity measure is generally a Euclidean distance between the data points, but Citi-block and Geodesic distances can also be used.

The data is broken down into clusters in a hierarchical fashion. The number of clusters is 0 at the top and maximum at the bottom. The optimum number of clusters is selected from this hierarchy.

There are two main types of hierarchical clustering algorithms:

  • Agglomerative: Bottom-up approach. Start with many small clusters…

Bhanwar Saini

Data science enthusiastic

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store