This blog demonstrates how we can create cluster groups of wine quality from wine samples in order to identify anomalies. The identification of these anomalies provides actionable insight, actionable such that we can start to investigate why the samples are exceptional.
The overall approach here is to take wine sampling data set (chemical analysis of wine samples) and applying an unsupervised machine learning algorithm called “DBSCAN” to classify the samples into groups based on the spatial density of the wine samples data points.
What is DBSCAN?
DBSCAN is abbreviation for Density-Based Spatial Clustering of Application with Noise algorithm. It is a method of clustering by separate high-density points from low-density points. As an outcome, the algorithm finds the noise points (outliers) from a set of data points. It sounds complicated however it is simple and easy to apply.
DBSCAN is an example of unsupervised learning (a branch of machine learning and hence a subset of artificial intelligence)and part of the density-based algorithm[DD1] . Before proceeding further, we need to understand what the unsupervised learning method is.
Unsupervised (machine) learning algorithms infer patterns from a dataset without reference to known, or labelled, outcomes. The term ‘density-based algorithm’ refers that we are going to arrange data based on how dense the location of data points.
What follows is a technical dive into the approach taken.
How to implement DBSCAN
The two main parameters of DBSCAN algorithm are ε (epsilon) and minPoints, which are defined as:
ε = Radius of the neighbourhood region
Two points are considered neighbours if the distance between the two points is below the threshold epsilon. The epsilon is calculated based on the Euclidian distance between points. To understand more explicitly assume the below example, where we have two points X and Y in a 2 two-dimensional axis then we can calculate its distance as,
By picking larger values of ɛ, more points become density-reachable (fewer outliers found), and by choosing smaller values of ɛ, less points become density-reachable (more outliers found).
minPoint = Minimum number of points that must present within the neighbourhood
We can adjust minPoint based on our convenience, for example if we need at least 10 points to be present in a core point then we will keep it as 10 and so on.
Based on ε and minPoint, we get three different outputs which are two clusters and an outlier. The figure below illustrates the scenario in a clear manner.
Core point = A data point is said to be a core point if it at least has ‘minPoint’. For example, assume our minPoint is five and if we get a datapoint with three, then we can’t classify it as a core point as it doesn’t satisfy the requirement.
Border point = A data point is said to be a border point if it has less than ‘minPoint’ and contains one of the core points. For example, assume our minPoint as five and if any of data point has 3 with one them as core point that is reachable with a distance of ε.
Noise point = Noise point can be termed as outliers which is the goal of finding through DBSCAN algorithm. A data point is said to be noise point if its neither a core nor a border point, these can be assumed as an extreme value, unexpected occurrence or different behaviour than a regular event.
Implementation of DBSCAN in Python
We can implement DBSCAN algorithm in python with sci-kit learning, which is really a simple procedure.
Step 1: Import necessary libraries required
Step 2: Read the dataset
The wine quality dataset is available in the UCI machine learning website, but it has many bad records. I have cleaned it manually before using it here. You can download by clicking the link.
Step 3: Basic idea of data
Although we have many variables here, we can take fixed acidity and volatile acidity for the analysis. We can even take multiple variables but normalisation procedure must be followed before that, for a simplicity we are assuming only two variables.
Step 4: Define the model
Here we have given the epsilon distance to be 0.2 units and minimum sample of each data point must be 20. So we can assume as any data point above 20 becomes core point and below that and having one core point becomes border. If it doesn’t have either of them then it becomes a noise point(outlier).
Step 5: Check the count in each cluster
From sci-kit parameters we can able to note that any point that as considered as -1 is an noise point. 1,2 and so on is treated as border point and 0 is treated as core point. From our analysis we can able to see there are 117 outliers identified by DBSCAN and some them have been printed out.
Step 6: Visualize the outliers
From the figure we can able to notice that points with maroon colour are considered outliers. As we have identified the outliers we can remove or take in consideration for further analysis based on our requirement. As our dataset is about wine quality, for these 117 types are different than the normal one, hence authorities need to have closer inspection.
I hope DBSCAN is now clear and you can download the full code from here.
Advantages of DBSCAN
It can identify outliers from any shape of data existence, where a normal k-means and k-median can identify clusters only when the data resembles a circle.
Can identify clusters whatever the shape of data and simple to implement.
Disadvantage of DBSCAN
Sensitive to ε and minPoint as the outliers point changes for every value combination.
If our data exist with varying densities, then it would be tricky to identify clusters and noise point.
DBSCAN is good at separating high-density clusters from low-density clusters but struggles with similar density.
DBSCAN suffers from high dimensionality of data. Hence, we need to do the additional task of feature selection before passing to DBSCAN.
Other potential applications of DBSCAN
Anomaly detection in temperature, sales and X-ray image cells.
Clustering of data.
Identification of abnormal behaviour in stock market.