The idea of having newer algorithms come into the picture doesn’t make the older ones ‘completely redundant’. British statistician, George E. P. Box had once quoted that, “All models are wrong, but some are useful”, meaning that no model is exact enough to certify as cent percent accurate. Reverse claims can only lead to the loss of generalization. The most accurate thing to do is to find the most approximate model.
Clustering is an unsupervised learning technique where the aim is to group similar objects together. We are virtually living in a world where our past and present choices have become a dataset that can be clustered to identify patterns in our searches, shopping carts, the books we read, etc such that the machine algorithm is sophisticated enough to recommend the things to us. It is fascinating that the algorithms know much more about us then we ourselves can recognize!
As already discussed in the previous blog, K-means makes use of Euclidean distance as a metric to form the clusters. This leads to a variety of drawbacks as mentioned. Please refer to the blog to read about the K-means algorithm, implementation, and drawbacks: Clustering — Diving deep into K-means algorithm
The real-life data has outliers and is irregular in shape. K-means fails to address these important points and becomes unsuitable for arbitrary shaped, noisy data. In this blog, we are going to learn about an interesting density-based clustering approach — DBSCAN.
DBSCAN is a density-based clustering approach that separates regions with a high density of data points from the regions with a lower density of data points. Its fundamental definition is that the cluster is a contiguous region of dense data points separated from another such region by a region of the low density of data points. Unlike K-means clustering, the number of clusters is determined by the algorithm. Two important concepts are density reachability and density connectivity, which can be understood as follows:
A point is considered to be density reachable to another point if it is situated within a particular distance range from it. It is the criteria for calling two points as neighbors. Similarly, if two points A and B are density reachable (neighbors), also B and C are density reachable (neighbors), then by chaining approach A and C belong to the same cluster. This concept is called density connectivity. By this approach, the algorithm performs cluster propagation.
The key constructs of the DBSCAN algorithm that help it determine the ‘concept of density’ are as follows:
Epsilon ε(measure) : ε is the threshold radius distance which determines the neighborhood of a point. If a point is located at a distance less than or equal to ε from another point, it becomes its neighbor, that is, it becomes density reachable to it.
Choice of ε: The choice of ε is made in a way that the clusters and the outlier data can be segregated perfectly. Too large ε value can cluster the entire data as one cluster and too small value can classify each point as noise. In layman terms, the average distance of each point from its k-nearest neighbors is determined, sorted, and plotted. The point of maximum change (the elbows bend) determines the optimal value of ε.
Min points m (measure): It is a threshold number of points present in the ε distance of a data point that dictates the category of that data point. It is driven by the number of dimensions present.
Choice of Min points:Minimum value of Min points has to be 3. Larger density and dimensionality means larger value should be chosen. The formula to be used while assigning value to Min points is: Min points>= Dimensionality + 1
Core points (data points): A point is a core point if it has at least m number of points within radii of ε distance from it.
Border points (data points): A point that doesn’t qualify as a core point but is a neighbor of a core point.
Noise points (data points): An outlier point that doesn’t fulfill any of the above-given criteria.
The time complexity of the DBSCAN lies between O(n log n) (best case scenario) to O(n²) (worst case), depending upon the indexing structure, ε, and m values chosen.
As a part of the scikit-learn module, below is the code of DBSCAN with some of the hyperparameters set to the default value:
class sklearn.cluster.DBSCAN(eps=0.5, *, min_samples=5, metric='euclidean')
eps is the epsilon value as already explained.
min_samples is the Min points value.
metric is the process by which distance is calculated in the algorithm. By default, it is Euclidean distance, other than that it can be any user-defined distance function or a ‘precomputed’ distance matrix.
There are some advanced hyperparameters which will be best discussed in future projects.
To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.
Stay tuned. Happy learning :)