Clustering comes from the set of unsupervised learning algorithms. The core idea here is to group (partition) the unlabelled data points based on their features, such that data items in one group would have similar properties. Being acquainted with the faintest machine learning concepts, one would know the idea of features and labels. Unsupervised learning works on features, sometimes as a use-case while preprocessing to find labels. You may have at some point come across the computer program asking you to identify cars from buses, cats from dogs, or the likes of this. How you can do it is by your visual capabilities. In the case of the processor, a clustering algorithm is running underneath to identify groups for images based on their features.

Consider that you have a huge dataset with the features of students for the past ten years of your college. Features can be like marks of different technical and non-technical subjects, annual income, a grading of soft skills, participation in extracurricular activities, and a number of suspensions, whether the person likes tea or coffee and other details. In the data preprocessing step, irrelevant features can be omitted. After applying clustering algorithms, we can obtain a cluster as in the image below:

Now, this can work as target segmentation where we know based on the features which group of students we need to target for skilling programs, which group of students we can recommend for specific internships, etc.

This was a simple use-case. Another well-known example is that of Google News. After enough verification, extraction and preprocessing (can be TF-IDF, name entity recognition techniques) of various news articles, a cluster of related news articles appears to us whenever we search one particular news, a keyword, etc. Another use-case can be that of color palettes, which groups similar shades of colors under one group, a technique often used in image compression.

Flat clustering: Algorithms where the value for the number of clusters is decided by the scientist.

Hierarchical clustering: Algorithms with a top-down/ bottom-up approach, where the machine decides the number of clusters.

Let’s discuss one of the well-known algorithms:

K-means clustering is a centroid based algorithm that assigns each point to a certain cluster based on its Euclidean distance. It is an iterative algorithm with two main steps — The assignment step and the update step. Let us have a look at the algorithm:

- K <- Choose the value for the number of clusters.
- Initialization techniques: Select K number of data points as centroids and assign nearest data points to their respective cluster (Forgy method) or assign a cluster randomly to each data point (Random Partition method).
- Using Euclidean distance, recalculate the centroids for data points assigned to each cluster and change the assignment of data points accordingly.
- Iterate till there is no change in the centroids and a balance of cluster is hence met.

Keep changing the values of K, and using the elbow method finds the best-suited values of K.

Total within-cluster sum of square (WCSS) or simply the total intra-cluster variation or “inertia” has to be minimized. For calculating total WCSS calculate the sum over the Euclidean distance of all the within-cluster data points from their centroids and divide by the number of data points. With an increase in K, cluster size gets smaller and the points are closer to their respective centroids. So with a K vs. Total WCSS graph will provide an indicator to an appropriate value of K, such that the trade-off of the computation is fair. The knee (or elbow point) is considered to suggest an optimal value of K.

Again being a part of the scikit-learn module, below is the code of K-means with some of the hyperparameters explained:

class sklearn.cluster.KMeans( n_clusters = 8, init = 'k-means++',

n_init = 10, max_iter = 300, tol = 0.0001, random_state = None)

*n_clusters: *The number of clusters you want to generate. This means the same number of centroids will be generated by the algorithm.

*init: *TThese are the algorithms that specify the choice of centroid initially. Values can be = {‘k-means++’, ‘random’, ndarray, callable}

* n__init: *Number of times centroids are initialized and the most appropriate centroid set based on convergence is selected.

*max__iter: * For each time a centroid set is selected, K-means is iterated over max__iter times.

*random__state: *The number of times we want the machine to initialize with the same centroids.

*tol: *Controls tolerance with regards to the changes in the within-cluster sum-squared-error of two consecutive iterations to declare convergence.

- The number of clusters is to be decided by the scientist.
- The centroid initialization is random and the chances of getting stuck at a local minimum of the cost function are fairly good.
- Different centroid selection will give different clusters.
- K-means doesn’t perform well on data points of varying density and sizes.
- Outliers can stretch the centroids; in fact, outliers are assigned a centroid separately in some cases.
- More dimensionality is a problem with K-means as the concept of convergence of the most suitable set which was dependent upon the distances becomes less distinguishing and more confusing.
- K-means assumes the spherical shape of the cluster and is not good with clusters of complicated geometric shapes.
- K-means is confusing when there are overlapping data points between clusters.

To overcome these, other advanced algorithms have been designed which will be discussed in future blogs.

WRITTEN BY

Tech Enthusiast | Writer

- Come, join DevTorq, a league to fulfill your dreams with the right career path.
- BHive, 27th Main, 2nd Sector, HSR Layout, Bengaluru, Karnataka - 560102

- You’re one click away from grabbing the ‘Impossible’ job and hiring the ‘just right’ candidate.
- +91-98863 26860 | +91-83031 74782
- info@devtorq.in

Copyright © Devtorq . All rights reserved.