If we have a dataset that we suspect can be divided into specific clusters, for example, customers who we think might fall into different types, we can use an iterative algorithm called ‘K means’ to find and refine these subgroups.
What’s the algorithm?
Choose K, the number of groups that you think you’re data is split into
Randomly assign each data point to one of these K clusters
Repeat the following two steps until centeroid assignment doesn’t change for each data point:
1. Assignment step: Calculate the Euclidean distance between each datapoint and each K centeroid. Pick the centeroid that is closest to said datapoint.
2. Update step: Recalculate centeroids, take all the data points’ coordinates in each cluster and calculate the mean coordinate. These are the new K centeroids.
How do you measure “closeness”?
There are several ways of doing this, but the most common is the Euclidean-based distance
as a measure of “closeness.”
How do you choose your intial K?
In the Udemy course, so far, we have not yet had to choose K, as the test data came already grouped into clusters. One way is to use the elbow method where you add more and more clusters to see where the percentage of variance explained as a function of the number of clusters dimishes, and choose the number that does that as your K. I don’t really know what that means yet, will report back when I get to it.
Useful links: