Data Science: What is K means?

If we have a dataset that we suspect can be divided into specific clusters, for example, customers who we think might fall into different types, we can use an iterative algorithm called ‘K means’ to find and refine these subgroups.

What’s the algorithm?

Choose K, the number of groups that you think you’re data is split into

Randomly assign each data point to one of these K clusters

Repeat the following two steps until centeroid assignment doesn’t change for each data point:

1. Assignment step: Calculate the Euclidean distance between each datapoint and each K centeroid. Pick the centeroid that is closest to said datapoint.

2. Update step: Recalculate centeroids, take all the data points’ coordinates in each cluster and calculate the mean coordinate. These are the new K centeroids.

How do you measure “closeness”?

There are several ways of doing this, but the most common is the Euclidean-based distance

as a measure of “closeness.”

How do you choose your intial K?

In the Udemy course, so far, we have not yet had to choose K, as the test data came already grouped into clusters. One way is to use the elbow method where you add more and more clusters to see where the percentage of variance explained as a function of the number of clusters dimishes, and choose the number that does that as your K. I don’t really know what that means yet, will report back when I get to it.

Useful links:

Towards Data Science

Wikipedia

Data Science: Python

From speaking to data scientists, economists and machine learning analysts, it seems like there are two main choices for mainstream programming languages: Python and R, with Julia coming in third. Some interesting discussions here

Python is the fastest-growing as it has such comprehensive machine learning libraries. Therefore, I’ve decided to go for Python. There are so many free and paid online courses, that for now, I think doing that will suffice. I think this is a sensible overall plan for someone interested in getting ready for data science:

This is a great Medium post on how to go about creating your own data science learning experience, ranking all kinds of online courses.

1. Some of the courses available:

Udemy

Datacamp

Coursera

2. Doing data science practice questions

– Udemy course

Analytics Vidhya

Project Euler

3. Flashcards for myself testing basic statistics and other machine learning concepts

4. Set up a Github page with my own code that others can see? Not quite ready for this step yet.