Sie sind auf Seite 1von 13

6/6/2018 K-Means Clustering in Excel

K-Means Clustering in Excel

Posted on October 24, 2015January 7, 2016 by bquan rading


In this post I wanted to present a very popular clustering algorithm used in machine learning. The k-means
algorithm is an unsupervised algorithm that allocates unlabeled data into a preselected number of K clusters.
A stylized example is presented below to help with the exposition. Lets say we have 256 observations which
are plo ed below.

(h ps://quantmacro.files.wordpress.com/2015/10/1.jpg)

Our task it to assign each observation into one of three clusters. What we can do is to randomly generate 3
cluster centroids.

(h ps://quantmacro.files.wordpress.com/2015/10/2.jpg)

We can then proceed in an iterative fashion by assigning each observation to one of the centroid based on
some distance measure. So if an observation has the least distance to the centroid k1 then it will be assigned to
that centroid. Most common distance measure is called the Euclidean distance. This is a measure of the
straight line that joins two points. More on this distance measure here
h ps //en ikipedi org/ iki/E clide n dist nce (h ps //en ikipedi org/ iki/E clide n dist nce)
https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 1/13
6/6/2018 K-Means Clustering in Excel
h ps://en.wikipedia.org/wiki/Euclidean_distance (h ps://en.wikipedia.org/wiki/Euclidean_distance).

The formula for a Euclidean distance between point Q =(q1,q2,q3,…qn) and point P=(p1,p2,p3,…pn) is given
as:

(h ps://quantmacro.files.wordpress.com/2015/10/3.jpg)

We can implement this measure in VBA with the below function that takes two arrays and the length of the
arrays as input to compute the distance.

(h ps://quantmacro.files.wordpress.com/2015/10/4.jpg)

In our example we have two features for each observation. Starting with the first we have x1=-.24513 and
x2=5.740192

(h ps://quantmacro.files.wordpress.com/2015/10/5.jpg)

Using the first observation we can calculate the distance to each of the centroids. For the first centroid we have
sqrt((-.24513-3)^2+(5.740192-3)^2) = 4.25. We can then calculate the distance to the second centroid as
sqrt((-.24513-6)^2+(5.740192-2)^2) = 7.28. And finally the distance to the third centroid is sqrt((-.24513-8)^2+
(5.740192-5)^2) = 8.28. Now, since the minimum distance is between this object and the first centroid, we will
assign that object to the first centroid cluster.

After repeating this process for all of the 256 objects we have all the objects assigned to one of three clusters. I
plot them below with different colours along with each centroid.

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 2/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/6.jpg)

We can loop through all of our objects and calculate the closest centroid in VBA using below VBA function
that returns an array that is the same length as our object array with each entry being between 1 to K.

(h ps://quantmacro.files.wordpress.com/2015/10/7.jpg)

At this point we move on to the next step of the iterative process and recalculate the centroids for each cluster.
That is, we group our data by the cluster that they have been assigned to and then take the average for each
feature of that cluster and make the value of each centroid.

In our example we get below results.

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 3/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/8.jpg)

In VBA we can accomplish this with the below function:

(h ps://quantmacro.files.wordpress.com/2015/10/9.jpg)

We can again calculate the distance between each object and the new centroids and reassign each object to a
cluster based on Euclidean distance. I have plo ed the results below:

(h ps://quantmacro.files.wordpress.com/2015/10/10.jpg)

Notice that the centroids have moved and the reassignment looks to improve the homogeneity of the cluster.
We can repeat this process until a maximum number of iterations have completed or until none of the objects
are switching a cluster.

Below are a few more iterations of the algorithm so you can see the dynamics of the cluster assignment and
centroid means.

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 4/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/11.jpg)

As you can see that after only 5 iterations the clusters look sensible. I want to remind you that we only
specified the number of clusters we are looking for and the algorithm found the clusters for us. This is an
example of an unsupervised machine learning algorithm.

Notice how simple the final loop of the algorithm is. After loading in the data and building the above
functions, we simply loop through the calculations a preset number of times with the code below:

(h ps://quantmacro.files.wordpress.com/2015/10/12.jpg)

The centroids are returned to L20 range. In our example after 10 loops we have

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 5/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/13.jpg)

We can check our work with R’s NbClust package.

(h ps://quantmacro.files.wordpress.com/2015/10/14.jpg)

For another example of k-means clustering we can use this approach to cluster countries based on analyst
forecast of some macro variables. For example we can compile a list of countries and consider bloomberg
consensus forecasts for GDP growth, inflation, CB policy rate, unemployment rate, current account balance as
a ratio of GDP, and budget deficit. Let’s say we wish to cluster these courtiers into four clusters for further
analysis and we wish to discard conventional clusters used in the markets such as geographical, development
status, or other marketing based conventions (BRIC being an example).

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 6/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/15.jpg)

At this point I should highlight what is probably already obvious, we need to standardize the data. When
features have different scale our distance measure will be dominated by the relatively large features.
Therefore each cluster will be based on that dominant feature instead of all features being weighed equally.

Here, for each feature, we can simply subtract the minimum from each observation and divide by the range
(max-min) of its values to standardize. Below is our processed table:

(h ps://quantmacro.files.wordpress.com/2015/10/16.jpg)

The other important point is that instead of using arbitrary random starting points for each class we should
select 4 random objects for the starting centroids. After running the k-means algo we end up with below
clusters.

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 7/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/17.jpg)

Af h h l h h d kb i W
https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ h k h i ld d FX f d 8/13
6/6/2018 K-Means Clustering in Excel
After we have the clusters the hard work begins. We can check what yields, curve spreads, FX forwards,
implied vol in FX and rates are within clusters and how they compare with other clusters. Any outliers can
then be further analysed to check for potential trades.

A final note of caution. The k-means algorithm is sensitive to starting values we chose for each centroid. Also,
the a priori choice of k determines the final result. It is worthwhile to run the algorithm multiple times to
check how sensitive the results are to the starting values. Many implementations of k-means run the algorithm
multiple times and select the clustering based on some metric such as total dispersion in the data (ie, sum of
squared distances between each object and its centroid).

In this post I just wanted to introduce the k-means algorithm and show how easy it is to implement it. All the
modeling can be done in a day. Building a model for yourself can bring a lot of insight into how the algorithm
works. For serious applications one should stick to be er implementations such as R. This type of analysis can
be done with just a few lines of code:

(h ps://quantmacro.files.wordpress.com/2015/10/18.jpg)

If you’ve been following along and want to know the complete VBA implementation, below I show you the
main k-means procedure that calls the user defined functions mentioned above. I also present the main
worksheet and I believe you can figure out where the data is read from and where the output is printed from
the code.

The worksheet that calls the macro is below. The user needs to point to where the data is, where you wish for
the output to be printed, the maximum number of iterations and the initial centroid values.

(h ps://quantmacro.files.wordpress.com/2015/10/19.jpg)

The macro bu on calls below macro:

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 9/13
6/6/2018 K-Means Clustering in Excel

(h ps://quantmacro.files.wordpress.com/2015/10/20.jpg)

Some Useful Resources:

1) Wikipedia entry is a great place to start h ps://en.wikipedia.org/wiki/K-means_clustering


(h ps://en.wikipedia.org/wiki/K-means_clustering)

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 10/13
6/6/2018 K-Means Clustering in Excel

Advertisements

Report this ad

Report this ad
Posted in Excel, Machine Learning, R, VBA

6 thoughts on “K-Means Clustering in Excel”

1. iasmer says: March 15, 2016 at 3:33 pm


Thanks a lot to Bquan rading. This is ultimly fascinating, to see results even on random data. But what is
purpose to show only VBA code pictures? Just lose time to OCR.

Any case – thanks one more time

Reply
bquan rading says: March 15, 2016 at 8:25 pm
t iti t d it lik l th
https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/
i t k i b li th t h ld it 11/13
6/6/2018 K-Means Clustering in Excel
to encourage writing out your own code. its likely there are mistakes. i believe that you should write
your own code so you understand it and catch bugs. thats why i dont upload code or workbooks.

Reply
2. Pingback: Hierarchical Clustering
3. peterurbani1 says: September 29, 2016 at 7:54 am

Hi Bquant in the above code for ComputeCentroids – Tempsum is not used and the Sub kmeans() code
crashes at If idx(CC)=ii

Dim tempsum() As Variant

For ii = 1 To k
For bb = 1 To j
counter = 0
For CC = 1 To m
If idx(CC) = ii Then ??

Is it possible to get the correct code or a copy of the spreadsheet with the same data for validation ? Many
thanks Peter U

Reply
4. Jose Rico says: October 12, 2016 at 1:00 am
As I can get the fix matrix, could you give us an orientation?

Dim tempsum() As Variant

Thank you very much

Reply
5. g polic says: March 10, 2017 at 1:30 pm
Did a minor change in ComputeCentroids and this should work:

For ii = 1 To K
For bb = 1 To J
counter = 0
For cc = 1 To M
If idx(cc) = ii Then
centroids(ii, bb) = centroids(ii, bb) + X(cc, bb)
counter = counter + 1
End If
Next cc
If counter > 0 Then
centroids(ii, bb) = centroids(ii, bb) / counter
Else
centroids(ii, bb) = 0
End If
Next bb
Next ii
ComputeCentroids = centroids

Reply

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 12/13
6/6/2018 K-Means Clustering in Excel

Create a free website or blog at WordPress.com.

https://quantmacro.wordpress.com/2015/10/24/k-means-clustering-in-excel/ 13/13

Das könnte Ihnen auch gefallen