Sie sind auf Seite 1von 3

K means clustering

Brief Introduction
K means clustering is a method that aims to classify the instances of the data into k clusters.
It does this by initially placing k centroids at random locations and considering the data
points closer to that centroid as a cluster. It does this calculation using a distance function.
Many distance functions are available but we will be using euclidean distance. The algorithm
then adjusts the location of the centroids to the means of the cluster. This is done repeatedly
until the algorithm has converged and the centroids can no longer be adjusted.
Experimental Design
For this example, we use the credit-g dataset. We set the initialization method to random, k
to 2 since the class has 2 categories and the distance function to euclidian distance. For
evaluation purposes we remove the class attribute from the clustering so the results can
later be compared to the class. We build the model and note the initial starting points, sum of
square errors and the final position of the centroids. For the evaluation, the classes are
assigned to the clusters based on the majority value of the class attribute within each cluster
and the error is calculated based on this assignment. The results are noted.
Results
The initial starting points for the centroids of the first and second cluster are noted below for
each attribute except class:
First Cluster: 'no checking', 36, 'critical/other existing credit', 'new car', 7855, <100, 1<=X<4,
4, 'female div/dep/mar', none, 2, 'real estate', 25, stores,own, 2, skilled, 1, yes, yes
Second Cluster:
<0,24, 'critical/other existing credit', 'used car', 6615, <100, unemployed, 2, 'male single',
none, 4, 'no known property', 75, none, 'for free', 2, 'high qualif/self emp/mgmt', 1, yes, yes

The sum of square errors was 5365.9976202840735


The final position of the centroids is shown on the table below:
Attribute

Full Data

1st

2nd

(1000)

(643)

(357)

checking_status

no checking

no checking

<0

duration

20.903

19.9285

22.6583

existing paid

existing
paid

existing paid

credit_history

purpose

radio/tv

radio/tv

new car

credit_amount

3271.258

2924.7869

3895.2941

savings_status

<100

<100

<100

employment

1<=X<4

1<=X<4

>=7

installment_commitme
nt

2.973

2.9611

2.9944

personal_status

male single

male single

male single

other_parties

none

none

none

residence_since

2.845

2.5599

3.3585

property_magnitude

car

car

no known
property

age

35.546

33.2364

39.7059

other_payment_plans

none

none

none

housing

own

own

own

existing_credits

1.407

1.3701

1.4734

job

skilled

skilled

skilled

num_dependents

1.155

1.1011

1.2521

own_telephone

none

none

yes

foreign_worker

yes

yes

yes

The algorithm performed at an accuracy of 61.1% with 611 correctly clustered instances and
389 incorrectly clustered ones. Confusion matrix shown below:

1st

2nd

good

477

223

bad

166

134

Here, we take the 1st cluster to be the good category of class and the second cluster to be
the bad category.

Das könnte Ihnen auch gefallen