Beruflich Dokumente
Kultur Dokumente
Brief Introduction
K means clustering is a method that aims to classify the instances of the data into k clusters.
It does this by initially placing k centroids at random locations and considering the data
points closer to that centroid as a cluster. It does this calculation using a distance function.
Many distance functions are available but we will be using euclidean distance. The algorithm
then adjusts the location of the centroids to the means of the cluster. This is done repeatedly
until the algorithm has converged and the centroids can no longer be adjusted.
Experimental Design
For this example, we use the credit-g dataset. We set the initialization method to random, k
to 2 since the class has 2 categories and the distance function to euclidian distance. For
evaluation purposes we remove the class attribute from the clustering so the results can
later be compared to the class. We build the model and note the initial starting points, sum of
square errors and the final position of the centroids. For the evaluation, the classes are
assigned to the clusters based on the majority value of the class attribute within each cluster
and the error is calculated based on this assignment. The results are noted.
Results
The initial starting points for the centroids of the first and second cluster are noted below for
each attribute except class:
First Cluster: 'no checking', 36, 'critical/other existing credit', 'new car', 7855, <100, 1<=X<4,
4, 'female div/dep/mar', none, 2, 'real estate', 25, stores,own, 2, skilled, 1, yes, yes
Second Cluster:
<0,24, 'critical/other existing credit', 'used car', 6615, <100, unemployed, 2, 'male single',
none, 4, 'no known property', 75, none, 'for free', 2, 'high qualif/self emp/mgmt', 1, yes, yes
Full Data
1st
2nd
(1000)
(643)
(357)
checking_status
no checking
no checking
<0
duration
20.903
19.9285
22.6583
existing paid
existing
paid
existing paid
credit_history
purpose
radio/tv
radio/tv
new car
credit_amount
3271.258
2924.7869
3895.2941
savings_status
<100
<100
<100
employment
1<=X<4
1<=X<4
>=7
installment_commitme
nt
2.973
2.9611
2.9944
personal_status
male single
male single
male single
other_parties
none
none
none
residence_since
2.845
2.5599
3.3585
property_magnitude
car
car
no known
property
age
35.546
33.2364
39.7059
other_payment_plans
none
none
none
housing
own
own
own
existing_credits
1.407
1.3701
1.4734
job
skilled
skilled
skilled
num_dependents
1.155
1.1011
1.2521
own_telephone
none
none
yes
foreign_worker
yes
yes
yes
The algorithm performed at an accuracy of 61.1% with 611 correctly clustered instances and
389 incorrectly clustered ones. Confusion matrix shown below:
1st
2nd
good
477
223
bad
166
134
Here, we take the 1st cluster to be the good category of class and the second cluster to be
the bad category.