Sie sind auf Seite 1von 32

Cluster Analysis

using Statgraphics

Presented by
Dr. Neil W. Polhemus
Cluster Analysis

“A statistical classification technique in which cases, data, or


objects (events, people, things, etc.) are sub-divided into groups
(clusters) such that the items in a cluster are very similar (but not
identical) to one another and very different from the items in
other clusters. It is a discovery tool that reveals associations,
patterns, relationships, and structures in masses of data.”
from businessdictionary.com
Important Applications
• Market research – partitioning consumers into
market segments.
• Medical imaging – identifying different types of
tissue.
• Bioinformatics – separating sequences into
gene families.
• Social networking – dividing people into
communities.
• Educational data mining – identifying groups of
students with different needs.
• Climatology – identifying weather patterns.
Example
Scatterplot matrix
LOG(Pop. Density)

Rural Population

Female Percentage

Age Dependency Ratio

Life Expectancy (Total)

Fertility Rate

LOG(Infant Mortality Rate)

Trade

LOG(GDP per Capita)

Consumer Price Inflation


Notation
n multivariate observations

𝑥11 𝑥12 … 𝑥1𝑝


𝑥21 𝑥22 … 𝑥2𝑝
𝑿= :
:
𝑥𝑛1 𝑥𝑛2 … 𝑥𝑛𝑝

𝑥𝑖𝑗 = value of jth variable for the ith item


Data Input Dialog
Analysis Options

Cluster – may cluster either observations or variables.


Method – procedure used to create clusters.
Distance metric – how distance between points or clusters are measured.
Standardize – whether to standardize the variables before calculating distance.
Number of clusters – target number of clusters to be created.
Distance Metrics
• Squared Euclidean
𝑝
2
𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗
𝑗=1
• Euclidean
𝑝
2
𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗
𝑗=1

• City Block
𝑝

𝑑 𝑥, 𝑦 = 𝑥𝑗 − 𝑦𝑗
𝑗=1
Clustering Methods in Statgraphics Centurion

• Agglomerative hierarchical methods


– Begin with n clusters.
– Combine 2 clusters together based on how close
they are to each other.
– Continue until only k clusters remain.

• Method of k-means
– Begin by creating k clusters using selected
observations as seeds.
– Assign other observations to the closest cluster.
– Move points from one cluster to another until no
more changes are indicated.
Single Linkage (Nearest Neighbor)
1. Begin with n clusters, one for each observation.
2. Calculate the minimum distance between all
pairs of points that are located in different
clusters.
3. For the pair that are closest to each other, join
those 2 clusters together.
4. Repeat steps 2 to 3 until the number of clusters
has been reduced to that desired.
Other Hierarchical Methods
• Complete linkage (farthest neighbor) – distance
between clusters is maximum distance between
members of the 2 clusters.
• Average linkage – distance between clusters is
average distance between all pairs of points.
• Centroid – distance between clusters is distance
between their centroids.
• Median – distance between clusters is distance
between their multivariate medians.
• Ward’s method – distance between clusters is
defined by increase in sums of squared deviations
around the means if clusters were joined.
Tables and Graphs
Distance

0
10
20
30
40
50
60
Angola
Benin
Cte d'Ivoire
Cameroon
Senegal
Guinea
Mozambique
Gambia
Central African Republic
Ethiopia
Kenya
Tanzania
Madagascar
Malawi
Uganda
Zambia
Chad
Rwanda
Nigeria
Sierra Leone
Comoros
Cambodia
Lao PDR
Tajikistan
Bangladesh
India
Pakistan
Nepal
Sudan
Haiti
Albania
Austria
Switzerland
Germany
Spain
France
Italy
Greece
Portugal
Denmark
Japan
Croatia
Poland
United Kingdom
Israel
Cyprus
Bulgaria
Montenegro
Tunisia
Macedonia
Romania
Czech Republic
Hungary
Slovak Republic
Estonia
Lithuania
Latvia
Slovenia
Barbados
Mauritius
Maldives
Grenada
St. Lucia
Thailand
St. Vincent and the Grenadines
Fiji
Bosnia and Herzegovina
Argentina
Uruguay
Brazil
Colombia
Peru
Mexico
Turkey
Ecuador
Armenia
Georgia
Azerbaijan
Morocco
Cape Verde
Honduras
Dendrogram

Nicaragua
Paraguay
Dominican Republic
Philippines
Syrian Arab Republic
Indonesia
Costa Rica
Belgium
Netherlands
South Korea
Moldova
El Salvador
Finland
Norway
Sweden
New Zealand
United States
Canada
Bolivia
Malta
Algeria
Kazakhstan
Gabon
China
Malaysia
Panama
Mongolia
Lebanon
Dendrogram

Guatemala
Nearest Neighbor Method,Squared Euclidean

Papua New Guinea


Solomon Islands
Vanuatu
Samoa
Tonga
Bahamas
Sri Lanka
Egypt
Jamaica
Jordan
Bhutan
Kyrgyz Republic
Ghana
Ireland
Vietnam
Congo (Rep.)
Mauritania
Russia
Botswana
Namibia
South Africa
Lesotho
Swaziland
Ukraine
Oman
Saudi Arabia
Iceland
Luxembourg
Singapore
Venezuela
Qatar
Distance

300
600
900
1200
1500

0
Angola
Chad
Zambia
Ethiopia
Kenya
Tanzania
Sudan
Madagascar
Tajikistan
Malawi
Uganda
Rwanda
Nigeria
Sierra Leone
Ghana
India
Pakistan
Nepal
Benin
Senegal
Gambia
Cte d'Ivoire
Cameroon
Guinea
Mozambique
Central African Republic
Congo (Rep.)
Mauritania
Lesotho
Swaziland
Albania
Romania
Bulgaria
Montenegro
Tunisia
Macedonia
Costa Rica
Bahamas
Lebanon
Armenia
Georgia
Azerbaijan
Morocco
Indonesia
China
Argentina
Uruguay
Brazil
Colombia
Peru
Ecuador
Mexico
Turkey
Russia
Ukraine
Venezuela
Bangladesh
Sri Lanka
Egypt
Jamaica
Cape Verde
Dominican Republic
Philippines
Syrian Arab Republic
El Salvador
Comoros
Haiti
Guatemala
Tonga
Bosnia and Herzegovina
Thailand
Moldova
Vietnam
Barbados
Mauritius
Maldives
Grenada
St. Lucia
St. Vincent and the Grenadines
Bolivia
Gabon
Dendrogram

Botswana
Namibia
South Africa
Algeria
Kazakhstan
Mongolia
Bhutan
Kyrgyz Republic
Fiji
Paraguay
Honduras
Nicaragua
Cambodia
Lao PDR
Papua New Guinea
Solomon Islands
Vanuatu
Samoa
Austria
Switzerland
Ward's Method,Squared Euclidean

Germany
Cyprus
South Korea
Denmark
United Kingdom
Israel
Spain
France
Italy
Greece
Portugal
Japan
Belgium
Netherlands
Malta
Ireland
Ward’s Method

Jordan
Malaysia
Panama
Czech Republic
Hungary
Slovak Republic
Croatia
Poland
Slovenia
Estonia
Lithuania
Latvia
Canada
New Zealand
United States
Finland
Norway
Sweden
Iceland
Oman
Saudi Arabia
Luxembourg
Singapore
Qatar
Cluster Scatterplot

Cluster Scatterplot
Ward's Method,Squared Euclidean

Cluster
1
2
12.6 3
LOG(GDP per Capita)

Centroids
10.6

8.6

6.6

4.6
100
0 80
1 60
2 3 40
4 20
5 0 Rural Population
LOG(Infant Mortality Rate)
Icicle Plot
Agglomeration Distance Plot

Agglomeration Distance Plot


Ward's Method,Squared Euclidean

1500

1200
Distance

900

600

300

0
0 30 60 90 120 150
Stage
Agglomeration Schedule
Distance

200
400
600
800

0
Angola
Chad
Zambia
Ethiopia
Kenya
Tanzania
Sudan
Madagascar
Tajikistan
Malawi
Uganda
Rwanda
Nigeria
Sierra Leone
Ghana
India
Pakistan
Nepal
Benin
Senegal
Gambia
Cte d'Ivoire
Cameroon
Guinea
Mozambique
Central African Republic
Congo (Rep.)
Mauritania
Lesotho
Swaziland
Albania
Romania
Bulgaria
Montenegro
Tunisia
Macedonia
Costa Rica
Bahamas
Lebanon
Armenia
Georgia
Azerbaijan
Morocco
Indonesia
China
Argentina
Uruguay
Brazil
Colombia
Peru
Ecuador
Mexico
Turkey
Russia
Ukraine
Venezuela
Bangladesh
Sri Lanka
Egypt
Jamaica
Cape Verde
Dominican Republic
Philippines
Syrian Arab Republic
El Salvador
Comoros
Haiti
Guatemala
Tonga
Bosnia and Herzegovina
Thailand
Moldova
Vietnam
Barbados
Mauritius
Maldives
Grenada
St. Lucia
St. Vincent and the Grenadines
Bolivia
Gabon
Dendrogram

Botswana
Namibia
South Africa
Algeria
Kazakhstan
Mongolia
Bhutan
Kyrgyz Republic
Fiji
Paraguay
Honduras
Nicaragua
Cambodia
Lao PDR
Papua New Guinea
Solomon Islands
Vanuatu
Samoa
Austria
Switzerland
Ward's Method,Squared Euclidean

Germany
Cyprus
South Korea
Denmark
United Kingdom
Israel
Spain
France
Italy
Greece
Portugal
Japan
Belgium
Netherlands
Malta
Ireland
Jordan
Malaysia
Panama
Czech Republic
Hungary
Slovak Republic
Croatia
Poland
Creating 3 Clusters

Slovenia
Estonia
Lithuania
Latvia
Canada
New Zealand
United States
Finland
Norway
Sweden
Iceland
Oman
Saudi Arabia
Luxembourg
Singapore
Qatar
Save Cluster Numbers
Discriminant Analysis
Discriminant Function Plot
Plot of Discriminant Functions

4.5 CLUSTNUMS
1
2
3
2.5
Function 2

0.5

-1.5

-3.5
-5 -3 -1 1 3 5 7
Function 1
Discriminant Function Coefficients
Cluster Scatterplot
Cluster Scatterplot
Ward's Method,Squared Euclidean

5 Cluster
1
2
LOG(Infant Mortality Rate)

4 3
Centroids

0
46 56 66 76 86
Life Expectancy (Total)
Cluster Scatterplot
Method of k-Means
1. k observations are selected to be the initial
seeds.
2. All remaining observations are assigned to
cluster with nearest seed.
3. The centroids of each cluster are calculated.
4. Each observation is checked to see if it is
closer to the centroid of another cluster. If so, it
is switched to that cluster.
5. Step 4 is repeated until there are no further
changes.
Seeds
• Select USA, China and India as initial seeds.
Cluster Summary
Cluster Scatterplot
USA=#3, China=#1, India=#2
World Map (coming soon)
CLUSTNUMS

1
2
3
References
• Johnson and Wichern (2012) Applied
Multivariate Analysis, sixth edition.

• Copy of slides and presentation at:


www.statgraphics.com/webinars