Cluster Analysis

1
2
Application of Cluster Analysis in
Agricultural Economic Research
Seminar II
Major advisor
Dr. T N Prakash
By
Asrar Ahmed Khan
PAK 4079
3
Scheme of presentation
Introduction
Some terminologies
Application of cluster Analysis in Agricultural
Economics
Goal and objectives of cluster analysis
Assumptions of cluster analysis
Steps in cluster analysis
Hypothetical example
Case studies (I, II).
Conclusion

4
Introduction
Cluster analysis is a formal multivariate technique which
has been an important technique adopted in social and
biological sciences to classify the characteristics in
meaningful sets known as clusters.

Cluster analysis classifies objects (e.g., respondents,
products, or other entities) so that each object is very
similar to others in the cluster with respect to some
predetermined selection criterion.

The resulting clusters of objects should then exhibit high
internal (with in cluster) homogeneity and high external
(between-cluster) heterogeneity.
5
Some Terminologies
Agglomerative methods: Hierarchical procedures that begins with
each object or observation in a separate cluster.
Centroid: Average or mean value of the objects contained in the
cluster on each variable.
Cluster seeds: Initial Centroids or starting points for clusters.
Dendogram: Graphical representation (tree graph) of the results of
a hierarchical procedures in which each object is arrayed on one
axis, and the other axis portrays the steps in the hierarchical
procedures.
Euclidean distance: It is a measure of the similarity between two
objects. It is a measure of the length of a straight line drawn
between two objects.
Source: Hair et al (2005)
6
Some of the Applications of Cluster Analysis in
Agricultural Economic Research:
Managers of high performance category and low performance
category
Villages as developed and underdeveloped
Farm family belonging to below poverty line and above
poverty line
Livestock farming and grain farming
Adopters of new technology and non-adopters of new
technology
To analyze the opinion survey about the performance of the
institutions.
Segmenting the market and determining the target market

Source: www.richmond.edu/~pli/psy538/Cluster%20Analysis.html
7
Goal of cluster analysis
To partition a set of objects into two or more groups
based on the similarity of the objects for a set of
specified characteristics
Objectives of cluster analysis
1. Taxonomy description
2. Data simplification
3. Relationship identification
4. Selection of clustering variables
8
Assumptions of cluster Analysis
+ Not a statistical inference technique

+ Is an objective methodology for quantifying the structure of a set
of observations.

+ Must be sure the sample represents the population and that
under sampling did not occur.

+ Ensure generalizability to population

+ It is sensitive to outliers

+ Examine data and clusters for multicollinearity.

9
Steps in cluster analysis:
1. Formulate a problem
2. Select a similarity measure
3. Select a clustering procedure
4. Decide on the number of clusters
5. Interpret and profile clusters
6. Assess the validity of clustering

Source: Green (1988)
10
1. Formulate a problem

Perhaps the most important part of formulating the
clustering problem is selecting the variables on which the
clustering is based. Inclusion of even one or two irrelevant
variables may distort an otherwise useful clustering solution.

2. Select a Similarity measures
The things are recognized as similar or dissimilar is
fundamental to the process of classification. The similarity
matrix has to be prepared with help the different tools.
a. Correlation coefficient
b. Distance measures
11
a. Correlation coefficient: The most popular correlation
coefficient is product moment correlation coefficient
suggested by Karl Pearson

Correlation coefficient is defined as
_ _
( X
ij
X ) ( X
ik
X )
r
jk
= -------------------------------------- (1)
_ _
( X
ij
- X )
2
( X
ik
- X )
2

Where X
ij
is the value of variable I for the case X
ij
is the mean of values
of the variable for the case j and n is the number of variables.
12
Distance measure: The most common approach is to
measure similarity distance between them are more
similar to each other than are those at larger
distances. There are several ways to compute the
distance between the objects.

Euclidian distance: The Euclidean distance is the square
root of the sum of the squared differences in values for
each variable.
p
d
ij
= ( X
ik
X
jk
)
2
.(2)
k =1
Where d
ij
is the distance between cases I and j and X
ik
is the value of the
k
th
variable for the case I
th

13
3. Selection of clustering procedure
Clustering procedures can be hierarchical or non
hierarchical.
Hierarchical clustering methods :
Hierarchical clustering is characterized by the
development of a hierarchy or tree like structure.
Hierarchical methods can be agglomerative or divisive.
Agglomerative clustering starts with each object in a
separate cluster. Clusters are formed by grouping
objects into bigger and bigger clusters.
Divisive clustering starts with all the objects grouped in a
single cluster. Clusters are divided or split until each
object is in a separate cluster.
14
Different types of hierarchical agglomerative
methods

+single linkage (nearest neighbor)
+Complete linkage (furthest neighbor)
+Average linkage
+Wards method.
15
Figure : Dendrogram
16
Non hierarchical clustering methods:
Non-hierarchical clustering are frequently referred to as
K-means clustering.

Here the first step is to select a cluster seed as the initial
cluster center, and all the objects within a prespecified
threshold distance are included in the resulting cluster.
Then another cluster seed is chosen, and the
assignment continues until all objects are assigned.

There are several approaches for selecting cluster seed,
they are
1. Sequential threshold
2. Parallel threshold
3. Optimization
17
4. Decide on the number of clusters
Unfortunately there is no standard, objective selection
procedure exists on deciding the number of clusters to
be formed.

One rule that is relatively simple examines the similarity
measure or distance between clusters at each
successive step i.e. when the successive values
between steps make a sudden jump.

second rule attempt to apply some of the statistical test,
such as point-biserial correlation.
18
5. Interpret and profile the clusters:
Interpreting and profiling clusters involves examining the
cluster centroids. The centroids represent the mean
values of the objects contained in each of the variables.

- They provide a means for assessing the
correspondence of the derived clusters to those
proposed by prior theory or practical experience.

- The cluster profiles provide a route for making
assessments of practical significance.
19
6. Assess Reliability and Validity

Perform cluster analysis on the same data using different
distance measures. Compare the results across measures
to determine the stability of the solutions.
Use different methods of clustering and compare the
results.
Split the data randomly into halves. Perform clustering
separately on each half. Compare cluster centroids across
the two subsamples.
Delete variables randomly. Perform clustering based on the
reduced set of variables. Compare the results with those
obtained by clustering based on the entire set of variables.
In the non-hierarchical clustering, the solutions may depend
on the order of cases in the data set. Make a multiple runs
using different order of cases until the solution stabilizes.

20
Hypothetical example for the application of
Cluster Analysis
Let us consider a clustering of consumers based on the attitudes
towards shopping. Six attitudinal variables were selected. Consumers
were asked to express their degree of agreement with the following
statements on a seven point scale (1= disagree, 7= agree).

The variables are:
V1= Shopping is fun.
V2=Shopping is bad for your budget.
V3=I combine shopping with eating out.
V4= I try to get the best buys when shopping.
V5= I dont care about shopping.
V6= you can save a lot of money by comparing prices.
Data
21
Table 2: Agglomeration Schedule
Stage
Cluster combined
Coefficients
Stage cluster first appears
Cluster
1
Cluster
2
Cluster
1
Cluster
2
Next
stage
1 14 16 1.000 0 0 6
2 6 7 2.000 0 0 7
3 2 13 3.500 0 0 15
4 5 11 5.000 0 0 11
5 3 8 6.500 0 0 16
6 10 14 8.167 0 1 9
7 6 12 10.500 2 0 10
8 9 20 13.000 0 0 11
9 4 10 15.583 0 6 12
10 1 6 18.500 0 7 13
11 5 9 23.000 4 8 15
12 4 19 27.750 9 0 17
(Contd..)
22
13 1 17 33.100 10 0 14
14 1 15 41.333 13 0 16
15 2 5 51.833 3 11 18
16 1 3 64.500 14 5 19
17 4 18 79.667 12 0 18
18 2 4 172.667 15 17 19
19 1 2 328.600 16 18 0
(Contd.)
23
Means of variables
Cluster No. V
1
V
2
V
3
V
4
V
5
V
6

1 5.750 3.625 6.000 3.125 1.750 3.875
2 1.667 3.000 1.833 3.500 5.500 3.333
3 3.500 5.833 3.333 6.000 3.500 6.000
Table 3: Cluster Centroids
24
Figure : Dendrogram using wards method
Rescaled Distance Cluster Combine
C A S E 0 5 10 15 20 25
Label Num +---------+---------+---------+---------+---------+
14 -^
16 --
10 --
4 -;---^
19 -u -----------------------^
18 -----u -------------------^
2 --^
13 -u -------------------------u
5 -^
11 -;-u
9 --
20 -u
3 --^
8 -u
6 -^
7 -- ---------------------------------------------u
12 --
1 -;-^
17 -u
15 ---u
25
Case study I
Economic Evaluation of gulbarga milk union,
Karnataka state
Patil (1999)
(Application of Cluster Analysis to evaluate
the performance of milk union in the region )
26
In this study cluster analysis of variables was adopted to
analyze the scores obtained from the opinion survey of
three different groups of respondents, in order to
evaluate the working of the union and DCS.

To facilitate the understanding and interpretation of the
results, the clusters were further aggregated based upon
the degree of similarity. The similarity values in the range
of 80 to 100, 60 to 80, 40 to 60, 20 to 40 and 0 to 20
were designated as very high, high, medium, low and
very low aggregate clusters in that order.
27
Aggregation of
clusters
Name of the variables
Degree of
similarity
Very high Managerial capacities and manufacturing problems 82.95
High
General problems and performance and performance
indicators of member society.
67.81
Marketing problems and benefits extended to member
society
64.11
Medium
Education and experience of officials and financial
problems
49.21
Low Involvement indicators 39.76
Table 4: Aggregation of cluster of variables on the working of the Gulbarga milk
union according to its officials
Source: Patil (1999)
28
Aggregation of
clusters
Name of the variable
Degree of
similarity
High
Supervision of the working of DCS and communication
within organization
60.18
Medium Performance indicators 52.44
Low
Benefits extended to members and education and
training to policy makers
33.52
Financial problems 33.10
Very low
Training required to improve the capabilities 18.96
General problems 13.64
Table 5: Aggregation of cluster of variables on the working
of the DCS according to its policy makers
29
Aggregation of
clusters
Degree of
similarity
Medium
Benefits extended to members and financial problems 58.49
Involvement indicators and managerial capacities 48.80
Education and experience of officials 40.93
Low
Performance standards 31.07
Table 6: Aggregation of clusters of variables on the working
of DCS according to its officials
30
Aggregation of
clusters
Degree of
similarity
High
Supervision of the working of DCS and communication
within organization
60.18
Medium Performance indicators 52.44
Low
Benefits extended to members and education and
training to policy makers
33.52
Financial problems 33.10
Very low
Training required to improve the capabilities 18.96
Table 7: Aggregation of cluster of variables on the working of
the DCS according to policy makers
31
Aggregation of
clusters
Name if the variables
Degree of
similarity
Medium
Benefit extended to members and financial problems 58.49
Involvement indicators and managerial capacities 48.80
Education and experience of officials 40.93
Low
Performance standards 31.07
Table 8: Aggregation of cluster of variables on the working
of DCS according to its officials
32
Aggregation of
clusters
Degree of
similarity
Low
Services of veterinary doctors and training facilities to member
farmers
31.81
Political awareness to farmers and procurement and distribution
of requirements of farmers by DCS
27.27
Rebate facility to DCS by union and other facilities to farmers by
DCS
25.41
Marketing of milk and artificial insemination facility 25.30
Benefits and services extended to members by DCS 24.35
Constraints in the functioning of DCS 23.90
Incentives given to farmers 22.28
Very low
Cattle feed supply facility 19.12
Linking credit with marketing of milk 18.89
Supply of market news 18.82
Awareness of benefits and facilities extended to DCS by the
union
18.70
Training to farmers 18.17
Technical guidance and advice 17.54
Preference for selling the produce through DCS 16.34
Table 9: Aggregation of cluster of variables on the working of the
DCS according to its farmer beneficiaries
33
Case study II
Consumers Quality Response for selected cut flowers
in Bangalore city: An economic Analysis
Sharma (1997)
(Application of cluster analysis in the segmentation
of consumers of the cut flowers)
34
Sharma (1997), conducted a study consumers
quality preference for selected cut flowers in
Bangalore city. In this study hierarchical procedure
was used to sort the respondents into groups that
are alike for each cut flower. The relative importance
scores given to the four quality attributes were used
as variables for the segmentation of consumers. The
relative importance scores obtained from the
conjoint analysis were used for cluster analysis.
35
Attributes Rose Gerbera Gladiolus
chrysanthemu
m
Color
White
Red
Pink
Deep red
Pink
Off white
Orange yellow
white
White
Yellow
Magenta
Variety
Hybrid
Local
Hybrid
Local
Hybrid
Local
Hybrid
Local
Floral
arrangement
Bouquet
Loose
Bunch
Bouquet
Loose
Bunch
Bouquet
Loose
Bunch
Bouquet
Loose
Bunch
Price
High
Medium
Low
High
Medium
Low
High
Medium
Low
High
Medium
Low
Table 10: Quality Attributes selected for the study
Source: Sharma (1997)
36
Sl no. Segment I Segment II Segment III
1 No. of consumers
11
(36.66)
15
(50)
4
(13.33)
Relative importance
2 Color 70.51 13.74 38.95
3 Variety 16.45 22.17 3.14
4 Floral arrangement 3.22 12.04 48.55
5 Price 9.82 52.21 9.37
6
Gender
a) Male
b) Female
-
11
6
9
4
-
7
Age group
a) <30 yrs
b) >30 yrs
8
3
7
8
2
2
8
Education level
a) Upto PUC
b) Graduation
c) Post graduation
3
4
4
4
5
6
1
3
-
9
Income level
a) Low
b) Medium
c) High
4
2
5
11
2
2
-
2
2
Table 11: Average relative importance for the quality attributes of rose for
different segments and their socio-economic characteristics
Note: Figures in parentheses represent percentages to total consumers
37
1 No. of consumers
14
(46.66)
15
(50)
1
(3.33)
Relative importance
2 Color 56.41 24.64 19.55
3 Variety 16.42 43.78 0.7
5 Price 21.66 24.12 75.27
6 Gender
a) Male
b) Female
6
8
4
11
-
1
7 Age group
a) <30 yrs
b) >30 yrs
8
6
8
7
1
-
8 Education level
a) Upto PUC
b) Graduation
c) Post graduation
1
10
3
4
7
4
1
-
-
9 Income level
a) Low
b) Medium
c) High
2
4
8
1
9
5
-
1
-
Table 12: Average relative importance for the quality attributes of Gerbera
for different segments and their socio-economic characteristics
38
1 No. of consumers
7
(23.33)
5
(16.66)
18
(60)
Relative importance
2 Color 70.94 17.52 30.97
3 Variety 6.66 9.7 40.31
5 Price 18.27 24.2 18.03
6 Gender
Male
Female
5
2
5
-
6
12
7 Gender
a) Male
b) Female
6
1
4
1
12
6
8 Age group
a) <30 yrs
b) >30 yrs
-
6
1
-
4
1
5
9
4
9 Education level
a) Upto PUC
b) Graduation
c) Post graduation
1
5
1
2
2
1
5
4
9
Table 13: Average relative importance for the quality attributes of gladiolus
39
1 No. of consumers
9
(30)
20
(66.67)
1
(3.33)
Relative importance
2 Color 10.74 53.11 20.05
3 Variety 13.5 29.63 5.92
5 Price 66.19 14.24 0.99
6
Gender
a) Male
b) Female
5
4
3
17
1
-
7
Age group
a) <30 yrs
b) >30 yrs
5
4
11
9
-
1
8
Education level
a) Upto PUC
b) Graduation
c) Post graduation
1
7
1
4
9
7
-
1
-
9
Income level
a) Low
b) Medium
c) High
7
2
-
3
5
12
-
1
-
Table 14: average relative importance for the quality attributes of chrysanthemum
40
Conclusion:
Cluster analysis can be a very useful data reduction
technique.
The classification has the effect of reducing the dimensionality
of a data table by reducing the number of rows (cases).
Different inter object measures and different algorithms can
and do affect the results.
The classification will depend upon the particular method
used.
Cluster analysis can be an invaluable tool in identifying latent
patterns by suggesting useful groupings (clusters) of objects
that are not discernible through other multivariate techniques.
The cluster analysis gives meaningful groups which can be
used for planning and policy making purpose, making zones,
classifying market varieties
41
42
43

Single linkage method:
It is based on minimum distance or the nearest neighbor rule.
The first two objects are those that have the smallest distance
between them. The next shortest distance is identified, and
either the third object is clustered with the first two or a new
two object cluster is formed. At every stage, the distance
between their two closest points. Two clusters are merged at
any stage by the single shortest link between them. This
process is continued until all objects are in one cluster.

Complete linkage method:
It is similar to single linkage, except that it is based on the
maximum distance or the furthest neighbor approach.

Back
44

Average linkage method:
In this method the distance between two clusters is defined as
the average of the distances between all pairs of objects,
where one member of the pair is from each of the clusters.

Wards method:
This method is designed to optimize the minimum variance
within clusters. This objective function is also known as the
within group sum of squares or the error sum of squares.
Ward's method calculates the sum of squared Euclidean
distances from each case in a cluster to the mean of all
variables.

Back
45

Sequential threshold:
It starts by selecting one cluster seed and includes all
objects within a prespecified distance. When all objects
within the distance are included, a second cluster seed is
selected and the process continues.

Parallel threshold
It selects several cluster seeds simultaneously in the
beginning and assigns objects within the threshold distance
to the nearest seed.

Optimization
It is similar to the above methods except that it allows for
re-assignment of objects. In the course of assigning objects,
an object becomes closer to another cluster that is not the
cluster to which it is currently assigned, then an optimization
procedure switches the object to the more similar cluster
Back
46
Case no. V1 V2 V3 V4 V5 V6
1 6 4 7 3 2 3
2 2 3 1 4 5 4
3 7 2 6 4 1 3
4 4 6 4 5 3 6
5 1 3 2 2 6 4
6 6 4 6 3 3 4
7 5 3 6 3 3 4
8 7 3 7 4 1 4
9 2 4 3 3 6 3
10 3 5 3 6 4 6
11 1 3 2 3 5 3
12 5 4 5 4 2 4
13 2 2 1 5 4 4
14 4 6 4 6 4 7
15 6 5 4 2 1 4
16 3 5 4 6 4 7
17 4 4 7 2 2 5
18 3 7 2 6 4 3
19 4 6 3 7 2 7
20 2 3 2 4 7 2
Table 1: Attitudinal data for clustering
Back

Cluster Analysis

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Cluster Analysis

Hochgeladen von

Copyright:

Verfügbare Formate

1

Das könnte Ihnen auch gefallen