Sie sind auf Seite 1von 9

Cluster analysis for political scientists

Dalson Figueiredo
Federal University of Pernambuco
Enivaldo Rocha
Federal University of Pernambuco
Mariana Batista
Federal University of Pernambuco
Ranulfo Paranhos
Federal University of Alagoas
Jos Alexandre
Federal University of Alagoas
March 19, 2014
Abstract
This paper provides an intuitive introduction to cluster analysis. Our
targeting audience is both undergraduate and graduate students in their
initial training stage. Methodologically, we use basic simulation to illus-
trate the underlying logic of cluster analysis. In addition, we replicate
data from Coppedge, Alvarez and Maldonado (2008) to classify political
regimes according to Dahls (1971) polyarchy dimensions: contestation
and inclusiveness. With this paper we hope to diuse cluster analysis
technique in political science and help novice scholars not only to under-
stand but also to employ cluster analysis in their own research designs.
Keywords: cluster analysis, Q analysis, political regimes.
Classication of objects into meaningful sets - clustering - is
an important procedure in all of the social sciences
Richard G. Niemi
1 INTRODUCTION
Classication of objects into meaningful groups is a central task in Science
(AHLQUIST and BREUNIG, 2011). Cluster analysis is a statistical technique
specialized to classify units into groups. Although cluster analysis is widely em-
ployed in other disciplines, its use in Political Science is limited when compared
to linear regression (LEWIS-BECK e KRUEGER, 2008), factor analysis and
other multivariate statistical techniques (TABACHNICK and FIDELL, 2007).
1
The principal aim of this paper is to present an intuitive introduction to
cluster analysis for political scientists. Our targeting audience is both under-
graduate and graduate students in their initial training stage. Methodologically,
we use basic simulation to illustrate the underlying logic of cluster analysis. In
addition, we replicate data from Coppedge, Alvarez and Maldonado (2008) to
classify political regimes according Dahls polyarchy dimensions: contestation
and inclusiveness. On substantive grounds, we hope to facilitate the under-
standing and application of cluster analysis technique in Political Science.
The remainder of the paper is divided as follows. The next section briey re-
views part of the literature on cluster analysis. The second section presents the
steps that should be followed to properly apply cluster analysis. The third sec-
tion provides an example of research design using cluster analysis and presents
the main statistics of interest. Finally, we present the conclusions of the article.
2 LITERATURE REVIEW
1
During a long time cluster analysis was restricted to a limited group of re-
searchers due to its mathematical complexity (ALDENDERFER and BLASH-
FIELD, 1984). Technically, computational development facilitated the dissemi-
nation cluster techniques among dierent areas. Today, statistical packages can
quickly perform mathematical distances calculations and therefore facilitate the
use of cluster analysis by non specialists.
But what is cluster analysis after all? According to Aldenderfer and Blash-
eld, cluster analysis is a generic designation for a large group of techniques
that can be used to create a classication. Such procedures results in empirically
clusters or groups of strongly similar objects (ALDENDERFER and BLASH-
FIELD, 1984: 7). For Hair et al (2009), the cluster analysis gathers individuals
or objects into clusters such that objects in the same cluster are more alike to
each other than to other clusters (HAIR ET AL, 2009: 555). The main purpose
of cluster analysis is to group cases according to the their degree of similarity.
The underlying logic cluster analysis is similar to factor analysis. The basic
dierence is that, in the factor analysis, the researcher is concerned to repre-
sent a set of observed variables from a lower number of factors, while in cluster
analysis she seeks to represent a set of cases from a smaller number of groups
(clusters). Figure 1 illustrates an ideal type of cluster analysis
2
.
1
To get more information on cluster analysis see Zubin (1938), Tryon (1939), Driver e
Kroeber (1932), Peters (1958), Sokal and Sneath (1963), MacRae (1966), Johnson (1967),
Bailey (1975), Everitt (1980) and Aldenderfer and Blasheld (1984). To dierent applications
in political science data see Alquist and Breunig (2011) and Jang and Hitchcock (2012).
2
See the appendix for simulation sintax.
2
Figure 1 - Cluster analysis examples
Cases are grouped according to the degree of mutual proximity, is what the
literature calls the distance/similarity. There are dierent ways of estimating
how far/close observations are. In general, it is sought to ensure maximum ho-
mogeneity within the cluster, while it maximizes between groups heterogeneity.
Our simulated data can be clustered in three dierent groups: A, B and C. Left
gure illustrates the distribution of two simulated variables (X
1
and X
2
). The
Pearson correlation between them is .980 (p-value<.000; n =150) considering all
clusters together as a unique group. When we compare the clusters, we observe
that the correlation is not statistically signicant for any group : (A; r = .019;
p-value = .897; n = 50), (B; r = -.096; p-value = .509; n = 50) and (C; r = .052;
p-value = .719; n = 50). Similary, right gure shows the scatter distribution of
X
1
and X
3
. Taking all the observations at once, the Pearson correlation between
them is .776 (p-value<.000; n = 150). Disaggregating by cluster, the correlation
between X
1
and X
3
is negative for each group (A; r = -.528; p-value <.000; n =
50), (B; r = -.701; p-value <.000; n = 50) and (C; r = -.501; p-value <.000; n
= 50). Substantively, these simulations show not only the pratical importance
of cluster analysis tecnique but also that not taking into account the clustered
nature of data can lead to wrong conclusions about political phenomena.
3 PLANNING A CLUSTER ANALYSIS
This section summarizes the requirements that must be met to proper use cluster
analysis. Political Scientists should follow ve steps:
1. Sample selection
2. Variables selection
3. Similarity measure and cluster method determination
4. Number of clusters denition
5. Results validation
3
Dierent from other statistical techniques, sample size in cluster analysis is
not related to statistical inference since the aim is not to estimate to what extent
the results found in the sample can be extended to the population (HAIR ET
AL, 2009). In fact, the sample size should ensure that small groups representa-
tion. Moreover, unlike other multivariate techniques, there is no general rule to
specify minimum sample size (DOLNICAR, 2002). Our recommendation is that
the more variables is included in the analysis, more cases should be collected. As
long cluster analysis is sensitive to outliers it is also important to check for atyp-
ical observations. Hair et al (2009) suggest prole diagram graphical inspection.
The researcher can also use the blox-plot and scatter plots to identify outliers,
in addition to the standard tests available in dierent statistical packages.
The second step is to select which variables will be used to estimate the
distance/similarity between cases. The choice of variables is one of the most im-
portant steps, but unfortunately one of the least understood (ALDENDERFER
and BLASHFIELD, 1984). Hair et al (2009) state that it should be included
only variables that characterize the objects to be grouped and specically relate
to the goals of cluster analysis. Ideally, the research design should select only
theoretically relevant variables to classify cases. The authors warn that, other-
wise there is a serious risk of naive empiricism, producing results conceptually
empty and that do not contribute knowledge cumulation.
After selecting the variables, the next step is to dene the similarity mea-
sure. Pohlmann (2007) argues that the similarity between objects is a measure of
correspondence or similarity between the objects to be grouped. There are dif-
ferent ways to calculate these measures and dierent measures tend to produce
distinct solutions. Important considerations include the variables measurement
level and knowledge of the eld object of research (POHLMANN, 2007). We
recommend that beginners should use the more conventional measures, incor-
porating dierent ones along their learning process.
The next step is to decide the clustering method (mathematical algorithm).
That is, the researcher must dene how distances are calculated and how many
clusters (groups) should be created
3
. There are three general approaches to cre-
ating clusters: a) hierarchical clustering; b) K-means clustering and c) two step
clustering. The hierarchical clustering approach is more suitable for small sam-
ples (n < 250). Insofar as the sample size grows, the solution of the algorithm
becomes slowler, and may even crash the computer. In hierarchical clustering,
the clusters are nested (not mutually exclusive). The researcher can choose the
extent of the number of clusters or the exact number of groups to be created.
The K-means clustering method is more suitable for larger samples (n >
1; 000) since it does not compute the proximity matrix between all cases. As
a similarity measure, the K-means clustering approach uses the Euclidian dis-
tance and the researcher must specify the number of groups in advance (clusters)
3
For example, version 20 of SPSS has the following similarities measures for metric vari-
ables: euclidian distances, squared euclidian distances, Cosine, Pearson correlation, Cheby-
chev, Block, Minkowski e Customized. As clustering methods, it is possible to chose among the
following: between groupslinkage, within groupslinkage, Nearest neighbor, Furthest neighbor,
Centroid clustering, Median clustering e Wards method.
4
that are formed. The Two Step clustering method is ideal for large datasets.
As well as the hierarchical clustering K-means clustering can present schedul-
ing problems when the sample is too large. In addition, the output presents
more options, including a graph comparing the importance of each variable in
clustering formation.
After choosing the similarity measure and the agglomeration method, the
researcher must dene the number of groups (K) to be formed. This choice
should be theoretically guided. For example, if previous studies suggested the
existence of three groups, an analytic possibility is to replicate the number of
groups in order to evaluate solution stability. In absence of theoretical guidance,
the researcher can adopt an exploratory approach and run dierent analysis
varying the number of groups. Dierent solutions should be compared with the
literature searching for substantive explanation.
Finally, results should be validated in order to ensure practical signicance
(HAIR ET AL 2009). To do so, the researcher can partionate the original
sample and compare the solutions obtained in both cases. Another way is
to test the predictive ability of the solution is to compare a random variable
that has not been used in the initial solution. For example, when separate
groups according to smoking habits, it is expected that, on average, the physical
strength of non-smokers is higher than smokers. Thus, after separating the
groups, the researcher can conduct a battery of physical tests and see if the
group of nonsmokers in fact presents a superior performance. Or, to classify
political regimes according to their level of democratization, the researcher can
estimate the extent to which income inequality varies among dierent groups
of countries, assuming that democracies tend to promote more equal income
distribution than non-democracies.
4 EXAMPLE OF RESEARCH DESIGN: CLASSIFYING POLITICAL
REGIMES
Methodologically, we replicate data from Coppedge, Alvarez and Maldonado
(2008) to classify political regimes according to the two dimensions of polyarchy
proposed by Dahl (1971): contestation and inclusiveness. Figure 1 illustrates
these dimensions.
5
Figure 2 - Two theoretical dimensions of democratization
Following Dahls (1971) original typology we observe for ideal types of po-
litical regimes: polyarchies (high scores in both dimensions), competitive oli-
garchies (high contestation, low inclusiveness), inclusive hegemonies (low con-
testation, high inclusiveness) and closed hegemonies (low scores in both dimen-
sions). Before decide the measure of similarity and agglomeration methods we
should graphically analyze observed countries distribution regarding Dahls two
dimensions. Figure 3 displays this information.
Figure 3 - Observed countries distribution
The scores of both inclusiveness and contestation are standardized (mean
= 0; std = 1) and we use the average as benchmark comparison. Denmark
(DNMK) represents a example of a polyarchy (upper right). Vietnam (VNM)
represents the political regimes that Dahl (1971) called that inclusive hege-
monies (lower left). Afghanistan (AFGN) can be classied as closed hegemonie
(lower right). Finnaly, countries in the upper left are named as competitive
oligarchies, Ghana (GHNA) is a example of this ideal institutional design. To
classify countries according to Dahl theoretical dimensions, we chose Euclidian
distance as similarity measure and run three dierent agglomeration methods.
Figure 4 show the observed results in comparative perspective.
6
Figure 4 - Agglomeration methods comparison
The dotted lines represent the mean of each dimension. In the two step
solution, cluster analysis identied a new regime type: super polyarchies. Tech-
nically, all countries located in the upper right quadrant should be classied
as polyarchies following Dahls original typology. However, it is clear that su-
per polyarchies have more developed both inclusiviness and contestation. In
the k-means solution we highligh the intermediate hegemonies. These politi-
cal regimes are more democratic than closed hegemonies but are systematically
less inclusive than inclusive hegemonies. Finnaly, the hierarchial solution show
that if we restrict our denition of intermediate hegemonies, we will observe a
unspecied political system that varies across two dierent ideal types (closed
hegemonies x inclusive hegemonies).
.
5 CONCLUSION
As long political scientists usually work on typologies, we believe that cluster
analysis is an important tool to classify units into groups. Its main advantage is
to produce objective and replicabe classication that can develop our knowledge
regarding political phenomena. This paper provided an intuitive introduction to
cluster analysis. Our targeting audience is both political science undergraduate
and graduate students in their initial training stage. Methodologically, we used
basic simulation to illustrate the underlying logic of cluster analysis. In addition,
we replicated data from Coppedge, Alvarez and Maldonado (2008) to classify
political regimes according to Dahls (1971) polyarchy dimensions: contestation
and inclusiveness. With this paper we hope to diuse cluster analysis technique
in Political Science and help novice scholars not only to understand but also to
employ cluster analysis in their own research designs.
7
6 APPENDIX
Sintax for X1 e X2 per cluster
IF (cluster = 1) x2=RV.NORMAL(10,.5).
EXECUTE.
IF (cluster = 1) x1=RV.NORMAL(10,.5).
EXECUTE.
IF (cluster = 2) x1=RV.NORMAL(5,.5).
EXECUTE.
IF (cluster = 2) x2=RV.NORMAL(5,.5).
EXECUTE.
IF (cluster = 3) x2=RV.NORMAL(1,.5).
EXECUTE.
IF (cluster = 3) x1=RV.NORMAL(1,.5).
EXECUTE.
Sintax for X3 per cluster
IF (cluster = 1) x3=x1 * -.6+x2 * SQRT(1-.6 ** 2).
EXECUTE.
IF (cluster = 2) x3=x1 * -.6+x2 * SQRT(1-.6 ** 2).
EXECUTE.
IF (cluster = 3) x3=x1 * -.6+x2 * SQRT(1-.6 ** 2).
EXECUTE.
References
[1] ALDENDERFER, M. S. e BLASHFIELD, R. K. "cluster Analysis". Sage
University Paper Series: Quantitative Applications in the Social Science,
1984.
[2] ALQUIST, J. S. and BREUNIG, C. (2011). Model-Based Clustering and
Typologies in the Social Sciences. Political Analysis (Winter 2012) 20 (1):
92-112
[3] BAILEY, K. D. "cluster Analysis". Sociological Methodology, vol. 6, p.
59-128, 1975
[4] COPPEDGE, M.; ALVAREZ, A.; MALDONADO, C. "Two Persistent Di-
mensions of Democracy: Contestation and Inclusiveness". Journal of Poli-
tics, v. 70, n. 3, p. 145, 2008.
[5] DAHL, R. Poliarquia: Participao e Oposio. So Paulo: Edusp, 1971.
[6] DOLNICAR, S. A review of unquestioned standards in used cluster analysis
for datadriven market segmentation. Faculty of Commerce - Papers. 2002.
8
[7] EVERITT, B.S. Cluster Analysis. Second Edition, London: Heineman Ed-
ucational Books Ltd, 1980.
[8] HAIR, J. et al (2009). Multivariate data analysis. 17
a
Edio. Prentice Hall.
[9] JANG, J. and HITCHCOCK, D. B. (2012),"Model-based Cluster Analysis
of Democracies", Journal of Data Science, 10, 297-319.
[10] JOHNSON, S. "Hierarchical clustering schemes". Psychometrika, 38, p.241-
254, 1967.
[11] LEWIS-BECK, M. and LEWIS-BECK, Michael. (2008). Is OLS Dead?
The Political Methodologist, vol 15, no 2: 24.
[12] PETERS, W. S. "cluster Analysis in Urban Demography", Social Forces,
v. 37, n. 1, p. 38-44, 1958.
[13] POHLMANN, M. C. Anlise de Conglomerados. In: CORRAR, L. J.;
EDLSON, P.; DIAS FILHO, J. M. (Orgs.). Anlise Multivariada. So
Paulo: Atlas, 2007.
[14] TABACHNICK, B.; FIDELL, L. (2007). Using multivariate analysis. Need-
ham Heights: Allyn & Bacon.
9

Das könnte Ihnen auch gefallen