Sie sind auf Seite 1von 65

Correspondence Analysis

Correspondence analysis is a descriptive/exploratory


technique designed to analyse simple two-way and
multi-way tables containing some measure of
correspondence between the rows and columns.
The results provide information which is similar in
nature to those produced by Factor Analysis
techniques, and they allow one to explore the structure
of categorical variables included in the table. The most
common kind of table of this type is the two-way
frequency cross-tabulation table.

3.1
Thursday, 13 July 2017 3:01 PM
Correspondence Analysis
Correspondence analysis (CA) may be defined as a
special case of Principal Components Analysis (PCA) of
the rows and columns of a table, especially applicable
to a cross-tabulation.

However CA and PCA are used under different


circumstances. Principal components analysis is used
for tables consisting of continuous measurement,
whereas correspondence analysis is applied to
contingency tables (i.e. cross-tabulations). Its primary
goal is to transform a table of numerical information
into a graphical display, in which each row and each
column is depicted as a point.
3.2
Correspondence Analysis
In a typical correspondence analysis, a cross-tabulation
table of frequencies is first standardised, so that the
relative frequencies across all cells sum to 1.0.

One way to state the goal of a typical analysis is to


represent the entries in the table of relative
frequencies in terms of the distances between
individual rows and/or columns in a low-dimensional
space.

There are several parallels in interpretation between


correspondence analysis and factor analysis.
3.3
Correspondence Analysis
Correspondence Analysis, with Special Attention to the
Analysis of Panel Data and Event History Data
Peter G. M. van der Heijden and Jan de Leeuw
Sociological Methodology 1989 19 43-87 DOI:
10.2307/270948
We present correspondence analysis as an exploratory method that uses graphical
representations to study the relation between rows and columns of a two-way table with non-
negative entries. We present multiple correspondence analysis (MCA) as ordinary
correspondence analysis of a so-called superindicator matrix. In this matrix, objects (e.g.
persons) are in the rows, and each category of each variable has a separate column. MCA uses
only the bivariate marginal frequencies to derive a representation for the columns. Therefore, it
can handle data sets with many variables with many categories. We give special attention to
panel data and event history data. We show how these types of data can be coded in three-way
superindicator matrices with objects in the rows, categories of the variables in the columns, and
time points in the layers.
3.4
Correspondence Analysis
Simple Correspondence Analysis: A Bibliographic Review
Eric J. Beh
International Statistical Review (2004), 72(2) 257284

Over the past few decades correspondence analysis has gained an international reputation as a
powerful statistical tool for the graphical analysis of contingency tables. This popularity stems
from its development and application in many European countries, especially France, and its use
has spread to English speaking nations such as the United States and the United Kingdom. Its
growing popularity amongst statistical practitioners, and more recently those disciplines where
the role of statistics is less dominant, demonstrates the importance of the continuing research
and development of the methodology.

The aim of this paper is to highlight the theoretical, practical and computational issues of simple
correspondence analysis and discuss its relationship with recent advances that can be used to
graphically display the association in two-way categorical data.

3.5
Correspondence Analysis
Correspondence Analysis Applied to Psychological
Research

L. Doey and J. Kurta

Tutorials in Quantitative Methods for Psychology


2011, Vol. 7(1), p. 5-14.

Correspondence analysis is an exploratory data technique used to analyze


categorical data (Benzecri, 1992). It is used in many areas such as marketing and
ecology. Correspondence analysis has been used less often in psychological research,
although it can be suitably applied. This article discusses the benefits of using
correspondence analysis in psychological research and provides a tutorial on how to
perform correspondence analysis using the Statistical Package for the Social
Sciences (SPSS).
3.6
Correspondence Analysis
An Introduction to Correspondence Analysis

P.M. Yelland

The Mathematica Journal 2010, Vol. 12, p. 1-23.

Cross tabulations (also known as cross tabs, or contingency tables)


often arise in data analysis, whenever data can be placed into two
distinct sets of categories. In market research, for example, we
might categorize purchases of a range of products made at
selected locations; or in medical testing, we might record adverse
drug reactions according to symptoms and whether the patient
received the standard or placebo treatment.
7
Correspondence Analysis
Correspondence Analysis
Susan C. Weller
Encyclopedia of Biostatistics 2005 DOI: 10.1002/0470011815.b2a13015

Correspondence analysis is a procedure for exploring the relationships


among two or more sets of variables. A key feature of the analysis is the
joint scaling of both row and column variables to provide information on
the interrelationships among variables within a set (either row or column
variables) and in interrelationships between row and column variables.
Correspondence analysis can be used on qualitative or quantitative data. A
final step in the analysis involves rescaling of characteristic vectors into
optimal scores. Normalising these optimal scores allows for assessment of
relative importance of factors. Correspondence analysis can also be used to
find the optimal ordering of variables for a given set of characteristics. If
the technique is extended to three or more sets of variables, it is called
multiple correspondence analysis.
8
Correspondence Analysis
Correspondence analysis is a useful tool to uncover the
relationships among categorical variables

N. Sourial, C. Wolfson, B. Zhu, J. Quail, J. Fletcher, S.


Karunananthan, K. Bandeen-Roche, F. Bland and H.
Bergman

Journal of Clinical Epidemiology 2010 63(6) 638-646


DOI: 10.1016/j.jclinepi.2009.08.008
Correspondence analysis (CA) is a multivariate graphical technique designed to
explore the relationships among categorical variables. Epidemiologists frequently
collect data on multiple categorical variables with the goal of examining associations
among these variables. Nevertheless, CA appears to be an underused technique in
epidemiology. The objective of this article is to present the utility of CA in an 9
epidemiological context.
Correspondence Analysis
Extensions?

Multiple correspondence analysis: one only or several


techniques?

Giovanni Di Franco

Qual Quant 2016 50 12991315 DOI: 10.5709/acp-


0163-9

The history of multiple correspondence analysis (MCA) is a curious one: in about 80


years, it has been invented and re-invented by different authors independently of
each other. After a brief historical account of MCA, the present article intends
comparing the various techniques based on the multiple correspondence analysis
systems provided by two main schools: the French and the Dutch. 10
Correspondence Analysis
The data summarises individuals political
affiliation (1,,5) and geographic region (1,,4) .

1 Liberal
2 Tend Lib
3 Moderate
4 Tend Cons
5 Conservative
This document is loosely based on SPSS 10; Correspondence
Analysis Output, Faculty of Social and Behavioural Sciences,
Leiden University, Leiden-Netherland. 3.11
Correspondence Analysis
The data summarises individuals political
affiliation (1,,5) and geographic region (1,,4) .

1 Northeast

2 Midwest

3 South

4 West

3.12
Correspondence Analysis
The data (a) summarises individuals political
affiliation (1,,5) and geographic region (1,,4) .

725 individuals,
so 725 rows of
data

3.13
Correspondence Analysis
Analyze > Dimension Reduction > Correspondence Analysis

3.14
Correspondence Analysis
Select row/column variables. And define the ranges.

Having defined the ranges. Use the buttons at the


side of the screen to set desired parameters. 3.15
Correspondence Analysis
Define Row Range. Select row bound, Update and
then Continue

There are 4 regions. 3.16


Correspondence Analysis
Define Column Range. Select column bound,
Update and then Continue

There are 5 political affiliations. 3.17


Correspondence Analysis
Finally

Use the buttons at the side of the screen to set


desired parameters.
3.18
Correspondence Analysis
Select Statistics

3.19
Correspondence Analysis
Select Plots

Select continue to return to


the main screen.

Finally use the OK button to


run the analysis,
or
Paste to preserve the syntax.

3.20
Correspondence Analysis
Syntax

CORRESPONDENCE
TABLE = region4(1 4) BY politics(1 5)
/DIMENSIONS = 2
/MEASURE = CHISQ
/STANDARDIZE = RCMEAN
/NORMALIZATION = SYMMETRICAL
/PRINT = TABLE RPOINTS CPOINTS RPROFILES CPROFILES RCONF CCONF
/PLOT = NDIM(1,MAX) BIPLOT(20) RPOINTS(20) CPOINTS(20) TRROWS(20)
TRCOLUMNS (20) .

3.21
Correspondence Analysis
The Correspondence Table is simply the cross-
tabulation of the row and column variables,
including the row and column marginal totals,
serving as input.

Corre spondence Table

Politic al Outlook
Region Liberal Tend Lib Moderate Tend Cons Conservative Ac tive Margin
Northeast 19 23 58 16 15 131
Midwest 26 31 71 47 35 210
South 18 27 75 46 70 236
W est 30 19 40 26 33 148
Ac tive Margin 93 100 244 135 153 725

3.22
Correspondence Analysis
The Row Profiles are the cell contents divided by their
corresponding row total (eg. 19/131=0.145 for the first
cell). This table also shows the column masses (column
marginals as a percent of n) (eg. 93/725=0.128). These
are intermediate calculations on the way toward
computing distances between points. Note the column
of 1s.
Row Profiles

Politic al Outlook
Region Liberal Tend Lib Moderate Tend Cons Conservative Ac tive Margin
Northeast .145 .176 .443 .122 .115 1.000
Midwest .124 .148 .338 .224 .167 1.000
South .076 .114 .318 .195 .297 1.000
W est .203 .128 .270 .176 .223 1.000
Mass .128 .138 .337 .186 .211
3.23
Correspondence Analysis
Column Profiles are the cell elements divided by
the column marginals (ex. 19/103=0.204). This
table also shows the row masses (row marginals as
a percent of n) (ex. 131/725=0.181). These are
intermediate calculations on the way toward
computing distances between points. Note the row
of 1s.
Colum n Profi les

Politic al Outlook
Region Liberal Tend Lib Moderate Tend Cons Conservative Mass
Northeast .204 .230 .238 .119 .098 .181
Midwest .280 .310 .291 .348 .229 .290
South .194 .270 .307 .341 .458 .326
W est .323 .190 .164 .193 .216 .204
Ac tive Margin 1.000 1.000 1.000 1.000 1.000 3.24
Correspondence Analysis
In the Summary table, we first look at the
chi-square value and see that it is significant,
justifying the assumption that the two
variables are apparently related.

Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.25
Correspondence Analysis
SPSS has computed the interpoint distances and
subjected the distance matrix to principal
components analysis, yielding in this case three
dimensions.

Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.26
Correspondence Analysis
Only the interpretable dimensions are reported, not the
full solution, which is why the eigen values add to
something less than 100% (labelled Inertia; these are the
percent of variance explained by each dimension) - in this
case only 0.057 = 5.7%. This reflects the fact that the
correlation between region and political outlook, while
significant, is weak.
Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.27
Correspondence Analysis
The eigen values (called inertia here) reflect the relative
importance of each dimension, with the first always being
the most important, the next second most important, etc.

Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.28
Correspondence Analysis
The singular values are simply the square roots of the
eigen values. They are interpreted as the maximum
canonical correlation between the categories of the
variables in analysis for any given dimension.

Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.29
Correspondence Analysis
Note that the "Proportion of Inertia" columns are the
dimension eigen values divided by the total (table) eigen
value. That is, they are the percent of variance each
dimension explains of the variance explained: thus the
first dimension explains 62.7% of the 5.7% of the
variance explained by the model.
Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom
3.30
Correspondence Analysis
The standard deviation columns refer back to the singular
values and helps the researcher assess the relative
precision of each dimension.

Summ ary

Confidence Singular
Proportion of Inertia Value

Singular St andard Correlation


Dimension Value Inertia Chi Square Sig. Ac counted for Cumulative Deviation 2
1 .189 .036 .627 .627 .035 -.043
2 .124 .015 .268 .895 .040
3 .078 .006 .105 1.000
Total .057 41.489 .000a 1.000 1.000
a. 12 degrees of freedom

3.31
Correspondence Analysis
Keyword interpretations

Mass: the marginal proportions of the row variable, used


to weight the point profiles when computing point
distance. This weighting has the effect of compensating
for unequal numbers of cases.

Scores in dimension: scores used as coordinates for


points when plotting the correspondence map. Each point
has a score on each dimension.

Inertia: Variance
3.32
Correspondence Analysis
Contribution of points to dimensions: as factor loadings
are used in conventional factor analysis to ascribe
meaning to dimensions, so "contribution of points to
dimensions" is used to intuit the meaning of
correspondence dimensions.

Contribution of dimensions to points: these are multiple


correlations, which reflect how well the principal
components model is explaining any given point (category).

3.33
Correspondence Analysis
The Overview Row Points table, for each row point in the
correspondence table, displays the mass, scores in
dimension, inertia, contribution of the point to the inertia
of the dimension, and contribution of the dimension to
the inertia of the point.
Overview
OverviewRow Pointsaa
RowPoints

Score in Dimension Contribution


Of Point to Inertia of
Dimension Of Dimens ion to Inertia of Point
Region Mass 1 2 Inertia 1 2 1 2 Total
Northeast .181 -.702 .309 .020 .470 .139 .832 .105 .938
Midwest .290 -.130 .065 .005 .026 .010 .181 .030 .210
South .326 .540 .194 .020 .501 .099 .901 .076 .977
West .204 -.055 -.675 .012 .003 .752 .010 .970 .979
Active Total 1.000 .057 1.000 1.000
a. Symmetrical normalization

3.34
Correspondence Analysis
The Overview Column Points table is similar to the
previous one, except for the column variable (party
rather than region) in the correspondence table.
Overview Column aa
Overview Col umnPoints
Points

Sc ore in Dimension Contribution


Of Point to Inertia of
Dimension Of Dimens ion to Inertia of Point
Politic al Outlook Mass 1 2 Inertia 1 2 1 2 Total
Liberal .128 -.491 -.800 .016 .163 .663 .363 .630 .993
Tend Lib .138 -.351 .124 .003 .090 .017 .921 .075 .995
Moderate .337 -.252 .334 .009 .113 .303 .448 .512 .960
Tend Cons .186 .237 -.037 .006 .055 .002 .308 .005 .313
Conservative .211 .721 -.094 .022 .579 .015 .940 .010 .950
Ac tive Total 1.000 .057 1.000 1.000
a. Sy mmetric al normalization

3.35
Correspondence Analysis
The Confidence Row Points tables display the standard
deviations of the row scores (the values used as
coordinates to plot the correspondence map) and are used
to assess their precision.

Confidence Row Points

Standard Deviation in
Dimension Correlation
Region 1 2 1-2
Northeast .190 .307 .528
Midwest .169 .323 .066
South .122 .206 -.685
West .339 .148 -.026

3.36
Correspondence Analysis
The Confidence Column Points tables display the standard
deviations of the column scores (the values used as
coordinates to plot the correspondence map) and are used
to assess their precision.

Confidence Column Points

Standard Deviation in
Dimension Correlation
Political Outlook 1 2 1-2
Liberal .387 .221 -.694
Tend Lib .072 .117 .801
Moderate .171 .122 .575
Tend Cons .215 .406 .095
Conservative .127 .302 .304

3.37
Correspondence Analysis
The plots of transformed categories for dimensions
display a plot of the transformation of the row category
values and of column category values into scores in
dimension, with one plot per dimension.

The x-axis has the category values and the y-axis has the
corresponding dimension scores. Thus the category
"Northeast" in the Overview Row Points table above had a
score in dimension of -0.702, as shown on the plot.

3.38
Correspondence Analysis

Refer back to Overview Row Points dimension 1


Why join! 3.39
Correspondence Analysis

Refer back to Overview Row Points dimension 2


3.40
Correspondence Analysis

Refer back to Overview Column Points dimension3.41


1
Correspondence Analysis

Refer back to Overview Column Points dimension3.42


2
Correspondence Analysis
The uniplots for the row and column variables. Note that
the origin of the axes is slightly different in the two
plots.

3.43
Correspondence Analysis

Refer back to Overview Row Points dimensions 1 and3.44


2
Correspondence Analysis

Refer back to Overview Column Points dimensions 1 and


3.45 2
Correspondence Analysis
Certain MDS and kernel projections output
horseshoes that are characteristic of dimensionality
reduction techniques. In general, a latent ordering of
the data gives rise to these patterns when one only has
local information. That is, when only the inter-point
distances for nearby points are known accurately.

Horseshoes In Multidimensional Scaling And Local Kernel Methods


P. Diaconis, S. Goel and S. Holmes
The Annals of Applied Statistics 2008 2(3) 777807
DOI: 10.1214/08-AOAS165

3.46
Correspondence Analysis
Finally the biplot correspondence map is obtained.

Note the axes now encompass the most extreme values of


both of the uniplots.

Note that while some generalizations can be made about


the association of categories (South more conservative,
West more liberal). The researcher must keep firmly in
mind that correspondence is not association. That is, the
researcher should not allow the maps display of inter-
category distances to obscure the fact that, for this
example, the model only explains 5.7% of the variance in
the correspondence table.
3.47
Correspondence Analysis

Refer back to Overview Row Points dimensions 1 and


2 and Overview Column Points dimensions 1 and 2. 3.48
Correspondence Analysis
Care must be taken when interpreting the previous
plot. It must be remembered that distances between
columns and rows are not defined.

Symmetrical normalization (via the model button


slide) is a technique used to standardize row and
column data so as to be able to make general
comparisons between the two. Other forms of
standardization allow you to compare row variable
points or column variable points, or rows or columns,
but not rows to columns (see Garson, 2012 for further
information on other standardization techniques for
correspondence analysis) also Doey and Kurta 2011
(slide).
3.49
Correspondence Analysis
Input Of A Collated Data Matrix, so 54 rather than
7252

An SPSS program that will do this operation is


ANACOR, although since we are using data in table
form, this has to be performed using the command
syntax window.

3.50
Correspondence Analysis
The data editor appears below. If you wish you may name the
columns. These names will then appear in the final plots.
Transpose the matrix for the row names to be employed.

Save the collated data matrix, xls or sav.

Note that we have only the matrix of interest in this view.


3.51
Correspondence Analysis
You must employ the syntax

Either via File > Open > Syntax

3.52
Correspondence Analysis
With the prepared commands in an ascii file

ANACOR TABLE= ALL (5 , 4)


/DIMENSION = 2
/NORMALIZATION = canonical
/VARIANCES= COLUMNS
/PLOT =NDIM (1 , 2)

Note the command "ALL" since we are providing the table

Note "5" for the number of rows

Note "4" for the number of columns


3.53
Correspondence Analysis
Or via File > New > Syntax

3.54
Correspondence Analysis
With the commands input into the Syntax Editor

3.55
Correspondence Analysis
The solution is, of course, unchanged.

The data has not been made available for the


following examples.

3.56
Correspondence Analysis
Two more illustrative examples.

In the table, 14 topics are taken from the testbed of


1033 MEDLINE abstracts on biomedicine obtained
from the National Library of Medicine. All the
underlined words in the table denote keywords which
are used as referents to the medical topics. The
parsing rule used for this sample database required
that keywords appear in more than one topic. Of
course, alternative parsing strategies can increase or
decrease the number of indexing keywords (or terms).

Computational Methods for Intelligent Information Access Michael W. Berry,


Susan T. Dumais and Todd A. Letsche
And apparently misreported by 3.57
Algorithms for Binary Factor Analysis Ale Keprt
Correspondence Analysis
Corresponding to the text in the previous table is the
18 by 14 term-document matrix shown here. The
elements of this matrix are the frequencies in which a
term occurs in a document or medical topic. For
example, in medical topic M2, the second column of
the term-document matrix, culture, discharge, and
patients all occur once.

Can you discern any structure? 3.58


Correspondence Analysis
Row Plot
2

M10
M13
M12
1 M14

M8
M9 M4
M1
Component 2

0 M2M3

M5
M6

-1 M11

-2

M7
-3

-3 -2 -1 0 1 2
Component 1

3.59
Correspondence Analysis
Column Plot
2

fast
rats
abnormal
1 respect
generation
age
rise
Component 2

patients
culture
0 pressure
oestrogen
study
depressed
discharged
blood
disease behavior
-1

-2
close

-3
-3 -2 -1 0 1 2
Component 1

3.60
Correspondence Analysis
Finally a set you might recognise!

Can you discern any structure?

3.61
Correspondence Analysis
Row Plot
Preston

1.0 Stevens
SmithClarkson
Negus
Wills
Whelan
Toward Hall
ThompsonShaikh
Bell
Coward-Whittaker
Pickard
Randall
CookeWiddrington Whitlock
0.5 Pearson
Subhedar Oliver
Tyrer MarsdenTang Temple
Patel
MooresRoberts
Holdsworth Hill Atlan Norris
Ratcliffe Moxham
Bainbridge
Rowley Gerrard
Denton
Huggins
Maunder
Bamber
Todd Clegg
Taylor
Component 2

Hudson Bushell
Fraser
Grahamslaw
Tithecott
Hunter
Coates
Wong
Halligan
Elliott
Hickford
Woods
James Lloyd
Barrett
Moss LeslieGallagher
Daley
Kite
Newham
Scrivener
Downing WebbSparrow
0.0 Brennan Stapleton
Douthwaite
Townsend
Mccartney
IrvingBolton
Sayer
TimoneyLam
Lee Waller
Ward
Cobb
Ferguson
Brown
Henly
Appleby NikoletsopoulouMacdonald
Petersson
Wallace
Hutton
Baker
Sams
Davis
Van Harber
Gancarczyk
Fitzpatrick
Froggatt
Pearson Harland
Hawkins
Smith
Fairs Papantoniou Akhtar
Nichol
Lau Grencis
-0.5 Maslen-Jones
ScottTaylor
Grafton-Clarke
Ballard
Simpson
Lilley
Pearson
Harrison
Hudson
-1.0 Parkes

-1.5
-1.5 -1.0 -0.5 0.0 0.5 1.0
Component 1

3.62
Correspondence Analysis
Column Plot
PSY3028

1.0 PSY3006
PSY3020
PSY3009

0.5 PSY3008
PSY3016 PSY3026
PSY3001
Component 2

PSY3013
PSY3097
PSY3029
0.0 PSY3030
PSY3002

PSY3027 PSY3018
-0.5
PSY3022

-1.0

-1.5 PSY3031

-1.5 -1.0 -0.5 0.0 0.5 1.0


Component 1

It would be interesting to repeat the analysis excluding the (central)


compulsory modules. 3.63
SPSS Tips

Now you should go and try for yourself.

Each week our cluster (5.05) is booked for 2 hours


after this session. This will enable you to come and
go as you please.

Obviously other timetabled sessions for this module


take precedence.

3.64
3.65

Das könnte Ihnen auch gefallen