Beruflich Dokumente
Kultur Dokumente
Paris, 2007
7-8 June 2007
A
1961
1984
1991
C
1999
1993
2002
1989
1973
1994
B
Correspondence Analysis
and Related Methods
Michael Greenacre
Universitat Pompeu Fabra
Barcelona
michael@upf.es
www.globalsong.net
www.econ.upf.es/~michael
2007
1998
Correspondence analysis
and Related Methods Part 1
1. What is correspondence analysis (CA)?
2. Why is CA so useful as a method of visualizing
tabular data?
3. How is CA implemented in XLSTAT?
(by CA I mean simple CA, as opposed to
multiple CA, which is discussed in the next
talk)
Jean-Paul Benzcri...
Correspondence analysis:
in which areas of research is it useful?
CA visualizes complex data, primarily data on categorical
measurement scales, facilitating understanding and
interpretation a neglected aspect of statistical
enquiry (cf. usual modelling approach)
linguistics, textual analysis: word frequencies
sociology: cross-tabulations and large sets of
categorical data from questionnaires; useful for
qualitative research, visualization of case study data
ecology: species abundance data at several
locations, often with explanatory variables
market research: perceptual mapping of
brands/products, ...
archeology: large sparse data matrices
biology, geology, chemistry, psychology...
O
O
A simple example
312 respondents, all readers of a certain newspaper, cross-tabulated
according to their education group and level of reading of the
newspaper
C1
C2 C3
0.4
0.0129 (15.48 %)
E1
0.3
E1
C1
E2
18
46
20
0.2
E3
19
29
39
0.1
E3
C3
E5
E4
12
40
49
E5
16
-0.1
0.0704 (84.52 %)
E4
E2
C2
-0.2
-0.5
-0.4
-0.3
-0.2
-0.1
0.1
0.2
0.3
0.4
0.5
0.6
mass
projection
centroid
mi
di
subspace
inertia = i midi2
Profile
A profile is a set of relative frequencies, that is a set of frequencies
expressed relative to their total (often in percentage form).
Each row or each column of a table of frequencies defines a different
profile.
It is these profiles which CA visualises as points in a map.
original data
row profiles
column profiles
C1
C2
C3
E1
14
E1
E1
E2
18
46
20
84
E2
E2
E3
19
29
39
87
E3
E3
E4
12
40
49 101
E4
E4
E5
E5
E5
16
C1
26
C2
C3
C1
C2
C3
0.36
0.14
0.50
The average is the point at which the two points are balanced.
weighted average
E1
0.50
0.36
E2
0.55
0.21
0.27
E5
0.12
C2
C3
masses
E1
14
.045
E2
18
46
20
84
.269
E3
19
29
39
87
.279
E4
12
40
49 101
.324
E5
16
.083
26
Readership data
C1
C2
C3
Total
Mass
E1
Some primary
5
(0.357)
7
(0.500)
2
(0.143)
14
0.045
E2
Primary completed
18
(0.214)
46
(0.548)
20
(0.238)
84
0.269
E3
Some secondary
19
(0.218)
29
(0.333)
39
(0.448)
87
0.279
E4
Secondary completed
12
(0.119)
40
(0.396)
49
(0.485)
101
0.324
Some tertiary
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
Total
57
(0.183)
129
(0.413)
126
(0.404)
312
Education Group
E5
C1: glance
Calculating chi-square
+
=
Education Group
C1
(3 - 4.76) 2
4.76
(7 -10.74) 2
10.74
26.0
C2
C3
Total
Mass
..
14
..
84
..
87
..
101
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
4.76
10.74
10.50
57
129
126
(0.183)
(0.413)
(0.404)
Observed Frequency
E5
Some tertiary
Expected Frequency
Total
(16 -10.50) 2
10.50
312
For example,
expected frequency
of (E5,C1):
0.183 x 26 = 4.76
Calculating chi-square
2
+ 26 [
(3 / 26 - 4.76 / 26) 2
4.76 / 26
2 / 312 =
+ 0.083 [
(0.115 0.183) 2
0.183
Education Group
(7 / 26 -10.74 / 26) 2
10.74 / 26
C1
(0.269 0.413) 2
0.413
C2
C3
(0.615 0.404) 2
0.404
Total
Mass
..
14
..
84
..
87
..
101
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
4.76
10.74
10.50
57
129
126
(0.183)
(0.413)
(0.404)
Observed Frequency
E5
Some tertiary
Expected Frequency
Total
312
Calculating inertia
Inertia =
+ 0.083 [
2 / 312
(0.115 0.183) 2
0.183
(0.615 0.404) 2
0.404
mass
(of row E5)
Inertia =
(0.115 0.183) 2
0.183
(0.269 0.413) 2
0.413
(0.269 0.413) 2
0.413
(0.615 0.404) 2
0.404
EUCLIDEAN
WEIGHTED
+ 0.083 [
2 / 312
(0.115 0.183) 2
0.183
(0.115 0.183) 2
0.183
(0.269 0.413) 2
0.413
(0.615 0.404) 2
0.404
mass
(of row E5)
0.115
0.183
0.183 0.183
(0.269 0.413) 2
0.413
2
) +(
0.269
0.413
(0.615 0.404) 2
0.404
0.413
0.413
) +(
EUCLIDEAN
WEIGHTED
0.615
0.404
0.404 0.404
vertices
Pythagorian
ordinary Euclidean
distances
profiles
Chi-square distances
What CA does
centres the row and column profiles with respect to their average
profiles, so that the origin represents the average.
re-defines the dimensions of the space in an ordered way: first
dimension explains the maximum amount of inertia possible in one
dimension; second adds the maximum amount to first (hence first two
explain the maximum amount in two dimensions), and so on until
all dimensions are explained.
decomposes the total inertia along the principal axes into principal
inertias, usually expressed as % of the total.
so if we want a low-dimensional version, we just take the first
(principal) dimensions
The row and column problem solutions are closely related,
one can be obtained from the other; there are simple scaling
factors along each dimension relating the two problems.
C1
.01289 (15,5%)
1.5
E1
1.5
1
E3
1
0.5
0.5
E1
E3
E2
C3
E5
E4
E5
C1
.0 7 0 3 7 ( 8 4 ,5 %)
C3
.07037 (84,5%)
C2
-0.5
-0.5
E2
E4
C2
-1
-1
-2.5
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
primary
incomplete
0.2
glance
.0 12 8 9 ( 15 ,5 %)
secondary
incomplete
very thorough
some
tertiary
0
.0 7 0 3 7 ( 8 4 ,5 %)
primary
complete
secondary
complete
fairly thorough
-0.2
-0.6
-0.4
-0.2
0.2
0.4
0.6
(McFie et al.)
Our company wishes to identify the perceptions of itself and its nine major
competitors.
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which companies they
associate with which of 8 attributes.
The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15
Reduction of dimensionality
Reduction of dimensionality
means
data centred
Reduction of dimensionality
data centred
points weighted (row masses)
in case of frequency data, points are weighted by
their row masses, that is the relative frequencies of
each row (i.e. proportional to sample sizes, n)
Reduction of dimensionality
i'
data centred
points weighted (row masses)
metric weighted (column weights)
e.g. wj = 1/j2 the inverse of the variance in PCA
wj = 1/cj the inverse of the expected value in CA
(McFie et al.)
Our company wishes to identify the perceptions of its products and its 9
major competitors (A, B, , I).
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which products they associate
with which of 8 attributes.
The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
Products
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15
(McFie et al.)
First note that this is NOT a contingency table, so the chi-square test is not
applicable (a permutation test could test for significance, but then we need
to have original respondent-level data).
This is an interesting example because it can be analyzed as is or it can
be recoded to bring out certain features.
Analyzing it with no recoding means that the size effect (sometimes
called the halo effect) is removed, since we analyze profiles, i.e., the
counts relative to their total counts. In other words, if a product gets
relatively few associations, then it is the highest of these (lower)
associations that are determinant. Hence, in the following extreme case,
a pattern of [ 18 18 18 ] is identical to a pattern of [ 1 1 1 ] !
The masses assigned to the products will be proportional to the number of
associations they get.
If the size effect is needed to be visualized as well, the data table should
be doubled.
A
B
C
D
E
F
G
H
I
ours
Products
Company
A
B
C
D
E
F
G
H
I
ours
PQ
In
PR
En
PL
(McFie et al.)
MI
PS
GP
3
1
13
9
6
3
18
2
10
4
16
15
11
11
14
16
12
14
14
15
14
6
4
4
15
14
13
7
13
15
13
8
13
9
17
15
16
6
12
16
14
10
11
11
14
12
13
10
14
14
18
13
4
9
16
14
5
4
16
7
6
14
10
11
8
7
4
14
4
6
18
9
2
3
15
16
7
8
8
15
Total
102
76
68
67
105
97
88
65
91
92
PQ
2.9
1.3
19.1
13.4
5.7
3.1
20.5
3.1
11.0
4.3
In
15.7
19.7
16.2
16.4
13.3
16.5
13.6
21.5
15.4
16.3
PR
13.7
7.9
5.9
6.0
14.3
14.4
14.8
10.8
14.3
16.3
En
12.7
10.5
19.1
13.4
16.2
15.5
18.2
9.2
13.2
17.4
PL
13.7
13.2
16.2
16.4
13.3
12.4
14.8
15.4
15.4
15.2
MI
17.6
17.1
5.9
13.4
15.2
14.4
5.7
6.2
17.6
7.6
PS
5.9
18.4
14.7
16.4
7.6
7.2
4.5
21.5
4.4
6.5
GP
17.6
11.8
2.9
4.5
14.3
16.5
8.0
12.3
8.8
16.3
Total
102
76
68
67
105
97
88
65
91
92
(McFie et al.)
Doubling involves coding the counts of the numbers (out of 18) that
DONT associate the product with the attribute in each case.
There are now two columns per attribute each attribute is represented by
its positive and negative end of the 0-to-18 scale of counts.
Doubled table:
Com. PQ
Prod.
A
B
C
D
E
F
G
H
I
ours
3
1
13
9
6
3
18
2
10
4
PQ15
17
5
9
12
15
0
16
8
14
In
In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- Total
16
2
14
4
13
5
14
4
18
0
6
12
18
0 144
15
3
6
12
8
10
10
8
13
5
14
4
9
9 144
11
7
4
14
13
5
11
7
4
14
10
8
2
16 144
11
7
4
14
9
9
11
7
9
9
11
7
3
15 144
14
4
15
3
17
1
14
4
16
2
8
10
15
3 144
16
2
14
4
15
3
12
6
14
4
7
11
16
2 144
12
6
13
5
16
2
13
5
5
13
4
14
7
11 144
14
4
7
11
6
12
10
8
4
14
14
4
8
10 144
14
4
13
5
12
6
14
4
16
2
4
14
8
10 144
15
3
15
3
16
2
14
4
7
11
6
12
15
3 144
0.0478 (33.2 %)
PriceSens
1
Innovatn
D C
0.0765 (53.1%)
F E PriceLevel
A
ModImage ours I
G
Environm
GlobProd
ProdQual
-1
ProdRange
-2
-2
-1
Column points
are projections of
extreme corner
profiles, or
vertices (cf.
triangle)
have inertia
along axes equal
to 1 (hence
standard
coordinates).
Profile points
generally close
to average.
Symmetric map
0.6
0.0478 (33.2%)
PriceSens
H
0.4
B
D
C
0.2
Innovatn
PriceLevel
0.0765 (53.1%)
Row points
and column
points are both
displayed in
principal
coordinates
both have
inertias along
axes equal to
principal
inertias.
-0.2
ModImage E
A F
GlobProd ours
Environm
I
ProdQual
ProdRange
-0.4
-0.4
-0.2
0.2
0.4
0.6
0.8
Both sets of
points occupy
similar regions
of the map:
aesthetically a
better graphic.
0.0682 (31.7%)
High
PQ product
quality
0.6
Attributes have
positive and
negative pole
average
association is at
the origin of the
map, e.g.,
In(novation) has
high average,
P(roduct)Q(uality)
has low average.
G
0.4
0.2
High
PSproduct
En
range,
PR
I
modern
image, E
ours
PL
global
products
F
In
GP
A
MI
In-
C
MI- GP-
0.1173 (54.5%)
D
-0.2
PLPRPS
-0.4
PQ-0.6
-0.6
-0.4
-0.2
0.2
0.6
0.8
Fairly similar
configuration to
undoubled
analysis: there is
no strong halo
effect.
Inertia contributions in CA
Correspondence analysis (CA) is a method of data visualization which
represents the true positions of profile points in a map which comes
closest to all the points, closest in sense of weighted least-squares.
12%
O
O
O
O
O71%
The inertia explained in the map applies to all the points: if we say
83% of the inertia is explained in the map, 71% on the first
dimension and 12% on the second, this is a figure calculated for all
row (or column) points together.
Inertia contributions in CA
This type of inertia-explained-by-axes calculation can be made for
individual points.
These more detailed results are aids to interpretation in the form of
numerical diagnostics, called contributions.
Especially when there is not a high percentage of inertia explained by the
map, these contributions will help us to identify points which are
represented inaccurately.
The inertias and their percentages tell us how much of the variance in
the table is explained by the principal axes. The contributions do the
same, but for each point individually, and help us to see:
(a) which points are being explained better than others;
(b) which points are contributing to the solution more than others.
i-th point ai
with mass mi
di
k-th principal
centroid
fik
projection on
axis
mi di2 = i mi k fik2 = k k
mi di2 = mi k fik2
mi fik2
axis
...
m1 d12
m2 d22
m3 d32
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
...
i-th point ai
with mass mi
di
mn dn2
centroid
k-th principal
c
fik
projection on
axis
mi di2 = i mi k fik2 = k k
mi di2 = mi k fik2
mi fik2
axis
Inertia contributions
Axes
2
...
p
m1 f122 ... m1 f1p2
m1 d12
m2 d22
m3 d32
1
1 m1 f112
:
:
:
:
:
:
:
:
:
:
:
:
:
:
...
di
i-th point ai
with mass mi
mn dn2
centroid
k-th principal
c
ik
fik
axis
projection on
axis
A
B
C
D
E
F
G
H
I
ours
Weight (relative)
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
F1
0.200
0.006
0.249
0.153
0.113
0.113
0.037
0.074
0.009
0.048
F2
0.010
0.266
0.031
0.011
0.010
0.004
0.414
0.202
0.044
0.010
A
B
C
D
E
F
G
H
I
ours
F1
0.922
0.033
0.901
0.856
0.827
0.929
0.129
0.320
0.087
0.389
F2
0.027
0.914
0.065
0.035
0.045
0.017
0.839
0.510
0.259
0.046
Not so well-represented
Eigenvalues and percentages of inertia:
F1
Eigenvalue
0.117
Rows depend on columns
54.482
(%)
Cumulative %
54.482
F2
0.068
31.656
86.139
After:
CARME 2007
Correspondence Analysis &
Related Methods
Erasmus University
Rotterdam
25-27 June 2007
http://www.carme-n.org
Just pubished by
Chapman & Hall /
CRC Press