Green Acre 2007 Caco L

First XLSTAT Users Conference
Paris, 2007
7-8 June 2007
A
1961
1984
1991
C
1999
1993
2002
1989
1973
1994
B
Correspondence Analysis
and Related Methods
Michael Greenacre
Universitat Pompeu Fabra
Barcelona
michael@upf.es
www.globalsong.net
www.econ.upf.es/~michael
2007
1998
Correspondence analysis
and Related Methods Part 1
1. What is correspondence analysis (CA)?
2. Why is CA so useful as a method of visualizing
tabular data?
3. How is CA implemented in XLSTAT?
(by CA I mean simple CA, as opposed to
multiple CA, which is discussed in the next
talk)
Jean-Paul Benzcri...
creator of Correspondence Analysis
Correspondence analysis:
in which areas of research is it useful?
CA visualizes complex data, primarily data on categorical
measurement scales, facilitating understanding and
interpretation a neglected aspect of statistical
enquiry (cf. usual modelling approach)
linguistics, textual analysis: word frequencies
sociology: cross-tabulations and large sets of
categorical data from questionnaires; useful for
qualitative research, visualization of case study data
ecology: species abundance data at several
locations, often with explanatory variables
market research: perceptual mapping of
brands/products, ...
archeology: large sparse data matrices
biology, geology, chemistry, psychology...
Correspondence Analysis (CA)

CA is a method of data visualization

O
O
O
O
It applies in the first instance to a cross-tabulation (contingency table)

but can be applied to many other data types after suitable recoding
The results of CA are in the form of a map of points
The points represent the rows and columns of the table; it is not the
absolute values which are represented (as in principal component
analysis, for example) but their relative values.
The positions of the points in the map tell you something about
similarities between the rows, similarities between the columns and the
association between rows and columns
A simple example
312 respondents, all readers of a certain newspaper, cross-tabulated
according to their education group and level of reading of the
newspaper
C1
C2 C3
0.4
0.0129 (15.48 %)
E1
0.3
E1
C1
E2
18
46
20
0.2
E3
19
29
39
0.1
E3
C3
E5
E4
12
40
49
E5
16
-0.1
0.0704 (84.52 %)
E4
E2
C2
-0.2
-0.5
-0.4
-0.3
-0.2
-0.1
0.1
E1: some primary E2: primary completed E3: some secondary

E4: secondary completed E5: some tertiary
C1 : glance C2 : fairly thorough C3 : very thorough
0.2
0.3
0.4
0.5
0.6
Three basic geometric concepts

distance
profile
mass
profile the coordinates (position) of the point

mass the weight given to the point
distance the measure of proximity between points
Four derived geometrical concepts

inertia
projection
centroid
mi
di
subspace
inertia = i midi2
centroid the weighted average position

subspace space of reduced dimensionality within the space
projection the closest point in the subspace
inertia the weighted sum-of-squared distances to centroid
Profile
A profile is a set of relative frequencies, that is a set of frequencies
expressed relative to their total (often in percentage form).
Each row or each column of a table of frequencies defines a different
profile.
It is these profiles which CA visualises as points in a map.
original data
row profiles
column profiles
C1
C2
C3
E1
14
E1
.36 .50 .14 1
E1
.09 .05 .02
E2
18
46
20
84
E2
.21 .55 .24 1
E2
.32 .37 .16
E3
19
29
39
87
E3
.22 .33 .45 1
E3
.33 .22 .31
E4
12
40
49 101
E4
.12 .40 .49 1
E4
.21 .31 .39
E5
E5
.12 .27 .62 1
E5
.05 .05 .13
16
C1
26
57 129 126 312
C2
C3
C1
C2
C3
Row profiles viewed in 3-d
Plotting profiles in profile space

(triangular coordinates)
E1
0.36
0.14
0.50
0.36 0.50 0.14
Weighted average (centroid)

average
The average is the point at which the two points are balanced.
weighted average
The situation is identical for multidimensional points...

(barycentric or weighted average principle)
0.14
E1
0.36 0.50 0.14
0.50
0.36

0.24
E2
0.21 0.55 0.24
0.55
0.21

0.62
0.27
E5
0.12 0.27 0.62
0.12
Masses of the profiles

original data
C1
C2
C3
masses
E1
14
.045
E2
18
46
20
84
.269
E3
19
29
39
87
.279
E4
12
40
49 101
.324
E5
16
.083
26
57 129 126 312

average
.183 .413 .404
row profile
Readership data
C1
C2
C3
Total
Mass
E1
Some primary
5
(0.357)
7
(0.500)
2
(0.143)
14
0.045
E2
Primary completed
18
(0.214)
46
(0.548)
20
(0.238)
84
0.269
E3
Some secondary
19
(0.218)
29
(0.333)
39
(0.448)
87
0.279
E4
Secondary completed
12
(0.119)
40
(0.396)
49
(0.485)
101
0.324
Some tertiary
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
Total
57
(0.183)
129
(0.413)
126
(0.404)
312
Education Group
E5
C1: glance
C2: fairly thorough
C3: very thorough
Calculating chi-square
12 similar terms ....
+
=
Education Group
C1
(3 - 4.76) 2
4.76
(7 -10.74) 2
10.74
26.0
C2
C3
Total
Mass
..
14
..
84
..
87
..
101
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
4.76
10.74
10.50
57
129
126
(0.183)
(0.413)
(0.404)
Observed Frequency
E5
Some tertiary
Expected Frequency
Total
(16 -10.50) 2
10.50
312
For example,
expected frequency
of (E5,C1):
0.183 x 26 = 4.76
Calculating chi-square
2
+ 26 [
(3 / 26 - 4.76 / 26) 2
4.76 / 26
2 / 312 =
+ 0.083 [
(16 / 26 -10.50 / 26) 2

10.50 / 26
(0.115 0.183) 2
0.183
Education Group
(7 / 26 -10.74 / 26) 2
10.74 / 26
C1
(0.269 0.413) 2
0.413
C2
C3
(0.615 0.404) 2
0.404
Total
Mass
..
14
..
84
..
87
..
101
3
(0.115)
7
(0.269)
16
(0.615)
26
0.083
4.76
10.74
10.50
57
129
126
(0.183)
(0.413)
(0.404)
Observed Frequency
E5
Some tertiary
Expected Frequency
Total
312
Calculating inertia
Inertia =
+ 0.083 [
2 / 312
= similar terms for first four rows ...
(0.115 0.183) 2
0.183
(0.615 0.404) 2
0.404
squared chi-square distance

(between the profile of E5 and
the average profile)
mass
(of row E5)
Inertia =
(0.115 0.183) 2
0.183
(0.269 0.413) 2
0.413
mass (chi-square distance)2
(0.269 0.413) 2
0.413
(0.615 0.404) 2
0.404
EUCLIDEAN
WEIGHTED
How can we see chi-square distances?

Inertia =
+ 0.083 [
2 / 312
(0.115 0.183) 2
0.183
(0.115 0.183) 2
0.183
(0.269 0.413) 2
0.413
(0.615 0.404) 2
0.404
squared chi-square distance

(between the profile of E5 and
the average profile)
mass
(of row E5)
= similar terms for first four rows ...
0.115
0.183
0.183 0.183
(0.269 0.413) 2
0.413
2
) +(
0.269
0.413
(0.615 0.404) 2
0.404
0.413
0.413
) +(
EUCLIDEAN
WEIGHTED
0.615
0.404
0.404 0.404
So the answer is to divide all profile elements by the of their averages
Stretched row profiles viewed in

3-d chi -squared space
vertices
Pythagorian
ordinary Euclidean
distances
profiles
Chi-square distances
What CA does
centres the row and column profiles with respect to their average
profiles, so that the origin represents the average.
re-defines the dimensions of the space in an ordered way: first
dimension explains the maximum amount of inertia possible in one
dimension; second adds the maximum amount to first (hence first two
explain the maximum amount in two dimensions), and so on until
all dimensions are explained.
decomposes the total inertia along the principal axes into principal
inertias, usually expressed as % of the total.
so if we want a low-dimensional version, we just take the first
(principal) dimensions
The row and column problem solutions are closely related,
one can be obtained from the other; there are simple scaling
factors along each dimension relating the two problems.
Asymmetric Maps using XLSTAT

2
.0 12 8 9 ( 15 ,5 %)
C1
.01289 (15,5%)
1.5
E1
1.5
1
E3
1
0.5
0.5
E1
E3
E2
C3
E5
E4
E5
C1
.0 7 0 3 7 ( 8 4 ,5 %)
C3
.07037 (84,5%)
C2
-0.5
-0.5
E2
E4
C2
-1
-1
-2.5
-1.5
-1
-0.5
0.5
1.5
-2
-1.5
-1
-0.5
0.5
1.5
Symmetric Map using XLSTAT
primary
incomplete
0.2
glance
.0 12 8 9 ( 15 ,5 %)
secondary
incomplete
very thorough
some
tertiary
0
.0 7 0 3 7 ( 8 4 ,5 %)
primary
complete
secondary
complete
fairly thorough
-0.2
-0.6
-0.4
-0.2
0.2
0.4
0.6
Asymmetric and symmetric maps
Asymmetric maps represent the rows and columns jointly in

principal & standard coordinates; asymmetric maps are also
biplots.
Because the principal coordinates can be much smaller than
the standard coordinates, especially when k is small, the
generally accepted way for the joint map is the symmetric map,
where both rows and columns are in principal coordinates.
Symmetric maps are strictly speaking not biplots, but they
are almost so (see Gabriel, Biometrika, 2002).
Data set product
(McFie et al.)
Our company wishes to identify the perceptions of itself and its nine major
competitors.
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which companies they
associate with which of 8 attributes.
The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15
Reduction of dimensionality

means
data centred
data centred
points weighted (row masses)
in case of frequency data, points are weighted by
their row masses, that is the relative frequencies of
each row (i.e. proportional to sample sizes, n)
i'
dii'2 = j wi ( yij yi'j )2
data centred
points weighted (row masses)
metric weighted (column weights)
e.g. wj = 1/j2 the inverse of the variance in PCA
wj = 1/cj the inverse of the expected value in CA
Fat Freddys Cat Dimensional Transmogrifier
with thanks to Jrg Blasius
Data set product
(McFie et al.)
Our company wishes to identify the perceptions of its products and its 9
major competitors (A, B, , I).
Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which products they associate
with which of 8 attributes.
The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
Products
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15
Data set product
(McFie et al.)
First note that this is NOT a contingency table, so the chi-square test is not
applicable (a permutation test could test for significance, but then we need
to have original respondent-level data).
This is an interesting example because it can be analyzed as is or it can
be recoded to bring out certain features.
Analyzing it with no recoding means that the size effect (sometimes
called the halo effect) is removed, since we analyze profiles, i.e., the
counts relative to their total counts. In other words, if a product gets
relatively few associations, then it is the highest of these (lower)
associations that are determinant. Hence, in the following extreme case,
a pattern of [ 18 18 18 ] is identical to a pattern of [ 1 1 1 ] !
The masses assigned to the products will be proportional to the number of
associations they get.
If the size effect is needed to be visualized as well, the data table should
be doubled.
Data set product

Products
Company
A
B
C
D
E
F
G
H
I
ours
Products
Company
A
B
C
D
E
F
G
H
I
ours
PQ
In
PR
En
PL
(McFie et al.)
MI
PS
GP
3
1
13
9
6
3
18
2
10
4
16
15
11
11
14
16
12
14
14
15
14
6
4
4
15
14
13
7
13
15
13
8
13
9
17
15
16
6
12
16
14
10
11
11
14
12
13
10
14
14
18
13
4
9
16
14
5
4
16
7
6
14
10
11
8
7
4
14
4
6
18
9
2
3
15
16
7
8
8
15
Total
102
76
68
67
105
97
88
65
91
92
PQ
2.9
1.3
19.1
13.4
5.7
3.1
20.5
3.1
11.0
4.3
In
15.7
19.7
16.2
16.4
13.3
16.5
13.6
21.5
15.4
16.3
PR
13.7
7.9
5.9
6.0
14.3
14.4
14.8
10.8
14.3
16.3
En
12.7
10.5
19.1
13.4
16.2
15.5
18.2
9.2
13.2
17.4
PL
13.7
13.2
16.2
16.4
13.3
12.4
14.8
15.4
15.4
15.2
MI
17.6
17.1
5.9
13.4
15.2
14.4
5.7
6.2
17.6
7.6
PS
5.9
18.4
14.7
16.4
7.6
7.2
4.5
21.5
4.4
6.5
GP
17.6
11.8
2.9
4.5
14.3
16.5
8.0
12.3
8.8
16.3
Total
102
76
68
67
105
97
88
65
91
92
Data set product
(McFie et al.)
Doubling involves coding the counts of the numbers (out of 18) that
DONT associate the product with the attribute in each case.
There are now two columns per attribute each attribute is represented by
its positive and negative end of the 0-to-18 scale of counts.
Doubled table:
Com. PQ
Prod.
A
B
C
D
E
F
G
H
I
ours
3
1
13
9
6
3
18
2
10
4
PQ15
17
5
9
12
15
0
16
8
14
In
In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- Total
16
2
14
4
13
5
14
4
18
0
6
12
18
0 144
15
3
6
12
8
10
10
8
13
5
14
4
9
9 144
11
7
4
14
13
5
11
7
4
14
10
8
2
16 144
11
7
4
14
9
9
11
7
9
9
11
7
3
15 144
14
4
15
3
17
1
14
4
16
2
8
10
15
3 144
16
2
14
4
15
3
12
6
14
4
7
11
16
2 144
12
6
13
5
16
2
13
5
5
13
4
14
7
11 144
14
4
7
11
6
12
10
8
4
14
14
4
8
10 144
14
4
13
5
12
6
14
4
16
2
4
14
8
10 144
15
3
15
3
16
2
14
4
7
11
6
12
15
3 144
Row asymmetric map

Row points are
projections of
row profiles
have inertias
along axes equal
to principal
inertias (hence
principal
coordinates).
0.0478 (33.2 %)
PriceSens
1
Innovatn
D C
0.0765 (53.1%)
F E PriceLevel
A
ModImage ours I
G
Environm
GlobProd
ProdQual
-1
ProdRange
-2
-2
-1
Column points
are projections of
extreme corner
profiles, or
vertices (cf.
triangle)
have inertia
along axes equal
to 1 (hence
standard
coordinates).
Profile points
generally close
to average.
Symmetric map
0.6
0.0478 (33.2%)
PriceSens
H
0.4
B
D
C
0.2
Innovatn
PriceLevel
0.0765 (53.1%)
Row points
and column
points are both
displayed in
principal
coordinates
both have
inertias along
axes equal to
principal
inertias.
-0.2
ModImage E
A F
GlobProd ours
Environm
I
ProdQual
ProdRange
-0.4
-0.4
-0.2
0.2
0.4
0.6
0.8
Both sets of
points occupy
similar regions
of the map:
aesthetically a
better graphic.
Doubled data: symmetric map

0.8
0.0682 (31.7%)
High
PQ product
quality
0.6
Attributes have
positive and
negative pole
average
association is at
the origin of the
map, e.g.,
In(novation) has
high average,
P(roduct)Q(uality)
has low average.
G
0.4
0.2
High
PSproduct
En
range,
PR
I
modern
image, E
ours
PL
global
products
F
In
GP
A
MI
In-
C
MI- GP-
0.1173 (54.5%)
D
-0.2
PLPRPS
-0.4
PQ-0.6
-0.6
-0.4
-0.2
0.2
High price sensitive;

low environment,
En- product range and
price level
0.4
0.6
0.8
Fairly similar
configuration to
undoubled
analysis: there is
no strong halo
effect.
Inertia contributions in CA
Correspondence analysis (CA) is a method of data visualization which
represents the true positions of profile points in a map which comes
closest to all the points, closest in sense of weighted least-squares.

12%
O
O
O
O
O71%
The inertia explained in the map applies to all the points: if we say
83% of the inertia is explained in the map, 71% on the first
dimension and 12% on the second, this is a figure calculated for all
row (or column) points together.
Inertia contributions in CA
This type of inertia-explained-by-axes calculation can be made for
individual points.
These more detailed results are aids to interpretation in the form of
numerical diagnostics, called contributions.
Especially when there is not a high percentage of inertia explained by the
map, these contributions will help us to identify points which are
represented inaccurately.
The inertias and their percentages tell us how much of the variance in
the table is explained by the principal axes. The contributions do the
same, but for each point individually, and help us to see:
(a) which points are being explained better than others;
(b) which points are contributing to the solution more than others.
Geometry of inertia contributions
i-th point ai
with mass mi
di
k-th principal
centroid
Total inertia of the cloud of points = i

Inertia of i-th point =
fik
projection on
axis
mi di2 = i mi k fik2 = k k
mi di2 = mi k fik2
Inertia contribution of i-th point to k-th axis =
mi fik2
axis
Geometry of inertia contributions

Axes
1
...
m1 f112 m1 f122 ... m1 f1p2
m1 d12
m2 f212 m2 f222 ... m2 f2p2
m2 d22
m3 f312 m3 f322 ... m3 f3p2
m3 d32
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
mn fn12 mn fn22 ... mn fnp2
...
i-th point ai
with mass mi
di
mn dn2
centroid
k-th principal
c
Total inertia of the cloud of points = i

Inertia of i-th point =
fik
projection on
axis
mi di2 = i mi k fik2 = k k
mi di2 = mi k fik2
Inertia contribution of i-th point to k-th axis =
mi fik2
axis
Inertia contributions
Axes
2
...
p
m1 f122 ... m1 f1p2
m1 d12
2 m2 f212 m2 f222 ... m2 f2p2
m2 d22
3 m3 f312 m3 f322 ... m3 f3p2
m3 d32
1
1 m1 f112
:
:
:
:
:
:
:
:
:
:
:
:
:
:
...
di
n mn fn12 mn fn22 ... mn fnp2
i-th point ai
with mass mi
mn dn2
centroid
k-th principal
c
ik
fik
axis
projection on
axis
mi fik2 / k : amount of inertia of axis k explained by point i (absolute contribution, CTR)

mi fik2 / midi2 : amount of inertia of point i explained by axis k (relative contribution, COR)
mi fik2 / midi2 = fik2 / di2 , i.e. the square of fik / di = cos(ik ), where ik is the angle point-axis
Contributions to axes and

contributions to points
(product data, doubled)
Contributions (rows):
A
B
C
D
E
F
G
H
I
ours
Weight (relative)
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
Squared cosines (rows):
F1
0.200
0.006
0.249
0.153
0.113
0.113
0.037
0.074
0.009
0.048
F2
0.010
0.266
0.031
0.011
0.010
0.004
0.414
0.202
0.044
0.010
A
B
C
D
E
F
G
H
I
ours
F1
0.922
0.033
0.901
0.856
0.827
0.929
0.129
0.320
0.087
0.389
F2
0.027
0.914
0.065
0.035
0.045
0.017
0.839
0.510
0.259
0.046
Not so well-represented
Eigenvalues and percentages of inertia:
F1
Eigenvalue
0.117
Rows depend on columns
54.482
(%)
Cumulative %
54.482
F2
0.068
31.656
86.139
After:

Correspondence Analysis in the

Social Sciences (Cologne,1991)
Visualizing Categorical Data
(Cologne, 1995)
Large Scale Data Analysis
(Cologne, 1999)
Correspondence Analysis and
Related Methods (CARME 2003)
(Barcelona, 2003)
CARME 2007
Correspondence Analysis &
Related Methods
Erasmus University
Rotterdam
25-27 June 2007
http://www.carme-n.org
Just pubished by
Chapman & Hall /
CRC Press

Green Acre 2007 Caco L

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Green Acre 2007 Caco L

Hochgeladen von

Copyright:

Verfügbare Formate

First XLSTAT Users Conference

creator of Correspondence Analysis

Correspondence Analysis (CA)

It applies in the first instance to a cross-tabulation (contingency table)

E1: some primary E2: primary completed E3: some secondary

C1 : glance C2 : fairly thorough C3 : very thorough

Three basic geometric concepts

profile the coordinates (position) of the point

Four derived geometrical concepts

centroid the weighted average position

.36 .50 .14 1

.09 .05 .02

.21 .55 .24 1

.32 .37 .16

.22 .33 .45 1

.33 .22 .31

.12 .40 .49 1

.21 .31 .39

.12 .27 .62 1

.05 .05 .13

57 129 126 312

Row profiles viewed in 3-d

Plotting profiles in profile space

0.36 0.50 0.14

Weighted average (centroid)

The situation is identical for multidimensional points...

Plotting profiles in profile space

0.36 0.50 0.14

Plotting profiles in profile space

0.21 0.55 0.24

Plotting profiles in profile space

0.12 0.27 0.62

Masses of the profiles

57 129 126 312

C2: fairly thorough

C3: very thorough

12 similar terms ....

12 similar terms ....

(16 / 26 -10.50 / 26) 2

12 similar terms ....

= similar terms for first four rows ...

squared chi-square distance

mass (chi-square distance)2

How can we see chi-square distances?

squared chi-square distance

= similar terms for first four rows ...

So the answer is to divide all profile elements by the of their averages

Stretched row profiles viewed in

Asymmetric Maps using XLSTAT

Symmetric Map using XLSTAT

Asymmetric and symmetric maps

Asymmetric maps represent the rows and columns jointly in

Data set product

dii'2 = j wi ( yij yi'j )2

Fat Freddys Cat Dimensional Transmogrifier

with thanks to Jrg Blasius

Data set product

Data set product

Data set product

Data set product

Row asymmetric map

Doubled data: symmetric map

High price sensitive;