Sie sind auf Seite 1von 47

First XLSTAT Users Conference

Paris, 2007
7-8 June 2007

A
1961
1984

1991

C
1999

1993
2002

1989
1973

1994

B

Correspondence Analysis
and Related Methods
Michael Greenacre
Universitat Pompeu Fabra
Barcelona
michael@upf.es
www.globalsong.net
www.econ.upf.es/~michael

2007
1998

Correspondence analysis
and Related Methods Part 1
1. What is correspondence analysis (CA)?
2. Why is CA so useful as a method of visualizing
tabular data?
3. How is CA implemented in XLSTAT?
(by CA I mean simple CA, as opposed to
multiple CA, which is discussed in the next
talk)

Jean-Paul Benzcri...

creator of Correspondence Analysis

Correspondence analysis:
in which areas of research is it useful?
CA visualizes complex data, primarily data on categorical
measurement scales, facilitating understanding and
interpretation a neglected aspect of statistical
enquiry (cf. usual modelling approach)
linguistics, textual analysis: word frequencies
sociology: cross-tabulations and large sets of
categorical data from questionnaires; useful for
qualitative research, visualization of case study data
ecology: species abundance data at several
locations, often with explanatory variables
market research: perceptual mapping of
brands/products, ...
archeology: large sparse data matrices
biology, geology, chemistry, psychology...

Correspondence Analysis (CA)


 CA is a method of data visualization

O
O

O
O

 It applies in the first instance to a cross-tabulation (contingency table)


but can be applied to many other data types after suitable recoding
 The results of CA are in the form of a map of points
 The points represent the rows and columns of the table; it is not the
absolute values which are represented (as in principal component
analysis, for example) but their relative values.
 The positions of the points in the map tell you something about
similarities between the rows, similarities between the columns and the
association between rows and columns

A simple example
 312 respondents, all readers of a certain newspaper, cross-tabulated
according to their education group and level of reading of the
newspaper
C1

C2 C3

0.4

0.0129 (15.48 %)

E1

0.3

E1

C1

E2

18

46

20

0.2

E3

19

29

39

0.1

E3
C3

E5

E4

12

40

49

E5

16

-0.1

0.0704 (84.52 %)

E4

E2

C2

-0.2
-0.5

-0.4

-0.3

-0.2

-0.1

0.1

E1: some primary E2: primary completed E3: some secondary


E4: secondary completed E5: some tertiary

C1 : glance C2 : fairly thorough C3 : very thorough

0.2

0.3

0.4

0.5

0.6

Three basic geometric concepts


distance
profile

mass

profile the coordinates (position) of the point


mass the weight given to the point
distance the measure of proximity between points

Four derived geometrical concepts


inertia

projection

centroid

mi

di

subspace

inertia = i midi2

centroid the weighted average position


subspace space of reduced dimensionality within the space
projection the closest point in the subspace
inertia the weighted sum-of-squared distances to centroid

Profile
 A profile is a set of relative frequencies, that is a set of frequencies
expressed relative to their total (often in percentage form).
 Each row or each column of a table of frequencies defines a different
profile.
 It is these profiles which CA visualises as points in a map.

original data

row profiles

column profiles

C1

C2

C3

E1

14

E1

.36 .50 .14 1

E1

.09 .05 .02

E2

18

46

20

84

E2

.21 .55 .24 1

E2

.32 .37 .16

E3

19

29

39

87

E3

.22 .33 .45 1

E3

.33 .22 .31

E4

12

40

49 101

E4

.12 .40 .49 1

E4

.21 .31 .39

E5

E5

.12 .27 .62 1

E5

.05 .05 .13

16

C1

26

57 129 126 312

C2

C3

C1

C2

C3

Row profiles viewed in 3-d

Plotting profiles in profile space


(triangular coordinates)
E1

0.36

0.14

0.50

0.36 0.50 0.14

Weighted average (centroid)


average

The average is the point at which the two points are balanced.
weighted average

The situation is identical for multidimensional points...

Plotting profiles in profile space


(barycentric or weighted average principle)
0.14

E1

0.36 0.50 0.14

0.50

0.36

Plotting profiles in profile space


(barycentric or weighted average principle)
0.24

E2

0.21 0.55 0.24

0.55

0.21

Plotting profiles in profile space


(barycentric or weighted average principle)
0.62

0.27

E5

0.12 0.27 0.62

0.12

Masses of the profiles


original data
C1

C2

C3

masses

E1

14

.045

E2

18

46

20

84

.269

E3

19

29

39

87

.279

E4

12

40

49 101

.324

E5

16

.083

26

57 129 126 312


average
.183 .413 .404
row profile

Readership data
C1

C2

C3

Total

Mass

E1

Some primary

5
(0.357)

7
(0.500)

2
(0.143)

14

0.045

E2

Primary completed

18
(0.214)

46
(0.548)

20
(0.238)

84

0.269

E3

Some secondary

19
(0.218)

29
(0.333)

39
(0.448)

87

0.279

E4

Secondary completed

12
(0.119)

40
(0.396)

49
(0.485)

101

0.324

Some tertiary

3
(0.115)

7
(0.269)

16
(0.615)

26

0.083

Total

57
(0.183)

129
(0.413)

126
(0.404)

312

Education Group

E5

C1: glance

C2: fairly thorough

C3: very thorough

Calculating chi-square

12 similar terms ....

+
=
Education Group

C1

(3 - 4.76) 2
4.76

(7 -10.74) 2
10.74

26.0
C2

C3

Total

Mass

..

14

..

84

..

87

..

101

3
(0.115)

7
(0.269)

16
(0.615)

26

0.083

4.76

10.74

10.50

57

129

126

(0.183)

(0.413)

(0.404)

Observed Frequency

E5

Some tertiary
Expected Frequency
Total

(16 -10.50) 2
10.50

312

For example,
expected frequency
of (E5,C1):
0.183 x 26 = 4.76

Calculating chi-square
2

+ 26 [

12 similar terms ....

(3 / 26 - 4.76 / 26) 2
4.76 / 26

2 / 312 =

+ 0.083 [

(16 / 26 -10.50 / 26) 2


10.50 / 26

12 similar terms ....

(0.115 0.183) 2
0.183

Education Group

(7 / 26 -10.74 / 26) 2
10.74 / 26

C1

(0.269 0.413) 2
0.413

C2

C3

(0.615 0.404) 2
0.404

Total

Mass

..

14

..

84

..

87

..

101

3
(0.115)

7
(0.269)

16
(0.615)

26

0.083

4.76

10.74

10.50

57

129

126

(0.183)

(0.413)

(0.404)

Observed Frequency

E5

Some tertiary
Expected Frequency
Total

312

Calculating inertia
Inertia =

+ 0.083 [

2 / 312

= similar terms for first four rows ...

(0.115 0.183) 2
0.183

(0.615 0.404) 2
0.404

squared chi-square distance


(between the profile of E5 and
the average profile)

mass
(of row E5)

Inertia =
(0.115 0.183) 2
0.183

(0.269 0.413) 2
0.413

mass (chi-square distance)2

(0.269 0.413) 2
0.413

(0.615 0.404) 2
0.404

EUCLIDEAN
WEIGHTED

How can we see chi-square distances?


Inertia =

+ 0.083 [

2 / 312

(0.115 0.183) 2
0.183

(0.115 0.183) 2
0.183

(0.269 0.413) 2
0.413

(0.615 0.404) 2
0.404

squared chi-square distance


(between the profile of E5 and
the average profile)

mass
(of row E5)

= similar terms for first four rows ...

0.115
0.183
0.183 0.183

(0.269 0.413) 2
0.413
2

) +(

0.269
0.413

(0.615 0.404) 2
0.404

0.413
0.413

) +(

EUCLIDEAN
WEIGHTED

0.615
0.404
0.404 0.404

So the answer is to divide all profile elements by the of their averages

Stretched row profiles viewed in


3-d chi -squared space

vertices

Pythagorian
ordinary Euclidean
distances

profiles
Chi-square distances

What CA does
 centres the row and column profiles with respect to their average
profiles, so that the origin represents the average.
 re-defines the dimensions of the space in an ordered way: first
dimension explains the maximum amount of inertia possible in one
dimension; second adds the maximum amount to first (hence first two
explain the maximum amount in two dimensions), and so on until
all dimensions are explained.
 decomposes the total inertia along the principal axes into principal
inertias, usually expressed as % of the total.
 so if we want a low-dimensional version, we just take the first
(principal) dimensions
The row and column problem solutions are closely related,
one can be obtained from the other; there are simple scaling
factors along each dimension relating the two problems.

Asymmetric Maps using XLSTAT


2
.0 12 8 9 ( 15 ,5 %)

C1

.01289 (15,5%)
1.5

E1

1.5
1

E3
1
0.5

0.5

E1

E3

E2

C3

E5

E4

E5

C1

.0 7 0 3 7 ( 8 4 ,5 %)

C3

.07037 (84,5%)

C2
-0.5

-0.5

E2
E4

C2

-1

-1

-2.5

-1.5
-1

-0.5

0.5

1.5

-2

-1.5

-1

-0.5

0.5

1.5

Symmetric Map using XLSTAT

primary
incomplete

0.2

glance

.0 12 8 9 ( 15 ,5 %)

secondary
incomplete
very thorough

some
tertiary

0
.0 7 0 3 7 ( 8 4 ,5 %)

primary
complete

secondary
complete

fairly thorough
-0.2
-0.6

-0.4

-0.2

0.2

0.4

0.6

Asymmetric and symmetric maps

Asymmetric maps represent the rows and columns jointly in


principal & standard coordinates; asymmetric maps are also
biplots.
Because the principal coordinates can be much smaller than
the standard coordinates, especially when k is small, the
generally accepted way for the joint map is the symmetric map,
where both rows and columns are in principal coordinates.
Symmetric maps are strictly speaking not biplots, but they
are almost so (see Gabriel, Biometrika, 2002).

Data set product

(McFie et al.)

 Our company wishes to identify the perceptions of itself and its nine major
competitors.
 Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which companies they
associate with which of 8 attributes.
 The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15

Reduction of dimensionality

Reduction of dimensionality



means

data centred

Reduction of dimensionality

data centred
points weighted (row masses)
in case of frequency data, points are weighted by
their row masses, that is the relative frequencies of
each row (i.e. proportional to sample sizes, n)

Reduction of dimensionality

i'

dii'2 = j wi ( yij yi'j )2

data centred
points weighted (row masses)
metric weighted (column weights)
e.g. wj = 1/j2 the inverse of the variance in PCA
wj = 1/cj the inverse of the expected value in CA

Fat Freddys Cat Dimensional Transmogrifier

with thanks to Jrg Blasius

Data set product

(McFie et al.)

 Our company wishes to identify the perceptions of its products and its 9
major competitors (A, B, , I).
 Data are gathered from representatives from 18 companies that represent
their potential client base: each has to say which products they associate
with which of 8 attributes.
 The aim is to gain an idea about the relationships between the competitors
and the attributes, and where our company is situated in the overall
scheme.
Companies ProdQual Innovatn ProdRange Environm PriceLevel ModImage PriceSens GlobProd
Products
A
3
16
14
13
14
18
6
18
B
1
15
6
8
10
13
14
9
C
13
11
4
13
11
4
10
2
D
9
11
4
9
11
9
11
3
E
6
14
15
17
14
16
8
15
F
3
16
14
15
12
14
7
16
G
18
12
13
16
13
5
4
7
H
2
14
7
6
10
4
14
8
I
10
14
13
12
14
16
4
8
ours
4
15
15
16
14
7
6
15

Data set product

(McFie et al.)

 First note that this is NOT a contingency table, so the chi-square test is not
applicable (a permutation test could test for significance, but then we need
to have original respondent-level data).
 This is an interesting example because it can be analyzed as is or it can
be recoded to bring out certain features.
 Analyzing it with no recoding means that the size effect (sometimes
called the halo effect) is removed, since we analyze profiles, i.e., the
counts relative to their total counts. In other words, if a product gets
relatively few associations, then it is the highest of these (lower)
associations that are determinant. Hence, in the following extreme case,
a pattern of [ 18 18 18 ] is identical to a pattern of [ 1 1 1 ] !
 The masses assigned to the products will be proportional to the number of
associations they get.
 If the size effect is needed to be visualized as well, the data table should
be doubled.

Data set product


Products
Company

A
B
C
D
E
F
G
H
I
ours

Products
Company

A
B
C
D
E
F
G
H
I
ours

PQ

In

PR

En

PL

(McFie et al.)
MI

PS

GP

3
1
13
9
6
3
18
2
10
4

16
15
11
11
14
16
12
14
14
15

14
6
4
4
15
14
13
7
13
15

13
8
13
9
17
15
16
6
12
16

14
10
11
11
14
12
13
10
14
14

18
13
4
9
16
14
5
4
16
7

6
14
10
11
8
7
4
14
4
6

18
9
2
3
15
16
7
8
8
15

Total
102
76
68
67
105
97
88
65
91
92

PQ
2.9
1.3
19.1
13.4
5.7
3.1
20.5
3.1
11.0
4.3

In
15.7
19.7
16.2
16.4
13.3
16.5
13.6
21.5
15.4
16.3

PR
13.7
7.9
5.9
6.0
14.3
14.4
14.8
10.8
14.3
16.3

En
12.7
10.5
19.1
13.4
16.2
15.5
18.2
9.2
13.2
17.4

PL
13.7
13.2
16.2
16.4
13.3
12.4
14.8
15.4
15.4
15.2

MI
17.6
17.1
5.9
13.4
15.2
14.4
5.7
6.2
17.6
7.6

PS
5.9
18.4
14.7
16.4
7.6
7.2
4.5
21.5
4.4
6.5

GP
17.6
11.8
2.9
4.5
14.3
16.5
8.0
12.3
8.8
16.3

Total
102
76
68
67
105
97
88
65
91
92

Data set product

(McFie et al.)

 Doubling involves coding the counts of the numbers (out of 18) that
DONT associate the product with the attribute in each case.
 There are now two columns per attribute each attribute is represented by
its positive and negative end of the 0-to-18 scale of counts.
Doubled table:
Com. PQ
Prod.
A
B
C
D
E
F
G
H
I
ours

3
1
13
9
6
3
18
2
10
4

PQ15
17
5
9
12
15
0
16
8
14

In
In- PR PR- En En- PL PL- MI MI- PS PS- GP GP- Total
16
2
14
4
13
5
14
4
18
0
6
12
18
0 144
15
3
6
12
8
10
10
8
13
5
14
4
9
9 144
11
7
4
14
13
5
11
7
4
14
10
8
2
16 144
11
7
4
14
9
9
11
7
9
9
11
7
3
15 144
14
4
15
3
17
1
14
4
16
2
8
10
15
3 144
16
2
14
4
15
3
12
6
14
4
7
11
16
2 144
12
6
13
5
16
2
13
5
5
13
4
14
7
11 144
14
4
7
11
6
12
10
8
4
14
14
4
8
10 144
14
4
13
5
12
6
14
4
16
2
4
14
8
10 144
15
3
15
3
16
2
14
4
7
11
6
12
15
3 144

Row asymmetric map


 Row points are
projections of
row profiles
have inertias
along axes equal
to principal
inertias (hence
principal
coordinates).

0.0478 (33.2 %)

PriceSens

1
Innovatn

D C

0.0765 (53.1%)

F E PriceLevel
A
ModImage ours I
G

Environm

GlobProd

ProdQual

-1
ProdRange

-2
-2

-1

 Column points
are projections of
extreme corner
profiles, or
vertices (cf.
triangle)
have inertia
along axes equal
to 1 (hence
standard
coordinates).
 Profile points
generally close
to average.

Symmetric map
0.6

0.0478 (33.2%)

PriceSens

H
0.4

B
D
C

0.2

Innovatn
PriceLevel

0.0765 (53.1%)

 Row points
and column
points are both
displayed in
principal
coordinates
both have
inertias along
axes equal to
principal
inertias.

-0.2

ModImage E
A F
GlobProd ours

Environm
I

ProdQual

ProdRange

-0.4
-0.4

-0.2

0.2

0.4

0.6

0.8

 Both sets of
points occupy
similar regions
of the map:
aesthetically a
better graphic.

Doubled data: symmetric map


0.8

0.0682 (31.7%)

High
PQ product
quality

0.6

 Attributes have
positive and
negative pole
average
association is at
the origin of the
map, e.g.,
In(novation) has
high average,
P(roduct)Q(uality)
has low average.

G
0.4

0.2

High
PSproduct
En
range,
PR
I
modern
image, E
ours
PL
global
products
F
In
GP
A
MI

In-

C
MI- GP-

0.1173 (54.5%)
D

-0.2

PLPRPS

-0.4

PQ-0.6
-0.6

-0.4

-0.2

0.2

High price sensitive;


low environment,
En- product range and
price level
0.4

0.6

0.8

 Fairly similar
configuration to
undoubled
analysis: there is
no strong halo
effect.

Inertia contributions in CA
 Correspondence analysis (CA) is a method of data visualization which
represents the true positions of profile points in a map which comes
closest to all the points, closest in sense of weighted least-squares.


12%

O
O
O
O

O71%

 The inertia explained in the map applies to all the points: if we say
83% of the inertia is explained in the map, 71% on the first
dimension and 12% on the second, this is a figure calculated for all
row (or column) points together.

Inertia contributions in CA
 This type of inertia-explained-by-axes calculation can be made for
individual points.
 These more detailed results are aids to interpretation in the form of
numerical diagnostics, called contributions.
 Especially when there is not a high percentage of inertia explained by the
map, these contributions will help us to identify points which are
represented inaccurately.
 The inertias and their percentages tell us how much of the variance in
the table is explained by the principal axes. The contributions do the
same, but for each point individually, and help us to see:
(a) which points are being explained better than others;
(b) which points are contributing to the solution more than others.

Geometry of inertia contributions

i-th point ai
with mass mi
di
k-th principal
centroid

Total inertia of the cloud of points = i


Inertia of i-th point =

fik

projection on
axis

mi di2 = i mi k fik2 = k k

mi di2 = mi k fik2

Inertia contribution of i-th point to k-th axis =

mi fik2

axis

Geometry of inertia contributions


Axes
1

...

m1 f112 m1 f122 ... m1 f1p2

m1 d12

m2 f212 m2 f222 ... m2 f2p2

m2 d22

m3 f312 m3 f322 ... m3 f3p2

m3 d32

:
:
:
:

:
:
:
:

:
:
:
:

:
:
:

mn fn12 mn fn22 ... mn fnp2

...

i-th point ai
with mass mi
di

mn dn2

centroid

k-th principal
c

Total inertia of the cloud of points = i


Inertia of i-th point =

fik

projection on
axis

mi di2 = i mi k fik2 = k k

mi di2 = mi k fik2

Inertia contribution of i-th point to k-th axis =

mi fik2

axis

Inertia contributions
Axes
2
...
p
m1 f122 ... m1 f1p2

m1 d12

2 m2 f212 m2 f222 ... m2 f2p2

m2 d22

3 m3 f312 m3 f322 ... m3 f3p2

m3 d32

1
1 m1 f112

:
:
:
:

:
:
:
:

:
:
:
:

:
:

...

di

n mn fn12 mn fn22 ... mn fnp2

i-th point ai
with mass mi

mn dn2

centroid

k-th principal
c

ik

fik

axis

projection on
axis

mi fik2 / k : amount of inertia of axis k explained by point i (absolute contribution, CTR)


mi fik2 / midi2 : amount of inertia of point i explained by axis k (relative contribution, COR)
mi fik2 / midi2 = fik2 / di2 , i.e. the square of fik / di = cos(ik ), where ik is the angle point-axis

Contributions to axes and


contributions to points
(product data, doubled)
Contributions (rows):

A
B
C
D
E
F
G
H
I
ours

Weight (relative)
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100
0.100

Squared cosines (rows):

F1
0.200
0.006
0.249
0.153
0.113
0.113
0.037
0.074
0.009
0.048

F2
0.010
0.266
0.031
0.011
0.010
0.004
0.414
0.202
0.044
0.010

A
B
C
D
E
F
G
H
I
ours

F1
0.922
0.033
0.901
0.856
0.827
0.929
0.129
0.320
0.087
0.389

F2
0.027
0.914
0.065
0.035
0.045
0.017
0.839
0.510
0.259
0.046

Not so well-represented
Eigenvalues and percentages of inertia:
F1
Eigenvalue
0.117
Rows depend on columns
54.482
(%)
Cumulative %
54.482

F2
0.068
31.656
86.139

After:





Correspondence Analysis in the


Social Sciences (Cologne,1991)
Visualizing Categorical Data
(Cologne, 1995)
Large Scale Data Analysis
(Cologne, 1999)
Correspondence Analysis and
Related Methods (CARME 2003)
(Barcelona, 2003)

CARME 2007
Correspondence Analysis &
Related Methods
Erasmus University
Rotterdam
25-27 June 2007
http://www.carme-n.org

Just pubished by
Chapman & Hall /
CRC Press

Das könnte Ihnen auch gefallen