Sie sind auf Seite 1von 39

Factor Analysis

• Factor analysis allows us to look at groups of variables


that tend to be correlated to each other and identify
underlying dimensions that explain these correlations.
• Here, variables are not classified as dependent or
independent. Instead, relationships among sets of
many interrelated variables are examined and
represented in terms of a few underlying factors.
• It is a class of procedures primarily used for data
reduction and summarization. For ex. Store image may
be measured by asking respondents to evaluate stores
on a series of items on a semantic differential scale.
These item evaluations may then be analyzed to
determine the factors underlying store image.
Some key definitions :
• Barletts’ test of sphericity: This is a test statistic used to examine
the hypothesis that the variables are uncorrelated in the
population. In other words, the population correlation matrix is an
identity matrix; each variable correlates perfectly with itself but
has no correlation with the other variables.
• Correlation matrix: It is a lower triangle matrix showing the simple
correlations, r, between all possible pairs of variables included in
the analysis.
• Communality: It is the amount of variance a variable shares with
all the other variables being considered.
• Eigenvalue: It represents the total variance explained by each
factor.
• Factor loadings: Simple correlations between the variables and
the factors.
• Factor loading plot: It is a plot of the original variables using the
factor loadings as coordinates.
• Factor matrix: It contains the factor loadings of all the
variables on all the factors extracted.
• Factor scores: Composite scores estimated for each
respondent on the derived factors.
• Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy:
It is an index used to examine the appropriateness of factor
analysis. High values indicate (between 0.5 & 1.0) factor
analysis is appropriate.
• Percentage of variance: This is the percentage of the total
variance attributed to each factor.
• Residuals: Differences between the observed correlations, as
given in the input correlation matrix, and the reproduced
correlations, as estimated from the factor matrix.
• Scree plot: Plot of the eigenvalues against the no. of factors in
order of extraction.
Model :
Likewise multiple regression analysis here each variable is
expressed as a linear combination of underlying factors.
The covariation among the variables is described in terms of a
small no. of common factors plus a unique factor for each
variable. It is represented as :
Xi= Ai1*F1 + Ai2*F2+……+ Aim*Fm + ViUi
where;
Xi = ith standardized variable
Aij= standardized multiple regression coefficient of variable i on
common factor j
F= common factor
Vi= standardized regression coefficient of variable i on unique
factor I
Ui= unique factor for variable i
m= no. of common factors
The unique factors are uncorrelated with each the other and with the
common factors.
The common factors can be expressed as linear combinations of the
observed variables.
Fi= Wi1*X1 + ……+ Wik*Xk
where;
Fi= estimate of ith factor
Wi= weight or factor score coeff.
k= no. of variables
It is possible to select weights so that the first factor explains the
largest portion of the total variance. Then a second set of weights
can be selected, so that the second factor accounts for most of the
residual variance, subject to being uncorrelated with the first
factor. This same principal could be applied to selecting additional
weights for the additional factors. Thus, the factors can be
estimated so that their factor scores, unlike the values of the
original variables, are not correlated. Furthermore, the first factor
accounts for the highest variance in the data, the second factor
second highest and so on.
Conducting factor analysis
Formulate the problem

Construct the correlation matrix

Determine the method of factor analysis

Determine the no. of factors

Rotate the factors

Interpret the factors

Calculate factor score Select the surrogate


variable

Determine the model fit


Formulation of the problem:
• First, the objectives of factor analysis should
be identified.
• The variables to be included in the factor
analysis should be specified based on past
research, theory. (Variables must be measured
appropriately on an interval or ratio scale.)
• An appropriate sample size should be used.
There should be atleast four or five times as
many observations (sample size) as there are
variables.
• Example : Suppose the researcher wants to determine the
underlying benefits consumers seek from the purchase of a
toothpaste. A sample of 30 respondents was interviewed
using mall-intercept interviewing. The respondents were
asked to indicate their degree of agreement with the
following statements using a 7-point scale (1= strongly
disagree, 7=strongly agree) :

V1: It is imp. To buy a toothpaste that prevents cavities.


V2: I like a toothpaste that gives shiny teeth.
V3: A toothpaste should strengthen your gums.
V4: I prefer a toothpaste that freshens breath.
V5: Prevention of tooth decay is not an imp benefit offered by
a toothpaste.
V6: The most imp consideration in buying a toothpaste is
attractive teeth.
The data obtained are :
Responde V1 V2 V3 V4 V5 V6
nt no.
1 7.00 3 6 4 2 4
2 1.00 3 2 4 5 4
3 6 2 7 4 1 3
4 4 5 4 6 2 5
5 1 2 2 3 6 2
6 6 3 6 4 2 4
7 5 3 6 3 4 3
8 6 4 7 4 1 4
9 3 4 2 3 6 3
10 2 6 2 6 7 6
11 6 4 7 3 2 3
12 2 3 1 4 5 4
13 7 2 6 4 1 3
14 4 6 4 5 3 6
15 1 3 2 2 6 4
16 6 4 6 3 3 4
17 5 3 6 3 3 4
18 7 3 7 4 1 4
19 2 4 3 3 6 3
20 3 5 3 6 4 6
21 1 3 2 3 5 3
22 5 4 5 4 2 4
23 2 2 1 5 4 4
24 4 6 4 6 4 7
25 6 5 4 2 1 4
26 3 5 4 6 4 7
27 4 4 7 2 2 5
28 3 7 2 6 4 3
29 4 6 3 7 2 7
30 2 3 2 4 7 2
Construction of the correlation matrix:
• For the factor analysis to be appropriate, the variables must
be correlated.
For testing the appropriateness of the factor model:
1. Bartletts’ test of sphericity can be used to test the null
hypothesis that the variables are uncorrelated in the
population; i.e. the population correlation matrix is an
identity matrix. The test statistic for sphericity is based on
a chi- square transformation of the determinant of the
correlation matrix.
2. KMO measure of sampling adequacy : This index
compares the magnitudes of the observed correlation
coeff. To the magnitudes of the partial correlation coeff.
Small value of the KMO statistic indicate that the
correlations between pairs of variables cannot be
explained by other variables.
• Correlation matrix :
Variable V1 V2 V3 V4 V5 V6
V1 1.00
V2 -0.053 1.00
V3 0.873 -0.155 1.00
V4 -0.086 0.572 -0.248 1.00
V5 -0.858 0.020 -0.778 -0.007 1.00
V6 0.004 0.640 -0.018 0.640 -0.136 1.00

There are relatively high correlation among


V1(prevention of cavities), V3(strong gums)
and V5(prevention of tooth decay). We would
expect these variables to correlate with the
same set of factors. And likewise V2, V4, V6.
The null hypothesis, that the population
correlation matrix is an identity matrix, is
rejected by the Barlett’s test of sphericity. The
approx chi-sq statistic is 111.314 with 15
degrees of freedom, which is significant at the
0.05 level. The value of the KMO statistic
(0.660) is also large(>0.5).
Thus, factor analysis may be considered an
appropriate technique for analyzing the
correlation matrix.
Determine the method of factor analysis :
• The approach used to derive the weights or factor score coeff.
differentiates the various methods of factor analysis.
1. In principal components analysis,
• the total variance in the data is considered. The diagonal of the
correlation matrix consists of unities, and full variance is brought
into the factor matrix.
• PCA is recommended when the primary concern is to determine
the minimum no. of factors that will account for maximum
variance in the data for use in subsequent multivariate analysis.
The factors are called principal components.
2. In common factor analysis,
• the factors are estimated based only on common variance.
Communalities are inserted in the diagonal of the correlation
matrix.
• This method is appropriate when the primary concern is to
identify the underlying dimensions and the common variance is
of interest. This method is known as principal axis factoring.
Results of PCA
Communalities
Variable Initial Extraction
V1 1.0 0.926
V2 1.0 0.723
V3 1.0 0.894
V4 1.0 0.739
V5 1.0 0.878
V6 1.0 0.790

Initial Eigenvalues
Factor Eigenvalue % of Variance Cumulative %
1 2.731 45.520 45.520
2 2.218 36.969 82.488
3 0.442 7.360 89.848
4 0.341 5.688 95.536
5 0.183 3.044 98.580
6 0.085 1.420 100.000
Extraction sums of squared loadings
Factor Eigenvalue % of variance Cumulative %
1 2.731 45.520 45.520
2 2.218 36.969 82.488

Factor Matrix
Factor 1 Factor 2
V1 0.928 0.253
V2 -0.301 0.795
V3 0.936 0.131
V4 -0.342 0.789
V5 -0.869 -0.351
V6 -0.177 0.871

Rotation sums of squared loadings


Factor Eigenvalue % of Variance Cumulative %
1 2.688 44.802 44.802
2 2.261 37.687 82.488
Rotated factor matrix
Factor 1 Factor 2
V1 0.962 -0.027
V2 -0.057 0.848
V3 0.934 -0.146
V4 -0.098 0.854
V5 -0.933 -0.084
V6 0.083 0.885

Factor score coefficient matrix


Factor 1 Factor 2
V1 0.358 0.011
V2 -0.001 0.375
V3 0.345 -0.043
V4 -0.017 0.377
V5 -0.350 -0.059
V6 0.052 0.395
Reproduced correlation matrix

V1 V2 V3 V4 V5 V6
V1 0.926 0.024 -0.029 0.031 0.038 -0.053
V2 -0.078 0.723 0.022 -0.158 0.038 -0.105
V3 0.902 -0.177 0.894 -0.031 0.081 0.033
V4 -0.117 0.730 -0.217 0.739 -0.027 -0.107
V5 -0.895 -0.018 -0.859 0.020 0.878 0.016
V6 0.057 0.746 -0.051 0.748 -0.152 0.790

• The lower left triangle contains the


reproduced correlation matrix; the diagonal,
the communalities; the upper right triangle,
the residuals between the observed
correlations and the reproduced correlations.
Determine the no. of factors :
• A Priori determination :Sometimes, because of prior knowledge, the
researcher knows how many factors to expect and thus can specify the
no. of factors to be extracted.
• Based on Eigenvalues :Here, only factors with eigenvalues greater than
1 are retained i.e. with a variance greater than 1.0. Factors with a
variance less than 1.0 are no better than a single variable, because, due
to standardization, each variable has a variance of 1.0. If the no. of
variables is less than 20, this approach will result ina conservative no. of
factors.
• Based on Scree plot :The shape of the plot is used to determine the no.
of factors. The plot has a distinct break between the steep slope of
factors, with large eigenvalues and a gradual trailing off (scree)
associated with the rest of the factors. The pt. at which the scree begins
denotes the true no. of factors. Generally, here the no. of factors will be
one or a few more than that by eigenvalue criteria.
• Based on percentage of variance :Here, the no. of factors extracted is
determined so that the cumulative percentage of variance extracted by
the factor reaches a satisfactory level which depends upon the
problem. However, it is recommended that the factors extracted should
account for at least 60 % of the variance.
• Based on Split-Half Reliability :The sample is split
in half and factor analysis is performed on each
half. Only factors with high correspondence of
factor loadings across the two subsamples are
retained.
• Based on Significance Tests :It is posible to
determine the statistical significance of the
separate eigenvalues and retain only those
factors that are statistically significant. A
drawback is that with large samples (greater than
200), many factors are likely to be statistically
significant, although practically many of these
account for only a small proportion of the total
variance.
Results :
• We see that the eigenvalue greater than 1.0 results
in two factors being extracted.
• Our a priori knowledge tells us that toothpaste is
bought for two major reasons.
• From the Scree plot, a distinct break occurs at three
factors. Finally from the cumulative percentage of
variance accounted for, we see that the first two
factors account for 82.49 % of the variance , and
that the gain achieved in going to three factors is
marginal.
• Split-half reliability also indicates that two factors
are appropriate.
Thus, two factors appear to be reasonable in this
situation.
Rotate factors :
• Factor matrix contains the coefficients used to express the
standardized variables in terms of the factors. These
coefficients, the factor loadings represent the correlations
between the factors and the variables. These coefficients can
be used to interpret the factors.
• In unrotated factor matrix, factors are correlated with many
variables. Therefore, through rotation, the factor matrix is
transformed into a simpler one that is easier to interpret.
• In rotating the factors,
 each factor should have nonzero, or significant, loadings for
only some of the variables.
 Each variable should have nonzero, or significant, loadings
with only a few factors, if possible with only one.
• Rotation does not affect the communalities and percentage of
total variance explained. However, the percentage of variance
accounted for by each factor does change.
Diff. methods of rotation:
• Orthogonal rotation : axes are maintained at
right angles and the factors are uncorrelated.
Varimax procedure : Orthogonal method of
rotation that minimizes the no. of variables
with high loadings on a factor, thereby
enhancing the interpretability of factors.
• Oblique rotation : axes are not maintained at
right angles and the factors are correlated.
Oblique rotation should be used when factors
in the population are likely to be strongly
correlated.
Example Factor 1 is at least somewhat correlated with 5 of the 6
variables( absolute value of factor loading greater than 0.3).
Likewise, factor 2 is at leasts somewhat correlated with four of
the six variables. Moreover, variables 2,4,5 load at least
somewhat on both the factors. Here it is difficult to interpret the
factors. Therefore, through rotation, the factor matrix is
tansformed into a simpler one that is easier to interpret.

Factors Factors
Variables 1 2 Variables 1 2
1 X 1 X
2 X X 2 X
3 X 3 X
4 X X 4 X
5 X X 5 X
6 X X 6 X
Interpret Factors :
• Method 1 : Identify the variables that have
large loadings on the same factor. Interpret in
terms of the variables that load high on it.
• Method 2 : Plot the variables using the
loadings as coordinates. Variables at the end
of an axis are those that have high loadings on
only that factor, and hence describe the factor.
Variables that are not near any of the axes are
related to both the factors. If a factor cannot
be clearly defined in terms of the original
variables, it should be labeled as an undefined
or a general factor.
Result : In the rotated factor matrix, factor 1 has high coefficients for
variables V1 and V3 ,and a negative coeff for V5. Therefore, this factor
may be labeled a health benefit factor. Factor 2 is highly related with
variables 2, 4 and 6. Thus, factor 2 may be labeled a social benefit
factor. This interpretation is confirmed by plotting factor loadings :

Therefore, consumers appear to seek two major kinds of benefits from a


toothpaste : health benefits and social benefits.
Calculate factor scores :
• If the goal of factor analysis is to reduce the original set of
variables to a smaller set of factors for use in subsequent
multivariate analysis, it is useful to compute factor scores for each
respondent.
• A factor is a L.C. of the original variables :
Fi= Wi1*X1 +…….+ Wik*Xk
The factor scores for the ith factor may be estimated as follows:
• The weights, or factor score coefficients, used to combine the
standardized variables are obtained from the factor score
coefficient matrix.
• The standardized variable values would be multiplied by the
corresponding factor score coefficients to obtain the factor scores.
• In principal component analysis, it is possible to compute exact
factor scores and these scores are uncorrelated.
• In common factor analysis, estimates of these scores are obtained
and there is no guarantee that the factors will be uncorrelated
with each other.
• In our example, using the factor score
coefficient matrix, we can compute two factor
scores for each respondent. The standardized
variable values would be multiplied by the
corresponding factor score coefficients to
obtain factor scores.
Select surrogate variables :
• Sometimes, instead of computing factor scores, the researcher
wishes to select surrogate variables which involves singling out
some of the original variables for use in subsequent analysis.
• Here, result is interpreted in terms of original variables rather than
factor scores.
• From factor matrix, select for each factor the variable with the
highest loading on that factor and use that variable as a surrogate
variable for the associated factor.
• This process works well if one factor loading for a variable is clearly
higher than all other factor loadings. However, it becomes difficult
if two or more variables have similarly high loadings. In such case,
the choice between these variables should be based on theoretical
and measurement considerations.
• Theory may suggest that a variable with a slightly lower loading is
more important than one with a slightly higher loading. Likewise, if
a variable has a slightly lower loading but has been measured
more precisely, it should be selected as a surrogate variable.
• Example : Variables 1,3 and 5 all have high
loadings on factor 1, and all are fairly close in
magnitude, although V1 has relatively the highest
loading and would therefore be a likely
candidate. However if prior knowledge suggests
that prevention of tooth decay is a very imp
benefit, V5 would be selected as the surrogate for
factor 1. Also the choice of a surrogate for factor
2 is not straightforward. Variables V2, V4, V6 all
have comparable high loadings on this factor. If
prior knowledge suggests that attractive teeth is
the most imp social benefit sought from a
toothpaste, the researcher would select V6.
Determine the model fit :
• A basic assumption underlying factor analysis is that
the observed correlation between variables can be
attributed to common factors. Hence, the correlations
between the variables can be deduced or reproduced
from the estimated correlations between the variables
and the factors. The differences between the observed
correlations (as given in the input correlation matrix)
and the reproduced correlations (as estimated from
the factor matrix) can be examined to determine
model fit. These differences are called Residuals. If
there are many large residuals, the factor model does
not provide a good fit to the data and the model
should be reconsidered.
• In our eg, we see that only five residuals are larger than
0.05, including an acceptable model fit.
• Example : A manufacturer of fabricating parts
is interested in identifying the determinants of
a successful salesperson. The manufacturer
has on file the information shown in the
following table. He is wondering whether he
could reduce these seven variables to two or
three factors, for a meaningful appreciation of
the problem.
Sales Height Weight Education Age No. of Size of IQ
person (x1) (x2) (x3) (x4) children household (x7)
(x5) (x6)
1 67 155 12 27 0 2 102
2 69 175 11 35 3 6 92
3 71 170 14 32 1 3 111
4 70 160 16 25 0 1 115
5 72 180 12 36 2 4 108
6 69 170 11 41 3 5 90
7 74 195 13 30 1 2 113
8 68 160 16 32 1 3 118
9 70 175 12 45 4 6 121
10 71 180 13 24 0 2 92
11 66 145 10 39 2 4 100
12 75 210 16 26 0 1 109
13 70 160 12 31 0 3 102
14 71 175 13 43 3 5 112
• Solution : Intuition might suggest the
presence of three primary factors : A maturity
factor revealed in age/children/size of
household, physical size as shown by height
and weight, and intelligence or training as
revealed by education and IQ.
The sales people data have been analysed by
the SAS program. This program accepts data in
the- original units, automatically transforming
them into standard scores. The three factors
derived from the sales people data by a
principal component analysis (SAS program)
are presented below :
Variable I II III Communality

Height 0.59038 0.72170 -0.30331 0.96140

Weight 0.45256 0.75932 -0.44273 0.97738

Education 0.80252 0.18513 0.42631 0.86006

Age -0.86689 0.41116 0.18733 0.95564

No. of children -0.84930 0.49247 0.05883 0.96730

Size of -0.92582 0.30007 -0.01953 0.94756


household
IQ 0.28761 0.46696 0.80524 0.94918

Sum of squares 3.61007 1.85136 1.15709

Variance 0.51572 0.26448 0.16530 0.94550


summarised
• Factor Loadings : The equations are :

F1 =0.59038x1 + 0.45256x2 + 0.80252x3 - 0.86689x4 -


0.84930x5 -0.92582x6 + 0.28761x7
F2 =0.72170x1 + 0.75932x2 + 0.18513x3 + 0.41116x4 +
0.49247x5 +0.30007x6 + 0.46696x7
F3 =- 0.30331x1- 0.44273x2 + 0.42631x3 + 0.18733x4 +
0.5883x5 - 0.01953x6 + 0.80524x7

In all the three equations, education (x3) and IQ (x7)


have got positive loading factor indicating that they
are variables of importance in determining the
success of sales person.
• Variance summarised : Factor analysis employs
the criterion of maximum reduction of variance-
variance found in the initial set of variables. Each
factor contributes to reduction. Here factor 1
accounts for 51.6 % of the total variance. Factor 2
for 26.4 % and 3 for 16.5 %.Together the three
factors explain 95 % of the variance.
• Communality : Here, communality is over 85 %
for every variable. Thus the three factors seem to
capture the underlying dimensions involved in
these variables. In the ideal solution the factors
derived will explain 100 percent of the variance in
each of the original variables.

Das könnte Ihnen auch gefallen