You are on page 1of 22

Multivariate Analysis

Business Research Methods

Compiled by Dr. Sunil Bhardwaj


(From various online and published resources)

1
Multiple Regression
Q-1 What is Multiple Regression?
Ans :

Multiple regression is used to account for (predict) the variance in an interval dependent,
based on linear combinations of interval, dichotomous, or dummy independent variables.
Multiple regression can establish that a set of independent variables explains a proportion
of the variance in a dependent variable at a significant level (through a significance test
of R2), and can establish the relative predictive importance of the independent variables
(by comparing beta weights). Power terms can be added as independent variables to
explore curvilinear effects. Cross-product terms can be added as independent variables to
explore interaction effects. One can test the significance of difference of two R2's to
determine if adding an independent variable to the model helps significantly. Using
hierarchical regression, one can see how most variance in the dependent can be explained
by one or a set of new independent variables, over and above that explained by an earlier
set. Of course, the estimates (b coefficients and constant) can be used to construct a
prediction equation and generate predicted scores on a variable for further analysis.

The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's
are the regression coefficients, representing the amount the dependent variable y changes
when the corresponding independent changes 1 unit. The c is the constant, where the
regression line intercepts the y axis, representing the amount the dependent y will be
when all the independent variables are 0. The standardized version of the b coefficients
are the beta weights, and the ratio of the beta coefficients is the ratio of the relative
predictive power of the independent variables. Associated with multiple regression is R2,
multiple correlation, which is the percent of variance in the dependent variable explained
collectively by all of the independent variables.

Multiple regression shares all the assumptions of correlation: linearity of relationships,


the same level of relationship throughout the range of the independent variable
("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose
range is not truncated. In addition, it is important that the model being tested is correctly
specified. The exclusion of important causal variables or the inclusion of extraneous
variables can change markedly the beta weights and hence the interpretation of the
importance of the independent variables.

2
Q-2 What is R-square ?
Ans:

R2, also called multiple correlation or the coefficient of multiple determination, is the
percent of the variance in the dependent explained uniquely or jointly by the
independents. R-squared can also be interpreted as the proportionate reduction in error in
estimating the dependent when knowing the independents. That is, R2 reflects the number
of errors made when using the regression model to guess the value of the dependent, in
ratio to the total errors made when using only the dependent's mean as the basis for
estimating all cases. Mathematically, R2 = (1 - (SSE/SST)), where SSE = error sum of
squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case
and EstYi is the regression prediction for the ith case; and where SST = total sum of
squares = SUM((Yi - MeanY)squared). The "residual sum of squares" in SPSS /SAS
output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a
percent of total error and will be 0 when regression error is as large as it would be if you
simply guessed the mean for all cases of Y. Put another way, the regression sum of
squares/total sum of squares = R-square, where the regression sum of squares = total sum
of squares - residual sum of squares

Q-3 What is Adjusted R-square and How it is calculated?


Ans:

Adjusted R-Square is an adjustment for the fact that when one has a large number of
independents, it is possible that R2 will become artificially high simply because some
independents' chance variations "explain" small parts of the variance of the dependent. At
the extreme, when there are as many independents as cases in the sample, R2 will always
be 1.0. The adjustment to the formula arbitrarily lowers R2 as p, the number of
independents, increases. Some authors conceive of adjusted R2 as the percent of variance
"explained in a replication, after subtracting out the contribution of chance." When used
for the case of a few independents, R2 and adjusted R2 will be close. When there are a
great many independents, adjusted R2 may be noticeably lower. The greater the number
of independents, the more the researcher is expected to report the adjusted coefficient.
Always use adjusted R2 when comparing models with different numbers of independents.

Adjusted R2 = 1 - ( (1-R2)(N-1 / N - k - 1) ).
where n is sample size and k is the number of terms in the model not counting the
constant (i.e., the number of independents).

Q-4 What is Multicollinearity and How it is measured?

3
Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the
assumption of no perfect collinearity, while high R2's increase the standard error of the
beta coefficients and make assessment of the unique role of each independent difficult or
impossible. While simple correlations tell something about multicollinearity, the
preferred method of assessing multicollinearity is to regress each independent on all the
other independent variables in the equation. Inspection of the correlation matrix reveals
only bivariate multicollinearity, with the typical criterion being bivariate correlations >
.90. To assess multivariate multicollinearity, one uses tolerance or VIF, which build in
the regressing of each independent on all the others. Even when multicollinearity is
present, note that estimates of the importance of other variables in the equation (variables
which are not collinear with others) are not affected.

Types of multicollinearity. The type of multicollinearity matters a great deal. Some types
are necessary to the research purpose

Tolerance is 1 - R2 for the regression of that independent variable on all the other
independents, ignoring the dependent. There will be as many tolerance coefficients as
there are independents. The higher the intercorrelation of the independents, the more the
tolerance will approach zero. As a rule of thumb, if tolerance is less than .20, a problem
with multicollinearity is indicated.

When tolerance is close to 0 there is high multicollinearity of that variable with other
independents and the b and beta coefficients will be unstable.The more the
multicollinearity, the lower the tolerance, the more the standard error of the regression
coefficients. Tolerance is part of the denominator in the formula for calculating the
confidence limits on the b (partial regression) coefficient.

Variance-inflation factor, VIF VIF is the variance inflation factor, which is simply the
reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and
instability of the b and beta coefficients. VIF and tolerance are found in the SPSS and
SAS output section on collinearity statistics.

Condition indices and variance proportions.

Condition indices are used to flag excessive collinearity in the data. A condition index
over 30 suggests serious collinearity problems and an index over 15 indicates possible
collinearity problems. If a factor (component) has a high condition index, one looks in the
variance proportions column. Criteria for "sizable proportion" vary among researchers
but the most common criterion is if two or more variables have a variance partition of .50
or higher on a factor with a high condition index. If this is the case, these variables have
high linear dependence and multicollinearity is a problem, with the effect that small data
changes or arithmetic errors may translate into very large changes or errors in the
regression analysis. Note that it is possible for the rule of thumb for condition indices (no
index over 30) to indicate multicollinearity, even when the rules of thumb for tolerance >
.20 or VIF < 4 suggest no multicollinearity. Computationally, a "singular value" is the
square root of an eigenvalue, and "condition indices" are the ratio of the largest singular

4
values to each other singular value. In SPSS or SAS, select Analyze, Regression,
Linear;click Statistics; check Collinearity diagnostics to get condition indices.

Q-5 What is homoscedasticity ?

Homoscedasticity: The researcher should test to assure that the residuals are dispersed
randomly throughout the range of the estimated dependent. Put another way, the
variance of residual error should be constant for all values of the independent(s). If not,
separate models may be required for the different ranges. Also, when the
homoscedasticity assumption is violated "conventionally computed confidence
intervals and conventional t-tests for OLS estimators can no longer be justified"
However, moderate violations of homoscedasticity have only minor impact on
regression estimates .

Nonconstant error variance can be observed by requesting a simple residual plot (a plot
of residuals on the Y axis against predicted values on the X axis). A homoscedastic
model will display a cloud of dots, whereas lack of homoscedasticity will be
characterized by a pattern such as a funnel shape, indicating greater error as the
dependent increases. Nonconstant error variance can indicate the need to respecify the
model to include omitted independent variables.

Lack of homoscedasticity may mean (1) there is an interaction effect between a


measured independent variable and an unmeasured independent variable not in the
model; or (2) that some independent variables are skewed while others are not.

One method of dealing with hetereoscedasticity is to select the weighted least squares
regression option. This causes cases with smaller residuals to be weighted more in
calculating the b coefficients. Square root, log, and reciprocal transformations of the
dependent may also reduce or eliminate lack of homoscedasticity.

Sources & Suggested Readings:

http://www2.chass.ncsu.edu/garson/pa765/regress.htm

www.cs.uu.nl/docs/vakken/arm/SPSS/spss4.pdf

Kahane, Leo H. (2001). Regression basics. Thousand Oaks, CA: Sage Publications.

Menard, Scott (1995). Applied logistic regression analysis. Thousand Oaks, CA: Sage
Publications. Series: Quantitative Applications in the Social Sciences, No. 106.

Miles, Jeremy and Mark Shevlin (2001). Applying regression and correlation. Thousand
Oaks, CA: Sage Publications. Introductory text built around model-building.

5
Discriminant Analysis
Q-1 What is Disriminant Analysis ?
Ans:

Discriminant function analysis, a.k.a. discriminant analysis or DA, is used to classify


cases into the values of a categorical dependent, usually a dichotomy. If discriminant
function analysis is effective for a set of data, the classification table of correct and
incorrect estimates will yield a high percentage correct. Discriminant function analysis is
found in SPSS/SAS under Analyze, Classify, Discriminant. One gets DA or MDA from
this same menu selection, depending on whether the specified grouping variable has two
or more categories.

Multiple discriminant analysis (MDA) is an extension of discriminant analysis and a


cousin of multiple analysis of variance (MANOVA), sharing many of the same
assumptions and tests. MDA is used to classify a categorical dependent which has more
than two categories, using as predictors a number of interval or dummy independent
variables. MDA is sometimes also called discriminant factor analysis or canonical
discriminant analysis.

There are several purposes for DA and/or MDA:

• To classify cases into groups using a discriminant prediction equation.


• To test theory by observing whether cases are classified as predicted.
• To investigate differences between or among groups.
• To determine the most parsimonious way to distinguish among groups.
• To assess the relative importance of the independent variables in classifying the
dependent variable.
• To infer the meaning of MDA dimensions which distinguish groups, based on
discriminant loadings.

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the
discriminant model as a whole is significant, and (2) if the F test shows significance, then
the individual independent variables are assessed to see which differ significantly in
mean by group and these are used to classify the dependent variable.

Discriminant analysis shares all the usual assumptions of correlation, requiring linear and
homoscedastic relationships, and untruncated interval or near interval data. Like multiple
regression, it also assumes proper model specification (inclusion of all important
independents and exclusion of extraneous variables). DA also assumes the dependent
variable is a true dichotomy since data which are forced into dichotomous coding are
truncated, attenuating correlation.

DA is an earlier alternative to logistic regression, which is now frequently used in place


of DA as it usually involves fewer violations of assumptions (independent variables
needn't be normally distributed, linearly related, or have equal within-group variances), is

6
robust, handles categorical as well as continuous variables, and has coefficients which
many find easier to interpret. Logistic regression is preferred when data are not normal in
distribution or group sizes are very unequal. See also the separate topic on multiple
discriminant function analysis (MDA) for dependents with more than two categories.

Few Definitions and Concepts

Discriminating variables: These are the independent variables, also called


predictors.

The criterion variable. This is the dependent variable, also called the grouping
variable in SPSS. It is the object of classification efforts.

Discriminant function: A discriminant function, also called a canonical root, is a


latent variable which is created as a linear combination of discriminating
(independent) variables, such that L = b1x1 + b2x2 + ... + bnxn + c, where the b's are
discriminant coefficients, the x's are discriminating variables, and c is a constant.

This is analogous to multiple regression, but the b's are discriminant coefficients
which maximize the distance between the means of the criterion (dependent) variable.
Note that the foregoing assumes the discriminant function is estimated using ordinary
least-squares, the traditional method, but there is also a version involving maximum
likelihood estimation.

Number of discriminant functions. There is one discriminant function for 2-group


discriminant analysis, but for higher order DA, the number of functions (each with its
own cut-off value) is the lesser of (g - 1), where g is the number of categories in the
grouping variable, or p,the number of discriminating (independent) variables. Each
discriminant function is orthogonal to the others. A dimension is simply one of the
discriminant functions when there are more than one, in multiple discriminant
analysis.

The eigenvalue, also called the characteristic root of each discriminant function,
reflects the ratio of importance of the dimensions which classify cases of the
dependent variable. There is one eigenvalue for each discriminant function. For two-
group DA, there is one discriminant function and one eigenvalue, which accounts for
100% of the explained variance. If there is more than one discriminant function, the
first will be the largest and most important, the second next most important in
explanatory power, and so on. The eigenvalues assess relative importance because
they reflect the percents of variance explained in the dependent variable, cumulating
to 100% for all functions. That is, the ratio of the eigenvalues indicates the relative
discriminating power of the discriminant functions. If the ratio of two eigenvalues is
1.4, for instance, then the first discriminant function accounts for 40% more between-

7
group variance in the dependent categories than does the second discriiminant
function. Eigenvalues are part of the default output in SPSS (Analyze, Classify,
Discriminant).

The relative percentage of a discriminant function equals a function's eigenvalue


divided by the sum of all eigenvalues of all discriminant functions in the model. Thus
it is the percent of discriminating power for the model associated with a given
discriminant function. Relative % is used to tell how many functions are important.
One may find that only the first two or so eigenvalues are of importance.

The canonical correlation, R, is a measure of the association between the groups


formed by the dependent and the given discriminant function. When R is zero, there
is no relation between the groups and the function. When the canonical correlation is
large, there is a high correlation between the discriminant functions and the groups.
Note that relative % and R* do not have to be correlated. R is used to tell how much
each function is useful in determining group differences. An R of 1.0 indicates that all
of the variability in the discriminant scores can be accounted for by that dimension.
Note that for two-group DA, the canonical correlation is equivalent to the Pearsonian
correlation of the discriminant scores with the grouping variable.

The discriminant score, also called the DA score, is the value resulting from
applying a discriminant function formula to the data for a given case. The Z score is
the discriminant score for standardized data. To get discriminant scores in SPSS,
select Analyze, Classify, Discriminant; click the Save button; check "Discriminant
scores". One can also view the discriminant scores by clicking the Classify button and
checking "Casewise results."

Cutoff: If the discriminant score of the function is less than or equal to the cutoff, the
case is classed as 0, or if above it is classed as 1. When group sizes are equal, the
cutoff is the mean of the two centroids (for two-group DA). If the groups are unequal,
the cutoff is the weighted mean.

Unstandardized discriminant coefficients are used in the formula for making the
classifications in DA, much as b coefficients are used in regression in making
predictions. The constant plus the sum of products of the unstandardized coefficients
with the observations yields the discriminant scores. That is, discriminant coefficients
are the regression-like b coefficients in the discriminant function, in the form

L = b1x1 + b2x2 + ... + bnxn + c, where L is the latent variable formed by the
discriminant function, the b's are discriminant coefficients, the x's are discriminating
variables, and c is a constant. The discriminant function coefficients are partial
coefficients, reflecting the unique contribution of each variable to the classification of
the criterion variable. The standardized discriminant coefficients, like beta weights in
regression, are used to assess the relative classifying importance of the independent
variables.

8
Standardized discriminant coefficients, also termed the standardized canonical
discriminant function coefficients, are used to compare the relative importance of the
independent variables, much as beta weights are used in regression. Note that
importance is assessed relative to the model being analyzed. Addition or deletion of
variables in the model can change discriminant coefficients markedly.

As with regression, since these are partial coefficients, only the unique explanation of
each independent is being compared, not considering any shared explanation. Also, if
there are more than two groups of the dependent, the standardized discriminant
coefficients do not tell the researcher between which groups the variable is most or
least discriminating. For this purpose, group centroids and factor structure are
examined.

Q-2 What is Wilk’s Lambda?

Wilks' lambda is used to test the significance of the discriminant function as a


whole. In SPSS, the "Wilks' Lambda" table will have a column labeled "Test of
Function(s)" and a row labeled "1 through n" (where n is the number of discriminant
functions). The "Sig." level for this row is the significance level of the discriminant
function as a whole. A significant lambda means one can reject the null hypothesis
that the two groups have the same mean discriminant function scores. Wilks's lambda
is part of the default output in SPSS (Analyze, Classify, Discriminant). In SPSS, this
use of Wilks' lambda is in the "Wilks' lambda" table of the output section on
"Summary of Canonical Discriminant Functions."

ANOVA table for discriminant scores is another overall test of the DA model. It is
an F test, where a "Sig." p value < .05 means the model differentiates discriminant
scores between the groups significantly better than chance (than a model with just the
constant). It is obtained in SPSS by asking for Analyze, Compare Means, One-Way
ANOVA, using discriminant scores from DA (which SPSS will label Dis1_1 or
similar) as dependent.

Wilks' lambda also can be used to test which independents contribute significantly to
the discriminant function. The smaller the lambda for an independent variable, the
more that variable contributes to the discriminant function. Lambda varies from 0 to
1, with 0 meaning group means differ (thus the more the variable differentiates the
groups), and 1 meaning all group means are the same. The F test of Wilks's lambda
shows which variables' contributions are significant. Wilks's lambda is sometimes
called the U statistic. In SPSS, this use of Wilks' lambda is in the "Tests of equality of
group means" table in DA output.

Q-3 What is Confusion or classification Matrix ?

Ans:

9
The classification table, also called a classification matrix, or a confusion,
assignment, or prediction matrix or table, is used to assess the performance of DA.
This is simply a table in which the rows are the observed categories of the dependent
and the columns are the predicted categories of the dependents. When prediction is
perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is
the percentage of correct classifications. This percentage is called the hit ratio.

Expected hit ratio. Note that the hit ratio must be compared not to zero but to the
percent that would have been correctly classified by chance alone. For two-group
discriminant analysis with a 50-50 split in the dependent variable, the expected
percent is 50%. For unequally split 2-way groups of different sizes, the expected
percent is computed in the "Prior Probabilities for Groups" table in SPSS, by
multiplying the prior probabilities times the group size, summing for all groups, and
dividing the sum by N.

Sources & Suggested Readings:

http://faculty.chass.ncsu.edu/garson/PA765/discrim2.htm

Huberty, Carl J. (1994). Applied discriminant analysis . NY: Wiley-Interscience.


(Wiley Series in Probability and Statistics).

Klecka, William R. (1980). Discriminant analysis. Quantitative Applications in the


Social Sciences Series, No. 19. Thousand Oaks, CA: Sage Publications.

Lachenbruch, P. A. (1975). Discriminant analysis. NY: Hafner.

10
Cluster Analysis
Q-1 What is Cluster Analysis ?

Ans:
Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to
identify homogeneous subgroups of cases in a population. That is, cluster analysis
seeks to identify a set of groups which both minimize within-group variation and
maximize between-group variation. Other techniques, such as latent class analysis
and Q-mode factor analysis, also perform clustering and are discussed separately.

SPSS offers three general approaches to cluster analysis. Hierarchical clustering


allows users to select a definition of distance, then select a linking method of forming
clusters, then determine how many clusters best suit the data. In k-means clustering
the researcher specifies the number of clusters in advance, then calculates how to
assign cases to the K clusters. K-means clustering is much less computer-intensive
and is therefore sometimes preferred when datasets are very large (ex., > 1,000).
Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters.

Key Concepts and Terms


Cluster formation is the selection of the procedure for determining how clusters are
created, and how the calculations are done. In agglomerative hierarchical clustering
every case is initially considered a cluster, then the two cases with the lowest distance (or
highest similarity) are combined into a cluster. The case with the lowest distance to either
of the first two is considered next. If that third case is closer to a fourth case than it is to
either of the first two, the third and fourth cases become the second two-case cluster; if
not, the third case is added to the first cluster. The process is repeated, adding cases to
existing clusters, creating new clusters, or combining clusters to get to the desired final
number of clusters. There is also divisive clustering, which works in the opposite
direction, starting with all cases in one large cluster. Hierarchical cluster analysis,
discussed below, can use either agglomerative or divisive clustering strategies.

Similarity and Distance

Distance. The first step in cluster analysis is establishment of the similarity or distance
matrix. This matrix is a table in which both the rows and columns are the units of analysis
and the cell entries are a measure of similarity or distance for any pair of cases.

Euclidean distance is the most common distance measure. A given pair of cases is plotted
on two variables, which form the x and y axes. The Euclidean distance is the square root

11
of the sum of the square of the x difference plus the square of the y distance. (Recall high
school geometry: this is the formula for the length of the third side of a right triangle.) It
is common to use the square of Euclidean distance as squaring removes the sign. When
two or more variables are used to define distance, the one with the larger magnitude will
dominate, so to avoid this it is common to first standardize all variables.

There are a variety of different measures of inter-observation distances and inter-cluster


distances to use as criteria when merging nearest clusters into broader groups or when
considering the relation of a point to a cluster. SPSS supports these interval distance
measures: Euclidean distance, squared Euclidean distance, Chebychev, block,
Minkowski, or customized; for count data, chi-square or phi-square. For binary data, it
supports Euclidean distance, squared Euclidean distance, size difference, pattern
difference, variance, shape, or Lance and Williams.

Similarity. Distance measures how far apart two observations are. Cases which are alike
share a low distance. Similarity measures how alike two cases are. Cases which are alike
share a high similarity. SPSS supports a large number of similarity measures for interval
data (Pearson correlation or cosine) and for binary data (Russell and Rao, simple
matching, Jaccard, Dice, Rogers and Tanimoto, Sokal and Sneath 1, Sokal and Sneath 2,
Sokal and Sneath 3, Kulczynski 1, Kulczynski 2, Sokal and Sneath 4, Hamann, Lambda,
Anderberg's D, Yule's Y, Yule's Q, Ochiai, Sokal and Sneath 5, phi 4-point correlation, or
dispersion).

Absolute values. Since for Pearson correlation, high negative as well as high positive
values indicate similarity, the researcher normally selects absolute values. This can be
done by checking the absolute value checkbox in the Transform Measures area of the
Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog.

Summary. In SPSS, similarity/distance measures are selected in the Measure area of the
Method subdialog obtained by pressing the Method button in the Classify dialog. There
are three measure pulldown menus, for interval, binary, and count data respectively.The
proximity matrix table in the output shows the actual distances or similarities computed
for any pair of cases. In SPSS, proximity matrices are selected under Analyze, Cluster,
Hierarchical clustering; Statistics button; check proximity matrix.

Method. Under the Method button in the SPSS Classify dialog, the pull-down Method
selection determines how cases or clusters are combined at each step. Different methods
will result in different cluster patterns. SPSS offers these method choices:

Nearest neighbor. In this single linkage method, the distance between two clusters is the
distance between their closest neighboring points

Furthest neighbor. In this complete linkage method, the distance between two clusters is
the distance between their two furthest member points.

12
UPGMA (unweighted pair-group method using averages). The distance between two
clusters is the average distance between all inter-cluster pairs. UPGMA is generally
preferred over nearest or furthest neighbor methods since it is based on information about
all inter-cluster pairs, not just the nearest or furthest ones. and is the default method in
SPSS. SPSS labels this "between-groups linkage."

Average linkage within groups is the mean distance between all possible inter- or intra-
cluster pairs. The average distance between all pairs in the resulting cluster is made to be
as small as possibile. This method is therefore appropriate when the research purpose is
homogeneity within clusters. SPSS labels this "within-groups linkage."

Ward's method calculates the sum of squared Euclidean distances from each case in a
cluster to the mean of all variables. The cluster to be merged is the one which will
increase the sum the least. This is an ANOVA-type approach and preferred by some
researchers for this reason.

Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean
distances between cluster means for all variables.

Median method. Clusters are weighted equally regardless of group size when computing
centroids of two clusters being combined. This method also uses Euclidean distance as
the proximity measure.

Correlation of items can be used as a similarity measure. One transposes the normal data
table in which columns are variables and rows are cases. By using columns as cases and
rows as variables instead, the correlation is between cases and these correlations may
constitute the cells of the similarity matrix.

Binary matching is another type of similarity measure, where 1 indicates a match and 0
indicates no match between any pair of cases. There are multiple matched attributes and
the similarity score is the number of matches divided by the number of attributes being
matched. Note that it is usual in binary matching to have several attributes because there
is a risk that when the number of attributes is small, they may be orthogonal to
(uncorrelated) with one another, and clustering will be indeterminate.

Summary measures assess how the clusters differ from one another.

Means and variances. A table of means and variances of the clusters with respect to the
original variables shows how the clusters differ on the original variables. SPSS does not
make this available in the Cluster dialog, but one can click the Save button, which will
save the cluster number for each case (or numbers if multiple solutions are requested).
Then in Analyze, Compare Means, Means the researcher can use the cluster number as
the grouping variable to compare differences of means on any other continuous variable
in the dataset.

Linkage tables show the relation of the cases to the clusters.

13
Cluster membership table. This shows cases as rows, where columns are alternative
numbers of clusters in the solution (as specified in the "Range of Solution" option in the
Cluster membership group in SPSS, under the Statistics button). Cell entries show the
number of the cluster to which the case belongs. From this table, one can see which cases
are in which groups, depending on the number of clusters in the solution.

Agglomeration Schedule. Agglomeration schedule is a choice under the Statistics button


for Hierarchical Cluster in the SPSS Cluster dialog. In this table, the rows are stages of
clustering, numbered from 1 to (n - 1). The (n - 1)th stage includes all the cases in one
cluster. There are two "Cluster Combined" columns, giving the case or cluster numbers
for combination at each stage. In agglomerative clustering using a distance measure like
Euclidean distance, stage 1 combines the two cases which have lowest proximity
(distance) score. The cluster number goes by the lower of the cases or clusters combined,
where cases are initially numbered 1 to n. For instance, at Stage 1, cases 3 and 18 might
be combined, resulting in a cluster labeled 3. Later cluster 3 and case 2 might be
combined, resulting in a cluster labeled 2. The researcher looks at the "Coefficients"
column of the agglomerative schedule and notes when the proximity coefficient jumps up
and is not a small increment from the one before (or when the coefficient reaches some
theoretically important level). Note that for distance measures, low is good, meaning the
cases are alike; for similarity measures, high coefficients mean cases are alike. After the
stopping stage is determined in this manner, the researcher can work backward to
determine how many clusters there are and which cases belong to which clusters (but it is
easier just to get this information from the cluster membership table). Note, though, that
SPSS will not stop on this basis but instead will compute the range of solutions (ex., 2 to
4 clusters) requested by the researcher in the Cluster Membership group of the Statistics
button in th Hierarchical Clustering dialog. When there are relatively few cases, icicle
plots or dendograms provide the same linkage information in an easier format.

Linkage plots show similar information in graphic form.

Icicle plots are usually horizontal, showing cases as rows and number of clusters in the
solution as columns. If there are few cases, vertical icicle plots may plotted, with cases as
columns. Reading from the last column right to left (horizontal icicle plots) or last row
bottom to top (vertical icicle plots), the researcher can see how agglomeration proceeded.
The last/bottom row will show all the cases in separate one-case clusters. This is the (n -
1) solution. The next-to-last/bottom column/row will show the (n-2) solution, with two
cases combined into one cluster. Subsequent columns/rows show further clustering steps.
Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a
single cluster. This is a visual way of representing information on the agglomeration
schedule, but without the proximity coefficient information.

Dendrograms, also called tree diagrams, show the relative size of the proximity
coefficients at which cases were combined. The bigger the distance coefficient or the
smaller the similarity coefficient, the more clustering involved combining unlike entities,
which may be undesirable. Trees are usually depicted horizontally, not vertically, with
each row representing a case on the Y axis, while the X axis is a rescaled version of the

14
proximity coefficients. Cases with low distance/high similarity are close together. Cases
showing low distance are close, with a line linking them a short distance from the left of
the dendogram, indicating that they are agglomerated into a cluster at a low distance
coefficient, indicating alikeness. When, on the other hand, the linking line is to the right
of the dendogram the linkage occurs a high distance coefficient, indicating the
cases/clusters were agglomerated even though much less alike. If a similarity measure is
used rather than a distance measure, the rescaling of the X axis still produces a diagram
with linkages involving high alikeness to the left and low alikeness to the right. In SPSS,
select Analyze, Classify, Hierarchical Cluster; click the Plots button, check the
Dendogram checkbox.

What is Hierarchical Cluster Analysis ?

Hierarchical clustering is appropriate for smaller samples (typically < 250). To


accomplish hierarchical clustering, the researcher must specify how similarity or distance
is defined, how clusters are aggregated (or divided), and how many clusters are needed.
Hierarchical clustering generates all possible clusters of sizes 1...K, but is used only for
relatively small samples. In hierarchical clustering, the clusters are nested rather than
being mutually exclusive, as is the usual case..That is, in hierarchical clustering, larger
clusters created at later stages may contain smaller clusters created at earlier stages of
agglomeration.

One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to
inspect results for different numbers of clusters. The optimum number of clusters
depends on the research purpose. Identifying "typical" types may call for few clusters and
identifying "exceptional" types may call for many clusters. After using hierarchical
clustering to determine the desired number of clusters, the researcher may wish then to
analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure:
Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.

Forward clustering, also called agglomerative clustering: Small clusters are formed by
using a high similarity index cut-off (ex., .9). Then this cut-off is relaxed to establish
broader and broader clusters in stages until all cases are in a single cluster at some low
similarity index cut-off. The merging of clusters is visualized using a tree format.

Backward clustering, also called divisive clustering, is the same idea, but starting with a
low cut-off and working toward a high cut-off. Forward and backward methods need not
generate the same results.

Clustering variables. In the Hierarchical Cluster dialog, in the Cluster group, the
researcher may selected Variable rather than the usual Cases, in order to cluster variables.

SPSS calls hierarchical clustering the "Cluster procedure." In SPSS, select Analyze,
Classify, Hierarchical Cluster; select variables; select Cases in the Cluster group click
Statistics, select Proximity Matrix; select Range of Solutions in the Cluster Membership
group, specify the number of clusters (typically 3 to 6); Continue; OK.

15
What is K-means Cluster Analysis ?

K-means cluster analysis. K-means cluster analysis uses Euclidean distance. The
researcher must specify in advance the desired number of clusters, K. Initial cluster
centers are chosen in a first pass of the data, then each additional iteration groups
observations based on nearest Euclidean distance to the mean of the cluster. Cluster
centers change at each pass. The process continues until cluster means do not shift more
than a given cut-off value or the iteration limit is reached.

Cluster centers are the average value on all clustering variables of each cluster's
members. The "Initial cluster centers," in spite of its title, gives the average value of each
variable for each cluster for the k well-spaced cases which SPSS selects for initialization
purposes when no initial file is supplied. The "Final cluster centers" table in SPSS output
gives the same thing for the last iteration step. The "Iteration history" table shows the
change in cluster centers when the usual iterative approach is taken. When the change
drops below a specified cutoff, the iterative process stops and cases are assigned to
clusters according to which cluster center they are nearest.

Large datasets are possible with K-means clustering, unlike hierarchical clustering,
because K-means clustering does not require prior computation of a proximity matrix of
the distance/similarity of every case with every other case.

Method. The default method is "Iterate and classify," under which an interative process is
used to update cluster centers, then cases are classified based on the updated centers.
However, SPSS supports a "Classify only" method, under which cases are immediately
classified based on the initial cluster centers, which are not updated.

Agglomerative K-means clustering. Normally in K-means clustering, a given case may be


assigned to a cluster, then reassigned to a different cluster as the algorithm unfolds.
However, in agglomerative K-means clustering, the solution is constrained to force a
given case to remain in its initial cluster.

SPSS: Analyze, Cluster, K-Means Cluster Analysis; enter variables in the Variables: area;
optionally, enter a variable in the "Label cases by:" area; enter "Number of clusters:";
choose Method: Iiterate and classify, or just Classify); Unlike hierarchical clustering,
there is no option for "Range of solutions"; instead you must re-run K-means clustering,
asking for a different number of clusters.

Iterate button. Optionally, you may press the Iterate button and set the number of
iterations and the convergence criterion. The default maximum number of iterations in
SPSS is 10. For the convergence criterion, by default, iterations terminate if the largest
change in any cluster center is less than 2% of the minimum distance between initial
centers (or if the maximum number of iterations has been reached). To override this
default, enter a positive number less than or equal to 1 in the convergence box. There is

16
also a "Use running means" checkbox which, if checked, will cause the clulster centers to
be updated after each case is classified, not the default, which is after the entire set of
cases is classified.

Save button: Optionally, you may press the Save button to save the final cluster number
of each case as an added column in your dataset (labeled QCL_1), and/or you may save
the Euclidean distance between each case and its cluster center (labeled QCL_2) by
checking "Distance from cluster center."

Options button: Optionally, you may press the Options button to select statistics or
missing values options. There are three statistics options: "Initial cluster centers" (gives
the initial variable means for each clusters); ANOVA table (ANOVA F-tests for each
variable., but as the F tests are only descriptive, the resulting probabilities are for
exploratory purposes only; nonetheless, non-significant variables might be dropped as not
contributing to the differentiation of clusters); and "Cluster information for each case"
(gives each case's final cluster assignment and the Euclidean distance between the case
and the cluster center; also gives the Euclidean distance between final cluster centers).

Getting different clusters. Sometimes the researcher wishes to experiment to get different
clusters, as when the "Number of cases in each cluster" table shows highly imbalanced
clusters and/or clusters with very few members. Different results may occur by setting
different initial cluster centers from file (see above), by changing the number of clusters
requested, or even by presenting the data file in different case order.

What is Two-Step Cluster Analysis ?

Two-step cluster analysis groups cases into pre-clusters which are treated as single cases.
Standard hierarchical clustering is then applied to the pre-clusters in the second step. This
is the method used when one or more of the variables are categorical (not interval or
dichotomous). Also, since it is a method requiring neither a proximity table like
hierarchical classification nor an iterative process like K-means clustering, but rather is a
one-pass-through-the-dataset method, it is recommended for very large datasets.

Cluster feature tree.. The preclustering stage employs a CFtree with nodes leading to leaf
nodes. Cases start at the root node and are channeled toward nodes and eventually leaf
nodes which match it most closely. If there is no adequate match, the case is used to start
its own leaf node. It can happen that the CFtree fills up and cannot accept new leaf entries
in a node, in which case it is split using the most-distant pair in the node as seeds. If this
recursive process grows the CFtree beyond maximum size, the threshold distance is
increased and the tree is rebuilt, allowing new cases to be input. The process continues
until all the data are read. Click the Advanced button in the Options button dialog to set
threshold distances, maximum levels, and maximum branches per leaf node manually.

Proximity. When one or more of the variables are categorical, log-likelihood is the
distance measure used, with cases categorized under the cluster which is associated with
the largest log-likelihood. If variables are all continuous, Euclidean distance is used, with

17
cases categorized under the cluster which is associated with the smallest Euclidean
distance.

Number of clusters. By default SPSS determines the number of clusters using the change
in BIC (the Schwarz Bayesian Criterion: when BIC change is small, it stops and selects
as many clusters as thus far created. It is also possible to have this done based on changes
in AIC (the Akaike Information Criterion), or to simply to tell SPSS how many clusters
are wanted. The researcher can also ask for a range of solutions, such as 3-5 clusters. The
"Autoclustering statistics" table in SPSS output gives, for example, BIC and BIC change
for all solutions.

SPSS. Choose Analyze, Classify, Two-Step Cluster; select your categorical and
continuous variables; if desired, click Plots and select the plots wanted; Click Output and
select the statistics wanted (descriptive statistics, cluster frequencies, AIC or BIC);
Continue

Sources and Suggested Reading:

http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm

www.cs.uu.nl/docs/vakken/arm/SPSS/spss8.pdf

Suggested Readings:

Anil K. Jain, Richard C. Dubes, Algorithms for Clustering Data ,2004

Leonard Kaufman, Peter J. Rousseeuw, Finding Groups In Data: An Introduction


To Cluster Analysis, 2005

18
Factor Analysis
Q-1 What is Factor Analysis?

• Factor analysis is a correlational technique to determine meaningful clusters of


shared variance.
• Factor Analysis should be driven by a researcher who has a deep and genuine
interest in relevant theory in order to get optimal value from choosing the right
type of factor analysis and interpreting the factor loadings.
• Factor analysis beings begins with a large number of variables and then tries to
reduce the interrelationships amongst the variables to a few number of clusters or
factors.
• Factor analysis finds relationships or natural connections where variables are
maximally correlated with one another and minimally correlated with other
variables, and then groups the variables accordingly.
• After this process has been done many times a pattern appears of relationships or
factors that capture the essence of all of the data emerges.
• Summary: Factor analysis refers to a collection of statistical methods for reducing
correlational data into a smaller number of dimensions or factors

Key Concepts and Terms

Exploratory factor analysis (EFA) seeks to uncover the underlying structure of a


relatively large set of variables. The researcher's à priori assumption is that any indicator
may be associated with any factor. This is the most common form of factor analysis.
There is no prior theory and one uses factor loadings to intuit the factor structure of the
data.

Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the
loadings of measured (indicator) variables on them conform to what is expected on the
basis of pre-established theory. Indicator variables are selected on the basis of prior
theory and factor analysis is used to see if they load as predicted on the expected number
of factors. The researcher's à priori assumption is that each factor (the number and labels
of which may be specified à priori) is associated with a specified subset of indicator
variables. A minimum requirement of confirmatory factor analysis is that one
hypothesize beforehand the number of factors in the model, but usually also the
researcher will posit expectations about which variables will load on which factors (Kim
and Mueller, 1978b: 55). The researcher seeks to determine, for instance, if measures
created to represent a latent variable really belong together.

Factor loadings: The factor loadings, also called component loadings in PCA, are the
correlation coefficients between the variables (rows) and factors (columns). Analogous to

19
Pearson's r, the squared factor loading is the percent of variance in that variable explained
by the factor. To get the percent of variance in all the variables accounted for by each
factor, add the sum of the squared factor loadings for that factor (column) and divide by
the number of variables. (Note the number of variables equals the sum of their variances
as the variance of a standardized variable is 1.) This is the same as dividing the factor's
eigenvalue by the number of variables.

Communality, h2, is the squared multiple correlation for the variable as dependent using
the factors as predictors. The communality measures the percent of variance in a given
variable explained by all the factors jointly and may be interpreted as the reliability of the
indicator. When an indicator variable has a low communality, the factor model is not
working well for that indicator and possibly it should be removed from the model.
However, communalities must be interpreted in relation to the interpretability of the
factors. A communality of .75 seems high but is meaningless unless the factor on which
the variable is loaded is interpretable, though it usually will be. A communality of .25
seems low but may be meaningful if the item is contributing to a well-defined factor.
That is, what is critical is not the communality coefficient per se, but rather the extent to
which the item plays a role in the interpretation of the factor, though often this role is
greater when communality is high

Eigenvalues: Also called characteristic roots. The eigenvalue for a given factor
measures the variance in all the variables which is accounted for by that factor. The ratio
of eigenvalues is the ratio of explanatory importance of the factors with respect to the
variables. If a factor has a low eigenvalue, then it is contributing little to the explanation
of variances in the variables and may be ignored as redundant with more important
factors.

Thus, eigenvalues measure the amount of variation in the total sample accounted for by
each factor. Note that the eigenvalue is not the percent of variance explained but rather a
measure of amount of variance in relation to total variance (since variables are
standardized to have means of 0 and variances of 1, total variance is equal to the number
of variables). SPSS will output a corresponding column titled '% of variance'. A factor's
eigenvalue may be computed as the sum of its squared factor loadings for all the
variables.

Q-2 What are the criteria for determining the number of factors, roughly in the
order of frequency of use in social science (see Dunteman, 1989: 22-3).

Kaiser criterion: A common rule of thumb for dropping the least important factors from
the analysis. The Kaiser rule is to drop all components with eigenvalues under 1.0. Kaiser
criterion is the default in SPSS and most computer programs.

Scree plot: The Cattell scree test plots the components as the X axis and the
corresponding eigenvalues as the Y axis. As one moves to the right, toward later
components, the eigenvalues drop. When the drop ceases and the curve makes an elbow
toward less steep decline, Cattell's scree test says to drop all further components after the

20
one starting the elbow. This rule is sometimes criticised for being amenable to researcher-
controlled "fudging." That is, as picking the "elbow" can be subjective because the curve
has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off
at the number of factors desired by his or her research agenda. Even when "fudging" is
not a consideration, the scree criterion tends to result in more factors than the Kaiser
criterion.

Variance explained criteria: Some researchers simply use the rule of keeping enough
factors to account for 90% (sometimes 80%) of the variation. Where the researcher's goal
emphasizes parsimony (explaining variance with as few factors as possible), the criterion
could be as low as 50%.

Q-3 What are the different rotation methods used in factor analysis?

Ans:

No rotation is the default, but it is a good idea to select a rotation method, usually
varimax. The original, unrotated principal components solution maximizes the sum of
squared factor loadings, efficiently creating a set of factors which explain as much of the
variance in the original variables as possible. The amount explained is reflected in the
sum of the eigenvalues of all factors. However, unrotated solutions are hard to interpret
because variables tend to load on multiple factors.

Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance
of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix,
which has the effect of differentiating the original variables by extracted factor. Each
factor will tend to have either large or small loadings of any particular variable. A
varimax solution yields results which make it as easy as possible to identify each variable
with a single factor. This is the most common rotation option.

Quartimax rotation is an orthogonal alternative which minimizes the number of factors


needed to explain each variable. This type of rotation often generates a general factor on
which most variables are loaded to a high or medium degree. Such a factor structure is
usually not helpful to the research purpose.

Q-4 How many cases are required to do factor analysis?

There is no scientific answer to this question, and methodologists differ. Alternative


arbitrary "rules of thumb," in descending order of popularity, include those below. These
are not mutually exclusive: Bryant and Yarnold, for instance, endorse both STV and the
Rule of 200.

Rule of 10. There should be at least 10 cases for each item in the instrument being used.

STV ratio. The subjects-to-variables ratio should be no lower than 5 (Bryant and
Yarnold, 1995)

21
Rule of 100: The number of subjects should be the larger of 5 times the number of
variables, or 100. Even more subjects are needed when communalities are low and/or few
variables load on each factor. (Hatcher, 1994)

Rule of 150: Hutcheson and Sofroniou (1999) recommends at least 150 - 300 cases, more
toward the 150 end when there are a few highly correlated variables, as would be the case
when collapsing highly multicollinear variables.

Q-5 What is "sampling adequacy" and what is it used for?

Measured by the Kaiser-Meyer-Olkin (KMO) statistics, sampling adequacy predicts if


data are likely to factor well, based on correlation and partial correlation. In the old days
of manual factor analysis, this was extremely useful. KMO can still be used, however, to
assess which variables to drop from the model because they are too multicollinear.

There is a KMO statistic for each individual variable, and their sum is the KMO overall
statistic. KMO varies from 0 to 1.0 and KMO overall should be .60 or higher to proceed
with factor analysis. If it is not, drop the indicator variables with the lowest individual
KMO statistic values, until KMO overall rises above .60.

To compute KMO overall, the numerator is the sum of squared correlations of all
variables in the analysis (except the 1.0 self-correlations of variables with themselves, of
course). The denominator is this same sum plus the sum of squared partial correlations of
each variable i with each variable j, controlling for others in the analysis. The concept is
that the partial correlations should not be very large if one is to expect distinct factors to
emerge from factor analysis.

In SPSS, KMO is found under Analyze - Statistics - Data Reduction - Factor - Variables
(input variables) - Descriptives - Correlation Matrix - check KMO and Bartlett's test of
sphericity and also check Anti-image - Continue - OK. The KMO output is KMO overall.
The diagonal elements on the Anti-image correlation matrix are the KMO individual
statistics for each variable.

Sources and Suggested Reading:

http://faculty.chass.ncsu.edu/garson/PA765/factspss.htm
www.sussex.ac.uk/Users/andyf/factor.pdf
www.cs.uu.nl/docs/vakken/arm/SPSS/spss7.pdf

Bruce Thompson, Exploratory and Confirmatory Factor Analysis: Understanding


Concepts and Applications, 2004

22