88 views

Uploaded by gdpi09

- A Bass Diffusion Model Analysis- Understanding Alternative Fuel V.pdf
- Multiple Regression
- Ejkm Volume5 Issue4 Article 137
- unit4
- A tale of two audiences: spectators, television viewers and outcome uncertainty in Spanish football
- Multiple Regression in CB
- Intellectual property and economic development in Germany_ empirical evidence for 1999-2009.pdf
- Regression Assignment Inst
- 1-s2.0-S1877042815054014-main.pdf
- Results
- 62254_JEIEFB_Adolphus J. Toby_ Daerego S. Thompson.pdf
- Minitab-Multiple-Regression-Analysis.pdf
- Determinants of Industrial Concentration
- Brusilovskiy_A Brief Introduction to Spatial Regression
- alesina_giuliano_nunn_qje_2013.pdf
- Martinez and Rubio
- Dummy Variable Regression and Oneway ANOVA Models Using SAS
- Multi Col Linearity
- 249361082-Correlation-and-Simple-Linear-Regression-Problems-With-Solutions.pdf
- Annual Natural Population Increase - An Empirical Study

You are on page 1of 22

(From various online and published resources)

1

Multiple Regression

Q-1 What is Multiple Regression?

Ans :

Multiple regression is used to account for (predict) the variance in an interval dependent,

based on linear combinations of interval, dichotomous, or dummy independent variables.

Multiple regression can establish that a set of independent variables explains a proportion

of the variance in a dependent variable at a significant level (through a significance test

of R2), and can establish the relative predictive importance of the independent variables

(by comparing beta weights). Power terms can be added as independent variables to

explore curvilinear effects. Cross-product terms can be added as independent variables to

explore interaction effects. One can test the significance of difference of two R2's to

determine if adding an independent variable to the model helps significantly. Using

hierarchical regression, one can see how most variance in the dependent can be explained

by one or a set of new independent variables, over and above that explained by an earlier

set. Of course, the estimates (b coefficients and constant) can be used to construct a

prediction equation and generate predicted scores on a variable for further analysis.

The multiple regression equation takes the form y = b1x1 + b2x2 + ... + bnxn + c. The b's

are the regression coefficients, representing the amount the dependent variable y changes

when the corresponding independent changes 1 unit. The c is the constant, where the

regression line intercepts the y axis, representing the amount the dependent y will be

when all the independent variables are 0. The standardized version of the b coefficients

are the beta weights, and the ratio of the beta coefficients is the ratio of the relative

predictive power of the independent variables. Associated with multiple regression is R2,

multiple correlation, which is the percent of variance in the dependent variable explained

collectively by all of the independent variables.

the same level of relationship throughout the range of the independent variable

("homoscedasticity"), interval or near-interval data, absence of outliers, and data whose

range is not truncated. In addition, it is important that the model being tested is correctly

specified. The exclusion of important causal variables or the inclusion of extraneous

variables can change markedly the beta weights and hence the interpretation of the

importance of the independent variables.

2

Q-2 What is R-square ?

Ans:

R2, also called multiple correlation or the coefficient of multiple determination, is the

percent of the variance in the dependent explained uniquely or jointly by the

independents. R-squared can also be interpreted as the proportionate reduction in error in

estimating the dependent when knowing the independents. That is, R2 reflects the number

of errors made when using the regression model to guess the value of the dependent, in

ratio to the total errors made when using only the dependent's mean as the basis for

estimating all cases. Mathematically, R2 = (1 - (SSE/SST)), where SSE = error sum of

squares = SUM((Yi - EstYi)squared), where Yi is the actual value of Y for the ith case

and EstYi is the regression prediction for the ith case; and where SST = total sum of

squares = SUM((Yi - MeanY)squared). The "residual sum of squares" in SPSS /SAS

output is SSE and reflects regression error. Thus R-square is 1 minus regression error as a

percent of total error and will be 0 when regression error is as large as it would be if you

simply guessed the mean for all cases of Y. Put another way, the regression sum of

squares/total sum of squares = R-square, where the regression sum of squares = total sum

of squares - residual sum of squares

Ans:

Adjusted R-Square is an adjustment for the fact that when one has a large number of

independents, it is possible that R2 will become artificially high simply because some

independents' chance variations "explain" small parts of the variance of the dependent. At

the extreme, when there are as many independents as cases in the sample, R2 will always

be 1.0. The adjustment to the formula arbitrarily lowers R2 as p, the number of

independents, increases. Some authors conceive of adjusted R2 as the percent of variance

"explained in a replication, after subtracting out the contribution of chance." When used

for the case of a few independents, R2 and adjusted R2 will be close. When there are a

great many independents, adjusted R2 may be noticeably lower. The greater the number

of independents, the more the researcher is expected to report the adjusted coefficient.

Always use adjusted R2 when comparing models with different numbers of independents.

Adjusted R2 = 1 - ( (1-R2)(N-1 / N - k - 1) ).

where n is sample size and k is the number of terms in the model not counting the

constant (i.e., the number of independents).

3

Multicollinearity is the intercorrelation of independent variables. R2's near 1 violate the

assumption of no perfect collinearity, while high R2's increase the standard error of the

beta coefficients and make assessment of the unique role of each independent difficult or

impossible. While simple correlations tell something about multicollinearity, the

preferred method of assessing multicollinearity is to regress each independent on all the

other independent variables in the equation. Inspection of the correlation matrix reveals

only bivariate multicollinearity, with the typical criterion being bivariate correlations >

.90. To assess multivariate multicollinearity, one uses tolerance or VIF, which build in

the regressing of each independent on all the others. Even when multicollinearity is

present, note that estimates of the importance of other variables in the equation (variables

which are not collinear with others) are not affected.

Types of multicollinearity. The type of multicollinearity matters a great deal. Some types

are necessary to the research purpose

Tolerance is 1 - R2 for the regression of that independent variable on all the other

independents, ignoring the dependent. There will be as many tolerance coefficients as

there are independents. The higher the intercorrelation of the independents, the more the

tolerance will approach zero. As a rule of thumb, if tolerance is less than .20, a problem

with multicollinearity is indicated.

When tolerance is close to 0 there is high multicollinearity of that variable with other

independents and the b and beta coefficients will be unstable.The more the

multicollinearity, the lower the tolerance, the more the standard error of the regression

coefficients. Tolerance is part of the denominator in the formula for calculating the

confidence limits on the b (partial regression) coefficient.

Variance-inflation factor, VIF VIF is the variance inflation factor, which is simply the

reciprocal of tolerance. Therefore, when VIF is high there is high multicollinearity and

instability of the b and beta coefficients. VIF and tolerance are found in the SPSS and

SAS output section on collinearity statistics.

Condition indices are used to flag excessive collinearity in the data. A condition index

over 30 suggests serious collinearity problems and an index over 15 indicates possible

collinearity problems. If a factor (component) has a high condition index, one looks in the

variance proportions column. Criteria for "sizable proportion" vary among researchers

but the most common criterion is if two or more variables have a variance partition of .50

or higher on a factor with a high condition index. If this is the case, these variables have

high linear dependence and multicollinearity is a problem, with the effect that small data

changes or arithmetic errors may translate into very large changes or errors in the

regression analysis. Note that it is possible for the rule of thumb for condition indices (no

index over 30) to indicate multicollinearity, even when the rules of thumb for tolerance >

.20 or VIF < 4 suggest no multicollinearity. Computationally, a "singular value" is the

square root of an eigenvalue, and "condition indices" are the ratio of the largest singular

4

values to each other singular value. In SPSS or SAS, select Analyze, Regression,

Linear;click Statistics; check Collinearity diagnostics to get condition indices.

Homoscedasticity: The researcher should test to assure that the residuals are dispersed

randomly throughout the range of the estimated dependent. Put another way, the

variance of residual error should be constant for all values of the independent(s). If not,

separate models may be required for the different ranges. Also, when the

homoscedasticity assumption is violated "conventionally computed confidence

intervals and conventional t-tests for OLS estimators can no longer be justified"

However, moderate violations of homoscedasticity have only minor impact on

regression estimates .

Nonconstant error variance can be observed by requesting a simple residual plot (a plot

of residuals on the Y axis against predicted values on the X axis). A homoscedastic

model will display a cloud of dots, whereas lack of homoscedasticity will be

characterized by a pattern such as a funnel shape, indicating greater error as the

dependent increases. Nonconstant error variance can indicate the need to respecify the

model to include omitted independent variables.

measured independent variable and an unmeasured independent variable not in the

model; or (2) that some independent variables are skewed while others are not.

One method of dealing with hetereoscedasticity is to select the weighted least squares

regression option. This causes cases with smaller residuals to be weighted more in

calculating the b coefficients. Square root, log, and reciprocal transformations of the

dependent may also reduce or eliminate lack of homoscedasticity.

http://www2.chass.ncsu.edu/garson/pa765/regress.htm

www.cs.uu.nl/docs/vakken/arm/SPSS/spss4.pdf

Kahane, Leo H. (2001). Regression basics. Thousand Oaks, CA: Sage Publications.

Menard, Scott (1995). Applied logistic regression analysis. Thousand Oaks, CA: Sage

Publications. Series: Quantitative Applications in the Social Sciences, No. 106.

Miles, Jeremy and Mark Shevlin (2001). Applying regression and correlation. Thousand

Oaks, CA: Sage Publications. Introductory text built around model-building.

5

Discriminant Analysis

Q-1 What is Disriminant Analysis ?

Ans:

cases into the values of a categorical dependent, usually a dichotomy. If discriminant

function analysis is effective for a set of data, the classification table of correct and

incorrect estimates will yield a high percentage correct. Discriminant function analysis is

found in SPSS/SAS under Analyze, Classify, Discriminant. One gets DA or MDA from

this same menu selection, depending on whether the specified grouping variable has two

or more categories.

cousin of multiple analysis of variance (MANOVA), sharing many of the same

assumptions and tests. MDA is used to classify a categorical dependent which has more

than two categories, using as predictors a number of interval or dummy independent

variables. MDA is sometimes also called discriminant factor analysis or canonical

discriminant analysis.

• To test theory by observing whether cases are classified as predicted.

• To investigate differences between or among groups.

• To determine the most parsimonious way to distinguish among groups.

• To assess the relative importance of the independent variables in classifying the

dependent variable.

• To infer the meaning of MDA dimensions which distinguish groups, based on

discriminant loadings.

Discriminant analysis has two steps: (1) an F test (Wilks' lambda) is used to test if the

discriminant model as a whole is significant, and (2) if the F test shows significance, then

the individual independent variables are assessed to see which differ significantly in

mean by group and these are used to classify the dependent variable.

Discriminant analysis shares all the usual assumptions of correlation, requiring linear and

homoscedastic relationships, and untruncated interval or near interval data. Like multiple

regression, it also assumes proper model specification (inclusion of all important

independents and exclusion of extraneous variables). DA also assumes the dependent

variable is a true dichotomy since data which are forced into dichotomous coding are

truncated, attenuating correlation.

of DA as it usually involves fewer violations of assumptions (independent variables

needn't be normally distributed, linearly related, or have equal within-group variances), is

6

robust, handles categorical as well as continuous variables, and has coefficients which

many find easier to interpret. Logistic regression is preferred when data are not normal in

distribution or group sizes are very unequal. See also the separate topic on multiple

discriminant function analysis (MDA) for dependents with more than two categories.

predictors.

The criterion variable. This is the dependent variable, also called the grouping

variable in SPSS. It is the object of classification efforts.

latent variable which is created as a linear combination of discriminating

(independent) variables, such that L = b1x1 + b2x2 + ... + bnxn + c, where the b's are

discriminant coefficients, the x's are discriminating variables, and c is a constant.

This is analogous to multiple regression, but the b's are discriminant coefficients

which maximize the distance between the means of the criterion (dependent) variable.

Note that the foregoing assumes the discriminant function is estimated using ordinary

least-squares, the traditional method, but there is also a version involving maximum

likelihood estimation.

discriminant analysis, but for higher order DA, the number of functions (each with its

own cut-off value) is the lesser of (g - 1), where g is the number of categories in the

grouping variable, or p,the number of discriminating (independent) variables. Each

discriminant function is orthogonal to the others. A dimension is simply one of the

discriminant functions when there are more than one, in multiple discriminant

analysis.

The eigenvalue, also called the characteristic root of each discriminant function,

reflects the ratio of importance of the dimensions which classify cases of the

dependent variable. There is one eigenvalue for each discriminant function. For two-

group DA, there is one discriminant function and one eigenvalue, which accounts for

100% of the explained variance. If there is more than one discriminant function, the

first will be the largest and most important, the second next most important in

explanatory power, and so on. The eigenvalues assess relative importance because

they reflect the percents of variance explained in the dependent variable, cumulating

to 100% for all functions. That is, the ratio of the eigenvalues indicates the relative

discriminating power of the discriminant functions. If the ratio of two eigenvalues is

1.4, for instance, then the first discriminant function accounts for 40% more between-

7

group variance in the dependent categories than does the second discriiminant

function. Eigenvalues are part of the default output in SPSS (Analyze, Classify,

Discriminant).

divided by the sum of all eigenvalues of all discriminant functions in the model. Thus

it is the percent of discriminating power for the model associated with a given

discriminant function. Relative % is used to tell how many functions are important.

One may find that only the first two or so eigenvalues are of importance.

formed by the dependent and the given discriminant function. When R is zero, there

is no relation between the groups and the function. When the canonical correlation is

large, there is a high correlation between the discriminant functions and the groups.

Note that relative % and R* do not have to be correlated. R is used to tell how much

each function is useful in determining group differences. An R of 1.0 indicates that all

of the variability in the discriminant scores can be accounted for by that dimension.

Note that for two-group DA, the canonical correlation is equivalent to the Pearsonian

correlation of the discriminant scores with the grouping variable.

The discriminant score, also called the DA score, is the value resulting from

applying a discriminant function formula to the data for a given case. The Z score is

the discriminant score for standardized data. To get discriminant scores in SPSS,

select Analyze, Classify, Discriminant; click the Save button; check "Discriminant

scores". One can also view the discriminant scores by clicking the Classify button and

checking "Casewise results."

Cutoff: If the discriminant score of the function is less than or equal to the cutoff, the

case is classed as 0, or if above it is classed as 1. When group sizes are equal, the

cutoff is the mean of the two centroids (for two-group DA). If the groups are unequal,

the cutoff is the weighted mean.

Unstandardized discriminant coefficients are used in the formula for making the

classifications in DA, much as b coefficients are used in regression in making

predictions. The constant plus the sum of products of the unstandardized coefficients

with the observations yields the discriminant scores. That is, discriminant coefficients

are the regression-like b coefficients in the discriminant function, in the form

L = b1x1 + b2x2 + ... + bnxn + c, where L is the latent variable formed by the

discriminant function, the b's are discriminant coefficients, the x's are discriminating

variables, and c is a constant. The discriminant function coefficients are partial

coefficients, reflecting the unique contribution of each variable to the classification of

the criterion variable. The standardized discriminant coefficients, like beta weights in

regression, are used to assess the relative classifying importance of the independent

variables.

8

Standardized discriminant coefficients, also termed the standardized canonical

discriminant function coefficients, are used to compare the relative importance of the

independent variables, much as beta weights are used in regression. Note that

importance is assessed relative to the model being analyzed. Addition or deletion of

variables in the model can change discriminant coefficients markedly.

As with regression, since these are partial coefficients, only the unique explanation of

each independent is being compared, not considering any shared explanation. Also, if

there are more than two groups of the dependent, the standardized discriminant

coefficients do not tell the researcher between which groups the variable is most or

least discriminating. For this purpose, group centroids and factor structure are

examined.

whole. In SPSS, the "Wilks' Lambda" table will have a column labeled "Test of

Function(s)" and a row labeled "1 through n" (where n is the number of discriminant

functions). The "Sig." level for this row is the significance level of the discriminant

function as a whole. A significant lambda means one can reject the null hypothesis

that the two groups have the same mean discriminant function scores. Wilks's lambda

is part of the default output in SPSS (Analyze, Classify, Discriminant). In SPSS, this

use of Wilks' lambda is in the "Wilks' lambda" table of the output section on

"Summary of Canonical Discriminant Functions."

ANOVA table for discriminant scores is another overall test of the DA model. It is

an F test, where a "Sig." p value < .05 means the model differentiates discriminant

scores between the groups significantly better than chance (than a model with just the

constant). It is obtained in SPSS by asking for Analyze, Compare Means, One-Way

ANOVA, using discriminant scores from DA (which SPSS will label Dis1_1 or

similar) as dependent.

Wilks' lambda also can be used to test which independents contribute significantly to

the discriminant function. The smaller the lambda for an independent variable, the

more that variable contributes to the discriminant function. Lambda varies from 0 to

1, with 0 meaning group means differ (thus the more the variable differentiates the

groups), and 1 meaning all group means are the same. The F test of Wilks's lambda

shows which variables' contributions are significant. Wilks's lambda is sometimes

called the U statistic. In SPSS, this use of Wilks' lambda is in the "Tests of equality of

group means" table in DA output.

Ans:

9

The classification table, also called a classification matrix, or a confusion,

assignment, or prediction matrix or table, is used to assess the performance of DA.

This is simply a table in which the rows are the observed categories of the dependent

and the columns are the predicted categories of the dependents. When prediction is

perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is

the percentage of correct classifications. This percentage is called the hit ratio.

Expected hit ratio. Note that the hit ratio must be compared not to zero but to the

percent that would have been correctly classified by chance alone. For two-group

discriminant analysis with a 50-50 split in the dependent variable, the expected

percent is 50%. For unequally split 2-way groups of different sizes, the expected

percent is computed in the "Prior Probabilities for Groups" table in SPSS, by

multiplying the prior probabilities times the group size, summing for all groups, and

dividing the sum by N.

http://faculty.chass.ncsu.edu/garson/PA765/discrim2.htm

(Wiley Series in Probability and Statistics).

Social Sciences Series, No. 19. Thousand Oaks, CA: Sage Publications.

10

Cluster Analysis

Q-1 What is Cluster Analysis ?

Ans:

Cluster analysis, also called segmentation analysis or taxonomy analysis, seeks to

identify homogeneous subgroups of cases in a population. That is, cluster analysis

seeks to identify a set of groups which both minimize within-group variation and

maximize between-group variation. Other techniques, such as latent class analysis

and Q-mode factor analysis, also perform clustering and are discussed separately.

allows users to select a definition of distance, then select a linking method of forming

clusters, then determine how many clusters best suit the data. In k-means clustering

the researcher specifies the number of clusters in advance, then calculates how to

assign cases to the K clusters. K-means clustering is much less computer-intensive

and is therefore sometimes preferred when datasets are very large (ex., > 1,000).

Finally, two-step clustering creates pre-clusters, then it clusters the pre-clusters.

Cluster formation is the selection of the procedure for determining how clusters are

created, and how the calculations are done. In agglomerative hierarchical clustering

every case is initially considered a cluster, then the two cases with the lowest distance (or

highest similarity) are combined into a cluster. The case with the lowest distance to either

of the first two is considered next. If that third case is closer to a fourth case than it is to

either of the first two, the third and fourth cases become the second two-case cluster; if

not, the third case is added to the first cluster. The process is repeated, adding cases to

existing clusters, creating new clusters, or combining clusters to get to the desired final

number of clusters. There is also divisive clustering, which works in the opposite

direction, starting with all cases in one large cluster. Hierarchical cluster analysis,

discussed below, can use either agglomerative or divisive clustering strategies.

Distance. The first step in cluster analysis is establishment of the similarity or distance

matrix. This matrix is a table in which both the rows and columns are the units of analysis

and the cell entries are a measure of similarity or distance for any pair of cases.

Euclidean distance is the most common distance measure. A given pair of cases is plotted

on two variables, which form the x and y axes. The Euclidean distance is the square root

11

of the sum of the square of the x difference plus the square of the y distance. (Recall high

school geometry: this is the formula for the length of the third side of a right triangle.) It

is common to use the square of Euclidean distance as squaring removes the sign. When

two or more variables are used to define distance, the one with the larger magnitude will

dominate, so to avoid this it is common to first standardize all variables.

distances to use as criteria when merging nearest clusters into broader groups or when

considering the relation of a point to a cluster. SPSS supports these interval distance

measures: Euclidean distance, squared Euclidean distance, Chebychev, block,

Minkowski, or customized; for count data, chi-square or phi-square. For binary data, it

supports Euclidean distance, squared Euclidean distance, size difference, pattern

difference, variance, shape, or Lance and Williams.

Similarity. Distance measures how far apart two observations are. Cases which are alike

share a low distance. Similarity measures how alike two cases are. Cases which are alike

share a high similarity. SPSS supports a large number of similarity measures for interval

data (Pearson correlation or cosine) and for binary data (Russell and Rao, simple

matching, Jaccard, Dice, Rogers and Tanimoto, Sokal and Sneath 1, Sokal and Sneath 2,

Sokal and Sneath 3, Kulczynski 1, Kulczynski 2, Sokal and Sneath 4, Hamann, Lambda,

Anderberg's D, Yule's Y, Yule's Q, Ochiai, Sokal and Sneath 5, phi 4-point correlation, or

dispersion).

Absolute values. Since for Pearson correlation, high negative as well as high positive

values indicate similarity, the researcher normally selects absolute values. This can be

done by checking the absolute value checkbox in the Transform Measures area of the

Methods subdialog (invoked by pressing the Methods button) of the main Cluster dialog.

Summary. In SPSS, similarity/distance measures are selected in the Measure area of the

Method subdialog obtained by pressing the Method button in the Classify dialog. There

are three measure pulldown menus, for interval, binary, and count data respectively.The

proximity matrix table in the output shows the actual distances or similarities computed

for any pair of cases. In SPSS, proximity matrices are selected under Analyze, Cluster,

Hierarchical clustering; Statistics button; check proximity matrix.

Method. Under the Method button in the SPSS Classify dialog, the pull-down Method

selection determines how cases or clusters are combined at each step. Different methods

will result in different cluster patterns. SPSS offers these method choices:

Nearest neighbor. In this single linkage method, the distance between two clusters is the

distance between their closest neighboring points

Furthest neighbor. In this complete linkage method, the distance between two clusters is

the distance between their two furthest member points.

12

UPGMA (unweighted pair-group method using averages). The distance between two

clusters is the average distance between all inter-cluster pairs. UPGMA is generally

preferred over nearest or furthest neighbor methods since it is based on information about

all inter-cluster pairs, not just the nearest or furthest ones. and is the default method in

SPSS. SPSS labels this "between-groups linkage."

Average linkage within groups is the mean distance between all possible inter- or intra-

cluster pairs. The average distance between all pairs in the resulting cluster is made to be

as small as possibile. This method is therefore appropriate when the research purpose is

homogeneity within clusters. SPSS labels this "within-groups linkage."

Ward's method calculates the sum of squared Euclidean distances from each case in a

cluster to the mean of all variables. The cluster to be merged is the one which will

increase the sum the least. This is an ANOVA-type approach and preferred by some

researchers for this reason.

Centroid method. The cluster to be merged is the one with the smallest sum of Euclidean

distances between cluster means for all variables.

Median method. Clusters are weighted equally regardless of group size when computing

centroids of two clusters being combined. This method also uses Euclidean distance as

the proximity measure.

Correlation of items can be used as a similarity measure. One transposes the normal data

table in which columns are variables and rows are cases. By using columns as cases and

rows as variables instead, the correlation is between cases and these correlations may

constitute the cells of the similarity matrix.

Binary matching is another type of similarity measure, where 1 indicates a match and 0

indicates no match between any pair of cases. There are multiple matched attributes and

the similarity score is the number of matches divided by the number of attributes being

matched. Note that it is usual in binary matching to have several attributes because there

is a risk that when the number of attributes is small, they may be orthogonal to

(uncorrelated) with one another, and clustering will be indeterminate.

Summary measures assess how the clusters differ from one another.

Means and variances. A table of means and variances of the clusters with respect to the

original variables shows how the clusters differ on the original variables. SPSS does not

make this available in the Cluster dialog, but one can click the Save button, which will

save the cluster number for each case (or numbers if multiple solutions are requested).

Then in Analyze, Compare Means, Means the researcher can use the cluster number as

the grouping variable to compare differences of means on any other continuous variable

in the dataset.

13

Cluster membership table. This shows cases as rows, where columns are alternative

numbers of clusters in the solution (as specified in the "Range of Solution" option in the

Cluster membership group in SPSS, under the Statistics button). Cell entries show the

number of the cluster to which the case belongs. From this table, one can see which cases

are in which groups, depending on the number of clusters in the solution.

for Hierarchical Cluster in the SPSS Cluster dialog. In this table, the rows are stages of

clustering, numbered from 1 to (n - 1). The (n - 1)th stage includes all the cases in one

cluster. There are two "Cluster Combined" columns, giving the case or cluster numbers

for combination at each stage. In agglomerative clustering using a distance measure like

Euclidean distance, stage 1 combines the two cases which have lowest proximity

(distance) score. The cluster number goes by the lower of the cases or clusters combined,

where cases are initially numbered 1 to n. For instance, at Stage 1, cases 3 and 18 might

be combined, resulting in a cluster labeled 3. Later cluster 3 and case 2 might be

combined, resulting in a cluster labeled 2. The researcher looks at the "Coefficients"

column of the agglomerative schedule and notes when the proximity coefficient jumps up

and is not a small increment from the one before (or when the coefficient reaches some

theoretically important level). Note that for distance measures, low is good, meaning the

cases are alike; for similarity measures, high coefficients mean cases are alike. After the

stopping stage is determined in this manner, the researcher can work backward to

determine how many clusters there are and which cases belong to which clusters (but it is

easier just to get this information from the cluster membership table). Note, though, that

SPSS will not stop on this basis but instead will compute the range of solutions (ex., 2 to

4 clusters) requested by the researcher in the Cluster Membership group of the Statistics

button in th Hierarchical Clustering dialog. When there are relatively few cases, icicle

plots or dendograms provide the same linkage information in an easier format.

Icicle plots are usually horizontal, showing cases as rows and number of clusters in the

solution as columns. If there are few cases, vertical icicle plots may plotted, with cases as

columns. Reading from the last column right to left (horizontal icicle plots) or last row

bottom to top (vertical icicle plots), the researcher can see how agglomeration proceeded.

The last/bottom row will show all the cases in separate one-case clusters. This is the (n -

1) solution. The next-to-last/bottom column/row will show the (n-2) solution, with two

cases combined into one cluster. Subsequent columns/rows show further clustering steps.

Row 1 (vertical icicle plots) or column 1 (horizontal icicle plots) will show all cases in a

single cluster. This is a visual way of representing information on the agglomeration

schedule, but without the proximity coefficient information.

Dendrograms, also called tree diagrams, show the relative size of the proximity

coefficients at which cases were combined. The bigger the distance coefficient or the

smaller the similarity coefficient, the more clustering involved combining unlike entities,

which may be undesirable. Trees are usually depicted horizontally, not vertically, with

each row representing a case on the Y axis, while the X axis is a rescaled version of the

14

proximity coefficients. Cases with low distance/high similarity are close together. Cases

showing low distance are close, with a line linking them a short distance from the left of

the dendogram, indicating that they are agglomerated into a cluster at a low distance

coefficient, indicating alikeness. When, on the other hand, the linking line is to the right

of the dendogram the linkage occurs a high distance coefficient, indicating the

cases/clusters were agglomerated even though much less alike. If a similarity measure is

used rather than a distance measure, the rescaling of the X axis still produces a diagram

with linkages involving high alikeness to the left and low alikeness to the right. In SPSS,

select Analyze, Classify, Hierarchical Cluster; click the Plots button, check the

Dendogram checkbox.

accomplish hierarchical clustering, the researcher must specify how similarity or distance

is defined, how clusters are aggregated (or divided), and how many clusters are needed.

Hierarchical clustering generates all possible clusters of sizes 1...K, but is used only for

relatively small samples. In hierarchical clustering, the clusters are nested rather than

being mutually exclusive, as is the usual case..That is, in hierarchical clustering, larger

clusters created at later stages may contain smaller clusters created at earlier stages of

agglomeration.

One may wish to use the hierarchical cluster procedure on a sample of cases (ex., 200) to

inspect results for different numbers of clusters. The optimum number of clusters

depends on the research purpose. Identifying "typical" types may call for few clusters and

identifying "exceptional" types may call for many clusters. After using hierarchical

clustering to determine the desired number of clusters, the researcher may wish then to

analyze the entire dataset with k-means clustering (aka, the Quick Cluster procedure:

Analyze, Cluster, K-Means Cluster Analysis), specifying that number of clusters.

Forward clustering, also called agglomerative clustering: Small clusters are formed by

using a high similarity index cut-off (ex., .9). Then this cut-off is relaxed to establish

broader and broader clusters in stages until all cases are in a single cluster at some low

similarity index cut-off. The merging of clusters is visualized using a tree format.

Backward clustering, also called divisive clustering, is the same idea, but starting with a

low cut-off and working toward a high cut-off. Forward and backward methods need not

generate the same results.

Clustering variables. In the Hierarchical Cluster dialog, in the Cluster group, the

researcher may selected Variable rather than the usual Cases, in order to cluster variables.

SPSS calls hierarchical clustering the "Cluster procedure." In SPSS, select Analyze,

Classify, Hierarchical Cluster; select variables; select Cases in the Cluster group click

Statistics, select Proximity Matrix; select Range of Solutions in the Cluster Membership

group, specify the number of clusters (typically 3 to 6); Continue; OK.

15

What is K-means Cluster Analysis ?

K-means cluster analysis. K-means cluster analysis uses Euclidean distance. The

researcher must specify in advance the desired number of clusters, K. Initial cluster

centers are chosen in a first pass of the data, then each additional iteration groups

observations based on nearest Euclidean distance to the mean of the cluster. Cluster

centers change at each pass. The process continues until cluster means do not shift more

than a given cut-off value or the iteration limit is reached.

Cluster centers are the average value on all clustering variables of each cluster's

members. The "Initial cluster centers," in spite of its title, gives the average value of each

variable for each cluster for the k well-spaced cases which SPSS selects for initialization

purposes when no initial file is supplied. The "Final cluster centers" table in SPSS output

gives the same thing for the last iteration step. The "Iteration history" table shows the

change in cluster centers when the usual iterative approach is taken. When the change

drops below a specified cutoff, the iterative process stops and cases are assigned to

clusters according to which cluster center they are nearest.

Large datasets are possible with K-means clustering, unlike hierarchical clustering,

because K-means clustering does not require prior computation of a proximity matrix of

the distance/similarity of every case with every other case.

Method. The default method is "Iterate and classify," under which an interative process is

used to update cluster centers, then cases are classified based on the updated centers.

However, SPSS supports a "Classify only" method, under which cases are immediately

classified based on the initial cluster centers, which are not updated.

assigned to a cluster, then reassigned to a different cluster as the algorithm unfolds.

However, in agglomerative K-means clustering, the solution is constrained to force a

given case to remain in its initial cluster.

SPSS: Analyze, Cluster, K-Means Cluster Analysis; enter variables in the Variables: area;

optionally, enter a variable in the "Label cases by:" area; enter "Number of clusters:";

choose Method: Iiterate and classify, or just Classify); Unlike hierarchical clustering,

there is no option for "Range of solutions"; instead you must re-run K-means clustering,

asking for a different number of clusters.

Iterate button. Optionally, you may press the Iterate button and set the number of

iterations and the convergence criterion. The default maximum number of iterations in

SPSS is 10. For the convergence criterion, by default, iterations terminate if the largest

change in any cluster center is less than 2% of the minimum distance between initial

centers (or if the maximum number of iterations has been reached). To override this

default, enter a positive number less than or equal to 1 in the convergence box. There is

16

also a "Use running means" checkbox which, if checked, will cause the clulster centers to

be updated after each case is classified, not the default, which is after the entire set of

cases is classified.

Save button: Optionally, you may press the Save button to save the final cluster number

of each case as an added column in your dataset (labeled QCL_1), and/or you may save

the Euclidean distance between each case and its cluster center (labeled QCL_2) by

checking "Distance from cluster center."

Options button: Optionally, you may press the Options button to select statistics or

missing values options. There are three statistics options: "Initial cluster centers" (gives

the initial variable means for each clusters); ANOVA table (ANOVA F-tests for each

variable., but as the F tests are only descriptive, the resulting probabilities are for

exploratory purposes only; nonetheless, non-significant variables might be dropped as not

contributing to the differentiation of clusters); and "Cluster information for each case"

(gives each case's final cluster assignment and the Euclidean distance between the case

and the cluster center; also gives the Euclidean distance between final cluster centers).

Getting different clusters. Sometimes the researcher wishes to experiment to get different

clusters, as when the "Number of cases in each cluster" table shows highly imbalanced

clusters and/or clusters with very few members. Different results may occur by setting

different initial cluster centers from file (see above), by changing the number of clusters

requested, or even by presenting the data file in different case order.

Two-step cluster analysis groups cases into pre-clusters which are treated as single cases.

Standard hierarchical clustering is then applied to the pre-clusters in the second step. This

is the method used when one or more of the variables are categorical (not interval or

dichotomous). Also, since it is a method requiring neither a proximity table like

hierarchical classification nor an iterative process like K-means clustering, but rather is a

one-pass-through-the-dataset method, it is recommended for very large datasets.

Cluster feature tree.. The preclustering stage employs a CFtree with nodes leading to leaf

nodes. Cases start at the root node and are channeled toward nodes and eventually leaf

nodes which match it most closely. If there is no adequate match, the case is used to start

its own leaf node. It can happen that the CFtree fills up and cannot accept new leaf entries

in a node, in which case it is split using the most-distant pair in the node as seeds. If this

recursive process grows the CFtree beyond maximum size, the threshold distance is

increased and the tree is rebuilt, allowing new cases to be input. The process continues

until all the data are read. Click the Advanced button in the Options button dialog to set

threshold distances, maximum levels, and maximum branches per leaf node manually.

Proximity. When one or more of the variables are categorical, log-likelihood is the

distance measure used, with cases categorized under the cluster which is associated with

the largest log-likelihood. If variables are all continuous, Euclidean distance is used, with

17

cases categorized under the cluster which is associated with the smallest Euclidean

distance.

Number of clusters. By default SPSS determines the number of clusters using the change

in BIC (the Schwarz Bayesian Criterion: when BIC change is small, it stops and selects

as many clusters as thus far created. It is also possible to have this done based on changes

in AIC (the Akaike Information Criterion), or to simply to tell SPSS how many clusters

are wanted. The researcher can also ask for a range of solutions, such as 3-5 clusters. The

"Autoclustering statistics" table in SPSS output gives, for example, BIC and BIC change

for all solutions.

SPSS. Choose Analyze, Classify, Two-Step Cluster; select your categorical and

continuous variables; if desired, click Plots and select the plots wanted; Click Output and

select the statistics wanted (descriptive statistics, cluster frequencies, AIC or BIC);

Continue

http://faculty.chass.ncsu.edu/garson/PA765/cluster.htm

www.cs.uu.nl/docs/vakken/arm/SPSS/spss8.pdf

Suggested Readings:

To Cluster Analysis, 2005

18

Factor Analysis

Q-1 What is Factor Analysis?

shared variance.

• Factor Analysis should be driven by a researcher who has a deep and genuine

interest in relevant theory in order to get optimal value from choosing the right

type of factor analysis and interpreting the factor loadings.

• Factor analysis beings begins with a large number of variables and then tries to

reduce the interrelationships amongst the variables to a few number of clusters or

factors.

• Factor analysis finds relationships or natural connections where variables are

maximally correlated with one another and minimally correlated with other

variables, and then groups the variables accordingly.

• After this process has been done many times a pattern appears of relationships or

factors that capture the essence of all of the data emerges.

• Summary: Factor analysis refers to a collection of statistical methods for reducing

correlational data into a smaller number of dimensions or factors

relatively large set of variables. The researcher's à priori assumption is that any indicator

may be associated with any factor. This is the most common form of factor analysis.

There is no prior theory and one uses factor loadings to intuit the factor structure of the

data.

Confirmatory factor analysis (CFA) seeks to determine if the number of factors and the

loadings of measured (indicator) variables on them conform to what is expected on the

basis of pre-established theory. Indicator variables are selected on the basis of prior

theory and factor analysis is used to see if they load as predicted on the expected number

of factors. The researcher's à priori assumption is that each factor (the number and labels

of which may be specified à priori) is associated with a specified subset of indicator

variables. A minimum requirement of confirmatory factor analysis is that one

hypothesize beforehand the number of factors in the model, but usually also the

researcher will posit expectations about which variables will load on which factors (Kim

and Mueller, 1978b: 55). The researcher seeks to determine, for instance, if measures

created to represent a latent variable really belong together.

Factor loadings: The factor loadings, also called component loadings in PCA, are the

correlation coefficients between the variables (rows) and factors (columns). Analogous to

19

Pearson's r, the squared factor loading is the percent of variance in that variable explained

by the factor. To get the percent of variance in all the variables accounted for by each

factor, add the sum of the squared factor loadings for that factor (column) and divide by

the number of variables. (Note the number of variables equals the sum of their variances

as the variance of a standardized variable is 1.) This is the same as dividing the factor's

eigenvalue by the number of variables.

Communality, h2, is the squared multiple correlation for the variable as dependent using

the factors as predictors. The communality measures the percent of variance in a given

variable explained by all the factors jointly and may be interpreted as the reliability of the

indicator. When an indicator variable has a low communality, the factor model is not

working well for that indicator and possibly it should be removed from the model.

However, communalities must be interpreted in relation to the interpretability of the

factors. A communality of .75 seems high but is meaningless unless the factor on which

the variable is loaded is interpretable, though it usually will be. A communality of .25

seems low but may be meaningful if the item is contributing to a well-defined factor.

That is, what is critical is not the communality coefficient per se, but rather the extent to

which the item plays a role in the interpretation of the factor, though often this role is

greater when communality is high

Eigenvalues: Also called characteristic roots. The eigenvalue for a given factor

measures the variance in all the variables which is accounted for by that factor. The ratio

of eigenvalues is the ratio of explanatory importance of the factors with respect to the

variables. If a factor has a low eigenvalue, then it is contributing little to the explanation

of variances in the variables and may be ignored as redundant with more important

factors.

Thus, eigenvalues measure the amount of variation in the total sample accounted for by

each factor. Note that the eigenvalue is not the percent of variance explained but rather a

measure of amount of variance in relation to total variance (since variables are

standardized to have means of 0 and variances of 1, total variance is equal to the number

of variables). SPSS will output a corresponding column titled '% of variance'. A factor's

eigenvalue may be computed as the sum of its squared factor loadings for all the

variables.

Q-2 What are the criteria for determining the number of factors, roughly in the

order of frequency of use in social science (see Dunteman, 1989: 22-3).

Kaiser criterion: A common rule of thumb for dropping the least important factors from

the analysis. The Kaiser rule is to drop all components with eigenvalues under 1.0. Kaiser

criterion is the default in SPSS and most computer programs.

Scree plot: The Cattell scree test plots the components as the X axis and the

corresponding eigenvalues as the Y axis. As one moves to the right, toward later

components, the eigenvalues drop. When the drop ceases and the curve makes an elbow

toward less steep decline, Cattell's scree test says to drop all further components after the

20

one starting the elbow. This rule is sometimes criticised for being amenable to researcher-

controlled "fudging." That is, as picking the "elbow" can be subjective because the curve

has multiple elbows or is a smooth curve, the researcher may be tempted to set the cut-off

at the number of factors desired by his or her research agenda. Even when "fudging" is

not a consideration, the scree criterion tends to result in more factors than the Kaiser

criterion.

Variance explained criteria: Some researchers simply use the rule of keeping enough

factors to account for 90% (sometimes 80%) of the variation. Where the researcher's goal

emphasizes parsimony (explaining variance with as few factors as possible), the criterion

could be as low as 50%.

Q-3 What are the different rotation methods used in factor analysis?

Ans:

No rotation is the default, but it is a good idea to select a rotation method, usually

varimax. The original, unrotated principal components solution maximizes the sum of

squared factor loadings, efficiently creating a set of factors which explain as much of the

variance in the original variables as possible. The amount explained is reflected in the

sum of the eigenvalues of all factors. However, unrotated solutions are hard to interpret

because variables tend to load on multiple factors.

Varimax rotation is an orthogonal rotation of the factor axes to maximize the variance

of the squared loadings of a factor (column) on all the variables (rows) in a factor matrix,

which has the effect of differentiating the original variables by extracted factor. Each

factor will tend to have either large or small loadings of any particular variable. A

varimax solution yields results which make it as easy as possible to identify each variable

with a single factor. This is the most common rotation option.

needed to explain each variable. This type of rotation often generates a general factor on

which most variables are loaded to a high or medium degree. Such a factor structure is

usually not helpful to the research purpose.

arbitrary "rules of thumb," in descending order of popularity, include those below. These

are not mutually exclusive: Bryant and Yarnold, for instance, endorse both STV and the

Rule of 200.

Rule of 10. There should be at least 10 cases for each item in the instrument being used.

STV ratio. The subjects-to-variables ratio should be no lower than 5 (Bryant and

Yarnold, 1995)

21

Rule of 100: The number of subjects should be the larger of 5 times the number of

variables, or 100. Even more subjects are needed when communalities are low and/or few

variables load on each factor. (Hatcher, 1994)

Rule of 150: Hutcheson and Sofroniou (1999) recommends at least 150 - 300 cases, more

toward the 150 end when there are a few highly correlated variables, as would be the case

when collapsing highly multicollinear variables.

data are likely to factor well, based on correlation and partial correlation. In the old days

of manual factor analysis, this was extremely useful. KMO can still be used, however, to

assess which variables to drop from the model because they are too multicollinear.

There is a KMO statistic for each individual variable, and their sum is the KMO overall

statistic. KMO varies from 0 to 1.0 and KMO overall should be .60 or higher to proceed

with factor analysis. If it is not, drop the indicator variables with the lowest individual

KMO statistic values, until KMO overall rises above .60.

To compute KMO overall, the numerator is the sum of squared correlations of all

variables in the analysis (except the 1.0 self-correlations of variables with themselves, of

course). The denominator is this same sum plus the sum of squared partial correlations of

each variable i with each variable j, controlling for others in the analysis. The concept is

that the partial correlations should not be very large if one is to expect distinct factors to

emerge from factor analysis.

In SPSS, KMO is found under Analyze - Statistics - Data Reduction - Factor - Variables

(input variables) - Descriptives - Correlation Matrix - check KMO and Bartlett's test of

sphericity and also check Anti-image - Continue - OK. The KMO output is KMO overall.

The diagonal elements on the Anti-image correlation matrix are the KMO individual

statistics for each variable.

http://faculty.chass.ncsu.edu/garson/PA765/factspss.htm

www.sussex.ac.uk/Users/andyf/factor.pdf

www.cs.uu.nl/docs/vakken/arm/SPSS/spss7.pdf

Concepts and Applications, 2004

22

- A Bass Diffusion Model Analysis- Understanding Alternative Fuel V.pdfUploaded byshreya
- Multiple RegressionUploaded bySunaina Kuncolienkar
- Ejkm Volume5 Issue4 Article 137Uploaded byqsyukrilah
- unit4Uploaded byJuan
- A tale of two audiences: spectators, television viewers and outcome uncertainty in Spanish footballUploaded byjpirespro
- Multiple Regression in CBUploaded bygotcan
- Intellectual property and economic development in Germany_ empirical evidence for 1999-2009.pdfUploaded byFauzi Piliang
- Regression Assignment InstUploaded byAshok Nk
- 1-s2.0-S1877042815054014-main.pdfUploaded byNiar Azriya
- ResultsUploaded byTintin Sumaway
- 62254_JEIEFB_Adolphus J. Toby_ Daerego S. Thompson.pdfUploaded byNURUL ISMAIL
- Minitab-Multiple-Regression-Analysis.pdfUploaded byBen Guhman
- Determinants of Industrial ConcentrationUploaded byRainier Justine Ridao
- Brusilovskiy_A Brief Introduction to Spatial RegressionUploaded byodcardozo
- alesina_giuliano_nunn_qje_2013.pdfUploaded byJuanin Torrecillas Jódar
- Martinez and RubioUploaded bykongkitti
- Dummy Variable Regression and Oneway ANOVA Models Using SASUploaded byRobin
- Multi Col LinearityUploaded byNidhi Joshi
- 249361082-Correlation-and-Simple-Linear-Regression-Problems-With-Solutions.pdfUploaded byiqbaltaufiqur rochman
- Annual Natural Population Increase - An Empirical StudyUploaded byFranz Eigner
- Mind on Statistics Ch14 QUploaded byglenlcy
- Regression Analysis of NiftyUploaded byNick
- Stepwise RegressionUploaded byArlene Arabit Tamayo
- Ejkm Volume5 Issue4 Article137Uploaded bynoelje
- Answer Key Regression 2.78Uploaded byIke Mahlida P
- Part3. 实用教程--Practical Regression and ANOVA using RUploaded byapi-19919644
- SSRN-id1079243Uploaded bycric6688
- Homicide rate in a cross section of countries.pdfUploaded byaguila004
- IT Readiness ICT Usage and National Sustainability Development-.pdfUploaded byDavid Liauw

- 12-Multiple Comparison ProcedureUploaded byDani Garnida
- psheet8Uploaded byMichael Corleone
- 5_engineering_probability_and_statistics.pdfUploaded byRasoul Gmdri
- Harrell NotesUploaded byPritam Changkakoti
- GSC2S14_PearsonCorrelationUploaded byyoudhiex
- Assignment Ch3Uploaded byAndy Leung
- 17-461[1]Uploaded byTerra Tucker
- Time Series Smoothing in ExcelUploaded bySpider Financial
- STAT_T_3Uploaded byrohanshettymanipal
- Normal DistributionUploaded byMathathlete
- Bagged TreesUploaded bysatish
- Multi Hetero AutoUploaded byEzairul Hossain
- $RL34KL1.pdfUploaded bysaleh
- An Insight in Statistical TechniquesUploaded byRodrigo Chang
- Calculating VARUploaded bySubramanian Rajagopalan
- V1_20151111_CFA一级数量知识框架图Uploaded byIves Lee
- Paper - Analisis de Datos WeibullUploaded byAlexis Skyper
- Fc-pi-5 Clause 5 Bit Error Rate Test Suitev1.0Uploaded bycoolkad81
- UsefulStataCommands.pdfUploaded bysannuk72581
- Dean I. Radin- Enhancing Effects in Psi Experiments with Sequential Analysis: A Replication and ExtensionUploaded byDominos021
- Copula ModelingUploaded byCharlene Silva
- 74517705-PC1-AUploaded byAndres
- Practical Guide to Statistical Forecasting in APO DPUploaded bythehumdinger
- ECON_532_Syllabus (1)Uploaded byJhorland Ayala
- Actuarial QuestionsUploaded byAngel Guo
- ProbabilityUploaded bySuvra Pattanayak
- Natural Parameter Form for Multivariate GaussianUploaded byAt Gitsin
- answers.pdfUploaded byishan_chawla123
- Math Midterm PptUploaded byEleisha Rosete
- Applications of Generalized Method of Moments EstimationUploaded bygbaliza