Sie sind auf Seite 1von 9

Analyse your data using a one-way ANOVA, part of the process involves checking to make sure that the

data you want to analyse can actually be analysed using a one-way ANOVA
one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups were
significantly different from each other; it only tells you that at least two groups were different
K-means cluster is a method to quickly cluster large data sets, which typically take a while to compute
with the preferred hierarchical cluster analysis
hierarchical cluster analysis takes time to calculate, but it generates a series of models with cluster
solutions from 1 (all cases in one cluster) to n (all cases are an individual cluster). Hierarchical
cluster also works with variables as opposed to cases; it can cluster variables together in a manner
somewhat similar to factor analysis. In addition, hierarchical cluster analysis can handle nominal,
ordinal, and scale data, however it is not recommended to mix different levels of measurement
Two-step cluster analysis is more of a tool than a single analysis. It identifies the groupings by running
pre-clustering first and then by hierarchical methods. Because it uses a quick cluster algorithm upfront, it
can handle large data sets that would take a long time to compute with hierarchical cluster methods. In
this respect, it combines the best of both approaches. Also two-step clustering can handle scale and
ordinal data in the same model. Two-step cluster analysis also automatically selects the number of
clusters, a task normally assigned to the researcher in the two other method.
Proximity matrix (these are the distances calculated in the first step of the analysis) and the predicted
cluster membership of the cases in our observations.
The Dendrogram will graphically show how the clusters are merged and allows us to identify what the
appropriate number of clusters is.
Three large blocks of distance measures for interval (scale), counts (ordinal), and binary(nominal) data:
1 For scale data, the most common is Square Euclidian Distance. It is based on the Euclidian
Distance between two observations(the distance is the square root of squared distance on dimension
x and y).
1

Between-groups linkage (distance between clusters is the average distance of all data points within
these clusters). Between-groups linkage works with both cluster types
a nearest neighbor (single linkage: distance between clusters is the smallest distance between two
data points). Single linkage works best with long chains of clusters
b furthest neighbor (complete linkage: distance is the largest distance between two data
points).complete linkage works best with dense blobs of clusters
c Ward's method (distance is the distance of all clusters to the grand average of the
sample). between-groups linkage works with both cluster types.
The usual recommendation is to use single linkage first. Although single linkage tends to create chains
of clusters, it helps in identifying outliers. After excluding these outliers, we can move onto Ward's
method. Ward's method uses the F value (like an ANOVA) to maximize the significance of
differences between cluster, which gives it the highest statistical power of all methods. The downside is
that it is prone to outliers and creates small clusters.
A last consideration is standardization. If the variables have different scales and means we might want to
standardize either to Z scores or just by centering the scale.
a. Cross-sectional: Data are collected from a cross-section (snapshot) of the population

b. Longitudinal: Data are collected at many different moments in time for a small sample of subjects
c. Panel: Data are collected at different moments in time for a large samples of subjects
slope dummy: If the coefficient for TIME*January is positive, say, then it means that the seasonal swings
for January gets larger with time.
If female dummy is negative but non-significant then with respect to slope both male and female do not
show differential, however if slope dummy is included and its coefficient is significant and negative then
it illustrates there is differential.

SOCIAL DESIRABILITY BIAS: Social desirability bias refer to phenomenon in which respondents in
a survey do not answer
entirely truthfully because they are influenced by the way they think one should answer - which in turn
depends on attitudes and characteristics of their peers or reference persons
An outlier is an observation that lies far away from the central tendency of the distribution. Hence, we
tend to classify an observation if its value is very far from the mean or the median. Typical rules of
thumbs are either +/-2.5 or +/- 3.5 S.D. away from the mean. We can assess this very easily with the help
of a Box-plot.
Secondary data benefits: Cheap, easy to get hold of, exposed to strict quality checks, so reliability is high
Dependency models imply that we are imposing a causal direction between two variables. For instance,
if the assumptions of the OLS holds, then we say that one unit change in X, LEADS to a change in the
dependent variable. In inter-dependency model we simply make statements based on correlations. For
instance, a positive correlation between two variables says that they go together - but not which one may
have an influence on the other.
An anchored scale means that we attach descriptions to the different categories in the scale.
The purpose is to make it clearer for the respondent what we mean by the question, and to improve
comparability - i.e. try to impose a meaning to the responses that is common to all respondents.
Omitted Variable Bias (OVB) is the bias of the estimators of regression coefficients caused by omitting
relevant variables
Omitted relevant variables are variables correlated both with the dependent variable and with the
independent variables included in the regression.
incorrect omission of variables leads to biased estimates of the parameters that are included,
incorrect inclusion of irrelevant variables(not correlated with Y) only produces inefficient
estimates SSE will increase and R2 will decrease thus making the model less good fit.
HOW TO CHECK ENDOGENITY:
create a new dependent variable which consists of the residuals from the original regression. We then
regress the residuals on the original explanatory variables. If there is no correlation between the error term
and the explanatory variables, then none of the
coefficients in this regression should be significant.

QUESTIONS ANSWERS:
R 2 AND ADJUSTED R2
Yes, the residuals are normal because their observed cumulated probabilities fall
on the 45 line in the P-P plot. (Above 45 or below 45: Non-normal)(OLS estimators not the BLUE).

What is the net effect of adding a new variable?


R2: never decreases when a new X variable is added to the model, even if the new variable is not an
important predictor variable. This can be a disadvantage when comparing models
1) We lose one degree of freedom when a new X variable is added
2)Did the new X variable add enough explanatory power to offset the loss of one
degree of freedom?
IF YOU EXCLUDE A SIGNIFICANT VARIABLE(P-VALUE<ALPHA) FROM A MODEL:
1 GOODNESS OF FIT DECREASES
2 IF THAT VARIABLE IS POSITIVELY CORRELATED WITH OTHER VARIABLES(THAT
MEANS THEY INCREASE/DECREASE) THEN EXCLUDING THAT VARIABLE WILL
MAKE THE REMAINING VARIABLES(I.E OTHER COVARIATES) TO CAPTURE ITS
EFFECT TOO). THUS THEIR COEFFICENTS WILL INCREASE AND THEIR
SIGNIFICANCE.
Adjusted R2 can be used to compare models that include a different number of independent variables. It
is adjusted (keep into account) by the degrees of freedom.
Adjusted-R2 provides a better comparison between multiple regression models with
a)different numbers of independent variables
b)Penalize excessive use of unimportant independent variables
c)Its always smaller (or equal to) than R2.
MULTICOLLIENEARITY
MULTICOLLINEARITY: None of the variable can be written as a linear correlation of the others, then
there is no multicollinearity.
Unexpected sign and lack of significance of one or more estimated
parameters
- High correlation between two or more explanatory variables
- VIF larger than 10
Strategy: eliminate from the regression one of the collinear variables
and re-estimate; build a synthetic variable; retain all the collinear
variables if only prediction is of interest.(1 point for suggestion of
strategy).
PROBLEMS WITH MULTICOLLINEARITY:
a it will be difficult to separately identify the influence of the variables that are highly correlated
(redundant information)
b As a result standard errors will be large and t-statistics small (unstable coefficients, high variance of
the coefficient)
c Coefficient signs may not match prior expectations
COMPLETE MODEL: all variables considered.
RESTRICTED MODEL: one variable considered with zero value
PREDICTIONS:
Moreover, predictions are more reliable for values of the covariates
included in the observed data range; i.e., x such that min(X) < x < max(X).
REGRESSION ASSUMPTIONS
A The constant is the intercept of the line that interpolates the data in the XY
Space.
A The slope parameter is the slope (the tangent of the angle of that

line with the horizontal axis) of that line


The assumptions on which the linear model is based are the following:
1. the error terms are random variables with mean 0.
The expected value of Y is a linear function of the independent variables
2. Cov(i, xj) = 0 i,j: the values xj and the error terms i are not correlated.
3. Cov(i, j) = 0 i j: the random error terms, i , are not correlated with one another.
4. Var(i) = 2 i : the variance of the error terms is constant (homoscedasticity)
5. STRONG ASSUMPTION: i are normally distributed.
The properties of OLS estimators deriving from these assumptions are:
" Linearity: they are a linear function of yi
" Unbiasedness: E(bj) = j .
" Efficiency: smallest variance, (in the class of linear unbiased estimators). That is, the OLS
estimators are the most efficient.
Gauss-Markov theorem:
The OLS estimators bj are best linear unbiased estimators (BLUE) for j
For linearity and unbiasedness only assumptions 1 and 2 are required; for efficiency, also
assumptions 3 and 4 must be valid.
If assumption 4 is violated (heteroschedascticity) the OLS estimators remain unbiased but they
are no more the estimators with the smallest variance (they are not the most efficient).
If also assumption 5 is added, then OLS estimators are normally distributed. For the central
limit theorem, even if errors are not perfectly normal but the sample size is high, the OLS are
approximately normal.
1

Cov(ei,ej)=0, means that random error terms are not correlated with one another.
We can also say that there is lack of serial (or spatial) correlation among the errors.
If there is correlation among the error terms, the standard errors of the estimates will be biased and,
consequently, the OLS estimators are not (BLUE) efficient (do not have the smallest variance, in
the class of linear unbiased estimators).
CONFIDENCE INTERVAL

A confidence interval around a parameter is an interval estimate of a population parameter.


Alternatively to estimating the parameter by a single value, we can build an interval that is likely
to include the parameter itself. Thus, confidence intervals are used to indicate the reliability of an estimate
and how likely the interval is to contain the parameter is determined by the confidence level (e.g. 95%,
99%, etc.).
The formula to compute the CI for a parameter is the following:
where bj +_ t (has (n-k-1) degrees of freedom.) If the CI does not include 0.
VIF
VIF measures how much the variance of the estimated regression coefficients are enlarged compared to
when independent variables are not linearly related.
The higher is the linear relationship between a given covariate and the others, the higher is VIF
and more serious is multicollinearity
VIF=1/1-R2(X1).
VIF is not appropriate for dummy variables.
ANOVA

H0: 1 = 2 = 3 = 4. H1: at least one pair, for which i j with i j.


It is dependency model. The main aim of one-way ANOVA is to evaluate the difference among the
means of three or
more groups (sampled groups are thought as samples coming from different populations).
The variables involved in the analysis are: one qualitative independent variable (X: employees
location) which give rise to 3 or more groups(3 or more treatment levels or classifications). THAT IS 1
COVARIATE AND K LEVELS. one quantitative dependent variable (Y: grades).
The variability of the data is key factor to test the equality of means. Not only the average level of the
outcome (SST) is important to find significant differences among groups but also the variation within
groups (SSW) : the higher is this variability the less likely is to find significant differences. AS SSW
INCREASES THEN VARIATION BETWEEN GROUPS TENDS TO EQUALISE) between groups
(confidence intervals for group averages will overlap!).
Communities=GROUPS/CLASSIFICATIONS/LEVELS
The F-statistic is used to test the hypothesis that all the mean ages in different communities
are equal (H0) vs. the hypothesis that for at least two communities mean ages are different
(H1).
The assumptions behind the ANOVA are:
A Populations are normally distributed.
B Populations have equal variances.
C Samples are randomly and independently drawn from the different populations.
FACTOR ANALYSIS
KMO MSA for the data set as a whole is the ratio of the sum of all squared correlations in the sample to
the sum or all squared correlations plus the sum of all squared partial correlations.
DESCRPITIVE ANALYSISMISSING VALUES:
WHAT TO DO:
1 Omit the case from all analyses (list wise deletion). But we lose all information
about that case!
2 Omit the case from just those (parts of) analyses that use the variable(s) for
which values are missing (pairwise deletion).
3 a good option, but not always technically possible is Impute(estimate) missing
values. Techniques include:
a mean substitution
b estimation by regression
Provides a full set of data, but results may be invalid if our estimates are
wrong. Results of analyses with and without imputation should be
compared
1

The communality of a variable is the proportion of each variables variance that


is in common with the other variable and therefore can be explained by the

factors. It can be defined as the sum of squared factor loadings for the
variables.
2 PFA AND PCA:
The initial communalities under the principal axis factoring analysis aim to
describe the
share of common variance between each variable and all the remaining ones,
thus they
are numbers between 0 and 1(PAF estimates variance for each variable by
means of R-SQUARE of the regression of the variable on all other variables.
This happens because this method is not interested in
explaining the whole variance in the data, but only the common variance
between the
variables. Apart from that Data Summarization is the primary concern: you
want to find latent dimensions in the data and the researcher has no prior
knowledge about the amount of common variance
Instead, the initial communalities under the PCA are all equal to one because
now the
analysis aims to reduce the data dimension having as goal to explain as much
as possible
of the whole variability in the data, not only the one in common between the
variables PLUS prior knowledge suggests that unique variance is a relatively
small proportion of the variance.
NO. Of factors to keep chosen by:
a Initial eigenvalue bigger than one, which indicates that the factor accounts for
more variability than a
single variable.
a scree plot, which plots the eigenvalues against the factor number. The optimal
number of factors is the one that is followed by an almost flat line. The scree
plot is a graphical representation of the eigenvalues associated with the
factors.
As we know we should choose as many factors as there are factors having
eigenvalues above 1, so in this case we choose x factors
The aim of FA is to reduce the dimension of data to a smaller set of variables
keeping as
large as possible their capacity to explain most of the total variance in the initial
data. In
this case, the result is very good, because (above 60-70)% of the total variance is
explained by two
factors, compared to the ten initial variables.
1

Indeed, rotation is a mathematical transformation that doesnt alter the factors


and the quantity of variability that they explain. Rather, by distributing the
variance that they explain more evenly among the
variables, rotation makes interpretation of the factors more straightforward.
Different kinds of rotation are possible. For instance, varimax rotation gives
totally uncorrelated factors, while oblique rotation gives correlated factors.
Direct Oblimin is a method for oblique rotation. This is not an orthogonal
rotation of the
factors, therefore the factors are no longer uncorrelated. If factor correlation is
low

I would be confident enough in using the two factors as independent variables


in a regression model.
CLUSTER ANALYSIS:
The K-Means cluster analysis assumes that the number of clusters is know a
priori. The
algorithm starts partitioning the data in the given number of clusters and then
keeps on
reassigning the observations to clusters until some criterion is met. The
hierarchical
cluster analysis(In hierarchical cluster analysis, we must run the
analysis first, before we select the number of clusters )
doesnt start with a given number of clusters, but rather at each
stepagglomerates (or divides) two observations that are
close enough (or too distant). Closeness (or distance) is defined according to
a proximity (or dissimilarity) matrix,
which is computed using a pre-defined distance measure between each pair
ofobservations. Finally, the TwoStep cluster analysis
can be used when we have a large data set, with many possible clustering
variables and we have no idea about the number
of clusters in the population. The two step procedure starts by forming little
pre-clusters to reduce the size of the data matrix and then continues by
performing a hierarchical clustering on the pre-clusters. I could possibly use
either a hierarchical cluster analysis or a
Two-Step cluster analysis, since I have no idea about the number of groups in
the
population.
The agglomeration schedule informs us about how clusters are being joined
together, starting with as many clusters as countries, and ending up with just one
cluster where all countries belong to the same cluster. Clusters close to together
gets joined first and clusters far apart (large coefficient) gets joined last
If the variables is binary one cannot use Euclidian distances for the
proximity matrix. One solution would be to simply perform the analysis
excluding the binary variable. The second issue is that apart from the
binary variable, the remaining three are not at the same scale. The
solution would be to standardize the variables.
DISTANCE FOR VARIABLES defines similarity or dissimilarity between cases.
In cluster analysis similarity is evaluated by means of a distance measure,
informing on the proximity of cases to one another on the basis of the selected
variables.
A Squared Euclidean distance (as default in SPSS): The squared Euclidean
distance between two units is given by the sum of squared differences of the
values taken by the variables on the two units.

B Euclidean distance if it is important to interpret the distance between cases


and clusters (rare). The Euclidean distance between the two cases is

( XAXB ) + ( YAYB )
C City block distance: if we want to bring outliers into clusters(dampens effect
outliers). The city block distance between two units is the sum of the absolute
value of the difference between the values taken by the variables
A proximity matrix, P, is an m by m matrix containing all the pairwise
dissimilarities or similarities between the objects being considered. The
higher the value the more dissimilar they are, while the lower the value
the less dissimilar and more similar the two cases are.
A proximity matrix provides a mathematical accounting of the similarity or
dissimilarity between cases. We usually index similarity in terms of
distance which refers to the variation, often standardized, between two
cases in terms of some kdimensional space. For example, if we have two
variables that we are using to cluster, we could use either a Euclidean or
squared Euclidean distance to see how far apart
cases are on the X and Y axes defined by the variables.
Hierarchical Cluster Procedures(ANALYSIS)
a Agglomerative methods. Start with forming a cluster for each case. Then join
the two most similar clusters. Repeat until only one cluster remains. Note that
the procedure works in such a way that results at an earlier stage are always
nested within results at a later stage, creating similarity to a tree
a

Divisive methods. Start with all cases forming a unique cluster and divide it
into two clusters, then three, then four and so forth until each cluster is a
single-member cluster.

Hierarchical cluster analysis reports only cluster membership, not the


characteristics of cases in the cluster
CLUSTERING ALGORITHMS/METHODS: measures similarity between clusters.
Clustering algorithms are:
1) Single Linkage(nearest neighbor method): method defines the distance
between two clusters as the shortest distance from any case in one cluster to
any case in another cluster.
2) Complete Linkage(farthest neighbor method): method defines the
distance between two clusters as the greatest distance from any case in one
cluster to any case in another cluster.
3) (Between-Group)/Average Linkage (Default in SPSS): method defines
the distance between two clusters as average distance of all individuals in one
cluster with all individuals in another cluster.
4) Centroid Method: defines the distance between to clusters as the distance
between the centroids of the two clusters. Note that every time a cluster is
formed the new centroid is estimated.
5) Wards Method: At each stage, joins the pair of clusters whose merger
results in the smallest increase in the overall sum of the within-cluster sum of
square distances (minimizes within-group variance). Tends to create a
relatively large number of small clusters.

1
2

If all variables are metric: A) Use Wards method if expecting clusters of similar
size. B) Use between groups average linkage (SPSS default) if some clusters
might be small.
In practice, select Squared Euclidean distance for metric data unless you have
a good reason to choose a different measure
In practice, standardize unless variables are measured on the same scale

Das könnte Ihnen auch gefallen