Beruflich Dokumente
Kultur Dokumente
data you want to analyse can actually be analysed using a one-way ANOVA
one-way ANOVA is an omnibus test statistic and cannot tell you which specific groups were
significantly different from each other; it only tells you that at least two groups were different
K-means cluster is a method to quickly cluster large data sets, which typically take a while to compute
with the preferred hierarchical cluster analysis
hierarchical cluster analysis takes time to calculate, but it generates a series of models with cluster
solutions from 1 (all cases in one cluster) to n (all cases are an individual cluster). Hierarchical
cluster also works with variables as opposed to cases; it can cluster variables together in a manner
somewhat similar to factor analysis. In addition, hierarchical cluster analysis can handle nominal,
ordinal, and scale data, however it is not recommended to mix different levels of measurement
Two-step cluster analysis is more of a tool than a single analysis. It identifies the groupings by running
pre-clustering first and then by hierarchical methods. Because it uses a quick cluster algorithm upfront, it
can handle large data sets that would take a long time to compute with hierarchical cluster methods. In
this respect, it combines the best of both approaches. Also two-step clustering can handle scale and
ordinal data in the same model. Two-step cluster analysis also automatically selects the number of
clusters, a task normally assigned to the researcher in the two other method.
Proximity matrix (these are the distances calculated in the first step of the analysis) and the predicted
cluster membership of the cases in our observations.
The Dendrogram will graphically show how the clusters are merged and allows us to identify what the
appropriate number of clusters is.
Three large blocks of distance measures for interval (scale), counts (ordinal), and binary(nominal) data:
1 For scale data, the most common is Square Euclidian Distance. It is based on the Euclidian
Distance between two observations(the distance is the square root of squared distance on dimension
x and y).
1
Between-groups linkage (distance between clusters is the average distance of all data points within
these clusters). Between-groups linkage works with both cluster types
a nearest neighbor (single linkage: distance between clusters is the smallest distance between two
data points). Single linkage works best with long chains of clusters
b furthest neighbor (complete linkage: distance is the largest distance between two data
points).complete linkage works best with dense blobs of clusters
c Ward's method (distance is the distance of all clusters to the grand average of the
sample). between-groups linkage works with both cluster types.
The usual recommendation is to use single linkage first. Although single linkage tends to create chains
of clusters, it helps in identifying outliers. After excluding these outliers, we can move onto Ward's
method. Ward's method uses the F value (like an ANOVA) to maximize the significance of
differences between cluster, which gives it the highest statistical power of all methods. The downside is
that it is prone to outliers and creates small clusters.
A last consideration is standardization. If the variables have different scales and means we might want to
standardize either to Z scores or just by centering the scale.
a. Cross-sectional: Data are collected from a cross-section (snapshot) of the population
b. Longitudinal: Data are collected at many different moments in time for a small sample of subjects
c. Panel: Data are collected at different moments in time for a large samples of subjects
slope dummy: If the coefficient for TIME*January is positive, say, then it means that the seasonal swings
for January gets larger with time.
If female dummy is negative but non-significant then with respect to slope both male and female do not
show differential, however if slope dummy is included and its coefficient is significant and negative then
it illustrates there is differential.
SOCIAL DESIRABILITY BIAS: Social desirability bias refer to phenomenon in which respondents in
a survey do not answer
entirely truthfully because they are influenced by the way they think one should answer - which in turn
depends on attitudes and characteristics of their peers or reference persons
An outlier is an observation that lies far away from the central tendency of the distribution. Hence, we
tend to classify an observation if its value is very far from the mean or the median. Typical rules of
thumbs are either +/-2.5 or +/- 3.5 S.D. away from the mean. We can assess this very easily with the help
of a Box-plot.
Secondary data benefits: Cheap, easy to get hold of, exposed to strict quality checks, so reliability is high
Dependency models imply that we are imposing a causal direction between two variables. For instance,
if the assumptions of the OLS holds, then we say that one unit change in X, LEADS to a change in the
dependent variable. In inter-dependency model we simply make statements based on correlations. For
instance, a positive correlation between two variables says that they go together - but not which one may
have an influence on the other.
An anchored scale means that we attach descriptions to the different categories in the scale.
The purpose is to make it clearer for the respondent what we mean by the question, and to improve
comparability - i.e. try to impose a meaning to the responses that is common to all respondents.
Omitted Variable Bias (OVB) is the bias of the estimators of regression coefficients caused by omitting
relevant variables
Omitted relevant variables are variables correlated both with the dependent variable and with the
independent variables included in the regression.
incorrect omission of variables leads to biased estimates of the parameters that are included,
incorrect inclusion of irrelevant variables(not correlated with Y) only produces inefficient
estimates SSE will increase and R2 will decrease thus making the model less good fit.
HOW TO CHECK ENDOGENITY:
create a new dependent variable which consists of the residuals from the original regression. We then
regress the residuals on the original explanatory variables. If there is no correlation between the error term
and the explanatory variables, then none of the
coefficients in this regression should be significant.
QUESTIONS ANSWERS:
R 2 AND ADJUSTED R2
Yes, the residuals are normal because their observed cumulated probabilities fall
on the 45 line in the P-P plot. (Above 45 or below 45: Non-normal)(OLS estimators not the BLUE).
Cov(ei,ej)=0, means that random error terms are not correlated with one another.
We can also say that there is lack of serial (or spatial) correlation among the errors.
If there is correlation among the error terms, the standard errors of the estimates will be biased and,
consequently, the OLS estimators are not (BLUE) efficient (do not have the smallest variance, in
the class of linear unbiased estimators).
CONFIDENCE INTERVAL
factors. It can be defined as the sum of squared factor loadings for the
variables.
2 PFA AND PCA:
The initial communalities under the principal axis factoring analysis aim to
describe the
share of common variance between each variable and all the remaining ones,
thus they
are numbers between 0 and 1(PAF estimates variance for each variable by
means of R-SQUARE of the regression of the variable on all other variables.
This happens because this method is not interested in
explaining the whole variance in the data, but only the common variance
between the
variables. Apart from that Data Summarization is the primary concern: you
want to find latent dimensions in the data and the researcher has no prior
knowledge about the amount of common variance
Instead, the initial communalities under the PCA are all equal to one because
now the
analysis aims to reduce the data dimension having as goal to explain as much
as possible
of the whole variability in the data, not only the one in common between the
variables PLUS prior knowledge suggests that unique variance is a relatively
small proportion of the variance.
NO. Of factors to keep chosen by:
a Initial eigenvalue bigger than one, which indicates that the factor accounts for
more variability than a
single variable.
a scree plot, which plots the eigenvalues against the factor number. The optimal
number of factors is the one that is followed by an almost flat line. The scree
plot is a graphical representation of the eigenvalues associated with the
factors.
As we know we should choose as many factors as there are factors having
eigenvalues above 1, so in this case we choose x factors
The aim of FA is to reduce the dimension of data to a smaller set of variables
keeping as
large as possible their capacity to explain most of the total variance in the initial
data. In
this case, the result is very good, because (above 60-70)% of the total variance is
explained by two
factors, compared to the ten initial variables.
1
( XAXB ) + ( YAYB )
C City block distance: if we want to bring outliers into clusters(dampens effect
outliers). The city block distance between two units is the sum of the absolute
value of the difference between the values taken by the variables
A proximity matrix, P, is an m by m matrix containing all the pairwise
dissimilarities or similarities between the objects being considered. The
higher the value the more dissimilar they are, while the lower the value
the less dissimilar and more similar the two cases are.
A proximity matrix provides a mathematical accounting of the similarity or
dissimilarity between cases. We usually index similarity in terms of
distance which refers to the variation, often standardized, between two
cases in terms of some kdimensional space. For example, if we have two
variables that we are using to cluster, we could use either a Euclidean or
squared Euclidean distance to see how far apart
cases are on the X and Y axes defined by the variables.
Hierarchical Cluster Procedures(ANALYSIS)
a Agglomerative methods. Start with forming a cluster for each case. Then join
the two most similar clusters. Repeat until only one cluster remains. Note that
the procedure works in such a way that results at an earlier stage are always
nested within results at a later stage, creating similarity to a tree
a
Divisive methods. Start with all cases forming a unique cluster and divide it
into two clusters, then three, then four and so forth until each cluster is a
single-member cluster.
1
2
If all variables are metric: A) Use Wards method if expecting clusters of similar
size. B) Use between groups average linkage (SPSS default) if some clusters
might be small.
In practice, select Squared Euclidean distance for metric data unless you have
a good reason to choose a different measure
In practice, standardize unless variables are measured on the same scale