Sie sind auf Seite 1von 27

Types of data

There are four types of data that may be gathered in social research, each one adding more
to the next. Thus ordinal data is also nominal, and so on. A useful acronym to help
remember this is NOIR (French for 'black').

Ratio

Interval

Ordinal

Nominal





Nominal
The name 'Nominal' comes from the Latin nomen, meaning 'name' and nominal data are
items which are differentiated by a simple naming system.
The only thing a nominal scale does is to say that items being measured have something in
common, although this may not be described.
Nominal items may have numbers assigned to them. This may appear ordinal but is not --
these are used to simplify capture and referencing.
Nominal items are usually categorical, in that they belong to a definable category, such as
'employees'.
Example
The number pinned on a sports person.
A set of countries.
Ordinal
Items on an ordinal scale are set into some kind of order by their position on the scale. This
may indicate such as temporal position, superiority, etc.
The order of items is often defined by assigning numbers to them to show their relative
position. Letters or other sequential symbols may also be used as appropriate.
Ordinal items are usually categorical, in that they belong to a definable category, such as
'1956 marathon runners'.
You cannot do arithmetic with ordinal numbers -- they show sequence only.
Example
The first, third and fifth person in a race.
Pay bands in an organization, as denoted by A, B, C and D.
Interval
Interval data (also sometimes called integer) is measured along a scale in which each
position is equidistant from one another. This allows for the distance between two pairs to
be equivalent in some way.
This is often used in psychological experiments that measure attributes along an arbitrary
scale between two extremes.
Interval data cannot be multiplied or divided.
Example
My level of happiness, rated from 1 to 10.
Temperature, in degrees Fahrenheit.
Ratio
In a ratio scale, numbers can be compared as multiples of one another. Thus one person
can be twice as tall as another person. Important also, the number zero has meaning.
Thus the difference between a person of 35 and a person 38 is the same as the difference
between people who are 12 and 15. A person can also have an age of zero.
Ratio data can be multiplied and divided because not only is the difference between 1 and 2
the same as between 3 and 4, but also that 4 is twice as much as 2.
Interval and ratio data measure quantities and hence are quantitative. Because they can
be measured on a scale, they are also calledscale data.
Example
A person's weight
The number of pizzas I can eat before fainting
Parametric vs. Non-parametric
Interval and ratio data are parametric, and are used with parametric tools in which
distributions are predictable (and often Normal).
Nominal and ordinal data are non-parametric, and do not assume any particular
distribution. They are used with non-parametric tools such as the Histogram.
Continuous and Discrete
Continuous measures are measured along a continuous scale which can be divided into
fractions, such as temperature. Continuous variables allow for infinitely fine sub-division,
which means if you can measure sufficiently accurately, you can compare two items and
determine the difference.
Discrete variables are measured across a set of fixed values, such as age in years (not
microseconds). These are commonly used on arbitrary scales, such as scoring your level of
happiness, although such scales can also be continuous.
Sampling terminology
Sampling has its own set of terms it uses. Here is a brief description of these.
Population
A population is the total group of people about who you are researching and about which
you want to draw conclusions.
It is common for variables in the population being denoted by Greek letters and for those in
the sample to be shown by Latin letters. For example standard deviation of the population
is often shown with o (sigma), whilst of a sample is 's'. Sometimes as an alternative, capital
letters are used for the population.
Sample frame
The list of people from whom you draw your sample, such as a phone book or 'people
shopping in town today', may well be less than the entire population and is called a sample
frame. This must be representative of the population otherwise bias will be introduced.
Sample frames are usually much larger than the sample. They are used because of
convenience and the difficulty of accessing people outside this frame (for example those
without a telephone).

Population

Sample frame

Available units

Sample




Sample
When the population is large or generally inaccessible (such as the population of
Birmingham) then the approach used is to measure a subset or sample.
Unit
A unit is the thing being studied. Usually in social research this is people. There may also
be additional selection criteria used to choose the units to study, such as 'people who have
been police officers for at least five years.'
Sample size
In order to be representative of the population, the sample must be large enough. There
are calculations to help you determine this. The required sample size depends on the
homogeneity of the population, as well as its total size.
Generalizing
After sampling you then generalize in order to make conclusions about the rest of the
population.
Validity
Validity is about truth and accuracy. A valid sample is representative of the population and
will allow you to generalize to valid conclusions. This aligns with external validity.
A valid sample is both big enough and is selected without bias so it is representative of the
population.
Bias
Bias, a distortion of results, is the bugbear of all research and it can be introduced by
taking a sample that does not truly represent the population and hence is not valid.
Assignment
Having drawn the sample, these may be assigned to different groups.
A common grouping is an experimental group which receive the treatment under study and
a control group that gives a standard against which experimental results can be compared.
To sustain internal validity, this is usually random assignment. Non-random assignment is
sometimes ok, for example where two school classes are selected as coherent groups and
one chosen as the control.
Sampling fraction
When there a sample of n people are selected from a population of N, then the sampling
fraction is calculated as n/N. This may be expressed as a number (eg. 0.10) or a
percentage (eg. 10%).
Sampling distribution
If the sample is described as a histogram (a bar chart showing numbers in different
measurement ranges) it will have a particular shape. Multiple samples should have similar
shapes, although random variation means each may be slightly different. The larger the
sample size, the more similar sample distributions will be.
Sampling error
This is the standard error for the sample distribution and measures the variation across
different samples. It is based on the standard deviation of the sample and the gap between
this and the standard deviation of the population. Larger sample sizes will lead to a smaller
sampling error.
An estimate calculation for a single sample is:
s
m
= s
x
/ sqrt(N)
Where:
s
x
is the standard deviation of the sample
N is the sample size
Systematic error
A systematic error is one caused by human error during the design or implementation of
the experiment.
Strata
Strata (singular: stratum) are sub-groups within a population or sample frame. These can
be random groups, but often are natural groupings, such as men and women or age-range
groups. Stratification helps reduce error. See stratified random sampling for usage.
Oversampling
Oversampling occurs when you study the same person twice. For example if you selected
people by their telephone number and someone had two phone numbers, then you could
end up calling them twice. This can cause bias.
Two error types
In an experiment, we seek to demonstrate that a primary hypothesis is true or false. This
leads to two classic types of error. Careful designand can significantly reduce the chance of
these errors occurring.
Type 1 error
The Type 1 error (often written 'Type I error') occurs when it is concluded that something is
true when it is actually false. In other words the experiment falsely appears to be
'successful'.
This generally means the primary hypothesis, H
1
is believed true, while it is actually false.
This is usually proven by finding the null hypothesis, H
0
is probably false, within an
acceptable tolerance.
Type 1 errors often occur due to carelessness or bias on the behalf of the researcher. When
their hypothesis is 'proven' they may well be loathe to challenge their findings. As such,
type 1 errors can be more common than type 2 errors.
It can be very frustrating when you desperately believe something is true but you are
unable to conclusively prove this to be so. It is sad that some researchers feel driven to
fake data in order to draw such false conclusions, particularly when professional reputation
and research grants may hang in the balance.
The probability of making a Type 1 error is often known as 'alpha' (o),or 'a' or 'p' (when it
is difficult to produce a Greek letter). For statistical significance to be claimed, this often
has to be less than 5%, or 0.05. For high significance it may be further required to be less
than 0.01.
Type 1 errors are also known as 'errors of the first kind'.
Type 2 error
The Type 2 error (often written 'Type II error') occurs when it is concluded that something
is false while it is actually true. In other words the experiment falsely appears to be
'unsuccessful'.
This generally means the primary hypothesis, H
1
is believed false, while it is actually true.
This is usually proven by finding the null hypothesis, H
0
is probably true, within an
acceptable tolerance.
Type 2 errors can occur when there are mistakes in experimental design, sampling or
analysis that cloak actual relationships, for example when the sample is too small or where
variation in contextual variables hide the actual relationship.
Being found to have made a type 1 error can lead to accusations of cheating, which can be
professionally very damaging. Because of this, type 2 error can be made by researchers
who are paranoid about avoiding type 1 errors and are consequently over-cautious in their
conclusions.
The probability of making a Type 2 error is known as 'beta' (|, in contrast to the 'alpha' of
Type 1). Cohen (1992) suggests that a maximum acceptable probability of a Type 2 error
should be 0.2 (20%).
Type 2 errors are sometimes called 'errors of the second kind'.
Results Matrix
The table below shows four possibilities in the results of experiments.
Type 1 and Type 2 errors are as described above.
When a significant change is correctly found then the effect can be measured to identify
how important this is.
When no change is correctly found, the power indicates how likely this is.


Real Result
No change,
H
0
true, H
1
false
Significant
change,
H
0
false, H
1
true

Assessed result
No change,
H
0
true, H
1
false
Significant
change,
H
0
false, H
1
true

Measure:
Effect
Type 1 error,
o
Type 2 error,
|
Measure:
Power, 1-|









Choosing a sampling method
There are many methods of sampling when doing research. This guide can help you choose
which method to use. Simple random sampling is the ideal, but researchers seldom have
the luxury of time or money to access the whole population, so many compromises often
have to be made.
Probability methods
This is the best overall group of methods to use as you can subsequently use the most
powerful statistical analyses on the results.

Method Best when
Simple random sampling Whole population is available.
Stratified sampling (random
within target groups)
There are specific sub-groups to investigate (eg.
demographic groupings).
Systematic sampling (every nth
person)
When a stream of representative people are
available (eg. in the street).
Cluster sampling (all in limited
groups)
When population groups are separated and access
to all is difficult, eg. in many distant cities.

Quota methods
For a particular analysis and valid results, you can determine the number of people you
need to sample.
In particular when you are studying a number of groups and when sub-groups are small,
then you will need equivalent numbers to enable equivalent analysis and conclusions.

Method Best when
Quota sampling (get only as many
as you need)
You have access to a wide population,
including sub-groups
Proportionate quota sampling (in
proportion to population sub-groups)
You know the population distribution across
groups, and when normal sampling may not
give enough in minority groups
Non-proportionate quota
sampling (minimum number from
There is likely to a wide variation in the
each sub-group) studied characteristic within minority groups

Selective methods
Sometimes your study leads you to target particular groups.

Method Best when
Purposive sampling (based on
intent)
You are studying particular groups
Expert sampling (seeking
'experts')
You want expert opinion
Snowball sampling (ask for
recommendations)
You seek similar subjects (eg. young drinkers)
Modal instance sampling (focus
on 'typical' people)
When sought 'typical' opinion may get lost in a
wider study, and when you are able to identify
the 'typical' group
Diversity sampling (deliberately
seeking variation)
You are specifically seeking differences, eg. to
identify sub-groups or potential conflicts

Convenience methods
Good sampling is time-consuming and expensive. Not all experimenters have the time or
funds to use more accurate methods. There is a price, of course, in the potential limited
validity of results.

Method Best when
Snowball sampling (ask for
recommendations)
You are ethically and socially able to ask
and seek similar subjects.
Convenience sampling (use who's
available)
You cannot proactively seek out subjects.
Judgment sampling (guess a good-
enough sample)
You are expert and there is no other choice.



Probability sampling
Probability sampling is uses random selection to create the sample.
This is not always easy and care must be taken to ensure the probability of something
appearing in the sample is the same probability as it also appearing in the population. This
will never be exactly the same, but any variation should be due to statistical sampling
error rather than a sample that is too small or where bias occurs in the selection process.
Methods of probability sampling include:
- Simple random sampling: Pure sample of population.
- Stratified sampling: Sample from separate groups.
Use it when there are smaller sub-groups that are to be investigated.
Use it when you want to achieve greater statistical significance in a smaller sample.
Use it to reduce standard error.
In a company there are more men than women, but it is required to have each group
equally represented. Two strata are thus created, of men and women, with an equal
number in each.
- Systematic sampling: Use every nth person.
- Cluster sampling: Focus on a few groups.
Note that statistical analysis is generally based on the assumption of random samples. If
the samples are not randomly chosen then statistical analysis may be invalid and give false
results.

Non-probability sampling
Although the ideal way of sampling is by random selection of targets, as in probability
sampling, the reality of research often means that this is not always possible. The opposite
of probability sampling is non-probability sampling, and simply means sampling without
using random selection methods.
The methods of non-probability sampling include:
- Convenience sampling: Use who's available.
- Purposive sampling: Selection based on purpose.
Purposive sampling starts with a purpose in mind and the sample is thus selected to
include people of interest and exclude those who do not suit the purpose.
Eg: This method is popular with newspapers and magazines which want to make a
particular point. This is also true for marketing researchers who are seeking support
for their product. They typically start with people in the street, first approaching only
'likely suspects' and then starting with questions that reject people who do not suit.

- Expert sampling: Selecting 'experts' for opinion or study.
- Quota sampling: Keep going until the sample size is reached.
Eg: A researcher in the high street wants 100 opinions about a new style of cheese.
She sets up a stall and canvasses passers-by until she has got 100 people to taste
the cheese and complete the questionnaire.
- Proportionate quota sampling: Balance across groups by population proportion.
- Non-proportionate quota sampling: Study a minimum number in each sub-group.
- Snowball sampling: Get sampled people to nominate others.
Eg: A researcher is studying environmental engineers but can only find five. She
asks these engineers if they know any more. They give her several further referrals,
who in turn provide additional contacts. In this way, she manages to contact
sufficient engineers.
- Judgment sampling: Selecting what seems like a good enough sample.
Eg:A TV researcher wants a quick sample of opinions about a political
announcement. They stop what seems like a reasonable cross-section of people in
the street to get their views.

Choosing a test
Here's a table to help you choose the analysis to use, based on the data you are analyzing:

Data type?
Frequency
/ count
How many variables?
1
Chi-square goodness of fit
2
Chi-square test of association
Scores
Objective of the study?
Correlation
between
independent
variables
Parametric data?
Y
Pearson correlation
N
Spearman correlation

Understanding
differences
between
groups
How many independent variables?
1
Independent (not repeated)measures?
Y
How many groups?
2
Parametric data?
Y
Independent-measures t-test
N
Mann-Whitney test
>2
Parametric data?
Y
One-way, independent-measures ANOVA
N
Kruskal-Wallis test

N
How many conditions?
2
Parametric data?
Y
Matched-pair t-test
N
Wilcoxon test
>2
Parametric data?
Y
One-way, repeated measures ANOVA
N
Friedman's test


>1
ANOVA



Some notes:
- Frequency/count data is often nominal.
- You vary independent variables to see how they compare with each other and how
dependent variables change as a result.
- Independent measures are applied to independent people or groups (vs. repeated
measure, which are applied to the same people or group).
- Parametric data has a Normal distribution and has homogeneous variances.

Parametric vs. non-parametric tests


There are two types of test data and consequently different types of analysis. As the table
below shows, parametric data has an underlying normal distribution which allows for more
conclusions to be drawn as the shape can be mathematically described. Anything else is
non-parametric.

Parametric Non-parametric
Assumed distribution Normal Any
Assumed variance Homogeneous Any
Typical data Ratio or Interval Ordinal or Nominal
Data set relationships Independent Any
Usual central measure Mean Median
Benefits
Can draw more
conclusions
Simplicity; Less
affected by outliers
Tests
Choosing
Choosing parametric
test
Choosing a non-
parametric test
Correlation test Pearson Spearman
Independent measures,
2 groups
Independent-measures
t-test
Mann-Whitney test
Independent measures, One-way, independent- Kruskal-Wallis test
>2 groups measuresANOVA
Repeated measures, 2
conditions
Matched-pair t-test
Wilcoxon test
Repeated measures,
>2 conditions
One-way, repeated
measuresANOVA
Friedman's test

There are a number of basic principles of statistics that need to be understood when doing
social research. Here they are:
- Central Limit Theorem: Distribution of sample means is normal
- Correlation: Relationship between variables.
- Covariance: Common movement between variables.
- Degrees of Freedom: N-1, of course.
- Experimental Effect: Importance of result.
- Frequency Distributions: Histograms and measures.
- Experimental Power: Ability of test to avoid type 2 error.
- Standard Error: Spread of sample means.
- Sum of the Squares, SS: Basic measure of spread.
- Variance: Common measure of spread.
- Measuring Centering: Mean, median and mode.
- Measuring Spread: Range and standard deviation.
- Z-score: A simple deviation measure.

Measurement error
Things vary, and few more so than people. Variation is the bane of the experimenter who
seeks to identify clear correlation.
Random error
Random error is that which causes random and uncontrollable effects in measured results
across a sample, for example where rainy weather may depress some people.
The effect of random error is to cause additional spread in the measurement distribution,
causing an increase in the standard deviation of the measurement. The average should not
be affected, which is good news if this is being quoted in results.
The stability of the average is due to the effect of regression to the mean, whereby random
effects makes a high score as likely as a low score, so in a random sample they eventually
cancel one another out.
True score
The true score is that which is sought. It is not the same as the observed score as this
includes the random error, as follows:
Observed score = True score + random error
When the random error is small, then the observed score will be close to the true score and
thus be a fair representation. If, however, the random error is large, the observed score
will be nothing like the true score and has no value.
The effect of random error is that repeated measurements will give a result across a range
of measures, often with the true score in the middle. This is one reason why means are
used (to cause regression to the mean).
Another effect is that if a test score is near a boundary it may incorrectly cross the
boundary. For example a school exam result is close to the A/B grade level, then the grade
given may not be a reflection of the actual ability of the student.
Assuming an observed score is that true score is a dangerous trap, particularly if you have
no real idea of how big the random error may be.
Systematic error
In addition to natural error, additional variation from the true score may be introduced
when there is some error caused by problems in the measurement system, such as when
bad weather affects everyone in the study or when poor questions results in answers which
do not reflect true opinions.
There are many ways of allowing or introducing systematic error and elimination of this is a
critical part of experimental design, as well as assessment of the context environment at
the time of the experiment.
The effect of systematic error is often to shift the mean of the measurement distribution,
which can be particularly pernicious if this is to be quoted in results.
Measurement error
Measurement error is the real variation from the true score, and includes both random
error and systematic error.
Observed score = True score + random error + systematic error
Measurement error can be reduced by such as:
- Testing questions in a range of settings.
- Asking respondents afterwards whether they felt inappropriately encouraged at any
time.
- Carefully training the research associates who are helping implementation of your
experiment.
- Double-entry of data (type in in twice).
- Double-checking formulae in spreadsheets.
Residual variance
When measuring variance in analysis of data, for example using the F-ratio, the model
variance is the variance that can be explained by the experiment, and this thus 'good'
variance. Residual variance is that which cannot be explained by the model being used and
is hence undesirable.
A test statistic may thus, for example, be based on the ratio of the model variance to the
residual variance. The F-ratio is calculated as MS
M
/MS
R
, where MS is the mean square.

Covariance
Deviation of a variable in a sample is its value minus the sample mean (x-bar).
dev(x) = x - x-bar
Covariance is a measure of how much the deviations of a pair of variables match.
cov(x,y) = SUM( (x - x-bar)*(y - y-bar) )
A higher number for covariance indicates strong matching. A negative number indicates
weak matching.
Correlation
Correlation of two variables is a measure of the degree to which they vary together.
More accurately, correlation is the covariation of standardized variables.
In positive correlation, as one variable increases, so also does the other.
In negative correlation, as one variable increases, the other variable decreases.
Pearson correlation

A correlation coefficient is a calculated number that indicates the degree of correlation
between two variables:
- Perfect positive correlation usually is calculated as a value of 1 (or 100%).
- Perfect negative correlation usually is calculated as a value of -1.
- A values of zero shows no correlation at all.
- Pearson devised a very common way of measuring correlation, often called
the Pearson Product-Moment Correlation. It is is used when bothvariables are at
least at interval level and data is parametric.
- It is calculated by dividing the covariance of the two variables by the product of their
standard deviations.
- r = SUM((x
i
- xbar)(y - ybar)) / ((n - 1) * s
x
* s
y
)

Where x and y are the variables, x
i
is a single value of x, xbar is the mean of
all x's, n is the number of variables, and s
x
is the standard deviation of all x's.
Pearson is a parametric statistic and assumes:
1. A normal distribution.
2. Interval or ratio data.
3. A linear relationship between X and Y
Spearman correlation
Description
The Spearman Rank Correlation Coefficient is a form of the Pearson coefficient with the
data converted to rankings (ie. when variables areordinal). It can be used when there
is non-parametric data and hence Pearson cannot be used.
The raw scores are converted to ranks and the differences (d
i
) between the ranks of each
observation on the two variables are calculated. The Spearman coefficient is denoted with
the Greek letter rho ().
= 1 - (6 * SUM(d
i
2
)) / (n * (n
2
- 1))
The Spearman Coefficient can be used to measure ordinal data (ie. in rank order),
not interval (as Pearson). It effectively works by first ranking the data then applying
Pearson's calculation to the rank numbers.
This coefficient is also called Spearman's rho (after the Greek letter used).
Likert Scale
Description
The Likert Scale is an ordered, one-dimensional scale from which respondents choose one
option that best aligns with their view.
There are typically between four and seven options. Five is very common (see arguments
about this below).
All options usually have labels, although sometimes only a few are offered and the others
are implied.
A common form is an assertion, with which the person may agree or disagree to varying
degrees.
In scoring, numbers are usually assigned to each option (such as 1 to 5).
Example

5-point traditional Likert scale:

Strongly
agree
Tend to
agree
Neither
agree
nor
disagree
Tend to
disagree
Strongly
disagree
I like going to Chinese restaurants [ ] [ ] [ ] [ ] [ ]

5-point Likert-type scale, not all labeled:
Good Neutral Bad
When I think about Chinese restaurants I
feel
[ ] [ ] [ ] [ ] [ ]

6-point Likert-type scale:
Never Infrequently Infrequently Sometimes Frequently Always
I feel happy when
entering a Chinese
Restaurant
O O O O O O

Question selection
Questions may be selected by a mathematical process, as follows:
1. Generate a lot of questions -- more than you need.
2. Get a group of judges to score the questionnaire.
3. Sum the scores for all items.
4. Calculate the intercorrelations between all pairs of items.
5. Reject questions that have a low correlation with the sum of the scores.
6. For each item, calculate the t-value for the top quarter and bottom quarter of the
judges and reject questions with lower t-values (higher t-values show questions with
higher discrimination).
Discussion
The Likert scale is named after its originator, Rensis Likert.
A benefit is that questions used are usually easy to understand and so lead to consistent
answers. A disadvantage is that only a few options are offered, with which respondents
may not fully agree.
As with any other measurement, the options should be a carefully selected set of questions
or statements that act together to give a useful and coherent picture.
A problem can occur where people may become influenced by the way they have answered
previous questions. For example if they have agreed several times in a row, they may
continue to agree. They may also deliberately break the pattern, disagreeing with a
statement with which they might otherwise have agreed. This patterning can be broken up
by asking reversal questions, where the sense of of the question is reversed - thus in the
example above, a reversal might be 'I do not like going to Chinese restaurants'. Sometimes
the 'do not' is emphasized, to ensure people notice it, although this can cause bias and
hence needs great care.
There is much debate about how many choices should be offered. An odd number of
choices allows people to sit on the fence. An even number forces people to make a choice,
whether this reflects their true position or not.
Some people do not like taking extreme choices as this may make them appear as if they
are totally sure when they realize that there are always valid opposing views to many
questions. They may also prefer to be thought of as moderate rather than extremist. They
thus are much less likely to choose the extreme options. This is a good argument to offer
seven choices rather than five. It is also possible to note people who do not make extreme
choices and 'stretch' their scores, although this can be a somewhat questionable activity.
[For these reasons, I have a personal preference for six options].
There is also debate as to what is a true Likert scale and what is a 'Likert-type' scale.
Likert's original scale (in his PhD thesis) was bipolar, with five points running from one
extreme to another, through a neutral central position, ranging from 'Strongly Agree' to
'Strongly Disagree'.
The Likert scale is also called the summative scale, as the result of a questionnaire is often
achieved by summing numerical assignments to the responses given.
Guttman scale
Description
A Guttman scale presents a number of items to which the person is requested to agree or
not agree. This is typically done in a 'Yes/No'dichotomous format. It is also possible to use
a Likert scale, although this is less commonly used.
Questions in a Guttman scale gradually increase in specificity. The intent of the scale is that
the person will agree with all statements up to a point and then will stop agreeing.
The scale may be used to determine how extreme a view is, with successive statements
showing increasingly extremist positions.
If needed, the escalation can be concealed by using intermediate questions.
Example

Place a check-mark against all statements` with which you agree
I like eating out [ ]
I like going to restaurants [ ]
I like going to themed restaurants [ ]
I like going to Chinese restaurants [ ]
I like going to Beijing-style Chinese restaurants [ ]

Concealed example (hardening attitude towards crime), using Likert scale:

Strongly
agree
Tend to
agree
Neither
agree
nor
disagree
Tend to
disagree
Strongly
disagree
Criminals should be punished [ ] [ ] [ ] [ ] [ ]
Litter is a problem in the street [ ] [ ] [ ] [ ] [ ]
Sentences for many crimes should be
longer
[ ] [ ] [ ] [ ] [ ]
Streets in this town are not well lit [ ] [ ] [ ] [ ] [ ]
More criminals deserve the death penalty [ ] [ ] [ ] [ ] [ ]

Question selection
1. Generate a list of possible statements.
2. Get a set of judges to score the statements with a Yes or No, depending on whether
they agree or disagree with them.
3. Draw up a table with the respondent in rows and statements in columns, showing
whether they answered Yes or No.
4. Sort the columns so the statement with the most Yes's is on the left.
5. Sort the rows so the respondent with the most Yes's is at the top.
6. Select a set of questions that have the least set of 'holes' (No's between 'Yes's).
Discussion
The Guttman scale was first described by Louis Guttman in in 1944. It allows progressive
investigation in the nature of interview probing, such that you can find out to what
degree respondents agree with a concept or principle. The group of questions seek to
investigate just one factor or trait.
There is a danger with this that respondents feel committed by earlier questions and seek
to sustain consistency and thus agree with more than they really believe. They may also
fear being drawn into an extreme position and hence hold back. This can be mitigated by
using the concealed form, interleaving the questions with random numbers of other
questions (that may or may not be needed in the survey).
Guttman scaling is also known as cumulative scaling or scalogram analysis.
Thurstone scale
Description
A Thurstone scale has a number of statements to which the respondent is asked to agree or
disagree.
There are three types of scale that Thurstone described:
- Equal-appearing intervals method
- Successive intervals method
- Paired comparisons method
Example

Agree Disagree
I like going to Chinese restaurants [ ] [ ]
Chinese restaurants provide good value for money [ ] [ ]
There are one or more Chinese restaurants near where I live [ ] [ ]
I only go to restaurants with others (never alone) [ ] [ ]

Question selection
Equal-appearing intervals
1. Generate a large set of possible statements.
2. Get a set of judges to rate the statements in terms of how much they agree with
them, from 1 (agree least) to 11 (agree most).
3. For each statement, plot a histogram of the numbers against which the different
judges scored it.
4. For each statement, identify the median score, the number below 25% (Q1) and
below 75% (Q3). The difference between these is the interquartile range.
5. Sort the list by median value (This is the 'common' score in terms of agreement).
6. Select a set of statements that are are equal positions across the range of medians.
Choose the one with the lowest interquartile range for each position.
Successive intervals

Paired comparisons
In this method, the judges select between every possible pair of potential statements. As
the number of comparisons increases with the square of the number of statements, this is
only practical when there is a limited number of statements.
Discussion
Judges are used beforehand to understand variation -- if the judge cannot agree, then the
question as posed is also likely to result in varied responses from target people.
One of the biggest problem with Thurstone scaling is to find sufficient judges who have a
good enough understanding of the concept being assessed.
With a set of questions with which you can agree or not, it is useful to have some questions
with which the respondent will easily agree, some with which they will easily disagree and
some which they have to think about, and where some people are more likely to make one
choice rather than another. This should then give a realistic and varying distribution across
all questions, rather than bias being caused by questions that are likely to give all of one
type of answer.
Thurstone scaling is also called Equal-Appearing Interval Scaling.

One-tail and two-tail tests

One-tail and two-tail tests

Explanations > Social Research > Design > One-tail and two-tail tests
Description | Example | Discussion | See also

Description
When a set of measures is made, they typically appear as a distribution, with more
'average' measures in the middle and less at the extremes, as in the standard normal 'bell'
curve.
Experimental assessment typically grabs the majority, snipping off extreme 'tails' as less
likely and typically forming the acceptable 5% error. This is a two-tailed test.
Some tests seek to discover questions about 'more' or 'less' and, rather than snipping off
the outer tails, draws a line and selects the people below or above the line. This is a one-
tailed test.
Example
A two-tailed experiment seeks to understand the intelligence as measured by
'IQ' of a group of people and finds that 95% of the people have an IQ between
113 and 145. The other 5% are above or below these figures. In a Normal
distribution, this would be around 2.5% below 113 and 2.5% above 145.
A one-tailed experiment starts with a 'genius' IQ rating of 150 and seeks to understand
whether training can increase the number of geniuses in a group. The measure thus slices
off the top section of the group, both before and after the treatment.

PARAMETRIC TESTS
t-test

Description
The t-test (or student's t-test) gives an indication of the separateness of two sets of
measurements, and is thus used to check whether two sets of measures are essentially
different (and usually that an experimental effect has been demonstrated). The typical way
of doing this is with the null hypothesis that means of the two sets of measures are equal.
The t-test assumes:
- A normal distribution (parametric data)
- Underlying variances are equal (if not, use Welch's test)
It is used when there is random assignment and only two sets of measurement to compare.
There are two main types of t-test:
- Independent-measures t-test: when samples are not matched.
- Matched-pair t-test: When samples appear in pairs (eg. before-and-after).
A single-sample t-test compares a sample against a known figure, for example where
measures of a manufactured item are compared against the required standard.
Calculation
The value of t may be calculated using packages such as SPSS. The actual calculation for
two groups is:
t = experimental effect / variability
= difference between group means /
standard error of difference between group means
Matched-pair t-test
Description
The t-test gives an indication of how separate two sets of measurements are, allowing you
to determine whether something has changed and there are two distributions, or whether
there is effectively only one distribution.
The matched-pair t-test (or paired t-test or paired samples t-test or dependent t-test) is
used when the data from the two groups can be presented in pairs, for example where the
same people are being measured in before-and-after comparison or when the group is
given two different tests at different times (eg. pleasantness of two different types of
chocolate).
In design notation, this could be is:

R O X O
or
R X O X O

Z-test
Description
The Z-test compares sample and population means to determine if there is a significant
difference.
It requires a simple random sample from a population with a Normal distribution and where
where the mean is known.
Calculation
The z measure is calculated as:
z = (x - ) / SE
where x is the mean sample to be standardized, (mu) is the population mean
and SE is the standard error of the mean.
SE = o/ SQRT(n)
where o is the population standard deviation and n is the sample size.
The z value is then looked up in a z-table. A negative z value means it is below the
population mean (the sign is ignored in the lookup table).
F-ratio
escription
The F-ratio is a test statistic for multiple independent variables. It is used
in ANOVA calculations and calculated as:
F-ratio = MS
M
/ MS
R

... where MS = SS / df
SS = Sum of the Squares
df = degrees of freedom
Subscripted M means 'Model' and indicates the expected systematic variance. This is often
measured as between-measures variation, and the subscript B is consequently often used
here.
Subscripted R means 'Residual' and indicates the random, unsystematic variance. This is
measured as within-measures variance, and the subscript W is consequently often used.
F can also be calculated with the Pearson correlation coefficient, r:
F = r
2
/ (1 - r
2
)(n - 2)
ANOVA
t-test problems
A significant problem with the t-test is that we typically accept significance with each t-test
of 95% (alpha=0.05). For multiple tests these accumulate and hence reduce the validity of
the results.
ANalysis Of VAriance (ANOVA) overcomes these problems by using a single test to detect
significant differences between the treatments as a whole.
ANOVA assumes parametric data.
F-ratio
Like the t-test, ANOVA produced a test statistic that compares the means of variables,
testing them for equality (or, hopefully, not). This is the F-ratio, which compares the
amount of unsystematic variance in the data (SS
M
) to the amount of systematic variance
(SS
R
).
This is a problem in that the F-ratio only says that there is a difference in means, but does
not say which ones differ or which are the same. This may be addressed with additional
post-hoc tests.
Bonferroni condition
In multiple tests, you could go back to the t-test problem of deteriorating alpha (the
probability of type 1 error). This is addressed with the Bonferroni correction, where alpha is
divided by the number of tests.
Thus if you have set alpha=0.05, then with five ad-hoc tests, you revise it to 0.01 and
require the test statistic to be less than this.
Test types
Types of ANOVA have 'X-way' (or 'X-factor') in the title. This indicates the number
of independent variables that were manipulated in the study. Thus:
- 'One way' means one independent variable.
- 'Two way' means two independent variables.
- etc.
The second part of the title tell how the independent variables are measured:
- 'Independent' means different subjects take part in different conditions.
- 'Repeated measures' means the same people take part in all treatments.
- 'Mixed' means at least one independent variable will be measured using different
subjects, and at least one independent variable will be measured using the same
subjects.
Choosing a non-parametric test
Choosing the test
Use the table below to choose the test. See below for further details.

How many separate samples?
1 How many scores for each subject?
1
How many measurement categories?
2
Binomial test
2+
Chi-square test for goodness of fit

2
Can difference scores be ranked?
Y
Wilcoxon test
N
Sign test


2
Matched samples? (N = independent)
Y
Can difference scores be ranked?
Y
Wilcoxon test
N
Sign test

N
Can scores be ranked with few tied values? (independent samples only)
Y
Median test
Y
Mann-Whitney test
N Chi-square test for independence


>2
Can scores be ranked with few tied values? (independent samples only)
Y
Median test
N
Chi-square test for independence
N Krushkal-Wallas test


Discussion
Non-parametric tests do not assume an underlying Normal (bell-shaped) distribution.
There are two general situations when non-parametric tests are used:
1. Data is nominal or ordinal (where means and variance cannot be calculated).
2. The data does not satisfy other assumptions underlying parametric tests.
Chi-square test
Description
The chi-square (_
2
) test measures the alignment between two sets of frequency measures.
These must be categorical counts and notpercentages or ratios measures (for these, use
another correlation test).
Note that the frequency numbers should be significant and be at least above 5 (although an
occasional lower figure may be possible, as long as they are not a part of a pattern of low
figures).
Goodness of fit
A common use is to assess whether a measured/observed set of measures follows an
expected pattern.
The expected frequency may be determined from prior knowledge (such as a previous
year's exam results) or by calculation of an average from the given data.
The null hypothesis, H
0
is that the two sets of measures are not significantly different.
Independence
The chi-square test can be used in the reverse manner to goodness of fit. If the two sets of
measures are compared, then just as you can show they align, you can also determine if
they do not align.
The null hypothesis here is that the two sets of measures are similar.

The main difference in goodness-of-fit vs. independence assessments is in the use of
the Chi Square table. For goodness of fit, attention is on 0.05, 0.01 or 0.001 figures. For
independence, it is on 0.95 or 0.99 figures (this is why the table has two ends to it).
Calculation
Chi-squared, _
2
= SUM( (observed - expected)
2
/ expected)
_
2
= SUM( (f
o
- f
e
)
2
/ f
e
)
...where f
o
is the observed frequency and f
e
is the expected frequency.
Note that the expected values may need to be scaled to be comparable to the observed
values. A simple test is that the total frequency/count should be the same for observed and
expected values.
In a table, the expected frequency, if not known, may be estimated as:
f
e
= (row total) x (column total) / n
...where n is the total of all rows (or columns).
The result is used with a Chi Square table to determine whether the comparison shows
significance.
In a table, the degrees of freedom are:
df = (R - 1) * (C - 1)
...where R is the number of rows and C is the number of columns.
Example
Goodness of fit
English test grade distributions have changed from last year, with grade B's somewhat
lower. Is this significant?
The table below shows the calculation. First, the expected values are created by scaling last
year's results to be equivalent to this year. Then the test statistic is calculated as SUM((O -
E)^2/E).

English test results
Grade A Grade B Grade C Grade D Grade E Sum
This year, O 23 32 20 15 10 100
Last year 25 20 15 25 10 95
Scaled last year, E 26 21 16 26 11 100
(O - E) -3.3 10.9 4.2 -11.3 -0.5
(O - E)^2 11.0 119.8 17.7 128.0 0.3
(O - E)^2/E 0.4 5.7 1.1 4.9 0.0 12.1

Chi-square is found to be 12.1 and the degrees of freedom are (5-1) = 4 (there are five
possible grades). Looking this up in the Chi Square table shows the probability is between
5% (9.49) and 1% (13.28), so H
0
is adequately falsified and a significant change can be
claimed.

Independence
A year group in school chooses between drama and history as below. Is there any
difference between boys' and girls' choices?

Observed

Chose
drama
Chose
history Total
Boys 43 55 98
Girls 52 54 106
Total 95 109 204


Expected = (row tot * col tot)/overall tot

Chose
drama
Chose
history Total
Boys 45.6 52.4 98
Girls 49.4 56.6 106
Total 95 109 204


(observed - expected)^2/expected

Chose
drama
Chose
history Total
Boys 0.2 0.1
Girls 0.1 0.1
Total 0.55

Chi-square is 0.55. There are (2-1)*(2-1) = 1 degree of freedom. Checking the Chi Square
table shows 0.55 is between 0.004 and 3.84, so no conclusion can be drawn about
independence or similarity between boys' and girls' choices.
Reporting
Chi-square is reported in the following form:
_
2
(3, N = 125) = 10.2, p = .012
Where:
3 - the degrees of freedom
125 - subjects in the sample
10.2 - the _
2
test statistic
.012 - the probability of the null hypothesis being true

Discussion
This test compares observed data with what we would expect to get (if the null hypothesis
of no difference was true). It is based on the principle that if the two variables are not
related (for example gender is not related to deafness) then measures applied to each
variable will give similar results (for example about the same proportion of men and women
being found to use a hearing aid), with any variation between the results being purely
caused by chance. If the experimental measures are significantly different, then some
relationship can be claimed.
A reason that percentages do not work is because they are fractions and low numbers will
not work. In practice, you can often get away with percentages by converting them into
larger numbers.
The measurement is unusual in that it has a square on numerator and a non-square on the
denominator. Squaring removes negatives and exaggerates outliers. This increases the
effect that chi-square has in analyzing the difference between two data sets.
Note that the test only reports whether two sets of figures are similar. It says nothing
about the nature of the similarity.
A chi-gram is a bar-chart plot of a set of chi-square calculations and can visually show how
chi-square varies across a set of related measurements.
Where variables are dichotomous (ie. can have only one of two values), then McNemar's
Q is a similar test that is customized for this circumstance.
Note that this test is called the 'Chi-square' test, not 'Chi-squared'.
The Chi-square test is non-parametric.
Phi (|) Correlation
Description
Phi (|) correlation is used to assess correlation between two variables where they are in a 2
x 2 table (ie. both variables are dichotomous).
Phi is calculated by first calculating chi-square, then using the following calculation:
| = SQRT(_
2
/ N)
Discussion
Chi-square says that there is a significant relationship between variables, but it does not
say just how significant and important this is. Phi correlation is a post-test to give this
additional information.
Phi varies between -1 and 1. Close to 0 it shows little association between variables. Close
to 1, it indicates a strong positive association. Close to -1 it shows a strong negative
correlation.
Remember that Phi is only of use in 2x2 tables. Where tables are larger, use Cramer's V.

Das könnte Ihnen auch gefallen