Sie sind auf Seite 1von 31

Introductory Course

in

SPSS
by
Prof. Liberato Camilleri

L.Camilleri

1.

Methods of Sampling

Sampling Theory

Sampling theory is a study of relationships existing between a population and a


sample drawn from it.
Sampling theory is useful in estimating unknown quantities such as the population
mean and variance from knowledge of corresponding sample quantities.
Sampling theory is also useful in determining whether the observed differences
between two samples are due to chance variation or whether they are really
significant.
Statistical inference is a study of the inferences made about a population by using
samples drawn from it.

Random Sampling
A sample of n items is said to be chosen by random sampling from a population if:
Every member of the population has the same chance of being included in the sample
The members of the sample are chosen independently of each other (the choice of a
member is not influenced by the other chosen member)
Proper random sampling requires that we have a list of all N items in the population, so
that we can assign each item one of the numbers from 1 to N. Such a list is called a
sampling frame. The use of sampling frames, make it easy to draw random samples
with the aid of computers or random number tables. Unfortunately, there are many
situations in which it is not possible to construct a sampling frame. For instance, if we
want to use a sample to estimate the mean height of the trees in a forest, it would be
impossible to number the trees, choose random numbers and then locate and measure the
corresponding trees. In these situations, the elements of a random sample must be chosen
haphazardly. That is, we must not select or reject, any element of a population because
of its seeming typicalness or lack of it, nor must we favour or ignore any part of a
population because of its accessibility or lack of it.
Systematic Sampling
In some instances, the most practical way of sampling is to select, say, every 20th name
on a list, every 12th house on one side of a street, every 50th component coming off an
assembly line, and so on. This is called systematic sampling, and an element of
randomness can be introduced into this kind of sampling by using random numbers to
pick the unit with which to start. Although a systematic sample may not be a random
sample in accordance with the definition, in some instances it actually provides an
improvement over a random sample inasmuch as the sample is spread more evenly over
the entire population. The real danger in systematic sampling lies in the possible
presence of hidden periodicities. For instance, if we inspect every 40th piece made by a
particular machine, the results would be very misleading if, because of a regularly
recurring failure, every 10th piece produced by the machine has blemishes.
L.Camilleri

Stratified Sampling
If we have information about the composition of a population and this is of relevance to
our investigation, we may be able to improve on random sampling by stratification.
This is a procedure that consists of dividing the population into a number of nonoverlapping sub-populations, or strata, and then taking a random sample from each
stratum. In stratified sampling the strata are often sampled in proportion to their size,
which means that the sizes of the samples from different strata are proportional to the
sizes of the strata. Stratification is not restricted to a single variable of classification.
Populations are often stratified according to several characteristics. For instance, in a
survey to determine the opinion of the Maltese people towards the European Union, one
might stratify his sample not only with respect to the 13 districts, but also with respect to
the peoples sex and age. This is called cross stratification and is widely used in opinion
sampling and market research because it increases the reliability (precision) of estimates.
In stratified sampling, the cost of taking random samples from the individual strata is
often so high that interviewers are simply given quotas to be filled from the different
strata with few restrictions. This is called quota sampling and is very convenient
however; the resulting samples do not normally have the essential features of random
samples. In the absence of any controls, interviewers tend to select individuals who are
most readily available - persons who work in the same building or perhaps reside in the
same area. Quota samples are essentially judgement samples, and inferences based on
such samples generally do not lend themselves to any sort of formal evaluation.

Cluster Sampling
Cluster sampling is particularly useful when the population members are widely scattered
geographically. In cluster sampling, the total population is divided into a number of
small subdivisions (clusters) and some of these clusters are randomly selected for
inclusion in the overall sample. If the clusters are geographic subdivisions, this kind of
sampling is also called area sampling. This kind of sampling is effective when random
sampling is impossible because suitable lists are not available and the cost of contacting
people scattered over a wide area is very high. It is easier and cheaper to interview
people living close together in clusters rather than selecting them at random over a wide
area. Although estimates based on cluster samples are usually not as reliable as estimates
based on random samples, they are often more feasible with regards to costs.

Multistage Sampling
Most large-scale surveys combine different types of sampling. For instance, if a
government wants to study the attitude of teachers towards certain federal programs,
statisticians might first stratify the country by towns, they might then use cluster
sampling, subdividing each stratum into a number of subdivisions (schools) and finally
they might use random sampling or systematic sampling to select a sample of teachers
within each cluster.

L.Camilleri

2.

Determining Sample Size

The sample size calculator can be used to estimate the confidence interval given the sample
size, confidence level and population size. If the population size is unknown leave the
value blank.
Example
Find the confidence interval when conducting a study on a sample of size 350 respondents
selected from a very large population, assuming a 95% confidence level.
Confidence interval = 5.24%

Example
Find the confidence interval when conducting a study on a sample of size 211 respondents
selected from a population of size 1008 individuals, assuming a 95% confidence level.
Confidence interval = 6%

L.Camilleri

The sample size calculator can be used to estimate the sample size given the confidence
interval, confidence level and population size. If the population size is unknown leave the
value blank.

Example
Find the sample size that should be selected from a very large population if the requested
confidence interval is 3%. Assume a 95% confidence level.
Required sample size = 1067

Example
Find the sample size that should be selected from a population of size 1800 individuals if
the requested confidence interval is 5%. Assume a 99% confidence level.
Required sample size = 486

L.Camilleri

2. Overview of Data Analysis in SPSS


2.1

Data Entry

In SPSS variables are defined in the Variable View output. These variables are
generated by specifying a name for each variable. Factors are declared by specifying a
label and a value for each level of the categorical variables. As an illustration we provide
the following case study.
In a study two groups of respondents were picked at random. The experimental group
suffered from cardiac problems and the control group was not known to suffer from heart
problems. All the members in the two groups were known to make daily use of a
treadmill. These 22 respondents were asked to fill a questionnaire specifying their age,
gender, weight, and mean duration of daily treadmill use, measured in minutes. They
were also asked to indicate whether or not they had cardiac problems.
What is the mean duration of your daily use of a treadmill?
Do you have cardiac problems? ______
Gender: _______
What is your weight? _____ (kg)
What is your age? _____ (years)

________ (minutes)

Gender and cardiac condition are two factors (qualitative variables) each having two
levels (categories). These levels have to be labelled and enumerated. For the cardiac
condition, the value 1 represents an unhealthy respondent with heart problems and the
value 2 represents a healthy respondent. For the gender, the value 1 represents a male
respondent and the value 2 represents a female. No levels have to be specified for the
variables age, weight and duration of daily treadmill use because they are covariates.

L.Camilleri

In SPSS the data is entered in the Data View output. Data files are presented in a
rectangular arrangement where the rows represent the respondents and the columns
represent the variables. A row contains the information elicited by a particular
respondent for all the variables and a column contains the information for a particular
variable elicited by all the respondents.

A further task was to generate another factor by classifying the respondents ages into
three age categories. This could be done explicitly by SPSS using the Recode option.
This option recodes any age value into an appropriate age category and then saves it in
the generated factor Age groups.

L.Camilleri

2.2

Graphical Presentations

A histogram is an important graphical presentation which shows the distribution of values


of a covariate (quantitative variable). These values are first divided into groups of equally
spaced intervals and then the frequency (count) of cases in each interval is plotted as a
bar. A histogram can be created by choosing Graphs and Histogram from the menus.
A normal curve can be superimposed onto the histogram with same mean and variance as
the data. It can be used to assess the symmetry of the distribution.

The distribution (histogram) of the mean duration of daily treadmill use can be generated
by moving the covariate duration in the variable list and selecting Display Normal
Curve. It is evident that a larger proportion of respondents are using the treadmill in the
range 12 to 14 minutes daily. The distribution of the mean duration of daily treadmill use
is fairly normal. It is possible, in SPSS, to modify the minimum, maximum and increment
of the scale values. To conduct these modification activate the chart editor by double
clicking on the graph, highlight the values on the axis and then select Scale from the
Properties tab.

L.Camilleri

It is also possible to modify the number of bars in the histogram and change the style and
colour of the inside of the histogram. To conduct this alteration, activate the chart editor
by double clicking on the graph, highlight the histogram and then select Fill and Border
from the Properties tab.

The number of intervals or the interval widths is modified by selecting the Histogram
Option from the Properties tab.

It is possible to generate two separate histograms of the mean duration of daily treadmill
use for the healthy or sick groups. This is conducted by moving the factor cardiac in
the Panel by rows. It is evident that healthy respondents use the treadmill for a longer
period of time compared to sick respondents.

L.Camilleri

Pie charts are used to analyze factors (qualitative variables). In a pie chart the different
levels of a factor are represented by the sectors of a circle. The size of each slice is
proportional to the size of its respective category. For example, a pie chart showing the
percentage of respondents in the experimental and controls groups can be created by
selecting pie charts from the graphs menu. Slices can either represent frequencies or
percentage of cases. To define the slices drag the categorical variable Cardiac to slice
by. Pie chart properties can be modified by clicking the right button. The counts or
percentages can be displayed on the pie charts by selecting data labels.
Do you have cardiac
problems?
Yes
No

45.45%

54.55%

Using this property window it is possible to separate slices by selecting explode chart.

L.Camilleri

10

Do you have cardiac


problems?
Yes
No

45.45%
54.55%

From this property window it is possible to change a pie chart to a bar chart, line chart or
area chart. In a bar chart the frequency or percentage value of each factor level is
represented by a vertical bar. Larger values are represented by longer bars. In a line chart
each frequency or percentage value is represented by a point. These points are connected
by straight lines. An area chart is a line chart with the space below the line filled in. Bar,
line and area charts can also be created by selecting bar, line and area from the graphs
menu. All graphs show a higher proportion of respondents in the control group.
60

60

54.55%
50

50

45.45%

40

40

30

30

54.55%
45.45%
20

20

10

10

0
Yes

Yes

No

No

One can also create pie, bar and area charts to display the proportion of respondents in
the control and experimental groups separately for males and females. These charts can
be generated by moving Cardiac in the category axis and Gender in the column panel.

L.Camilleri

11

Do you have cardiac


problems?
Yes
No

Gender
male

female

9.09%
22.73%
36.36%

31.82%

Gender
male

female

40.0%
36.4%

31.8%

Percent

30.0%

22.7%

20.0%

10.0%

9.1%

0.0%
Yes

No

Yes

No

Do you have cardiac problems?

All graphs demonstrate a higher proportion of males in the experimental group who have
cardiac problems compared to females. The graphs also demonstrate a higher proportion
of females in the control group compared to males who do not reveal any problems.

L.Camilleri

12

Gender
male

female

40.0%

Percent

30.0%

20.0%
36.4%
31.8%

22.7%

10.0%

9.1%

0.0%
Yes

No

Yes

No

Cardiac

Another way of representing categories is with clustered charts. Clustered area charts can
be generated by selecting area and stacked from the chart menu. In these graphs the
areas for all the factor levels have the same baseline. In the following clustered area chart
the category axis is defined by Cardiac and area is defined by Gender.

120.0%

77.8%

100.0%

22.2%

Percent

80.0%

61.5%
60.0%

38.5%

40.0%

20.0%

0.0%
Yes

No

Do you have cardiac problems?

L.Camilleri

13

Gender
male
female

Clustered bar charts can be generated by selecting bar and clustered from the chart
menu. In the following clustered bar chart the category axis is defined by Cardiac and
the clusters are defined by Gender. Error bars can be displayed on bar charts from the
option menu. The bars display the 95% confidence interval and help the analyst visualize
distributions and dispersion by indicating the variability of the measure being displayed.
Gender
male
female

100.0%

Percent

80.0%

60.0%

40.0%

20.0%

0.0%
Yes

No

Do you have cardiac problems?

A bivariate scatter plot is used to analyze two covariates simultaneously and is plotted
along two axes. This graphical presentation of data points reveals important relationships
between the covariates. It can also reveal outliers and unusual combinations of data
points. Points that do not fit a relationship well stand out in the plot. The procedure is to
click on Graphs and Scatter/Dot and select Simple Scatter. The axes of the scatter
plot are defined by moving duration in the y-axis and the respondents age in the xaxis. The line of best fit can be obtained by double clicking on the graph to activate the
chart editor. Select Add fit line at Total to produce the regression line.

L.Camilleri

14

It is evident from the plot that young respondents are more likely to use the treadmill for
a longer daily duration compared to elderly ones.

The data points can be clustered either by cardiac condition or by gender. These two
graphs can be produced by moving in turn the factors, cardiac and gender, in set
markers by. Separate line fits for the two clusters can be obtained by double clicking on
the graph to activate the chart editor, and then select Add fit line at Subgroups to
produce separate regression lines.

The daily treadmill use is on average longer for healthy respondents compared to sick
ones. This difference becomes more conspicuous with an increase in the respondents
age. The reduction of the mean daily treadmill use as the respondents get older applies to
both experimental and control groups.

L.Camilleri

15

The second scatter plot does not demonstrate any gender bias with regards to the daily
treadmill use. Duration of treadmill use decreases with age for both male and female
respondents.

It is also possible to produce scatter plots for three covariates simultaneously; however, it
is very difficult to visualize the relationships between all the three covariates unless the
scatter plot is rotated along an axis. This is carried out by clicking on 3D scatter and
then define the axes by respectively moving duration, age and weight in the y, x and
z-axis. In a 3D space we get a plane of best fit rather than a line.

L.Camilleri

16

Simple box plots, sometimes called box-and-whisker plots, characterize the distribution
and dispersion of a covariate, displaying its median and quartiles across the levels of a
factor. The median is the 50th percentile and the interquartile range ranges from the 25th
to the 75th percentile. Whiskers at the ends of the box show the distance from the end of
the box to the largest and smallest observed values that are less than 1.5 box lengths from
either end of the box. Data points that fall outside this range are labelled as outliers or
extreme values and their position is identified. Box plots are created by choosing
Graphs and Boxplot from the menus. In a simple box plot the selected variable must
be a covariate and the category axis must be defined by a factor.
This simple box plot demonstrates the distribution of respondents weights for both the
experimental and control groups. The median weights for the two groups are respectively
91.5kg and 73kg. This implies that half the respondents in the experimental group weigh
more than 91.5kg and half the respondents in the control group weigh less than 73kg. An
interesting observation is that the lower quartile (25th percentile) for the experimental
group and the upper quartile (75th percentile) of the control group are almost equal. This
implies that 75% of the respondents in the experimental group weigh more than 75% of
the respondents in the control group.

L.Camilleri

17

It is possible to generate two separate box plots showing the distribution of respondents
weights for both males and females. This is conducted by moving the factor gender in
the Panel by rows. An interesting observation is that male respondents weigh more than
females for both the experimental and control groups. It is also evident that sick male
respondents weigh significantly more than healthy ones but this is not so evident for
females. Three data points are marked as outliers because they lie between 1.5 box
lengths and 3 box lengths from the end of the box. Any data point which lies beyond 3
box lengths is marked with an asterisk.

It is possible to combine the two plots in a single clustered box plot. Clustered box plots
display the distribution of a covariate across two factors. In the subsequent clustered box
plot the selected variable is the respondents weight whereas the category axis and the
clusters are respectively defined by cardiac condition and gender. The plot exhibits the
same contrasts displayed in the preceding plot.

L.Camilleri

18

2.3

Analyzing multiple responses

In the case study presented the 22 respondents were further asked to indicate the type of
food that they prefer eating given four possible food categories. These food options were
pasta, fish, meat and vegetables and the respondents were allowed to select more than one
option. The four food options have to be defined explicitly by four categorical variables
because each cell can allow only one data entry. The first categorical variable indicates
whether the respondents prefer pasta or not. For instance, the second respondent prefers
pasta and fish whereas the fourth respondent prefers pasta, meat and vegetables.

Multiple responses are analyzed through a multiple response frequency procedure. This
produces frequency tables for multiple response sets.

To generate a single combined set of these four food categories choose Multiple
Responses from the menus. The new set of food categories is defined by moving pasta,

L.Camilleri

19

fish, meat and vegetables in the new set which is labeled Preferred Food. The levels of
this factor are defined by entering the range of categories from 1 to 4.
Crosstabs are very useful when analyzing associations between factors. It is also possible
to get cross-tabulations of any number of factors by choosing Multiple Responses and
Crosstabs from the menus. To examine the association between the respondents health
and preferred food, one need to specify which of these two factors is defined by the
crosstab rows and columns. In this example we define the levels of preferred food by the
crosstab rows and the health categories by the crosstab columns. It is evident from the
crosstab that respondents with cardiac problems are more likely to eat vegetables and fish
whereas healthy respondents are more likely to eat pasta and meat

Preferred
Food

Total

Pasta
Fish
Meat
Vegetables

Count
Count
Count
Count
Count

Do you have cardiac problems?


Yes
No
4
9
8
7
3
9
9
2
10
12

Total
13
15
12
11
22

An alternative method is to stack the entries of these four categorical variables pasta, fish,
meat and vegetables to explicitly generate this new factor Preferred Food. Stack also
the entries of the factor Cardiac four times to generate a new expanded factor such that
both factors have 88 entries. To obtain a crosstab select Descriptive Statistics and
Crosstab from the menus. Since the numbers of respondents in the two health categories
are unequal it is advisable to produce column percentage to make correct associations of
the preferred food categories for the two health groups. A clustered bar graph can also be
produced to display, graphically, these associations.

L.Camilleri

20

Preferred
Food

Pasta
Fish
Meat
Vegetables

Total

Do you have cardiac problems?


Yes
No
4
9
16.7%
33.3%
8
7
33.3%
25.9%
3
9
12.5%
33.3%
9
2
37.5%
7.4%
24
27
100.0%
100.0%

Count
Percentage
Count
Percentage
Count
Percentage
Count
Percentage
Count
Percentage

Total
13
25.5%
15
29.4%
12
23.5%
11
21.6%
51
100%

For each preferred food category the bar lengths vary considerably between the two
health-groups demonstrating graphically the association described above.

2.4

Methods for describing data sets

Numerical descriptive measures are very useful to make inferences for a population about
the corresponding measures. A number of numerical methods are available to describe
quantitative data sets. These methods measure one of these four data characteristics.
1.

Measures of central tendency (location)

Central tendency is the tendency of the data to cluster about a certain numerical value.
The most popular measure of central tendency is the sample mean. The sample mean x
is simply the average of the n observations xi .

x=

1 n
xi
n i =1

The median is another measure of central tendency. This is the middle observation when
all the observations are arranged in ascending order.
The third measure of central tendency is the mode. This is the observation in the sample
which occurs most frequently.
2.

Measures of dispersion (variability)

Dispersion is the extent to which the given data is different from the mean. The sample
standard deviation, s, is the most popular measure of dispersion. It is the square root of
the sample variance given by
1 n
2
s2 =
( xi x )

n 1 1

L.Camilleri

21

The range is another measure of dispersion and it is the difference between the largest
and the smallest observations. This is a rather plain, insensitive measure of dispersion and
is hardly ever used.
3.

Measures of relative standing

Measures of relative standing describe the placement of an observation to the rest of the
data. One measure of the relative standing of an observation is its percentile ranking.
The observations are ranked from smallest to largest and the pth percentile is the number
such that p% of the observations fall below this value. The lower quartile, median and
upper quartile are respectively the 25th, 50th and 75th percentiles. The interquartile range
is the distance between the lower and upper quartiles. Percentile rankings are of practical
value only for large data sets.
4.

Measures of the distribution of the data set

The skewness characterizes the degree of asymmetry of a distribution around its mean.
Negative skewness indicates a distribution which is skewed to the left. Positive skewness
indicates a distribution which is skewed to the right.
Many naturally occurring continuous variables, such peoples heights and examination
marks have a normal distribution which is symmetric. This is the most widely used
distribution in Statistics. The kurtosis characterizes the relative peakedness or flatness of
a distribution compared with the normal distribution. The skewness and kurtosis of the
normal distribution are both zero. Negative kurtosis indicates a relatively flat distribution
compared to the normal distribution, whereas positive kurtosis indicates a relatively
peaked distribution.
The Frequency procedure of SPSS provides the most important summary statistics.
Some of these statistics require that the data follow a normal distribution (or at least that
the shape of the variables histogram be symmetric). In particular, the mean, standard
deviation, variance and skewness should be used with caution unless the distribution is
fairly symmetric and has no extreme outlier. A descriptive statistic is called robust if the
calculations are insensitive to violations of the assumption of normality. This category
includes the median, mode, minimum and maximum values, range and quartiles. It is
necessary to use graphics such as histograms with normal curve to determine whether the
variables summarized have approximately a normal distribution.
9
43
56
68
84

L.Camilleri

12
44
56
68
84

15
45
57
70
86

21
47
58
73
87

24
47
58
73
88

26
49
63
74
88

22

31
52
64
77
90

31
52
64
79
93

38
54
65
80
95

39
56
67
82
96

The above table shows the marks obtained by 50 students in a Mathematics examination.
The sample was chosen randomly from large school population.

From the menus select Descriptive Statistic and Frequencies and move the vector of raw
marks into the variables list. From the dialogue box select the statistics mean, median,
mode, quartiles, standard deviation, variance, range, skewness and kurtosis to
measure central tendency, variability, symmetry and peakedness of the distribution of
marks.
marks
Mean
Median
Mode
Std. Deviation
Variance
Skewness
Kurtosis
Range
Percentiles

59.56
60.50
56
23.037
530.700
-.404
-.574
87
44.75
60.50
79.25

25
50
75

The three measures of central tendency indicate that the average mark is approximately
60. In a perfectly symmetric distribution the mean, median and mode should be equal.
The fact that these three measures differ from each other indicate that the distribution is
skewed. The bigger the difference between these three measures the less symmetric the
distribution will be. The marks range from 9 to 96 explaining why the standard deviation
is large. If the marks had to be clustered closer to the mean one would expect a smaller
standard deviation. Both the skewness and the kurtosis have a negative value indicating
that the distribution of marks is skewed to the left and is flatter than the normal
distribution. This can be verified by plotting a histogram and displaying the normal
curve.

L.Camilleri

23

The lower and upper quartiles are respectively 44.75 and 79.25. This implies that 25% of
the students got a mark less than 45 and another 25% of the sample got a mark higher
than 79.

2.5

Types of Reliability

There are two classes of reliability testing and each estimates reliability in a different
way. These include:

Inter-rater, intra-rater and test-retest reliability


Internal consistency reliability

Inter-rater, intra rater and test-retest reliability


Inter-rater reliability, inter-rater agreement, or concordance is the degree of agreement
among raters and assesses the homogeneity in the raters evaluations of the same item. In
other words it assesses the consistency with which different raters produce similar evaluations
in judging the same abilities or characteristics in the same target person or object.

Intra-rater reliability is the degree of agreement among repeated administrations of a test


performed by a single rater. Test-retest reliability is a form of intra-rater reliability and
assesses the homogeneity or agreement in the raters evaluations when the same test is
administered on two different occasions.
There are several tests to assess inter-rater, intra-rater and test-retest reliability; however,
these tests depend on the evaluation scale of the raters rather than the type of reliability. If
the evaluation scale is nominal (true, false or good, bad or present, absent) then the Kappa

L.Camilleri

24

test is recommended. If the evaluation scale is ordinal (poor, moderate, good, excellent or
never, rarely, sometimes, often, always) then the Gamma and Kendall tau b and c tests are
all appropriate and yield similar results. If the evaluations have an interval or metric scale
then the absolute agreement intra class correlation is recommended to assess inter-rater,
intra-rater or test-retest reliability. The following examples illustrate the procedure.

Example 1
Two doctors A and B assess 30 patients independently on 4 mental disorders (anxiety,
psychotic, personality and obsessive compulsive disorder).
Patient

Doctor A

Doctor B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30

Anxiety disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Psychotic disorder
Obsessive disorder
Personality disorder
Anxiety disorder
Obsessive disorder
Psychotic disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Anxiety disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Anxiety disorder
Anxiety disorder
Psychotic disorder
Obsessive disorder
Psychotic disorder
Anxiety disorder
Personality disorder
Personality disorder
Obsessive disorder
Psychotic disorder

Psychotic disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Psychotic disorder
Obsessive disorder
Personality disorder
Anxiety disorder
Psychotic disorder
Psychotic disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Anxiety disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Psychotic disorder
Personality disorder
Obsessive disorder
Obsessive disorder
Anxiety disorder
Psychotic disorder
Obsessive disorder
Psychotic disorder
Anxiety disorder
Personality disorder
Personality disorder
Personality disorder
Psychotic disorder

To assess inter-rater reliability the Kappa test will be used because the doctors evaluations
have a nominal scale. To get the output, click on Analyze, Descriptive Statistics and
Crosstabs. Move the variables A and B in the row and column slots, click on Statistics
and select Kappa. Click on Continue and OK to get the output.

L.Camilleri

25

Anxiety
disorder
Doctor A

Anxiety disorder
Psychotic disorder
Personality disorder
Obsessive disorder

4
0
0
0

Doctor B
Psychotic
Personality
disorder
disorder
1
9
0
1

Obsessive
disorder
0
0
7
1

1
0
0
6

Approx. T
7.715

P-value
0.000

Symmetric Measures
Value
Measure of Agreement
Number of valid cases

Kappa

.820
30

Std. Error
0.083

The crosstab shows a large percentage agreement (26/30 x 100% = 86.7%) indicating a
strong inter-rater reliability. Kappa values of greater than 0.75 indicate excellent agreement
beyond chance; values in the range 0.4 to 0.75 indicate fair to good; and values below 0.4
indicate poor agreement. The p-value (approximately 0) is less than the 0.05 criterion and
indicates that the Kappa values (0.820) is significantly different from 0 indicating excellent
inter-rater reliability.
Example 2
A teacher was asked to assess a child on 20 socio-emotional behaviour difficulties using a
5-point Likert (ordinal) scale, ranging from Strongly disagree to Strongly agree. The
child was assessed by the teacher on two separate occasions, allowing a one week period.
Socio emotional behaviour difficulties
Often complains of headaches, stomach-aches
Many worries, often seems worried
Often unhappy, downhearted or tearful
Nervous or clingy in new situations
Many fears, easily scared
Often has temper tantrums or hot tempers
Generally disobedient
Often fights with other children or bullies them
Often lies or cheats
Steals from home, school or elsewhere
Restless, overactive, cannot stay still for long
Constantly fidgeting or squirming
Easily distracted, concentration wanders
Acts hastily without thinking
Never finishes a task, poor attention span
Rather solitary, tends to play alone
Has no friends
Generally disliked by other children
Picked on or bullied by other children
Gets on better with adults than children his age

L.Camilleri

26

Before

After

Strongly agree
Agree
Strongly agree
Neutral
Agree
Disagree
Strongly disagree
Strongly disagree
Disagree
Strongly disagree
Neutral
Strongly disagree
Disagree
Disagree
Strongly disagree
Agree
Disagree
Strongly disagree
Agree
Agree

Agree
Strongly agree
Agree
Agree
Neutral
Strongly disagree
Strongly disagree
Disagree
Strongly disagree
Disagree
Agree
Disagree
Disagree
Strongly disagree
Strongly disagree
Strongly agree
Disagree
Disagree
Neutral
Neutral

To assess test-retest or intra-rater reliability the Kendall tau b and c and Gamma tests will
be used because the teachers evaluations have a ordinal scale. To get the output, click on
Analyze, Descriptive Statistics and Crosstabs. Move the variables Before and After
in the row and column slots, click on Statistics and select Kendall tau b and c, and
Gamma. Click on Continue and OK to get the output.
These three measures are based on concordant pairs. If the values of one case are both
larger (or smaller) than those for the other member of the pair, the pair is concordant. If
the direction is reversed for the second factor, the pair is discordant. When the cases have
the same value for one or both variables, the pair is tied. If the number of concordant pairs
is similar to the number of discordant pairs then the Kendall tau b and c and Gamma values
will be close to 0 and their respective p-values exceed the 0.05 criterion indicating poor testretest reliability. If the number of concordant pairs is considerably larger than the number
of discordant pairs then the Kendall tau b and c and Gamma values will be close to 1 and
their respective p-values will be less than the 0.05 criterion indicating satisfactory test-retest
reliability.
Second evaluation
Strongly
disagree
First evaluation

Disagree

Neutral

Strongly
agree

Agree

Strongly disagree

Disagree

Neutral

Agree

Strongly agree

Symmetric Measures
Value
Ordinal by Ordinal

Std. Error

Approx. T

P-value

Kendall's tau-b

0.591

0.067

7.696

0.000

Kendall's tau-c

0.569

0.074

7.696

0.000

Gamma

0.674

0.081

7.696

0.000

The heaviest concentration of responses occurs near the principal diagonal. There is no
item in which the teacher agreed in one evaluation and disagreed in the other, as evidenced
by the zero counts near the lower left and upper right corners. All three tests indicate good
test-retest reliability since the p-values are all less than the 0.05 criterion

Example 3
Two examiners were asked to correct the scripts of twenty students where the marks ranged
from 0 to 100. To assess inter-rater reliability the intraclass correlation will be used to
measure both consistency and absolute agreement.

L.Camilleri

27

Examiner A

Examiner B

Examiner A

Examiner B

67
49
91
84
97
49
38
59
65
79

36
48
84
71
52
19
29
27
38
46

83
53
98
90
69
82
45
40
61
76

56
28
76
60
42
50
33
35
34
41

To get the output, click on Analyze, Scale and Reliability Analysis. Move the variables
A and B in the items slot, click on Statistics, select Intraclass correlation coefficient
and choose Consistency for type. Click on Continue and OK to get the output. Repeat
the whole procedure but choose Absolute Agreement for type.
Intraclass Correlation Coefficient measuring consistency
95% Confidence Interval
Intraclass Correlation
Single Measures
Average Measures

Lower Bound

0.793
0.885

Upper Bound
0.549
0.709

0.913
0.954

Intraclass Correlation Coefficient measuring absolute agreement


95% Confidence Interval
Intraclass Correlation
Single Measures
Average Measures

Lower Bound

0.435
0.606

Upper Bound
-0.087
-0.192

0.794
0.885

While the examiners seem to be quite different in their methods of scoring there may be
similar patterns in the way they are scoring. Both examiners provide higher scores for good
performances and lower scores for poor performances; however, the two examiners differ
in the precise mark that should be assigned to a particular performance. Examiner B is
more stringent in his marking scheme than A. The average measure intraclass correlation
for consistency (0.885) is larger than the average measure intraclass correlation for absolute
agreement (0.606), which implies that the marks provided by the examiners are correlated
but differ in size.

Internal Consistency Reliability


In internal consistency reliability estimation we use our single measurement instrument
administered to a group of respondents on one occasion to estimate reliability. Basically,
we are judging the reliability of the instrument by estimating how well the items that
reflect the construct yield similar results. There are mainly two measures to assess internal
consistency reliability - Cronbachs Alpha and Guttman Split-Half measures.

L.Camilleri

28

Cronbachs Alpha is equal to the average measure intraclass correlation for consistency. In
Split-Half Reliability the items, that are assumed to measure the same construct, are
divided into two sets. The Guttman split coefficient is computed using the same formula
for Cronbachs Alpha for two items. For both measures, values greater than 0.9 indicate
excellent reliability; values between 0.8 and 0.9 imply good reliability; values between 0.7
and 0.8 indicate acceptable reliability; values between 0.5 and 0.7 indicate questionable
reliability and values less than 0.5 imply unacceptable reliability.
Consider the following example as an illustration of the procedure. An observer was asked
to assess 24 children on 6 items related to prosocial behaviour. Not true corresponds to a
score of 0; somewhat true corresponds to 1 and certainly true corresponds to 2. Variables
P1, P2, P3, P4, P5, and P6 comprise the scores elicited by the observer for the 6 items. The
task is to measure internal consistency reliability using the two methods described above.
Prosocial Behaviour

Not True

Somewhat True

Certainly True

Considerate of other peoples feeling


Shares readily with other children
Helpful if someone is hurt
Upset of feeling ill
Kind to younger children
Often volunteers to help others

P1
Not true
Somewhat true
Certainly true
Certainly true
Somewhat true
Not true
Somewhat true
Not true
Somewhat true
Not true
Certainly true
Somewhat true
Not true
Somewhat true
Not true
Somewhat true
Certainly true
Not true
Somewhat true
Not true
Certainly true
Somewhat true
Not true
Somewhat true

L.Camilleri

P2
Somewhat true
Somewhat true
Certainly true
Somewhat true
Not true
Not true
Somewhat true
Not true
Not true
Somewhat true
Somewhat true
Not true
Somewhat true
Not true
Not true
Not true
Certainly true
Somewhat true
Somewhat true
Not true
Certainly true
Certainly true
Not true
Certainly true

P3
Not true
Not true
Somewhat true
Somewhat true
Somewhat true
Somewhat true
Somewhat true
Not true
Not true
Somewhat true
Certainly true
Somewhat true
Somewhat true
Not true
Not true
Somewhat true
Certainly true
Not true
Not true
Somewhat true
Certainly true
Certainly true
Not true
Somewhat true

P4
Somewhat true
Somewhat true
Certainly true
Certainly true
Not true
Somewhat true
Somewhat true
Somewhat true
Not true
Somewhat true
Certainly true
Somewhat true
Not true
Not true
Not true
Somewhat true
Somewhat true
Somewhat true
Not true
Not true
Somewhat true
Certainly true
Somewhat true
Certainly true

29

P5
Not true
Not true
Certainly true
Certainly true
Not true
Somewhat true
Somewhat true
Not true
Somewhat true
Not true
Certainly true
Somewhat true
Not true
Not true
Not true
Somewhat true
Certainly true
Somewhat true
Somewhat true
Not true
Certainly true
Somewhat true
Somewhat true
Certainly true

P6
Not true
Not true
Certainly true
Somewhat true
Somewhat true
Not true
Not true
Somewhat true
Not true
Not true
Somewhat true
Not true
Somewhat true
Not true
Somewhat true
Somewhat true
Certainly true
Not true
Not true
Somewhat true
Certainly true
Certainly true
Somewhat true
Somewhat true

To compute a split-half reliability measure between items 1, 2 and 3 (set 1) and items 4, 5
and 6 (set 2), click on Analyze, Scale and Reliability Analysis. Move P1, P2, P3, P4,
P5 and P6 in the items list and select Split-half for the model option. Click on Statistics
and select Inter item correlation. A similar procedure is used to compute Cronbachs
Alpha but select Alpha for the model option.

Inter-Item Correlation Matrix


P1a
P1a
P2a
P3a
P4a
P5a
P6a

L.Camilleri

1.000
0.525
0.568
0.451
0.752
0.465

P2a
0.525
1.000
0.539
0.590
0.586
0.511

P3a
0.568
0.539
1.000
0.469
0.558
0.634

30

P4a
0.451
0.590
0.469
1.000
0.665
0.370

P5a
0.752
0.586
0.558
0.665
1.000
0.462

P6a
0.465
0.511
0.634
0.370
0.462
1.000

Reliability Statistics
Guttman Split-Half Coefficient

0.902

Cronbachs Alpha

0.877

All inter-item correlations are positive indicating that the variables (prosocial behaviours)
are positively related. Moreover, both the Cronbachs Alpha and Split half coefficients
exceed the 0.7 criterion indicating good internal consistency (reliability).

L.Camilleri

31

Das könnte Ihnen auch gefallen