Sie sind auf Seite 1von 25

1

EDUC9762
Study Guide Week 8
Testing for differences among multiple groups
So far, when examining differences on a continuous variable between different groups
(categories) we have looked at:
differences between the scores within a group and a single reference value the
one sample t-test;
differences in the mean scores between two separate groups the independent
samples t-test; and
differences between the scores of matched individuals (e.g. twins) or the same
person across two occasions (e.g. pre-and post-tests) the dependent or paired
samples t-test.
These tests are very useful and indeed are used very commonly.
However, there are many situations in which we have multiple groups three or more
and we want to know whether there are differences between any of these groups on
a continuous variable that interests us.
We could do this by using multiple t-tests, but as you see below, this is poor practice
and can lead to errors in making inferences.
In your data file, you have variables that have multiple groups, e.g. ASDHHER (home
educational resources), ASDGSRSC (reading self-concept), and ASDHEDUP
(parents education level). We will focus on ASDHHER for now and we will ignore
cases with missing values.
The variable HER has three levels, i.e. there are three categories of HER, namely
low, medium and high levels of access to educational resources at home, but there are
also likely to be missing values and we need to check the missingness of this
variable before we use it for other analyses. The results of this check are shown in
Table 1.
Table 1 Index of home economic resources
Frequency Per cent Valid per cent Cumulative
per cent
Valid High 546 11.6 11.9 11.9
Medium 3975 84.6 87.0 98.9
Low 50 1.1 1.1 100.0
Total 4571 97.2 100.0
Missing Omitted 130 2.8
Total 4701 100.0

For this variable, we have 2.8% missing data, and that is generally regarded as not
being a threat to the validity of conclusions we might draw. This might not be the case
for your country, so you will need to check this.
So, we have three distinct groups based on HER. If we want to compare the means for
these groups to see if there are differences in reading achievement between them, we
could do a series of t-tests. If we use a t-test for independent samples, we could
compare low- with medium-, medium- with high-, and low- with high-HER. For three
groups, we would need to use three t-tests. If you have g groups, you will need to do
g(g-1)/2 tests and this soon becomes many tests.
There is a problem with this method. Recall from previous discussions of statistical
inference, when we conduct a statistical test, we are using our sample data to make
2

inferences about the population from which the sample is drawn. Because samples are
a random selection from the population, we expect them to represent the population
but we know that there are variations between the samples we select. This leads to the
possibility that we will draw a sample that is not truly representative. In inferential
testing, we know about this problem and we agree before we conduct any tests that we
are prepared to accept a 5% chance that we will make such an error a Type I error
in which we reject a null hypothesis (that there is no difference between two groups)
that is true. That is, we claim there is a difference when there is not. That is OK,
because we can expect that other researchers will repeat our study with other samples
from that population and that eventually, our theory will be thoroughly tested.
But, if we keep making comparisons using a single sample, the likelihood of making a
Type I error increases. It is like throwing dice. If you throw a die once, you have the
same chance of getting a six as any other number (1/6). But if you throw the die
twice, you increase the chance of getting at least one 6 and the more often you throw
it, the more likely you will eventually get a 6. If you do three t-tests, the probability
of not making a Type I error is 3*.95 or .0.857. Thus, in the probability of making a
Type I error over three tests is .143 or 14.3%. This is much higher than we would
want. If you are planning to do three t-tests, you could set the p-value at .05/3 or .017
for each test. The problem with this method is that is very demanding and you would
rarely reject a null hypothesis, even if it is false (a Type II error), unless you had very
large samples.
This dilemma is resolved by using a one-way analysis of variance or a one-way
ANOVA.
ANOVA
The name of this procedure gives us a clue about how it works; it analyses sources of
variance in the continuous outcome variable (reading achievement). We can
understand how it works by thinking about the variation in reading achievement
between groups based on their access to educational resources at home. We might
expect that students with high access to educational resources should do rather well,
but there will be variation in achievement within that group. Similarly, we might
expect that students with access to a medium level of HER will do moderately well,
but not as well as those with more HER. Finally, we might predict that students with
very limited access to HER would do relatively poorly but we would also expect that
there will be considerable variation within this group. Of course, we expect
exceptions to this pattern and we know that within each group, there will be variation
in reading ability. The question that interests us is:
Is the variation in reading ability between the three groups substantial compared
with the variation that we observe within the groups?
Most students in Lithuania have moderate access to HER, but about 11% have very
good access and about 1% have very limited access (see Table 1).
We should explore the variation that occurs within the sample and the sub-groups.
This is exploratory data analysis. We are just having a quick look without doing any
serious tests.
If we think about variation that occurs between individuals, we can first think about
the entire sample, without separating students into their HER categories, and we can
find the mean reading score and the variance or standard deviation. Then we can do
the same investigation for each sub-group separately. The mean for overall reading
score is 540.846 (sd=56.280; N=4701). (Note the discrepancy between the mean for
the total sample and the mean shown for the total in Table 2. This is because there are
130 cases with missing data on HER, and these are not included in the total shown in
3

the Table 2). The table was generated using the Case Summaries... command from the
Analyze, Reports menu option.
I moved reading achievement (ASRREA01) into the Variables box and ASDHHER
into the Grouping Variable(s) box. I unchecked the Display cases check box.
Using the Statistics button, I moved Number of cases, Mean, and Standard Deviation
into the Cell Statistics box then clicked Continue and OK.
Table 2: Descriptive statistics for reading achievement for HER groups
Index of Home
Educational Resources
(HER)
N Mean Std. Deviation
High 546 577.668 51.312
Medium 3975 536.926 54.782
Low 50 499.867 58.731
Total 4571 541.387 56.161

In the discussion above, I guessed that high-HER students would have the highest
reading score and that low-HER students would have the lowest. While it is useful to
begin an analysis with an expectation of the outcome, we need to be open to
surprising results. In his case the means do appear to be quite different. The standard
deviations of the groups are a bit different, but I am not going to be concerned about
that at this stage. When we do an ANOVA, it is assumed that the variances within
each of the groups are similar. If they are very different, our analysis might be invalid.
In this case, I am comfortable that they are sufficiently similar not to be a concern.
The information in Table 2 is not a basis for drawing a firm conclusion about the
differences between the groups. We would like to know if the differences are
statistically significant.
We turn our attention to how students vary. We calculate variance by finding the
difference between an individuals score and the mean. We square this, sum the
squared values for all individuals before finally dividing by the number of cases. This
gives us the variance of the entire sample.
But we can calculate the variance in another way. We can think about the variance as
being based on two components: the difference between the individuals score and
their groups mean and the difference between their groups mean and the overall
mean. That is, we are saying that the variance can be split into two components, a
within group component and a between group component.
In order to see how this partitioning of the variance into two components works in
practice, we will examine the differences between the three sub-groups identified in
the HER variable and we will do this using an ANOVA.
Moving beyond an exploratory analysis
I have been a bit lax in developing this example through exploratory analysis because
I did not specify a null hypothesis and an alternative hypothesis. As in previous
comparisons, the null hypothesis states:
H
o
: there are no differences in reading achievement between the groups based on
HER.
The alternative hypothesis is:
H
a
: there is a difference in reading achievement between at least one pair of
groups.
Notice the wording of the null and alternative hypotheses. We are not saying anything
about which groups might be different from which other groups. The alternative
4

hypothesis simply states that there is at least one difference between groups. There
might be more than one difference. If we reject the H
o
, this test does not tell us which
groups are different, just that there is a statistical difference in there somewhere. Later
we will find ways to identify which groups are different.
In order to test the null hypothesis, we run an ANOVA by selecting the Analyze,
Compare Means, One-way ANOVA option from the SPSS menu. This produces a
familiar dialogue box (see Figure 1).

Figure 1: One-way ANOVA dialogue box with the reading achievement variable
selected and the HER variable as the grouping or factor variable
When this analysis is run, we get the following output table. In this table, the
variances as we are accustomed to seeing them are not calculated. Instead part of the
variance calculation is done (sums of squares and mean squares), first for the
differences between the groups and then for the differences between individual scores
and their group means, i.e. a within-group component (see Table 3).
Table 3 Result of partitioning the variance of reading achievement scores into a
between-groups component and a within-groups component on HER
Sum of Squares df Mean Square F Sig.
Between Groups 884026.626 2 442013.313 149.232 .000
Within Groups 13530048.711 4568 2961.920
Total 14414075.336 4570

If there was not much difference between the groups, i.e. the groups were fairly
similar, the variance between the groups would be low compared to the variance
within the groups. In this case, we can see that the variation between groups, indicated
by the mean square is relatively high at 442013 whereas the variation within the
groups is rather smaller at 2962. So, we observe that we have much more variation
between groups than within them, the ratio of the between group: within group
variance is large (149 times as much variance between- as within-groups), and we
find that this ratio is statistically significant (p<0.001). So we conclude that the
difference in reading achievement between at least two of the HER groups is
significant.
SPSS and other statistical programs compare these indicators of variance (the mean
squares) by finding their ratio. In this case, the ratio is 149.232, i.e. the variation
between groups is 149 times the variation within groups. This suggests very strongly
that there are probably differences in reading proficiency between the various groups
based on their HER status.
We can reject the H
o
that there are no differences because the F statistic (149.232) has
a very low probability (p<0.001) that we would see this much variation between
groups in our sample if there were no differences in the population.
In reporting the results of this analysis, we might say something like:
5

In order to investigate possible differences in reading achievement between
students based on their access to educational resources at home (HER), a one-way
analysis of variance was undertaken. We find evidence that there is a difference in
reading achievement based on HER status (F=149.232, p<0.001).
Which groups are different?
If we reject H
o
, we can begin a search to locate which groups are different from which
others. Before we do that, we should look again at Table 2. There do seem to be
substantial differences between all groups, but there is quite a bit of variation within
each group and the low-HER group is a relatively small sample, so it is hard to guess
which differences might be statistically significant.
In order to test for statistical significance, we need to supplement the ANOVA with a
post-hoc test. This simply means that after we have found that there seems to be a
difference, the post-hoc test helps us to locate the source of the significance we found
in the ANOVA.
We need to re-run the ANOVA, but this time we will click on the Post-hoc button
(see Figure 1). This leads to the dialogue box shown in Figure 2. There are many post-
hoc tests available. I have checked Tukeys b for no other reason than it is the one I
normally use if the variances of the groups are reasonably similar. Feel free to try
others to see if they make any difference. Field (2009, p. 378) has a more detailed
discussion and he selects several of these tests.

Figure 2: Post-hoc comparisons dialogue box for a one-way ANOVA
In addition to the ANOVA table that we saw as Table 3 that led us to reject the null
hypothesis, we get the table shown in Table 4 that enables us to identify where any
group difference(s) lie.
Table 4 Results of a post-hoc comparison of reading achievement between three HER
groups
Index of Home
Educational Resources
(HER)
N Subset for alpha = 0.05
1 2 3
Low 50 499.867
Medium 3975 536.926
High 546 577.668

This table tells us that there are significant differences between all three pairs of
groups. That is, the differences between the high-HER and medium-HER groups, the
medium- and the low-HER groups and the high- and low-HER groups are all
statistically significant.
6

Reading
Field, A. (2009), Chapter 10, pp. 347-394
Note that Field, like most other authors, goes into much more detail than I have done
about partitioning of variance. We will look at this in class.

Activity
Duplicate the above analyses using HER and reading achievement for your country.
In your data set, you have some other categorical variables such as ASDGSRSC and
ASDHEDUP.
Investigate whether there are differences in reading achievement scores for the
subgroups recorded in these categorical variables.
These investigations are amenable to analysis using one-way ANOVA.

Reference
Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.


1

EDUC9762
Study Guide Week 7

Review of a test for difference
In the past few weeks, we have examined sampling distributions. We made two
important findings. First, we found that when you take a sample from a population,
the sample mean is a good (unbiased) estimate of the population mean. Second, we
found that, if we repeatedly take samples of size n from a population and calculate the
mean for each sample, the standard deviation of the sample means is given by:

/ Eq 1
This is very useful. It tells us that the standard deviation of a set of sample means is
equal to the standard deviation of the original variable in the population divided by
the square root of the sample size. If the variable in the original sample is very large,
the variability of the mean across many samples will be relatively large. It also tells us
that as we take larger samples, the variability of the sample means will be smaller
i.e. better estimates of the population mean. Both of these findings should make
intuitive sense to you.
But, we can say more than simply making literal interpretation of equation 1. We
made a third important finding. Even if the distribution of the original variable is not
normal, the distribution of the sample means tends to be normal, especially if the
sample size is large. This is important because, knowing the distribution is normal
tells us that we can use the properties of that distribution to estimate how reliable our
estimate is. Review the diagram of the normal distribution (Gay, Mills, & Airasian,
2012, Figure 12.1, p. 327). Notice that 95% of the area under the normal distribution
is bounded by the values -1.96 and +1.96 standard deviations.
In class, we undertook a test to see if the average reading score in a country was equal
to the international average of 500. (See the Class Notes summary for Week 6). For
Latvia, we found that the mean reading score was 492.73 with a standard error of the
mean of 1.32. (The standard deviation of reading scores in Latvia was 90 and the
sample size was 4627. We could use equation 1 to estimate the standard error of the
mean or the standard deviation of means if we took repeated samples of n=4627). If
we apply the properties of the normal distribution to the mean reading score from
Latvia we can say that our estimate of the population mean is 492.7, that its standard
deviation is 1.3 and that we can be 95% confident that the true population mean lies
between 492.72-1.96*1.32 or 490.14 and 492.73+1.96*1.32 or 495.32. We can make
the very useful statement that we are 95% confident that the true mean reading score
in the Latvian 15-year-old population is between 490.14 and 495.32 units.
This finding implies that there is a 5% chance that the true mean reading score lies
outside this range. We were interested to know whether the mean for Latvia was the
same as or different from the international mean. We should expect that Latvias score
is not the same as the international mean, but we tested that proposition by doing a
difference test using a one-sample t-test in SPSS. (See the Class Notes summary for
Week 6. Remember that we stated a belief, then set up a null hypothesis and an
alternative hypothesis).
The t-test is used to find out how likely it is to observe a difference between means,
given the variability in means. If sample means are very variable, it would not
2

surprise us to find substantial differences between two samples, but if means do not
vary much then large differences between samples are unexpected. The t statistic is
calculated using:
=
(

2
In our case, we want to test how likely it would be to observe a mean of 492.73 in a
sample of N=4627 if the real population mean is 500 (
0
), knowing that the standard
error of the mean (i.e. the standard deviation of means in samples of n=4627) is 1.32.
Using the above equation, we get a t value of -5.51. The calculation tells us that the
difference we observed is 5.51 standard deviation units away from the international
mean of 500. For large sample sizes, the t statistic follows a normal distribution.
Having run the t-test in SPSS, we get the results of our test in the following tables:
Table 1: One sample statistics
N Mean Std. Deviation Std. Error
Mean
Plausible value in
reading
4627 492.73 89.69 1.32

Table 2: One sample test
95% Confidence interval
t df Sig. (2-
tailed)
Mean
Difference
Lower Upper
Plausible value in
reading
-5.51 4626 0.000 -7.265 -9.850 -4.680

The first table (Table 1) tells us the sample statistics, and we have already seen these.
The second table (Table 2) shows us the results of the t-test. We can see the mean
difference and we had already calculated that. The output also shows us the
confidence interval for that difference. Above, we calculated the confidence interval
for the sample mean. In this table, we see that range converted into differences from
500. Given the expected variation between samples, we could expect to see
differences in a range from 9.85 units below 500 to 4.68 units below. This tells us that
we should be quite confident that the mean reading score we observe in the Latvian
sample reflects a population mean that is below the international mean 500. But how
sure can we be? The column labelled Sig. (significance) tells us this. In other
statistical programs, this column would be labelled Probability or p. It is the
probability that we might observe a difference of 7.265 from the international mean of
500 if the true mean in Latvia is 500, that is, if the null hypothesis (H
0
) is true. You
can see that the probability of seeing a difference of this magnitude is extremely
small. The output is shown to three decimal places, so we are sure that, whatever the
probability is, it is less than 0.001 or less than one chance in 1000. This is so unlikely
that we must conclude that our null hypothesis is not true and we must reject it. Using
the idea of significance, we would say that we have a significant result. That is, the
difference that we have observed is significantly different from what we would expect
if the null hypothesis were true.
If we were to report the above results in a paper, we might say something like:
In order to test the proposition that the mean reading score in the 15-year-old
population in Latvia is no different from the international mean reading score, a
one-sample t-test was undertaken. The mean difference was found to be -7.265
(se=1.32) score units (t=-5.51, df=4626, p<0.001). The hypothesis is therefore
rejected and it is concluded that the reading achievement of Latvian students is
lower than the international average.
3

Activity
Run a test for your country to see whether the mean reading performance is equal to
or different from the international average.
Write this as a research question, write a null hypothesis and write an alternative
hypothesis.
When you have run the analysis, keep a copy of the two tables generated in the
output.
What is the mean for your country? How different is that from 500?
What is the sample size?
What is the standard error of the mean?
How many standard deviations is the mean of your sample away from 500? (Hint:
What is the t statistic in your analysis?)
What is the probability that you would observe the mean value in your sample if
the true mean of the population is 500?
Based on your sample, what is the likelihood that the true mean in the population
is 500?
Write a short paragraph answering the research question suggested above,
summarising and interpreting the result of your analysis in your answer.
Confidence intervals
When we estimate a parameter for a population using data from a sample, there is
always some uncertainty in our estimate because we know that samples taken from a
population are not exact miniatures of the population there are always some
differences between samples taken from a population and we refer to this as sampling
error.
However, even if our estimates are not exact, we need to know how closely they
reflect the true parameter of the population. Because we know the distribution of
sampling statistics, we can estimate a likely range. We just did that in relation to the
mean reading achievement scores of 15-year-old students from Latvia, and I hope you
have calculated similar statistics for your country. For Latvia, we were able to say we
are 95% confident that the true mean reading score in the Latvian 15-year-old
population is between 490.14 and 495.32 units.
Our ability to make these estimates depends on knowing the distribution of the
statistic of interest. We estimated the population mean using the sample mean. Then,
knowing the standard deviation of the variable, we used equation 1 to calculate the
standard deviation of the sampling distribution of the mean. Then, knowing some
properties of the normal distribution, in particular that 95% of the area under the
normal curve is bounded by the range 1.96 standard deviations above and below the
mean, enables the 95% confidence interval to be calculated. Other common
confidence intervals are the 90% range, bounded by 1.68 standard deviations above
and below the mean and the 99% interval bounded by 2.58 standard deviations above
and below the mean.
The 95% confidence interval is the one most commonly computed.
Statistical and practical significance
We say that a finding is statistically significant if we have evidence from our sample
that leads us to reject the null hypothesis. We found in the case of Latvia that the
likelihood of getting a sample mean of 492.7 if that the true population mean in Latvia
is 500 is extremely small (p<0.001). We interpret this to mean that, given the
4

evidence of our sample, the true population mean is most unlikely to be 500. We can
say that we have a statistically significant finding.
This raises the question How unlikely does an outcome have to be before we say it is
too improbable?
In most statistical analyses in the social sciences, a probability of 0.05 is taken as the
threshold. If the p value (Sig. in SPSS) reported for the analysis is <0.05, i.e. if there
is less than 1 chance in 20 that we would observe the outcome in our sample if our
null hypothesis were true, we reject the null hypothesis in favour of the alternative.
There is nothing magical about p<0.05; it is a convention. Before we decide on an
appropriate probability as our threshold for making decisions about whether to reject
or not reject the null hypothesis (H
0
), we should think about the consequences of
making an error in that decision. Recall from our last class discussion the decision
table we constructed (see Table 3).
Table 3: Decision table for statistical analyses
Population
H
0
is true H
0
is false
Based on
our sample
Reject H
0

Type I error
Reject a true H
0

Good decision
Do not reject H
0
Good decision
Type II error
Fail to reject a false H
0


We will consider Type II errors a little later. For now, we focus on the Type I error
the error we make if we reject the null hypothesis when we should not.
Let us illustrate this problem by considering an example like reading scores in a
country. Suppose that the true mean reading score in a country is 500 but that, by
chance, we have selected a sample in which the reading score is quite low. We would
probably get a test result like the one we found for Latvia, but in this case, the
statistically significant difference is a result of sampling variation. We would reject
the null hypothesis, but in doing so, we would be making a Type I error. Of course,
we would not know this because we would not know the true population mean.
This is one reason why replication studies are undertaken. If another researcher
repeated the study, but drew a different sample from the same population, we would
be interested to know whether they arrived at the same conclusion as the first
researcher. The likelihood of both researchers independently making a Type I error is
very low 1 in 400 or 0.0025 or .25%.
By setting the p value at 0.05, we are saying that we are prepared, in 5% of the tests
that we undertake, to make such an error. We could decide that this level of error is
too high and we could set the p value to 1% (0.01). This is done in some research, but
a consequence of this is that we wold increase the likelihood of making a Type II
error. So statistics, like politics, is the art of compromise.

Activity
A teacher lobby group approaches the education ministry and advocates a 20%
reduction in class sizes, arguing that in smaller classes, students will get much more
individualised attention and that this will raise educational achievement.
A statistical consultant is asked to design an experiment in which class sizes are
reduced in some schools, but not in comparison schools, and achievement scores are
measured after several terms of instruction.
What is your null hypothesis for this study?
What are the consequences of reducing class sizes?
5

What are the consequences of reducing class sizes if your null hypothesis is true?
What p value would you establish before reaching a conclusion about the
influence of class size on achievement?

In summary, we can say that a finding is statistically significant if we reject the null
hypothesis, and we do this if the probability of observing the outcome we test is
unlikely to be due to sampling variation and therefore is likely to represent a real
effect or relationship in the population.
Usually we use the 0.05 level of significance to make a judgment about whether the
observation is due to sampling variation or reflects a real feature of the population
from which the sample was taken.
Two aspects of the sample influence the probability of the effect we observe. First, if
there is a very strong feature of the population, then this is very likely to be observed
in any sample that we might select. Second, if the feature in the population is small,
but we take a very large sample, i.e. a sample that is likely to be fully representative
of the population, then we are likely to get a significant result a result that is not
attributable to sampling variation.
In the preceding discussion, we have solved the problem of statistical significance.
However, a new problem arises for us. If it is possible to find that a small difference is
statistically significant if we use very large samples, how do we know if a small
statistically significant effect is practically important?
In order to answer this question, we turn to the issue of effect size.
Effect size
Before we look at the notion of effect size, undertake the following activity.
Activity
The SES variable that you have in your data set, ESCS, has been created by the
managers of the PISA study to have an international average of 0.
What is the average ESCS score in your country?
Undertake an investigation of whether the ESCS of your country is equal to the
international average.
Pose a research question.
State a null hypothesis.
State an alternative hypothesis.
Conduct a test.
Do you have a statistically significant finding?
Write a brief paragraph in which you answer the research question by
summarising and reporting the results of the statistical analysis.
Now, look back over your analyses.
How big is the difference in ESCS between your country and the international
average?
How big was the difference in reading achievement between your country and the
international average?
Can you comment on the relative magnitudes of these two differences?

We have a problem when we are comparing variables that are measured on different
scales. Reading achievement is measured on a scale with a mean of 500 and a
6

standard deviation of 100. The ESCS variable is measured on a scale where the
international mean is 0 and the standard deviation is 1.
When I completed the above activity for Latvia, I found the difference between the
international reading average and Latvias was -7.27, but the corresponding difference
in the ESCS was +0.164. However, I know that these two variables are not
comparable because they are measured on different scales. Nonetheless, I want to get
an idea about relative magnitudes of these differences. This is done using an effect
size. I am aware of about ten different measures of effect size, but we will work with
the simplest of these, Cohens d (Cohen, 1992).
Cohens d is calculated by finding the difference between means and dividing that
difference by the standard deviation of the variable. I compared the reading scores of
males and females in Latvia and found that the mean female score is 512.3 while for
males it is 471.7, a difference of 40.6. The standard deviation of scores is about 90, so
the effect size measured by Cohens d is 0.45.
Cohen (1992, p. 157) suggested that effect sizes of 0.2, 0.5 and 0.8 can be regarded as
small, medium and large respectively. Thus, it seems we have a medium effect size or
a medium difference between the reading abilities of males and females. Incidentally,
I compared their mathematics scores. The mean male score is 4.9 units higher than the
mean females score, an effect size of 0.05 a very small effect. A policy maker in
Latvia might recommend an intervention to enhance the reading ability of boys, but
would probably not recommend a corresponding intervention for mathematics
instruction for girls.
Tests of differences
Research questions about differences are very common.
Is there a difference between Group A and Group B on Variable X?
Is there a difference between Variable X on Occasion 1 and Occasion 2?
Here we will preview some tests of difference. All the tests we will consider require
the test variable to be a continuous interval or ratio variable and this should be
normally distributed.
One-sample t-test
We have used this test to see whether, in our country sample, reading achievement is
equal to the international mean. This test is a one-sample t-test. It is a t-test because it
calculates the t statistic and provides the significance (p value) of that statistic. It is a
one-sample test because we are comparing the performance of one group with a
specified value.
You have seen an example of the research questions that can be answered using this
test.
Dependent or paired sample t-test
This test is used when we have two samples that are related in some way and we want
to test for a difference in a variable. A common situation in educational research
occurs when we test a group of students, provide an instructional intervention, then
re-test the same group. Because we are interested in change between occasions within
individuals, i.e. we want to compare each students pre-test score with their post-test
score, the participants are paired between the two tests. The question we are asking is
whether the difference within individuals between the two test occasions is
significant.
7

Another less common application of this type of test would be a comparison of twins
or even of siblings. Here two different individuals are paired because of their twin or
sibling status. The term dependent is also used because the observations that we make
are not statistically independent; we compare the same person on two occasions or we
compare two related individuals on a common task.
Independent samples t-test
An independent samples t-test is used when we are comparing two groups but we do
not attempt to match individuals between the groups. We are simply testing whether
there is a difference between the mean of one group and the mean of the other. A very
common application of this test is the comparison of boys and girls on a particular
variable.
Reading
Field (2009, pp. 326-342) has an extensive discussion of paired sample t-tests and
independent sample t-tests.
Analysis of variance
An extension of the t-test occurs when we have more than two groups. In our country
data sets, we have the variable IMMIG immigrant status, and this includes three
groups, namely native born, first-generation and immigrant students.
In order to see if there are differences between these groups we could do a series of
independent sample t-tests comparing native-born with first-generation students,
native-born with immigrant students, and first-generation with immigrant students.
This would be very bad practice. The reason is that each time we undertake a test,
there is a chance that we will make a Type I error. The more tests we do, the greater
the likelihood that we will make such an error.
The solution is to use a method that is able to manage multiple comparisons without
compromising the Type I error rate and that technique is known as Analysis of
Variance or ANOVA. There are several types of analysis of variance. We will focus
on one-way ANOVA. It is one-way because we will compare one variable, e.g.
reading achievement, across several groups.
Reading
Field (2009, pp. 347-394) devotes a chapter to this method.
References
Cohen, J . (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Gay, L. R., Mills, G. E., & Airasian, P. W. (2012). Educational research.
Competencies for analysis and applications (10th ed.). Boston: Pearson.


1

EDUC9762 Week 13 Study Guide
Non-parametric statistics
This week we step back to address some old questions about differences and relationships,
but we do so using a new set of methods. You will recall that in our discussion of inferential
statistics, we made the assumption that our variables were normally distributed. This meant
that we could parameterise the distributions we could describe the distribution of a
variable by specifying just two parameters, the mean and the standard deviation. This is
enough information to reproduce the distribution that we would get if we plotted a histogram
of all the data.
We know, however, that not all variables behave in this way, and we need to do analyses on
those variables despite their disagreeable distributions. Methods of analysis that do not
depend on the assumption that the variables are normally distributed are called non-
parametric methods. In almost all cases, instead of depending on the mathematical
properties of the normal distribution, we use the rank order of cases on a variable.
I prefer to use parametric tests when I can because they are usually more powerful than non-
parametric ones. We have discussed the power of statistical tests previously. The power of a
test is the probability of finding an effect if it is present in the population. Because non-
parametric tests use ranks instead of observed values, some information is lost.
Consider the following data (see Table 1). In the top row, we have the original values for 12
cases and in the bottom row, their ranks from lowest to highest. Notice that the difference in
the value of the 10
th
and 11
th
ranked cases is 9 units while the difference between the 11
th
and
12
th
ranked cases is 23 units. When we use ranks, we effectively collapse differences between
values and therefore we lose information and hence we lose statistical power.
Table 1 Ranking data values
Value 16 30 30 30 34 39 42 45 45 58 67 90
Rank 1 3 3 3 5 6 7 8.5 8.5 10 11 12

As an aside, notice that where scores are tied, ranks are averaged, so we have no cases ranked
2
nd
or 4
th
, but three ranked 3
rd
and two cases ranked 8.5
th
.
This week we will consider just a few research questions that we can answer using data that
are non-normal. The research questions we have considered to date include difference
questions and relationship questions.
Checking for normality
One of the early steps we take before running any analysis is to check the properties of each
variable. We look for missing data and we check the distributions of continuous variables.
We have examined socioeconomic status and for this construct, we have used the continuous
variable ESCS and the categorical variable HSECATEG. There is another socioeconomic
status variable in the data set, HISEI. This index is based on ratings of the status of parental
occupation. It takes values between 16 and 90.
Basic descriptive statistics (see Table 1) suggest that the variable is not skewed nor does it
show much kurtosis.
2

Table 2 Descriptive statistics for HISEI

The problem with occupationally based indices is that some occupations are very common
(e.g. teachers and nurses) while others are relatively rare (e.g. judges). This means that the
distribution can be rather patchy (see Figure 1).

Figure 1 Frequency histogram for HISEI for Latvia
The distribution shown in Figure 1would make me rather nervous about doing any parametric
tests as it appears that this variable does not comply with the normal distribution. I could get
a QQ-plot to see if I could convince myself that it might be safe to proceed with a parametric
test (see Figure 2). The points approximately follow the expected normal line with some
deviations above and below the line.
3


Figure 2 QQ plot to check the normality of HISEI
I decided to undertake a third test for normality and requested a Kolmogorov-Smirnov test
(Analyze, Descriptive statistics, Explore). I did this test twice. On the first occasion, I used all
4627 cases, but I know that in large samples, small differences (departures from normality)
are shown as being highly significant (p<<0.05), so I repeated the test using 400 cases. The
result is shown in Table 2. The test is significant (p<0.001) so I have to conclude that the
variable is not normally distributed.
Table 3 K-S test to check for normality of HISEI

As much as I would like to treat this variable as being normally distributed, it appears that it
is not and that if I want to do any inferential statistics, I will need to use non-parametric
methods.
If I use parametric methods and they are not justified, I will report findings that are
potentially misleading.

Exercise 1
Undertake the above analysis on your own countrys data set.
Contrast the findings for your country with those shown for Latvia.

Comparing ranks for independent samples
When we posed research questions about differences between groups, we used a t-test
(parametric) when the data were normally distributed. We have already determined that
mathematics achievement is normally distributed, but I want you to pretend that it is not
quite normal enough for your liking and that you need to do a non-parametric test. I want
4

you to do this, because I want you to compare the results that you found from the t-test with
the results of its non-parametric equivalent the Mann-Whitney test.
The research question is similar. Is there a difference in the mathematics achievement of
males and females in Latvia? For this question, we can write the null and alternative
hypotheses and set the level of significance (Ill stay with 0.05 for ).
If the data were not normally distributed, it would not make sense to talk about the mean.
Instead, we would focus on the median value, so if we present descriptive statistics for our
variable, we should report medians rather than means.
However, in non-parametric statistics, we work with ranks. If H
0
is true, the mathematics
achievement of males and females will be about the same. That is, among the top performing
students, we should find about equal numbers of males and females and similarly we should
find about equal proportions in the lower performing groups. If we rank all students in order,
we would expect the male and female groups to have similar numbers of low and high ranks.
The Mann-Whitney test compares the ranks of the two groups to see whether their shares of
high and low ranks are different.
Running a Mann-Whitney test
I want you to run this test the old way as the output is more informative than the new way.
Select the Analyze, Non-parametric tests, Legacy dialogs, independent samples option (see
Figure 3).

Figure 3 Legacy dialogue box for a Mann-Whitney test of difference for mathematics
achievement
The command that is generated from this dialogue produces two tables. The first is a
summary of the ranks for males and females (see Table 4).
Table 4 Summary of ranks on mathematics achievement for males and females


If males and females were equally distributed at the various score levels, their mean ranks
would be about the same about 2313 or half way along the list of ranks of the 4627
students. However, we notice that girls have a lower average rank than males.
5

As always, we cannot be sure whether this difference in average ranks is significant, so we
look for a test of significance. This is found in the second table (see Table 5).
Table 5 Results of the significance test of differences in rank order on mathematics
achievement between males and females

We see in that table the Mann-Whitney statistic and its level of significance (p=0.185). This
is greater than the critical value (=0.05), so we have no grounds for rejecting H
0
. Notice also
that SPSS provides the results of similar tests, the Wilcoxon rank sum (sum of ranks) test and
the Kolmogorov-Smirnov Z test. (This is different from their test for normality). Field (2009,
pp. 540-548) briefly describes these tests.
Exercise 2
Compare this result with the t-test we conducted several weeks ago.
Is the result different? If so, why?
If you were to repeat this test for reading achievement, what would you expect to find?
Why?
Try it.

I wrote above that we would do this test the old way. I prefer this because it shows you the
results for the test quite directly. However, in the most recent versions of SPSS, a new
dialogue is available for testing this type of hypothesis. You select this using the Analyze,
Non-parametric tests, Independent samples menu options. This presents a different dialogue
box (see Figure 4).

Figure 4 Dialogue box anticipating a Mann-Whitney test for difference between two
groups
6

Notice that we need to select the compare medians option and we need to select the Fields
tab in order to enter the variables. This is similar to other dialogues. You need to enter the test
variable (PV1Math) and the grouping variable (ST3Q01). On this occasion, you do not have
to specify the number of groups as SPSS works that out from the values in the variable.
Finally, you must select the Settings tab to specify the tests that you want. For now, select
the Mann-Whitney test, but we will return to this dialogue later.
Now click Run and examine the output. (Notice that SPSS uses Run instead of OK in this
new dialogue). The output is shown in Figure 5.

Figure 5 The results of the Mann-Whitney test using the new dialogue in SPSS
The results are displayed in a user-friendly way. The null hypothesis being tested is shown,
the test is described, the p value is shown and its interpretation to accept or reject H
0
is
given. What you do not see is any of the underlying calculations. You do not see how many
cases were included nor the mean ranks or the sums of ranks. Perhaps you do not need this,
but I like to look at this information because it helps me to track exactly what is being
analysed.
Exercise 3
Run the above analysis using your own data.
Do you prefer the old or new dialogue?

Associations between non-normal variables
In Week 11 we looked at associations between categorical variables, for which we used
cross-tabulations, and between continuous normally distributed variables, for which we used
the Correlation command and we requested the Pearson product-moment correlation
coefficient (r).
This would be a good time to review what we did then.
When you have a variable that is not normal, you need to decide whether it can be
categorised into groups (e.g. low, medium and high SES) or it is better to leave as a
continuous variable and use non-parametric statistics.
Now, I want to focus on describing relationships between continuous variables, but ones that
are not normally distributed. We have been looking at HISEI and we are confident that this is
not normal. I will use Sense of belonging at school as the other variable (Belong), although I
think it is not too far from normal. While we are doing this, we might consider some other
attitudinal variables, perhaps teacher-student relations and
The correlation coefficient that we use for non-normal data is the Spearman rank order
correlation coefficient ( rho).
We can compare the results that we get if we assume normality (and use Pearsons r) with the
correlation we would get if we do not assume normality (and use Spearmans ).
Select the Analyze, Correlate, Bivariate menu option to get the dialogue shown in Figure 6.
Note that both Pearsons r and Spearmans are being requested.
7


Figure 6 Dialogue for requesting correlation coefficients both Pearsons r and
Spearmans
Two tables are generated. In the first, we see Pearsons r and in the second Spearmans .
Compare the values for these two correlation coefficients. The first is based on the scores on
each variable and the second on the ranks of those scores.
Exercise 4
Run the above analysis using your own data.
How close are the two correlation coefficients (r and )?
Is one systematically higher or lower than the other?
Can you explain why you might see these differences?

Reading
Field devotes a chapter (15) to non-parametric statistics.
Field (2009, Chapter 15, especially pp. 539-568).
See also the table Commonly used parametric and non-parametric tests taken from Gay et
al. (2012, p. 369).
References
Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.
Gay, L. R., Mills, G. E., & Airasian, P. W. (2012). Educational research. Competencies for
analysis and applications (10th ed.). Boston: Pearson.


1

EDUC9762 Week 12 Study Guide
Simple Linear Regression
Last week we introduced measures of association and correlation so that we could answer
research questions about associations between variables.
We will extend that discussion to consider regression relationships, noting that regression
seeks to describe the association between continuous variables in a way that enables
prediction and explanation.
Regression is a powerful technique and is used very commonly in statistical analysis and
modelling. Indeed, this method marks our introduction to the idea of creating and testing
statistical models as a method for understanding and theorising about relationships.
A model is a way of representing reality. Sometimes a model is a scaled-down version of a
large construction, or a scaled-up version of something small like an atom, but models are
frequently simplifications of the real situation. The reality is simplified so that we can focus
on a few important features of the construction. We know that human behaviour is variable
and we use statistical models to see through the variability and to focus on suspected
underlying regularities. That is exactly what regression models do.
A typical research question involving a regression relationship between variables might be
To what extent does socioeconomic status predict (or explain) mathematics achievement at
school?
We will restrict our discussion to simple linear regression involving only two variables an
outcome or criterion or dependent variable and an explanatory, predictor or independent
variable. However, regression models can be quite complex; they can include many predictor
or independent variables, they can use discrete outcome variables, and they can involve
predictors at multiple levels of measurement (student and school characteristics). The ideas of
regression have also been combined with some other methods, notably factor analysis, and
this has led to a structural equation modelling. These advanced techniques lie beyond the
scope of this topic, but a sound understanding of simple linear regression will provide a
foundation for your later exploration of these more complex approaches.
Year 9 Algebra revisited
In high school, I recall doing some simple experiments. Here are the results of one to
investigate Hookes Law the extension of a spring is proportional to the force (of gravity)
applied to the spring (Figure 1).
We should notice a few things about the graph. First, it is a straight line. We can say that the
relationship between the extension of a spring (shown on the vertical axis) and the mass
suspended by the spring (horizontal axis) is linear. Second, the line has a particular slope the
represents the influence of increasing the mass on the extension of the spring. This is
expressed as a ratio of the change in extension to the change in mass. In this case, the
extension of the spring changes from 17 to 67 mm, a change of 50mm, for a change from 0 to
100g in mass. This ratio is, therefore, 50/100 or 0.5 and this is the slope of the line. Third, the
slope is positive; an increase in mass is associated with an increase in extension. Fourth, the
line cuts the y-axis at 17. This point is called the intercept. This is the displacement of the
pointer when there is no mass on the spring. Fifth, this relationship is a deterministic one. The
points are not scattered at all they all lie along the line. This is clearly not educational data
and, even for a physics experiment, the data must have been collected by a meticulous
researcher. In the social sciences, we expect to see variability in the data with points scattered
about a line.
2


Figure 1: A graph of the data from an experiment to investigate Hookes Law
A key point to notice about the line is that we can characterise it by specifying just two
values, the slope and the intercept. If they are given, you can draw the line without any
additional information.
In educational research where are data have much more variability, if we think there is a
linear relationship between variables, we can find out what the line is that is we can find the
slope of the line and its intercept. If we do this, we are creating a model to describe an aspect
of the data we are saying, in effect, that there is an underlying linear relationship but it is
obscured by the variability. When we estimate the two parameters for the line (intercept and
slope) there will be some uncertainty about them because of the variability in the data.
Last week we looked at a more typical social science relationship one between ESCS and
mathematics achievement, shown again in Figure 2.

Figure 2: Scatterplot of Mathematics against SES index (ESCS) scores
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120
D
i
s
p
l
a
c
e
m
e
n
t

(
m
m
)
Mass (g)
3

When I look at Figure 2, I am reasonably confident that there is an underlying linear
relationship between these two variables. I think that some of the variation in mathematics
scores can be explained by variation in their socioeconomic status. To test this idea, I will run
a regression in SPSS. I will regress PV1Math on ESCS.
Regression in SPSS
To run a regression model in SPSS, select the Analyze, Regression, Linear option. This
opens the following dialogue box (Figure 3).

Figure 3: Dialogue box in which dependent and independent variables are specified
When we run this command, SPSS generates four tables. We will ignore the first, which lists
the variables that we have entered, and look at the other three. The Model Summary table
(Figure 4) gives us information about the correlation (r) between the dependent and
independent variables. It also tells us the value of r
2
, which is an indication of how much of
the variance of the dependent variable is explained by the independent variable. In this case,
11.6% of the variation in mathematics scores is explained by differences in students
socioeconomic status.

Figure 4: Model summary of the regression of mathematics achievement on ESCS
In regression models, SPSS also does an analysis of variance (Figure 5). We have seen this
before when we compared the means of three or more groups. The analysis of variance in this
case tells us whether the amount of variance predicted by our linear model is substantial
compared with the total amount of variance in the dependent variable. In this case, most of
the variance is predicted by our model and only a small proportion remains. The ratio of the
predicted variance to the residual variance is very large (over 600) and this value is
4

statistically significant. In other words, we are informed that the regression model is a sound
one.

Figure 5: Analysis of variance for the regression of mathematics achievement on ESCS
We are most interested in the coefficients that define the regression line. They are shown in
the Coefficients table (Figure 6). The intercept is shown as the Constant in the coefficients
table. The slope of the line is shown first as an unstandardized coefficient with its standard
error. The slope of 40.262 indicates that if a students ESCS score is one unit more than
anothers, the difference in their predicted mathematics scores is 40.262 points. Notice that
this is a positive number, so as ESCS increases, the predicted mathematics score increases.
For the slope, there is also a standardized coefficient. Compare this with the correlation
coefficient that you found last week when you examined these variables.

Figure 6: The coefficients (intercept and slope) for the regression of mathematics
achievement on ESCS
For both coefficients, there is a corresponding t statistic and a p-value. In both instances,
there is an implied null hypothesis that they are no different from zero. In both cases, we can
reject these and conclude that the intercept and slope are both statistically different from zero
and that there is a relationship between ESCS and mathematics achievement. Moreover, we
can use this relationship to predict a students mathematics score given their ESCS score and
we can say that 11.6% of the variance in mathematics achievement is explained by ESCS.
Exercise 1
Undertake the above analysis on your own countrys data set.
Contrast the findings for your country with those shown for Latvia. We will make a list of
countries with the regression slope for mathematics achievement on ESCS. What will this
table tell us?
What do you conclude about equity in your country?
Comparing correlation with regression
Both correlation and regression are methods of telling us about associations between
variables. A correlation coefficient tells us whether the association is positive or negative and
how weak or strong the relationship is.
5

Regression provides similar information to what correlation tells us, and in its output includes
information on correlations between variables, but it goes further and enables us to make
predictions about the score on one variable if we know the score on the other.
Both correlation and regression evaluate a possible relationship between variables, and both
assume that there is an underlying linear relationship. Correlation tells us how closely data
points are clustered about the line. Regression tells us the intercept and slope of the line. If
the correlation is weak and the points are widely dispersed, the coefficients for the line
(intercept and slope) will have high standard errors and may not be significantly different
from zero.
Reading
Field devotes a substantial chapter to regression, reflecting its importance.
Field (2009, Chapter 7, especially pp. 197-208).
Reference
Field, A. P. (2009). Discovering statistics using SPSS (3rd ed.). London: Sage.

Das könnte Ihnen auch gefallen