Sie sind auf Seite 1von 8

Significance Testing

page 1

Significance Testing Significance testing is the process of using statistics to determine how well data fit a particular pattern. Patterns can be very simple, such as rates of obedience in the authoritypresent condition will be higher than rates of obedience in the authority-absent condition or they can be more comple . The !ey to understanding significance testing is to understand that the strength of each pattern can be measured by a test statistic. "hat is a test statistic# $irst, let%s loo! at what a statistic is. & statistic is a number representing some aspect of a group of numbers. $or e ample, a mean is a statistic measuring central tendency in a group of numbers. The mean of the numbers ', (, and ) is *'+(+),-. / 0. &nother common statistic is standard deviation, which describes how spread out a group of numbers are. The numbers ', (, and ) have a standard deviation of ..0, while the numbers 1, (, and 2 have a standard deviation of '.1. These values reflect the fact that ', (, and ) are more spread out than 1, (, and 2. 3oth the mean and the standard deviation are statistics because they are numbers describing properties of groups of numbers. Test Statistics & test statistic is a number that represents the strength of a pattern in a group of numbers. 4t is a statistic because it reflects some property of a group of numbers, but unli!e most statistics, it is sensitive to a particular pattern. There are many test statistics, and each is sensitive to a different pattern, but in general, test statistics get farther from 5ero when their pattern is present. Test statistics are very useful for two reasons6 1, simplification6 they enable researchers to e press the strength of a pattern with a single number, and ', falsifiability. 4f a hypothesis is falsifiable, then it is possible to test the hypothesis in a way that either supports or discredits the hypothesis. 4t is possible to use a test statistic to ma!e a decision about whether or not a pattern is present. To understand how this is done, we must first discuss the concept of statistical significance. Statistical Significance 7et%s say that you are a detective and you get reports of a croo!ed gambling establishment. People are betting on a coin toss in which they gain one dollar for every heads and lose one dollar for every tails. The allegation is that the coin toss is somehow rigged to come up tails more than heads. This allegation becomes your experimental hypothesis6 the pattern you e pect to find. 4f your hypothesis is supported, you will charge the establishment with a crime. The first decision you need to ma!e is, how confident do 4 need to be before 4 conclude that the coin toss is rigged# 4f you observe a thousand coin tosses, find 8.2 tails, and compute that 8.2-or-more-tails would occur less than 1 in 199 times with a fair coin, would you feel confident charging the organi5ation with a crime# "hat if you find 81) tails, which would occur less than 1 in 1999 times with a fair coin# The probability that you establish *e.g., 1 in 199, 1 in 1999, to ma!e your decision of whether your hypothesis is supported or not supported is called the alpha level. 4n most psychological research, where the conse:uences of being wrong are not severe, the alpha level is set at either 9.98 or 9.91, which corresponds to either a 1 in '9 or 1 in 199 chance of being wrong6 saying the coin is rigged when really it is fair and the string of tails was ;ust bad luc!. 4f you are conducting research where the conse:uences of being wrong are severe, such as saying a drug is safe when it might be deadly, you would set your alpha level lower, to perhaps 9.9991. So, your first step is to set the alpha level of your test. Step two is to find a test statistic. <ou need a number that will reflect the strength of the pattern you suspect. 4n this case, the pattern you are observing is number of tails and you can
by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page '

use that number as your test statistic. >igh values will indicate a stronger pattern. 7et%s say you observe 1999 coin tosses and 81) of those are tails. <our test statistic in this case is 81). Step three is where the magic happens6 computing the probability of your test statistic. ?ore specifically, you want to !now the probability of your test statistic, 81) tails out of 1999 coin tosses, occurring with a fair coin. 4f that probability is sufficiently low *below your alpha level,, you will re;ect the idea that the toss is fair and conclude that it is rigged. 3ut wait, do you want to !now the probability of getting exactly 81) tails# 4t%s probably pretty unusual to flip a coin 1999 times and get any one number, even 899. @o, you want a sense of how unusual 81) isA how much it differs from the e pected number of 899 tails for a fair coin. To answer this :uestion, you could do 1999 coin tosses with a coin you !now is fair and record the number of tails, then repeat that procedure ten million times, each time writing down the number of tails. Sometimes, you would get more than 81) tails. ?ost of the time, you would get around 899 tails. 4f you arranged the number of tails from smallest to largest and counted how often each result occurred, you would have a good sense of how unusual your obtained result of 81) is. 4f you drew a line at 812 tails and counted the number of results above 812, then divided that number by ten million, you would !now the percentage of times that your result of 81) tails, or one more e treme than yours, occurs with a fair coin ;ust by chance. That percentage would be the answer to your :uestion6 the probability of obtaining your test statistic, or one more e treme, ;ust by chance. This number is called statistical significance. 4t is often written as the italici5ed lower-case letter p, or p-value, for probability value. 4f it is below your alpha level, you would conclude that the coin toss is rigged. The e ample given in the previous paragraph presents the logic behind all significance testing. To test your hypothesis that the coin toss is rigged, you begin by assuming a fair coin and seeing how rarely your results would be obtained under that assumption. The hypothesis you want to test is called the e perimental hypothesis, while the assumption of no effect is called the null hypothesis. "hen you generate a large number of test statistics under the null hypothesis *as you did by repeating your coin toss e periment with a fair coin ten million times,, the resulting distribution of test statistics is called the null distribution. 3y computing the percentage of the null distribution that is e:ual to or more e treme than your test statistic, you find the p-value. 4f the p-value is less than the alpha level, you would re;ect the null hypothesis and accept your e perimental hypothesis. Bne part of the process above seems terribly unwieldy6 running millions of replications of a test under the null hypothesis to create the null distribution. 4n practice, researchers rely on mathematical appro imations of this process. There is a formula for determining the e act probability of a '-outcome event *heads or tails, with !nown probability *a fair coin comes up heads 89C of the time, and a given number of trials *1999 coin flips,. 4t is called the binomial function. Dach test statistic has a similar function that statistics programs use to estimate the probability of obtaining a given test statistic. "hat does statistical significance tell you# Eoes it tell you the probability that a result will be repeated# @o. Eoes it tell you the probability that the null hypothesis is true# @o1. Eoes it tell you that a result is important# @o. Statistical significance only tells you that a result is unlikely given the assumption of no effect *the null hypothesis,. Fegrettably, many people use the p-value as an indication of importance. To correct this problem, researchers have recommended additional ways to evaluate test statistics, such as effect si5e.

Statistical significance gives you the probability of a test statistic (T), given the null hypothesis (H0). Written as a conditional probability, it is p(T|H0). n contrast, the probability that the null hypothesis is true is p(H0). These can be very different. by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page .

Effect Size "hereas statistical significance *the p-value, tells you how unli!ely a test statistic is, effect si5e reflects the strength of the pattern that the test statistic is designed to detect. &s mentioned in the reading on the scientific literature in psychology, meta-analyses use effect si5e to combine the results of many studies because all test statistics *t, r, F, etc., can be converted to effect si5e. Several statistics have been developed to e press effect si5e, but three of the most popular are r, 2 *eta s:uared,, and d. r. r is on the same scale as the Pearson correlation coefficient r, ranging from -1 to +1 with 9 indicating no effect. 4t is most useful if the variables you are comparing are both continuous, that is, capable of being e pressed by numbers along a continuum. $or e ample, e posure to media violence e ists along a continuum from none to a great deal, and degree of aggression also e ists along a continuum from mild to severe. Thus, you could e press the effect of e posure to media violence on aggression using an r statistic. Fesearch on the effects of media violence on aggression finds effect si5es in the r / 9.' to 9.. range *&nderson et al., '99.,. 2. Dta s:uared is analogous to r' but is used for t-tests and &@BG&. 7i!e r', it reflects the percentage of variability in the dependent variable that can be e plained by the independent variable. d. d is generally used to describe the strength of the difference between two groups. 4t represents the number of standard deviations that separate the means of two groups. & d of 1.9 indicates that two means differ by one standard deviation, and a d of 9.8 indicates that two means differ by half a standard deviation. $or most people, those units don%t mean much, so Hohen *1)22, suggests thin!ing about d in terms of percentiles. Honsider two groups, e perimental and control. "ith a d of 9, the mean of the e perimental group e actly overlaps with the mean of the control group. &t d / 9.2, the mean of the e perimental group would fall at the ()th percentile of the control group, meaning that ()C of the scores of the control group are below the mean of the e perimental group. &t d / 1.(, the mean of the e perimental group is at the )8.8th percentile. Hohen *1)22, offers the cutoff values in the table below as general guidelines for evaluating the strength of effect si5es. Table 1 Cohen's (1988) Small ?edium 7arge eco!!ended "nterpretations of #ffect $i%e r .19 .'1 ..( d .' .8 .2

by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page 1

Four Test Statistics Psychology is mostly concerned with four test statistics, which are !nown by their symbols or letters6 ' *Hhi S:uare, pronounced !ai s:uare,, r, t, and F. Chi $&uare *', The Chi Square test statistic measures the degree to which one categorical variable is distributed disproportionately across one or more other categorical variables. & categorical variable is a nominal-scale variable, a variable that can ta!e on values in distinct categories *such as Dgyptian or 7ebanese, that cannot be e pressed as points along a continuum *as would be the case with a variable such as income or temperature,. "hen we tal! about it being distri'uted across another variable, we mean that we are loo!ing at how often particular combinations of two categorical variables occur. $or e ample, let%s say we%re studying whether support for gun control varies by political party. "e call 199 registered Eemocrats and 199 registered Fepublicans and as! each person whether they thin! there should be more restrictions on gun use *for gun control, or fewer restrictions *against gun control,. "e obtain the data in Table '. Table '. (ttitudes to)ard *un Control 'y +olitical +arty Iun Hontrol $or gun control Political Party Eemocratic Fepublican (9 '9 &gainst gun control .9 29

"hen we tal! about one variable being distributed disproportionately across another variable, we mean that the distribution of gun control opinion is not in the same proportion for Eemocrats *(96.9, as it is for Fepublicans *'9629,. To ma!e sense of nominal-scale data, it is often useful to convert them to percentages. $or the data in Table 1, we see that gun control is supported by (9C of our Eemocratic sample but only '9C of our Fepublican sample. Bur ne t :uestion is whether the difference between (9C and '9C is a significant differenceA that is, whether the difference between (9C and '9C is so large that it is unli!ely to occur by chance. Iiven the data above, we would obtain a ' value of 12 with 1 degree of freedom, which would be significant at p J .991. Thus, we could conclude that Eemocrats were significantly more li!ely to be for gun control than Fepublicans. 4n general, Hhi S:uare is the test statistic you would use if you are comparing two percentages and testing whether they are different. Correlation (r) Correlation is a statistical procedure used to measure the degree of linear relation between two variables. & linear relation is one that can be well described by a straight line. $igure 1 shows a linear relation between temperature and aggression. 4n these hypothetical data, participants are placed into rooms of different temperatures and their aggression is measured. Dach data point refers to the same person, and each data point represents two pieces of information about that person6 the temperature of their room and their level of aggression. Horrelation re:uires that both of your variables are on an underlying continuum6 interval- or ratio-scale. Horrelation is often reported as a lower-case italici5ed r6 r. r ranges from -1 to +1, with scores farther from 5ero indicating a stronger relation. The sign of r *whether it is positive or negative, indicates the direction of the relation between the two variables.
by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page 8

Positive correlations indicate that high scores on one variable *temperature, tend to be found with high scores on another variable *aggression,, and low scores on one variable tend to be found with low scores on the other variable. 4n the data presented in $igure 1, the correlation is +9.)1, a very strong correlation because it is close to +1. Figure 1- Correlation of ./-91 Figure 2- Correlation of ./-80

89 &ggression &ggression 19 89 09 (9 29 )9 199 119 Temperature


$igure ' displays a correlation of r / +9.28. Hompared to $igure 1, the data in $igure ' are not as concentrated around a straight lineA they are more spread out. &s data points become less and less well-described by a straight line, the correlation decreases. & negative correlation indicates that high scores on one variable tend to be found with low scores on the other variable. $igure . shows an e ample of a negative correlation6 the relation between temperature and the sales of hot drin!s. &s the temperature increases, people buy fewer hot drin!s6 high values on one variable are found with lower values on the other.

89 19 .9 '9 19 9 19 89 09 (9 29 )9 199 119 Temperature


Figure 2- Correlation of ,/-31

19 .9 '9 19 9

099 Sales of hot drin!s 899 199 .99 '99 199 9 19 89 09 (9 29 )9 199 119 Temperature

Bne way to convert the correlation coefficient r into a more usable form is to s:uare it, creating r'. r' is the percentage of one variable that can be e plained, accounted for, or predicted from a linear relation with the other variable. 4f the correlation between temperature and aggression is r / +.28, then r' / 9.(', meaning that temperature has accounted for ('C of aggression. This means that '2C is unaccounted for due to measurement error and alternative causes of aggression. &ccounting for ('C of any human behavior would be a monumental achievement given the myriad causes of behavior. ?any psychological correlations are around r / +9.., which means that they account for only about )C of behavior. &lthough it may not seem li!e much, being able to predict )C of behavior can be an enormous advantage when you are dealing with large numbers of people, such as customers on e3ay or visitors to Eisneyworld. "ndependent t,test &n independent t test is used to compare the means of two separate *independent, groups of people. 4t is used when your independent variable is nominal-scale and has two levels *such as e perimental and control, and when your dependent variable is on a continuum and is interval- or ratio-scale *such as an iety,. "hen the two groups have means
by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page 0

that are e actly e:ual, t is 5ero. &s the means of the two groups diverge, t increases. The formula for t begins by subtracting the mean of one group from the mean of the other group. Thus, if the first group has a higher mean, t will be positive, but if the second group has a higher mean, t will be negative. t gets larger as the magnitude of the difference between means increases, and gets smaller as the variability in scores within each group increases. This means that when scores are very spread out within each group, t will be smaller and you will be less li!ely to obtain a statistically significant result. 4f you were comparing the running times of two groups of Blympic athletes, one wearing a special shoe and one wearing a regular shoe, there would be very little variability within each groupA each runner would probably be within a few hundredths of a second of each other. Knder those circumstances, you would be more li!ely to find a statistically significant difference between the groups because the variability within groups would be small. Honstrast that situation to a comparison of two groups of first-graders, one wearing the special shoe and one not. The running times for the first graders may differ by several minutes, spreading out the scores so much that it would not be possible to see the effects of the running shoe. (456( !"#$! stands for &@alysis Bf G&riance. 4t is not a test statistic per se, but rather a statistical procedure. Gariance consists of the differences between scores. Ireater differences among scores is greater variance. &nalysis refers to a cutting into pieces of the variance. 4n this case, our goal is to divide variance into two ma;or pieces6 the variance within each group and the variance between the groups. 7et%s say you survey ten members each of three fraternities and as! them how satisfied they are with college. The differences among the responses within each fraternity comprise the within-group variance. The differences between the means of the three groups comprise the between-groups variance. The test statistic for &@BG& is an F and it is the ratio of between-group variance to within-group variance. &s between-groups variance increases and within-group variance decreases, F gets larger and is more li!ely to indicate a significant difference among the means of the groups. &@BG& re:uires that your independent variable is nominal-scale, but unli!e the independent t-test, &@BG& can handle more than two groups. 7i!e the t-test, the dependent variable for &@BG& must be continuous and on an interval or ratio scale. #ne %ay !"#$! refers to &@BG& involving a single independent variable, such as which fraternity a participant is in. Factorial !"#$! involves more than one independent variable, such as a study investigating both the effects of being Iree! or independent and the effects of gender. 4n that case, the e perimental design would re:uire a t%o %ay !"#$! because there are two independent variables6 Iree! status and gender. $u!!ary of test procedures The appropriate test is determined by the !ind of data you have and the pattern you want to loo! for, as outlined in Table ..

by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page (

Table . 7ype of 8ata e&uired for #ach 7est &$ @ominal 4nterval ' categories '+ categories '$ @ominal 4nterval 4nterval 4nterval Test
'

r t &@BG&

Example Eo a higher percentage of Fepublicans than Eemocrats favor the death penalty# &s temperature increases, are violent crimes more li!ely# &re males more aggressive than females# Eo four fraternities differ in their average IP&#

Type & and Type && Errors Type 4 and Type 44 errors refer to two possible ways that the conclusions from statistical significance tests can be mista!en. Perhaps the easiest way to e plain this is to begin with the chart in Table 1, which shows the relationship between the true state of nature and your conclusions. Fecall that the null hypothesis is the assumption of no effect. 4n Table 1, we see that there are two ways to be correct6 to re;ect the null when there is an effectA and not to re;ect the null when there is no effect. There are also two ways to ma!e mista!es, and these are the Type 4 and Type 44 errors. & Type 4 error occurs when we re;ect the null hypothesis and claim there is an effect even though the true state of nature is that there is no effect. Thus, a Type 4 error occurs when p is below .98 ;ust by chance, not because there is a real effect. >ere we see another definition of statistical significance *the p-value,6 the ris! of ma!ing a Type 4 error. & Type 44 error occurs when we find that p is above .98 and conclude that there is no significant relationship when, in fact, there is a real relationship and we ;ust didn%t detect it. Table 1 #xplaining 7ype " and 7ype "" #rrors <our conclusions Eo not re;ect null Fe;ect null, conclude 9e Say no effect when Type 4 Drror6 $alse there is no effect positive *p-value / *correct, prob. of ma!ing a Type-4 Drror, Type 44 Drror6 $alse Say there is an effect negative when there is an effect *correct,

@ull is true *there really is no effect, True state of nature @ull is false *there really is an effect,

(!ccepting) the "ull *ypothesis+ The null hypothesis is a prediction of no effect. 4f your analysis yields p L .98 and you have set .98 as your alpha level, you cannot re;ect the null hypothesis. "hy don%t we say accept the null hypothesis rather than the more cumbersome do not re;ect# The reason is that to accept the null hypothesis would be to state as empirical fact that there is no effect, when all you really !now is that you do not have sufficient evidence of an effect. 4t may be that an effect e ists but your measures were not sensitive enough to pic! it up or you did not have enough participants. $or e ample, a researcher studying gender differences in preferences for romantic films collects data from 19 men and 19 women and finds that the mean preference
by 3ill &ltermatt, last updated =an '1, '992

Significance Testing

page 2

rating for romantic films is higher for women than for men, but this difference is not significant at p / .92. The researcher should not conclude that there is no gender difference *accept the null hypothesis,, but rather should conclude that no gender difference was found. ,Statistical- Po%er 4n the conte t of statistics, po%er refers to the probability of achieving a specific p-value, given a specified effect si5e, sample si5e, and variability in the dependent variable. Thus, power is a probability, a number between 9 and 1. Power increases as effect si5e and sample si5e increase and as variability in the dependent variable decreases. $or e ample, consider a study that compares the means of two groups of 19 people each. &ssume that the mean of one group is a 1 and the mean of the other group is a 1.8 and that the variability of each group is a standard deviation of 1.9. "ith a p-value cutoff of .98, the power of that study is only .'91. That is, there is only a '9.1C probability that the study will achieve a p-value less than .98. That information might ma!e you pessimistic about actually conducting the study because even if everything goes smoothly, you will only find p J .98 '9.1C of the time. 4f the number of people in each group is increased to 199, power in the study ;ust mentioned ;umps to 9.)1'. This shows that power is influenced by sample si5e and tells you why researchers often try to obtain as many participants as possible. 4f we !eep sample si5e in each group to 19 and increase our estimate of the difference between the means to 1.9 *by predicting that the means will be 1.9 for one group and 8.9 for the other, for e ample,, power ;umps to 9.09). This shows that larger effects are easier to detect than smaller effects. & common application of statistical power is in computing the necessary sample si5e to have power at a certain level. $or e ample, if you would li!e to have an 29C probability of obtaining p J .98, how many people would you need in your sample# To answer that :uestion, you need reasonably accurate estimates of effect si5e and variability, such as from an earlier study. "ith that information, you could learn that to have 29C power, you would need only '9 participants, in which case you should do the study, or you could learn that you would need ',999 participants, in which case you might give up. Summary. 'efinitions of Statistical Significance 4n this reading, we have encountered at least five ways to define statistical significance. &ll of them are e:uivalent and are ;ust different ways of saying the same thing. 3elow is a summary of these definitions. Statistical significance is the probability of... 1. ...obtaining a test statistic as large or larger than the one you obtained, ;ust by chance *as the result of a completely random process, '. ...any observed difference between groups *or correlation between variables, being due to a random organi5ation of the data .. ...ma!ing a Type 4 *false positive, error 1. ...concluding that an effect e ists, when in fact no effect e ists 8. ...re;ecting the null hypothesis *and accepting the e perimental hypothesis,, when the null hypothesis should not be re;ected Feferences &nderson, H. &., 3er!owit5, 7., Eonnerstein, D., >uesmann, 7. F., =ohnson, =. E., 7in5, E. I., ?alamuth, @. ?., M "artella, D. *'99.,. The 4nfluence of ?edia Giolence on <outh. +sychological $cience in the +u'lic "nterest, 1*.,, 21-119. Hohen, =. *1)22,. $tatistical po)er analysis for the 'eha:ioral sciences *'nd ed.,. >illsdale, @=6 7awrence Darlbaum &ssociates.

by 3ill &ltermatt, last updated =an '1, '992

Das könnte Ihnen auch gefallen