Sie sind auf Seite 1von 5

Radio NationalOckhams Razor

9 October 2011

Significant does not equal important: why we need the new statistics
Professor Geoff Cumming from the Statistical Cognition Laboratory at La Trobe University in Melbourne looks at how we interpret statistics. Robyn Williams: A few weeks ago in this program we kicked off a row about dope, Cannabis actually. Does it trigger psychosis or even schizophrenia in heavy users? And are the figures for schizophrenia going up, as those for cannabis users certainly are. Well, you'd think that's a straightforward question, but no, we found there's disagreement about statistics and their significance. The same can be heard in the Health Report nearly every week. Yes, avoid coffee or fructose, or cholesterol, or red meat because the stats are worrying. And then Norman Swan talks to ten professors about how the significance of the figures may be in doubt. So, what do we do? Give up? Well no, instead listen to Professor Geoff Cumming of La Trobe University in Melbourne, who has a book on The New Statistics. Geoff Cumming: Suppose you read in a newspaper that 'Support for the Prime Minister is 43% in a poll with an error margin of 2%." Most people probably understand that: 43% is our best estimate of support in the whole population and, most likely, it's within 2% of the true value. Reporting something like 43 plus or minus 2 is a good way to answer many of science's questions. A chemist reports a melting point as 17 plus or minus .2 degrees, or a geologist estimates the age of the earth as 4.5 plus or minus .1 billion years. These are examples of estimation, a statistical strategy widely used in the natural sciences, and in engineering and other applied fields. The 43 plus or minus 2 defines a range, from 41 to 45%, which, most likely, includes the true value. This is the 95% confidence interval. We can be 95% confident that this interval calculated from the poll results includes the true value for support for the Prime Minister. Estimation is highly informative, it tells us what we want to know, and so simple you can report it in a newspaper. The astonishing thing is that psychology and many other disciplines rarely use it, and instead use a different strategy - statistical significance testing - which was introduced in the 1920s by Sir Ronald Fisher, that brilliant and argumentative statistician and geneticist. Statistical significance testing has always been controversial, but even so became widespread in psychology from the 1950s, and since then across the social sciences. It's also widely used in medicine and many biological sciences. Consider a psychologist who's investigating a new therapy for anxiety. She randomly assigns anxious clients to the therapy group, or a control group. You might think the most informative result would be an estimate of the benefit of therapy - the average improvement as a number of points on the anxiety scale-together with the amount that's the confidence interval around that average. But psychology typically uses significance testing rather than estimation. Introductory statistics books often introduce significance testing as a step-by-step recipe: 1

Step 1. Assume the new therapy has zero effect. You don't believe this and you fervently hope it's not true, but you assume it. Step 2. You use that assumption to calculate a strange thing called a 'p value', which is the probability that, if the therapy really has zero effect, the experiment would have given a difference as large as you observed, or even larger. Step 3. If the p value is small, in particular less than the hallowed criterion of .05 (that's 1 chance in 20), you are permitted to reject your initial assumption - which you never believed anyway-and declare that the therapy has a 'significant' effect. If that's confusing, you're in good company. Significance testing relies on weird backward logic. No wonder countless students every year are bamboozled by their introduction to statistics! Why this strange ritual they ask, and what does a p value actually mean? Why don't we focus on how large an improvement the therapy gives, and whether people actually find it helpful? These are excellent questions, and estimation gives the best answers. For half a century distinguished scholars have published damning critiques of significance testing, and explained how it hampers research progress. There's also extensive evidence that students, researchers, and even statistics teachers often don't understand significance testing correctly. Strangely, the critiques of significance testing have hardly prompted any defences by its supporters. Instead, psychology and other disciplines have simply continued with the significance testing ritual, which is now deeply entrenched. It's used in more than 90% of published research in psychology, and taught in every introductory textbook. Why do researchers continue to rely on significance testing? I suspect one reason is that declaring a result 'significant' strongly suggests certainty, even truth, and that the effect is large and important even though statistical significance doesn't imply any of that. Another problem relates to replication. Replication is central in science - usually, we won't take any result seriously until it has been replicated at least a couple of times. An advantage of estimation is that a confidence interval tells us what's likely to happen on replication. If we ran another poll, the same size, but asking a different sample of people, we'd most likely get a result within the 43 plus or minus 2 confidence interval given by our first poll. Not so for significance testing! p values are usually calculated to 2 or even 3 decimal places, and decisions about significance are based on the precise value. However, a significance test applied to a replication experiment is likely to give a very different p value. Significance testing gives almost no information about what's likely to happen on replication! Few researchers appreciate this problem with p values, which totally undermines any belief, or desperate hope, that significance is a reliable guide to truth. For a simulation of how p values jump around wildly with replication, go to YouTube and search for Dance of the p values. There's colour and movement, and even weird dance music. I refer to estimation and related techniques as 'the new statistics', not because the techniques are new, but because, for most researchers who currently rely on significance testing, using the techniques would be very new and would require big changes in attitude. But switching to estimation could, I'm convinced, give great improvements to research. 2

Another reason researchers hesitate to use estimation may be because, often, confidence intervals are long, perhaps embarrassingly so. It's discouraging to report a reduction in anxiety of 7.5 plus or minus 7 units on the anxiety scale, or even 7.5 plus or minus 15, but even such long confidence intervals are just reporting accurately the uncertainty in the data. Confidence intervals in psychology, medicine, and many other disciplines are often very long because people vary so much, and it's not practical, or even possible, to use sufficiently large samples to get short confidence intervals. So, what do we do? An excellent approach is to combine results from multiple studies to get better estimates, and meta-analysis does exactly that. Metaanalysis integrates evidence over studies to give an overall estimate of the size of the effect of interest, and a confidence interval to tell us how precise that estimate is and how consistent the results from the different studies are. Meta-analysis is based on estimation, and requires as input the mean and confidence interval from each of the separate studies. It's a key part of the new statistics, and can draw clear conclusions from a messy and controversial research literature. For example, meta-analysis supports a very confident conclusion that phonics is essential for beginning readers. It's also now routinely used to combine evidence from different trials of a drug, to guide decisions about whether the drug is effective and should be approved for use. Of course, meta-analysis can only give an accurate result if all relevant studies are included. If only selected studies are available it's likely to give a biased result. Here, alas, is where significance testing does further damage, because significance has often influenced which studies are published and are, therefore, easily available for inclusion in a meta-analysis. Few journal editors have been willing to allocate precious journal space to report results found to be non-significant! Imagine carrying out a dozen replications of an experiment. The results would, of course, vary somewhat, and the p values would vary enormously. Typically, experiments that happened to obtain somewhat larger effects would be significant and others would not. If only the significant results were published, a meta-analysis would probably mainly include the larger results, so would give an overall estimate that's too large, not through any fault of meta-analysis, but because only a biased selection of research was available. Significance testing is not only irrelevant for carrying out meta-analysis, which relies on estimation, but, through its influence on journal publication, significance testing throws a major spanner in the works of meta-analysis. Here's an example of the pernicious influence of significance testing. Back in the 1970s, reviews of gender differences identified substantial differences between boys and girls - in verbal ability, mathematics ability, and aggressiveness for example. Then scholars began to realise that the published literature on gender abilities was biased. If researchers were studying, for example, memory, they would carefully check whether the boys and girls differed on memory, even if gender was not their primary focus. If the difference was not significant they might not even mention that in their published article, but if it was significant the difference would be reported, with averages and other statistical details. Statistical significance thus biased the published literature towards evidence of difference. Happily, researchers have become more careful to report evidence of small or

negligible differences, and recent reviews have identified many fewer differences between boys and girls, and many abilities on which their performance is very similar. My conclusion is that significance testing gives only a seductive illusion of certainty and is actually extremely unreliable. It also distorts the published research record. All round, it's a terrible idea. Statistical reformers have long advocated a switch from significance testing to estimation and other better techniques. Recently the sixth edition of the Publication Manual of the American Psychological Association was released. The Manual is used by more than 1,000 journals across numerous disciplines, and by millions of students and researchers around the world, so its advice is highly influential. The new edition states unequivocally that interpretation of results should, wherever possible, be based on estimation. This crucial advice is new, and I hope gives a great boost to statistical reform. Statisticians sometimes argue that reformers' claims are exaggerated, because estimation and significance testing are based on the same statistical models and so you can translate between the two. That's true, statistically, but my research group has published evidence that, psychologically, presenting results in a significance testing or an estimation format can make an enormous difference to the way they are understood. In many cases, estimation is not only more informative, but leads to more accurate interpretation. Such evidence comes from our research on statistical cognition, which is the study of how people understand - or misunderstand - various ways to present results. I believe we need much more statistical cognition research, so we understand better how people think about statistical ideas and presentations. This research could also suggest how to help researchers understand that the apparent certainty offered by statistical significance is a mirage, and that there are better ways to analyse data. I have a book about all this. The title is Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis. I hope it will help students avoid being bamboozled by significance testing, and help researchers as well as students use estimation and meta-analysis. There's further information at www.thenewstatistics.com I'd like to close with another example, heart-rending yet hopeful. It concerns SIDS, or cot death, the tragic death while sleeping of apparently healthy infants. Back in the early 1980s my wife and I followed what was then the best advice and put our young kids down to sleep face down on a sheepskin. A recent review examined the evidence published in various years about the influence of sleeping position on the risk of SIDS. If meta-analysis had been available, applying it to the evidence available in 1970 it would have given a reasonably clear conclusion that back sleeping is safer. The evidence strengthened during the 1970s and '80s, but some parenting books still recommended front sleeping as late as 1988. The review estimated that, if meta-analysis had been available and used, and the resulting recommendation for back sleeping had been made in 1970, as many as 50,000 infant deaths in the developed world could have been prevented. Who says statistical techniques don't make a difference? I'm pleased to say that our young grandchildren are resolutely laid down to sleep on their backs. Robyn Williams: Snug and safe, 50,000 lives saved by stats in that single example alone. Professor Geoff Cumming at La Trobe University in Melbourne. Understanding The New Statistics is the name of the book. 4

Next week we go to Bendigo for a talk on physics, the flavour of the month, what with Brian Schmidt's Nobel, I'm Robyn Williams. Publications Title: Understanding The New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis Author: Geoff Cumming Publisher: Routledge New York

Das könnte Ihnen auch gefallen