Beruflich Dokumente
Kultur Dokumente
5 STATISTICAL SINS
INTRODUCTION
Most published scientific results are false.
That's the thesis arguedconvincinglyby John Ionnadis in his landmark
2005 paper.
Here's one reason for this.
Imagine that you're living in an alternate past. The year is 1850. Its the
height of the California gold rush. Except, in this timeline, youre living in a
more scientifically literate society, so researchers set out to prove that there
are ways to more efficiently strike it rich. An agenda I can get behind.
These researchers try whatever they can think oftesting, to name a few,
the divine power of prayer, sacrificing groundhogs, and dowsing. None of
these pan out (heh).
Then, one otherwise uneventful Tuesday, a paper is published, "Human
echolocation of gold veins: evidence of an adaptive mechanism for aural mineral detection." The paper describes an experiment where researchers split
1000 gold-hungry migrants into two groups: a traditional control and an echolocation group. The echolocation group was instructed to walk around clacking their tongues in pursuit of the shiny yellow stuff.
Here's the crazy part: the echolocation people found a statistically significant amount of excess gold. The authors speculate that humans have a heretofore unknown sensory organ that picks up mineral densities. This, they say,
was useful in the ancestral environment for tasks like finding salt, copper,
and primo soon-to-be farmland.
Everyone reads the paper. It's flawless, compelling. Id call it a slam dunk
but basketball wont be invented for another 40 years. Soon, only fools search
for gold without clacking their tongues madly, harnessing this power of human echolocation.
I.
Except, you know, homo sapiens (unfortunately) have no particular knack
for echomining. So what went wrong? Why is this paper false, even though the
researchers executed the scientific process the "right" way?
1
Here's one possible answer: those researchers weren't the only ones that
tested echolocation. There was a lot of excitementgold to be had! Anything
and everything was tried.
For the researchers who found nothing other than the null result, no paper was published. "Of course humans can't detect gold with tongue clacking.
This is not publication-worthy science."
But, in a universe where enough people test something, someone is
bound to obtain a significant result. The laws of probability dictate it.
If 20 teams tested human echolocation, the odds of one team reporting a
false positive are not 5%, like you might expect with a p-value threshold of
.05, but 65%.
II.
This concept is called the "file drawer problem," because studies that find
expected, business-as-usual results are relegated to the metaphorical file
drawer, away from the light of the fair sun.
This is a real problem, with real consequences. The antidepressant Reboxetin, for instance, was approved for use after clinical trials found it effective.
One problem. The stuff doesnt work. It was publication bias on the part
of manufacturer Pfizer. Pfizer ran many studies, publishing only the positive
findings.
Something similar, but less nefarious, happens with the study of extrasensory perceptionyou know, psychics, mind control, that kind of thing.
When a study finds that people can't read minds (duh), the paper doesn't get
either written up or published.
But when such a study finds a positive result, everyone wants to read it.
There is a whole field like this, parapsychology, and many of the papers it encompasses are at least as fanciful as my echolocation example.
Like JB Hasted's paper, "Paranormal Metal Bending", where he reports
that participants could bend metal with their minds. It contains a photo of a
glass globe filled with paranormally-scrunched paper clips. Notably, there is
hole in the globe, but Halsted explains, "We have found it necessary that a
small orifice be left in the glass globes in which wires are bent.
Right.
Amusingly, academic finance has the opposite problem. If you do discover an anomaly in, say, the stock market, you don't publish it. You arbitrage
the opportunity away, making millions in the process.
III.
This is one of the reasons why most published research is false. In the
coming pages, I'll detail 5 more common mistakes, and what you can do to
avoid them.
And I say that this is a red-flag, because the paper was later retracted. Of
it, one critique said that there is "no theoretical or empirical justification for
the use of differential equations drawn from fluid dynamics, a subfield of
physics, to describe changes in human emotions over time."
Whoops again.
Stereotype threat. One more, I can't resist. How I have been wrong! Let
me count the ways.
The basic idea is that priming negative stereotypes can hinder performance. So if your son is in ballet, and about to compete, and you bring up
something about how strange it is for a boy to be doing ballet, well, that
might mess him up.
Actually, the usual example is even weaker than that. Say you're a girl
and about to take a math test, and you've heard the "girls are bad at math"
meme. If, as part of the exam, you're required to indicate your gender, you'll
perform worse.
On reflection, this is not the most believable thing in the world. It sounds
like, "Well, I guess that could be true, but it seems sorta forced."
And probably it's not true, or at least not very true, such that having people indicate their gender or whatever before an exam has a very small effect,
if any.
The paper that torpedoed it for me is, "An Examination of Stereotype
Threat Effects on Girls' Mathematics Performance," which found substantial
publication biasthe same problem our echolocating gold miners had. Studies
that could find no evidence of a stereotype threat tended not to be published.
So this theory is not true, or at least not true enough to matter.
We can do better.
IV.
Being wrong sucks.
That's why I'm writing this.
It sucks not in a general for-the-good-of-all-men way, but for selfish
reasons:
You don't want to be wrong about how the world works, because it will
be embarrassing if you get into an argument.
You don't want to be wrong because if you are mistaken and you act on
that mistaken belief when pursuing a goal, probably it's not going to work
out.
And, finally, if you assume a wrong thing to be true and build off of
that knowledge, when you discover your mistake, you'll also have to throw
out all the beliefs connected to that belief.
If that's not enough (it should be), the main reason we know the name Johannes Kepler today is because of his fanatical devotion to creating planetary
models consistent with observational data.
He really didn't want to be wrong.
We could all benefit from being a little more like Kepler. In the upcoming
sections, I'll cover 5 common beginner stats mistakes and how to avoid them.
SECTION
How to Fix It
The fix here is trivial. Learn about different types of visualizations and
then experiment with them. Try typical graphs, dotplots, heatmaps, whatever.
Learn how to create these with your software system of choice.
This is my favorite book on the subject, by the way.
I saw a video a while back where one guy won a Kaggle competition because he opened the data set in Excel and colored the numbers according to
their severity. This gave him an intuitive understanding of the data which allowed him to build a better model and win.
Plot your data.
SECTION
2. OVERFITTING
Overfitting is a subset of that too-human passtime: seeing patterns where
none exist.
Here's an example.
When Nazi Germany bombed London during World War II, the British
came up with a lot of theories about when and where the Germans were bombing. Trying to figure out their schedule. On Sundays, one part of London.
Other days, another.
They were pretty confident that theyd figured out at least part of it.
Only problem: later, more rigorous analysis revealed that there were no
such patterns. No schedules, nothing. The distribution of bombs was random.
This is also the way that baseball players are announced: Next up to bat,
Ricky Example. Mr. Example is distinguished by the highest number of home
runs on Sunday afternoons. Ladies and gentleman, let's see what he's got for
us.
See, with overfitting, you can describe anyone as the best at something,
as long as it's specific enough. (This is one reason you should never believe
those lists of best cities, best colleges, or best jobs. With the right metrics
and weightingswhich are often mostly arbitraryanything can be #1.)
More generally, overfitting is when you fit a model too tightly to your
data. Like, once when working on a model for predicting book ratings, I ran
some automated feature selection algorithms, but I got stuff out like, "If a
book is published in December with more than 40 citations and between 1
and 5 reviews on Amazon, the user will give this book 5 stars."
That sort of thing is not going to generalize to new data, it's an irrelevant
characteristic of the input data set.
It's all in the name, sort of. Overfitting occurs when you choose a model
that's too specific to your training data set and doesn't generalize to other datawhich is why you want the model in the first place. You want transfer.
Imagine that you coach people on mountain climbing and one day you're
hired by a client, Bob Overfit. He wants to learn to climb mountains, so you
both drive out to Eagle Mountain, and you show him the ropes, literally and
metaphorically.
A few weeks pass, and you get a call from Bob. Dude is irate. He tried to
go out to Mount Falcon with his pals, but kept stumbling and getting lost.
You were supposed to train him to climb mountains!
So you ask him a few questions, attempting to debug the problem. Finally, you ask him, "Bob, what exactly was your strategy for climbing Mount
Falcon?"
"Well, I closed my eyes and followed the path that I'd memorized for
climbing Eagle Mountain."
That's overfitting. Memorizing to climb one specific mountain isn't going
to help you climb all mountains, just like building a model too tightly around
one data set won't give you good performance on all data sets.
To ensure that this happened, I'd practice with only one of the tests, and
then I'd use the other one to benchmark my performance, being sure not to
do that too often.
This ended up working well.
However, this is a pretty painful solution if you have a small data set (say
10,000 samples or less), because leaving out any data gives you less to build
an accurate model with.
To get around this, you can cross-validate your model.
In k-fold cross validation, for instance, a data set is split into k subsets.
One of those subsets is held out as a test set, while the model is trained on
the rest. This is then repeated, rotating which k is the hold out. By averaging
the results, the hope is that you'll get a more accurate idea of how a model
will perform in the wild.
The best thing you can do, though, really, is just avoid building models
that don't make any sense. If you've "discovered" that babies born on the
first Sunday of every month are more likely to be criminals, but you have
no compelling reason why, you've probably done something wrong.
This applies more broadly, too. When you hear about a new scientific
study in the news that says something like, "People who wash their hands after eating losing more weight," be skeptical.
Use your brain. Don't overfit.
10
SECTION
11
I once had a question about whether or not the student's t-test the right
test for me to be using. I spent some time on google and even asked a few
people. This proved to be wholly unnecessary. Sitting down for 15 minutes
and grokking the mathematical definition was enough to convince me that it
was the right tool and that I should continue.
Do the same thing. Sit down, figure it out, and move on with confidence.
If you're worried about understanding, use the Feynman method, which I'll
cover in a future email.
12
SECTION
4. HYPOTHESIS MINING
In the introduction, I mentioned the mysterious case of gold echolocation.
The culprit was that many people were testing something, but only those that
had fluke positive results were reported.
Data mining works by the same mechanism but, instead of many testing
one thing, one person tests many things.
This is accurately summed up by this xkcd:
13
See, the issue is when you mix up exploratory analysis and actually testing something. In general, your process should be along the lines of:
1. Hm, I wonder if x is true. Here's how I could test it.
2. Collect relevant data.
3. Test it.
Don't test a ton of things on one data set and then only report that which
is statistically significant. This isn't how it works. Just by chance alone, you'll
end up reporting false things.
14
SECTION
6. UNREPRESENTATIVE SAMPLES
The point of statistics is inference. You want to take a sample of, say, 35
people, and be able to learn something about everyone.
This means you need to take a representative sample. If you don't, your
results will be worthless (or even harmful).
My favorite of this is Saturday Night Live's "Weekend Update." This is one
of those fusions of news and humor that is quickly becoming too common.
Anyways, one night they had this to say, A recent poll showed that 58%
of Americans now favor the legalization of marijuana. Of course, this was 58%
of people who were at home in the middle of the day to answer a telephone
poll.
The implication: those who answered the poll weren't a random sample
and, thus, can't reflect the opinions of Americans as a whole.
This problem is not limited to Gallup polls.
Psychological science reeks of it. So much so that we have an acronym:
WEIRD. Because participants are overwhelmingly from, "Western, educated,
and from industrialized, rich, and democratic countries."
Indeed, it's even worse, because the sample is almost overwhelmingly
from college kids who are doing it for extra credit.
A similar problem plagues businesses. I had a chance to hear a presentation by one of FlightCar's founders, and he mentioned the problem with surveys: only those who love and hate you leave them. If you want responses
from people in the middle, you'll have to do something else.
So, okay. Those are unrepresentative samples. If you fuck up your sample, your results will be useless. Garbage in, garbage out.
So how do you prevent this?
15
pling was wrong, what will have been the problem?" (This is sometimes called
a pre-mortem.
A pretty typical problem with calling people is that there are certain subsets who just won't answer. So you have to come up with some way of getting
in touch with this subset. Keep calling, letters, email, knocking on doors. I
don't know. But you'll have to figure it out.
16
SECTION
PUTTING IT ALTOGETHER
Here's the executive summary then.
Not plotting the data. The human visual system rocks. Use it. Learn
how to chart data. Read this book, absorb it, love it, and visualize.
Overfitting. Overfitting occurs when a model is trained too closely to
the training set, such that the performance doesn't generalize to new data.
It's like learning to climb mountains by memorizing the route up one mountain. To prevent this, test your model against a hold out set or use cross validation.
Not understanding your tools. If you use a statistical model without
understanding how it functions, you will fuck it up. If you're confused, sit
down and figure it out. Use the Feynman method, which I'll cover in a future
email.
Data mining instead of hypothesis testing. If you run many experiments on one data set, you need to correct with something like a Bonferroni
correction. Pick one hypothesis and test it. Don't go exploring.
Taking an unrepresentative sample. If you want to say something
about all Americans, you need to take a sample representing all Americansnot representing people taking psychology class on your campus or
similar. Fix this by thinking long and hard about potential issues with your
sampling mechanism.
That's all. Go forth and produce error-free statistics! Find gold.
17