Sie sind auf Seite 1von 3

5/1/2019 P-hacking and backtest overfitting « Mathematical Investor

The seven reasons most econometric investments fail

P-hacking and backtest overfitting


The reproducibility crisis in science
Recent public reports have underscored a crisis of reproducibility in
numerous fields of science. Here are just a few of recent cases that have
attracted widespread publicity:

In 2012, Amgen researchers reported that they were able to


reproduce fewer than 10 of 53 cancer studies.
In 2013, in the wake of numerous recent instances of highly touted
pharmaceutical products failing or disappointing when fielded,
researchers in the field began promoting the All Trials movement,
which would require participating firms and researchers to post the
results of all trials, successful or not.
In March 2014, physicists announced with great fanfare that they
had detected evidence of primordial gravitational waves from the
“inflation” epoch shortly after the big bang. However, other
researchers subsequently questioned this conclusion, arguing Credit: Wikimedia
that the twisting patterns in the data could be explained more
easily by dust in the Milky Way.
In 2015, in a study by the Reproducibility Project, only 39 of 100 psychology studies could be replicated, even after
taking extensive steps such as consulting with the original authors.
Also in 2015, a study by the U.S. Federal Reserve was able to reproduce only 29 of 67 economics studies.
In an updated 2018 study by the Reproducibility Project, only 14 out of 28 classic and contemporary psychology
experimental studies were successfully replicated.
In 2018, the Reproducibility Project was able to replicate only five of ten key studies in cancer research, with three
inconclusive and two negative; eight more studies are in the works but incomplete.

P-hacking
Part of the problem is due to abuse of the p-test, a widely used significance test introduced by the British statistician Ronald
Fisher in the 1920s. It assesses whether the results of an experiment are more extreme that what would one have given the nu
hypothesis. The smaller this p-value is, argued Fisher, the greater the likelihood that the null hypothesis is false. The p-test, use
alone, has significant drawbacks. To begin with, the typically used level of p = 0.05 is not a particularly compelling result. In any
event, it is highly questionable to reject a result if its p-value is 0.051, whereas to accept it as significant if its p-value is 0.049.

The prevalence of the classic p = 0.05 value has led to the egregious practice that Uri Simonsohn of the University of
Pennsylvania has termed p-hacking: proposing numerous varied hypotheses until a researcher finds one that meets the 0.05
level. Note that this is a classic multiple testing fallacy of statistics: perform enough tests and one is bound to pass any specific
level of statistical significance. Such suspicions are justified given the results of a study by Jelte Wilcherts of the University of
Amsterdam, who found that researchers whose results were close to the p = 0.05 level of significance were less willing to share
their original data than were others that had stronger significance levels (see also this summary from Psychology Today).

For additional details and discussion, see this Math Scholar blog.

A statistical crisis in finance


Before researchers and practitioners in the mathematical finance field start feeling smug at these reports, it must be emphasize
that finance has its own problems in this area, which the present authors argue are even more severe than those in psychology
sociology, economics, biomedicine and physics.

In 2014, one of the present authors and co-authors called attention to the pervasiveness of backtest overfitting in finance.
Indeed, backtest overfitting is now thought be a principal reason why investment funds and strategies that look good on paper
often fail in practice — the impressive performance that was evident in backtest studies is not reproduced, to put it mildly, when
the fund or strategy is fielded in practice.

Some of us also demonstrated, in a 2017 JOIM paper, how easy it is to design a stock fund, based on backtests, that achieves
virtually any desired goal (e.g., a steady increase of 1% per month growth, month after month for ten or more years). However,

https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 1/3
5/1/2019 P-hacking and backtest overfitting « Mathematical Investor
when such designs are presented with new data (as if one were actually fielded), they prove to be very brittle, often failing
catastrophically, or, at the least, utterly failing to achieve their stated goal. The reason is, once again, backtest overfitting.

There is no indication that the situation has improved since these early studies. If anything, the situation has worsened with the
ever-increasing prevalence of computer-based designs for investment funds and strategies. This is because it is an increasingly
simple matter to explore thousands, millions or even billions of alternative component weightings or parameter settings for an
investment fund or strategy, and select only the highest-scoring combination to be fielded. As we have shown (see here for
instance), such computer explorations (typically never disclosed to editors, readers or customers) render the resulting study or
financial product hopelessly overfit and, in many cases, subject to catastrophic failure.

As one of the present authors and a colleague has found, most investment strategies uncovered by practitioners and
academics are false.

In a 2018 Forbes interview, two of the present authors, interviewed by Brett Steenbarger, discussed this growing crisis:

Imagine that a pharmaceutical company develops 1000 drugs and tests these on 1000 groups of volunteer
patients. When a few dozen of the tests prove “significant” at the .05 level of chance, those medications are
marketed as proven remedies. Believing the “scientific tests”, patients flock to the new wonder drugs, only to
find that their conditions become worse as the medications don’t deliver the expected benefit. Some
consumers become quite ill and several die.

Clearly, there would be a public outcry over such deceptive practice. Indeed, that is precisely the reason we
have a regulatory agency and laws to help ensure that medications have been properly tested before they are
offered to the public. … [But] no such protections are offered to financial consumers, leaving them vulnerable
to unproven investment strategies. … These false positives are particularly misleading, as they are promoted
by researchers with seemingly impeccable research backgrounds—and who do not employ the scientific tools
needed to detect such false findings.

P-hacking and backtest overfitting


Is there a connection between p-hacking and backtest overfitting? You bet there is.

After all, both p-hacking and backtest overfitting are instances of the multiple testing fallacy — if one performs enough tests o
explores enough hypotheses on a single dataset, one is certain to find a test or hypothesis that is successful to any pre-specifie
level of statistical significance.

This is perhaps best illustrated by the


following graph. As we argued in an earlier
Mathematical Investor blog, every
investor should be familiar with it. If one
understands this plot well, it will prevent a
lot of misery. If one ignores it, and it may
result in substantial financial loss.

Suppose that you are an analyst for a large


investment firm, and you have an idea for a
new strategy. You run backtests on a
moderately large number of parameter
combinations, and find that a few of these
parameterized strategies achieve high
Sharpe ratios, some of them above 3
(which is typically taken as a rather
promising result). You show these results
to your boss, who performs independent
tests that confirm your result. The
investment committee then approves the
deployment of what appears to be the most
promising strategy, and the strategy
receives a $100 million allocation.
Unfortunately a 20% loss follows shortly
after. The strategy never fully recovers,
and it is eventually dismissed, along with
its author. What happened?

https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 2/3
5/1/2019 P-hacking and backtest overfitting « Mathematical Investor
The y-axis of the plot on the right displays the distribution of the maximum Sharpe ratios (max{SR}) for a given number of trials
(x-axis). A lighter color indicates a higher probability of obtaining that result, and the dash-line indicates the expected value. For
example, after only 1,000 backtests, the expected maximum Sharpe ratio (E[max{SR}]) is 3.26, even though the true Sharpe
ratio of the strategy is zero. How is this possible?

The reason, of course, is backtest overfitting, or, in other words, selection bias under multiple testing.

By the way, this plot is an experimental verification of the “False Strategy” theorem, first proven in this 2014 NAMS paper. This
theorem essentially states that, unless max{SR} (the maximum Sharpe ratio) is much greater than E[max{SR}] (the expected
value of the maximum Sharpe ratio), the discovered strategy is likely to be a false positive. Moreover, the theorem is notable for
providing a closed-form estimate of the rising hurdle that the researcher must beat as he or she conducts more backtests. The
plot confirms that this estimated hurdle (the dash-line) is quite precise under a wide range of trials (in the plot, between 2 and
1,000,000).

Ending the crisis


Fortunately, there are some specific tools that can be used to prevent the deleterious effects of backtest overfitting and related
multiple testing errors. For example, this 2014 JPM paper provides some solid ways to detect and prevent false discoveries.
Also, this 2017 JCF paper provides a theoretical framework to calculate the probability of backtest overfitting. Many of these
techniques and others are discussed in the recently published book Advances in Financial Machine Learning.

In the Forbes article (mentioned above) on the growing crisis in modern finance, one of us argues that additional government
regulation and oversight may be required, in particular to address the extent to which retail investors are not aware of the
potential statistical pitfalls in the design of investment products.

But in the end, the only long-term solution is education — all researchers and investment practitioners in finance need to be
rigorously trained in modern statistics and how best to use these tools. Special attention should be paid to showing how
statistical tests can mislead when used naively. Note that this education needs to be done not only for students and others
entering the work force, but also for those who are already practitioners in the field. This will not be easy but must be done.

The American Statistical Association, in a 2019 statement, after recommending that p-values not be used except in rigorously
analyzed contexts, concludes

Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good
study design and conduct, a variety of numerical and graphical summaries of data, understanding of the
phenomenon under study, interpretation of results in context, complete reporting and proper logical and
quantitative understanding of what data summaries mean. No single index should substitute for scientific
reasoning.

No royal road
Such considerations bring to mind a historical anecdote from the great Greek mathematician Euclid. According to an ancient
account, when Pharaoh Ptolemy I of Egypt grew frustrated at the degree of effort required to master geometry, he asked his
tutor Euclid whether there was some easier path. Euclid is said to have replied, There is no royal road to geometry.

The same is true for mathematical finance: there is no “royal road” to reliable, reproducible, statistically rigorous financial
analysis, particularly in an era of big data. Those researchers who learn how to deal effectively with this data, producing
statistically robust results, will lead the future. Those who do not will be left behind.

April 23rd, 2019 | Category: Essays

Comments are closed.

https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 3/3

Das könnte Ihnen auch gefallen