Beruflich Dokumente
Kultur Dokumente
P-hacking
Part of the problem is due to abuse of the p-test, a widely used significance test introduced by the British statistician Ronald
Fisher in the 1920s. It assesses whether the results of an experiment are more extreme that what would one have given the nu
hypothesis. The smaller this p-value is, argued Fisher, the greater the likelihood that the null hypothesis is false. The p-test, use
alone, has significant drawbacks. To begin with, the typically used level of p = 0.05 is not a particularly compelling result. In any
event, it is highly questionable to reject a result if its p-value is 0.051, whereas to accept it as significant if its p-value is 0.049.
The prevalence of the classic p = 0.05 value has led to the egregious practice that Uri Simonsohn of the University of
Pennsylvania has termed p-hacking: proposing numerous varied hypotheses until a researcher finds one that meets the 0.05
level. Note that this is a classic multiple testing fallacy of statistics: perform enough tests and one is bound to pass any specific
level of statistical significance. Such suspicions are justified given the results of a study by Jelte Wilcherts of the University of
Amsterdam, who found that researchers whose results were close to the p = 0.05 level of significance were less willing to share
their original data than were others that had stronger significance levels (see also this summary from Psychology Today).
For additional details and discussion, see this Math Scholar blog.
In 2014, one of the present authors and co-authors called attention to the pervasiveness of backtest overfitting in finance.
Indeed, backtest overfitting is now thought be a principal reason why investment funds and strategies that look good on paper
often fail in practice — the impressive performance that was evident in backtest studies is not reproduced, to put it mildly, when
the fund or strategy is fielded in practice.
Some of us also demonstrated, in a 2017 JOIM paper, how easy it is to design a stock fund, based on backtests, that achieves
virtually any desired goal (e.g., a steady increase of 1% per month growth, month after month for ten or more years). However,
https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 1/3
5/1/2019 P-hacking and backtest overfitting « Mathematical Investor
when such designs are presented with new data (as if one were actually fielded), they prove to be very brittle, often failing
catastrophically, or, at the least, utterly failing to achieve their stated goal. The reason is, once again, backtest overfitting.
There is no indication that the situation has improved since these early studies. If anything, the situation has worsened with the
ever-increasing prevalence of computer-based designs for investment funds and strategies. This is because it is an increasingly
simple matter to explore thousands, millions or even billions of alternative component weightings or parameter settings for an
investment fund or strategy, and select only the highest-scoring combination to be fielded. As we have shown (see here for
instance), such computer explorations (typically never disclosed to editors, readers or customers) render the resulting study or
financial product hopelessly overfit and, in many cases, subject to catastrophic failure.
As one of the present authors and a colleague has found, most investment strategies uncovered by practitioners and
academics are false.
In a 2018 Forbes interview, two of the present authors, interviewed by Brett Steenbarger, discussed this growing crisis:
Imagine that a pharmaceutical company develops 1000 drugs and tests these on 1000 groups of volunteer
patients. When a few dozen of the tests prove “significant” at the .05 level of chance, those medications are
marketed as proven remedies. Believing the “scientific tests”, patients flock to the new wonder drugs, only to
find that their conditions become worse as the medications don’t deliver the expected benefit. Some
consumers become quite ill and several die.
Clearly, there would be a public outcry over such deceptive practice. Indeed, that is precisely the reason we
have a regulatory agency and laws to help ensure that medications have been properly tested before they are
offered to the public. … [But] no such protections are offered to financial consumers, leaving them vulnerable
to unproven investment strategies. … These false positives are particularly misleading, as they are promoted
by researchers with seemingly impeccable research backgrounds—and who do not employ the scientific tools
needed to detect such false findings.
After all, both p-hacking and backtest overfitting are instances of the multiple testing fallacy — if one performs enough tests o
explores enough hypotheses on a single dataset, one is certain to find a test or hypothesis that is successful to any pre-specifie
level of statistical significance.
https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 2/3
5/1/2019 P-hacking and backtest overfitting « Mathematical Investor
The y-axis of the plot on the right displays the distribution of the maximum Sharpe ratios (max{SR}) for a given number of trials
(x-axis). A lighter color indicates a higher probability of obtaining that result, and the dash-line indicates the expected value. For
example, after only 1,000 backtests, the expected maximum Sharpe ratio (E[max{SR}]) is 3.26, even though the true Sharpe
ratio of the strategy is zero. How is this possible?
The reason, of course, is backtest overfitting, or, in other words, selection bias under multiple testing.
By the way, this plot is an experimental verification of the “False Strategy” theorem, first proven in this 2014 NAMS paper. This
theorem essentially states that, unless max{SR} (the maximum Sharpe ratio) is much greater than E[max{SR}] (the expected
value of the maximum Sharpe ratio), the discovered strategy is likely to be a false positive. Moreover, the theorem is notable for
providing a closed-form estimate of the rising hurdle that the researcher must beat as he or she conducts more backtests. The
plot confirms that this estimated hurdle (the dash-line) is quite precise under a wide range of trials (in the plot, between 2 and
1,000,000).
In the Forbes article (mentioned above) on the growing crisis in modern finance, one of us argues that additional government
regulation and oversight may be required, in particular to address the extent to which retail investors are not aware of the
potential statistical pitfalls in the design of investment products.
But in the end, the only long-term solution is education — all researchers and investment practitioners in finance need to be
rigorously trained in modern statistics and how best to use these tools. Special attention should be paid to showing how
statistical tests can mislead when used naively. Note that this education needs to be done not only for students and others
entering the work force, but also for those who are already practitioners in the field. This will not be easy but must be done.
The American Statistical Association, in a 2019 statement, after recommending that p-values not be used except in rigorously
analyzed contexts, concludes
Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good
study design and conduct, a variety of numerical and graphical summaries of data, understanding of the
phenomenon under study, interpretation of results in context, complete reporting and proper logical and
quantitative understanding of what data summaries mean. No single index should substitute for scientific
reasoning.
No royal road
Such considerations bring to mind a historical anecdote from the great Greek mathematician Euclid. According to an ancient
account, when Pharaoh Ptolemy I of Egypt grew frustrated at the degree of effort required to master geometry, he asked his
tutor Euclid whether there was some easier path. Euclid is said to have replied, There is no royal road to geometry.
The same is true for mathematical finance: there is no “royal road” to reliable, reproducible, statistically rigorous financial
analysis, particularly in an era of big data. Those researchers who learn how to deal effectively with this data, producing
statistically robust results, will lead the future. Those who do not will be left behind.
https://mathinvestor.org/2019/04/p-hacking-and-backtest-overfitting/ 3/3