Methodology17 May 2026 · 9 min read

Why most backtests overstate edge — and what to do about it

The predictable reasons that backtested performance doesn't translate to live trading, and the statistical tools that help separate real edge from noise.

The typical retail backtesting workflow goes like this: you get an idea, you code it up, you run it against five years of data, you get a Sharpe ratio of 1.8, and you think you've found something. Then you go live and the performance is nothing like the backtest. The strategy loses money for three months and you abandon it.

This is not bad luck. It is the predictable outcome of a methodology that almost guarantees overfitting. Understanding why requires understanding what a backtest actually measures — and what it doesn't.

A backtest is not a prediction

A backtest tells you how a strategy would have performed on the specific historical data you tested it on, with the specific parameters you chose. That's all. It says nothing about the future unless the conditions that produced those returns are likely to repeat, and the parameters you chose were not selected because they happened to fit this particular dataset.

Both of those conditions are almost never satisfied in practice.

The degrees of freedom problem

Suppose you have five years of daily data — about 1,260 observations. Suppose your strategy has three parameters, each with ten candidate values. That's 1,000 combinations. But here's the thing: the amount of independent information in 1,260 daily returns is much less than it seems. Consecutive days are correlated. Regime periods cluster. The effective sample size, after accounting for serial correlation, might be closer to 300–400 independent observations.

You're fitting a model with many degrees of freedom to a dataset with few effective observations. The model will find patterns. Most of them will be noise.

The number of trials you've run isn't just the number of combinations in your most recent optimisation sweep. Every time you've looked at historical data, formed an impression, and modified your strategy in response, you've consumed degrees of freedom. Backtesting is cheaper than trading, so people do far more of it than they account for.

Why the Sharpe ratio alone can't save you

The naive response to overfitting is “just require a high Sharpe ratio.” This doesn't work, for a simple reason: the Sharpe ratio of the best result from N trials increases with N even when all strategies are pure noise. Test 100 random strategies on the same historical data and the best will show a Sharpe ratio around 2.3. It's a statistical artifact of the selection process.

E[max Sharpe from N noise strategies] ≈ √(2 × ln N)

For N = 100, that's approximately 3.0. For N = 1,000, it's approximately 3.7. If your strategy has a “great” Sharpe ratio after extensive optimisation, the question isn't whether the Sharpe is high — it's whether it's high enough to distinguish real edge from the expected maximum of noise.

What actually helps

The Deflated Sharpe Ratio (DSR) adjusts for this explicitly. It estimates the probability that your observed Sharpe ratio reflects genuine edge rather than selection from many trials. To compute it honestly, you need to count your trials honestly — including informal ones.

Walk-forward analysis provides a less biased performance estimate by separating the data you fit on from the data you test on. It doesn't eliminate overfitting, but it makes it harder to hide.

Combinatorial Purged Cross-Validation (CPCV) provides the most rigorous test: it generates all possible IS/OOS splits and asks what fraction of them produce a strategy that loses money out-of-sample. A PBO above 0.5 means more combinations fail than pass — the result is probably noise regardless of what the in-sample Sharpe says.

None of these tools tell you whether a strategy will work in the future. They tell you whether it's indistinguishable from noise on past data. That's a lower bar, and it's the right place to start filtering.

The right mental model

Think of a backtest not as a prediction but as a hypothesis test with a very limited statistical budget. Every look at the data, every parameter tweak, every strategy variant costs something from that budget. The question is whether anything is left when you've finished — enough evidence to distinguish your result from what you'd expect from random search on the same dataset.

Kestrel Signal is built around this framing. Every backtest result shows DSR, PSR, and sample size flags alongside the Sharpe ratio, because a Sharpe ratio without statistical context is noise dressed up as signal.