A checklist for evaluating whether a backtest result is worth trusting
Six criteria — from result hash to parameter sensitivity — for distinguishing credible backtests from sophisticated noise.
Not all backtests are equally believable. A backtest with a Sharpe of 3.2 can be less credible than one with a Sharpe of 0.8, depending on how it was constructed. Here is a practical checklist for evaluating whether a backtest result is worth taking seriously.
1. The result hash is present
A deterministic backtest platform produces identical results from identical inputs, every time. The result hash is a cryptographic fingerprint of the inputs (strategy definition, market data, engine version). If the same hash can be reproduced independently, the result is auditable. If the platform can't produce a hash, the result isn't auditable.
2. DSR is above 0.75, with an honest trial count
The Deflated Sharpe Ratio accounts for the number of strategy variations tested and the non-normality of returns. A DSR above 0.75 (“some” tier) means there is moderate statistical evidence of edge after adjusting for multiple testing.
The DSR is only meaningful if the trial count is honest. Reporting N=1 when you actually tested 200 variants inflates the DSR. The trial count should include every strategy that touched the same dataset, not just the one being reported.
3. At least 252 bars and 30 trades
With fewer than 252 daily observations, the standard error of the annualised Sharpe estimator is too large to draw reliable conclusions. With fewer than 30 trades, the skewness and kurtosis inputs to the PSR formula are themselves noisy, making the DSR estimate unreliable.
These aren't arbitrary thresholds. They reflect the minimum sample sizes at which the statistical machinery underlying DSR and PSR becomes reliable.
4. Walk-forward efficiency is above 0.5
Walk-Forward Efficiency (WFE = OOS Sharpe / IS Sharpe) measures whether the strategy generalises across time periods. A WFE above 0.5 across multiple windows suggests the parameters are reasonably robust. Below 0.5, the in-sample fit doesn't translate to out-of-sample periods — a hallmark of overfitting.
WFE should be consistent across windows, not just positive on average. If half the windows show WFE of 1.2 and half show −0.3, the strategy is regime-dependent rather than structurally sound.
5. Parameter sensitivity is plateau-shaped, not spike-shaped
Run the strategy with parameters ±20% from optimal. If the Sharpe degrades by more than half, the “optimal” parameter values are a spike — they work on this specific dataset but are unlikely to be the right values going forward.
Robust strategies show a plateau: performance is relatively stable across a range of parameter values. The exact optimum doesn't matter much, which is evidence that the result is driven by structure rather than fit.
6. Transaction costs are realistic
Many retail backtests use zero or symbolic transaction costs. For a strategy trading 100 times per year, a realistic round-trip cost of 0.1% (bid-ask spread plus commission) represents a 10% annual drag. For a strategy with a gross Sharpe of 1.0, that drag can eliminate the entire edge.
If a strategy is profitable only at unrealistically low transaction costs, it should be classified as not viable, not “promising with optimization.”
7. The strategy makes intuitive sense
This is the least quantitative criterion but arguably the most important filter. A strategy that has no plausible mechanism for generating edge — no story about why the market would offer this return persistently — is almost certainly a data artefact, regardless of what the metrics say.
The statistical tools above can eliminate bad strategies. They cannot guarantee that a strategy with good statistics has genuine edge. The mechanism matters. If you can't explain in one sentence why the strategy should work, you probably don't understand it well enough to trade it.