Methodology17 May 2026 · 6 min read

DSR in Practice: How to Count Your Trials Honestly

A practical guide to applying the Deflated Sharpe Ratio by counting trials honestly, including correlated trials and pre-registered budgets.

The Deflated Sharpe Ratio adjusts your in-sample Sharpe for the number of independent trials you ran before arriving at it. The hard part is not the formula — it is counting trials honestly. Most practitioners undercount by an order of magnitude, then wonder why their "statistically significant" strategy dies in production.

This post is about what counts as a trial, why your git history is not enough, and how to build a trial budget that survives contact with reality.

What the DSR actually corrects for

The DSR, due to Bailey and López de Prado, shrinks an observed Sharpe ratio toward zero based on how many strategies you tested, the variance of their Sharpes, and the higher moments of the winning strategy's return distribution. The intuition: if you tried 1,000 random parameter combinations, the best one will look great even on pure noise. The DSR asks how great it would have to look to convince you it is not noise.

DSR = Z[ (SR - SR₀) · √(T - 1) / √(1 - γ₃·SR + (γ₄ - 1)/4 · SR²) ]

Here SR is the observed Sharpe, T is the number of return observations, γ₃ and γ₄ are skew and kurtosis of returns, and SR₀ is the expected maximum Sharpe under the null across N independent trials. Z[·] is the standard normal CDF. The SR₀ term is where trial counting enters:

SR₀ ≈ √V[SR] · ( (1 - γ)·Φ⁻¹(1 - 1/N) + γ·Φ⁻¹(1 - 1/(N·e)) )

V[SR] is the variance of Sharpe ratios across your trials, γ is the Euler-Mascheroni constant (≈0.5772), and N is the trial count. Notice that SR₀ grows roughly as √log(N). Going from 10 trials to 10,000 only triples the hurdle — but undercounting N by 100x is still enough to flip a rejection into a false acceptance.

What counts as a trial

A trial is any configuration of your strategy that you evaluated and could have chosen to deploy. This is broader than most people admit. Every parameter grid point is a trial. Every alternative entry rule you sketched and ran once is a trial. Every universe filter, every stop-loss variant, every regime overlay you tried and discarded — all trials.

Trials are not limited to runs you saved. If you opened a notebook, ran a backtest, eyeballed the equity curve, and closed the tab, that was a trial. Your brain stored the result and used it to inform the next attempt. The selection bias is real whether or not the artifact is on disk.

If you read a paper, replicated its strategy, and it worked on your data — those are not zero trials. They are the trials the original authors ran, plus yours. Public strategies have a selection history before you ever touched them. Treating a published Sharpe as N=1 is the single most common DSR mistake.

Building a trial budget before you start

The honest fix is to budget N before you begin research, not reconstruct it afterward. Write down the parameter ranges you intend to sweep, the alternative formulations you will compare, and the selection criterion. Multiply it out. That product is your N — and you commit to it.

A concrete example: you plan to test a momentum strategy across 5 lookbacks, 4 holding periods, 3 universe filters, and 2 volatility scalings. That is 5·4·3·2 = 120 trials before you have written any code. If you then decide to add a regime filter mid-research, you have not done 121 trials — you have done 240, because the regime filter applies to every prior combination you might revisit.

In Kestrel Signal, every backtest execution is logged with its parameter hash, and the trial counter is exposed in the DSR computation. You cannot accidentally forget a run, but you can still lie to yourself about which runs were "exploratory" versus "real." Don't.

Correlated trials and effective N

Not all trials are independent. A 20-day and 21-day moving average crossover produce nearly identical equity curves; counting them as two trials overstates your search. The correction is to use an effective trial count based on the correlation structure of returns across trials.

N_eff = N / (1 + (N - 1)·ρ̄)

where ρ̄ is the average pairwise return correlation across your trials. With ρ̄ = 0.9 and N = 100, N_eff ≈ 10.9. This is why fine parameter grids are less harmful than they appear — but also why "I only tested 12 strategies" is misleading when those 12 are wildly different ideas with ρ̄ near zero. Twelve uncorrelated trials punish you more than a hundred correlated ones.

The asymmetry is uncomfortable: dense sweeps of one idea inflate N modestly, while diverse exploration across many ideas inflates N nearly linearly. Researchers who pride themselves on "trying lots of different things" are precisely those who need the largest DSR haircuts.

What honest practice looks like

Pre-register your trial budget. Log every backtest, including the ones you abandoned at minute three. Compute ρ̄ from your trial population and report N_eff alongside N. When you import an external strategy, attach a prior trial count reflecting its provenance — for a published equity factor, N in the hundreds is defensible; for an anonymous Reddit post, treat it as adversarial selection and assume N is large enough that no observed Sharpe survives.

The DSR does not make bad research good. It makes honest research interpretable. Count your trials as if a hostile reviewer were going to audit your notebooks — because the market is that reviewer, and it does not grade on a curve.