Statistics5 min read

Bootstrap Methods for Strategy Evaluation

Bootstrap methods resample your strategy's returns or trades to construct empirical distributions of performance statistics, replacing fragile parametric assumptions with brute-force computation. They matter because the sampling distribution of Sharpe, drawdown, and CAGR is unknown, skewed, and fat-tailed — analytical confidence intervals lie about how much you actually know.

The core procedure is simple. Given an observed sample of N returns, draw N returns with replacement to form a synthetic track record, compute the statistic of interest, and repeat B times. The empirical quantiles of those B values approximate the sampling distribution of the statistic under the assumption that your observed returns are representative of the true return-generating process.

The computation

For a return series r = (r_1, ..., r_N) and statistic θ (e.g., Sharpe ratio), the basic IID bootstrap is:

For b = 1..B: r*_b = sample_with_replacement(r, size=N); θ*_b = stat(r*_b)

The 95% confidence interval for θ is then the interval [Q_0.025(θ*), Q_0.975(θ*)] across the B replicates. For autocorrelated returns — which is most strategy returns — replace IID resampling with the stationary bootstrap (Politis & Romano, 1994), which resamples blocks of random geometric length with mean L:

L_optimal ≈ N^(1/3) × (2 × ρ̂_1² / (1 - ρ̂_1²))^(1/3)

where ρ̂_1 is the first-order autocorrelation of returns. For trade-level bootstraps, resample individual trades (or trade clusters) rather than daily returns — this preserves the structure of position holding periods.

How to interpret it

A bootstrap 95% CI on Sharpe ratio that crosses zero means you cannot reject the null of zero edge at conventional significance. With 3 years of daily returns (N ≈ 750), expect Sharpe CIs of roughly ±0.5 around the point estimate — a backtest showing Sharpe 1.0 typically has a 95% CI of [0.5, 1.5]. If your CI lower bound is below 0.3, the strategy is statistically indistinguishable from noise at sample sizes most retail backtests provide.

For maximum drawdown, bootstrap distributions are heavily right-skewed: the observed max DD is almost always optimistic. A reasonable heuristic is that the 95th percentile bootstrap DD is 1.5x to 2.5x the observed DD, and this is the number you should plan capital around. CAGR bootstrap CIs widen with the square root of time — short backtests produce CAGR estimates that are nearly useless without bootstrap context.

Run B = 1000 for exploratory work, B = 10000 for reported figures. Beyond 10000, additional precision rarely justifies the compute cost.

What it does not capture

Bootstrap resamples from the observed sample. It cannot tell you anything about regimes not present in your data. A bootstrap of 2015-2021 SPX returns will not produce a 2008 or a 2020-March, no matter how many replicates you draw. The empirical distribution is bounded by what happened.

Bootstrap confidence intervals address sampling variability, not model risk, not regime risk, not selection bias. A strategy chosen from 1000 candidates by Sharpe ranking will have a tight, impressive bootstrap CI — and almost zero out-of-sample edge. Bootstrap after the search, on held-out data, or use a multiple-testing-aware procedure like the deflated Sharpe ratio.

It also assumes the resampling preserves the dependence structure that matters. IID bootstrap on autocorrelated returns produces CIs that are too narrow — typically by 20-40% for momentum strategies. Block bootstrap fixes serial correlation but not cross-sectional dependence in multi-asset portfolios; for that you need a paired or panel bootstrap.

Finally, bootstrap cannot rescue a strategy from look-ahead bias, survivorship bias, or transaction cost optimism. Garbage in, garbage out — with rigorous-looking confidence intervals.

In Kestrel Signal

Kestrel Signal reports stationary bootstrap confidence intervals for Sharpe, Sortino, CAGR, and maximum drawdown alongside every backtest, with block length chosen automatically from the residual autocorrelation. Bootstrap distributions are visualized as histograms rather than scalar CIs, so you can see skew and tail behavior directly. When a strategy is selected from a parameter sweep, bootstrap statistics are computed on the held-out walk-forward segments only — never on the in-sample data that produced the selection.