Why You Need More Data Than You Think
Backtest row counts mislead; the statistical sample size that governs strategy validation is far smaller than most researchers assume.
A backtest covering ten years of daily data feels substantial until you count what you actually have: roughly 2,500 observations, perhaps 40 independent monthly returns, and maybe two or three regime shifts. That is not a large sample. It is a small sample that happens to be expensive to obtain, and most of the inferential mistakes in systematic trading trace back to confusing the size of a CSV file with the size of a statistical sample.
Sample size is not row count
The number of rows in your dataset is not the number of independent observations relevant to your strategy. A mean-reversion signal that holds positions for five days has effective sample size closer to N/5 than N. A momentum strategy with 60-day lookbacks and monthly rebalances generates perhaps a dozen independent decisions per year, regardless of whether the underlying price series is sampled by the minute or by the day.
The relevant question is how many independent realizations of your strategy's decision process exist in your data. This is closer to the number of non-overlapping trades than to the number of bars. Once you frame it this way, ten years of daily data for a weekly-rebalanced strategy gives you roughly 520 decisions — a sample size at which most statistical estimates remain badly imprecise.
The standard error of the Sharpe ratio
The estimated Sharpe ratio has its own sampling distribution, and its standard error is larger than most practitioners assume. For a strategy with true Sharpe S over T years of returns, assuming roughly normal returns:
For a true Sharpe of 1.0 estimated over 3 years, the standard error is about 0.71. The 95% confidence interval on your estimate spans roughly -0.4 to +2.4. You cannot distinguish a genuinely strong strategy from a coin flip with three years of data, and this is before accounting for fat tails, autocorrelation, or selection bias from the strategies you discarded along the way.
Regimes are the actual unit of observation
For strategies whose performance depends on macro conditions — and that is most strategies — the relevant sample size is not trades or years but regimes. A trend-following system that thrives in dispersive markets and dies in mean-reverting ones has seen perhaps four or five distinct regime environments in the post-2000 data most retail backtesters use. That is the sample you are working with, no matter how granular your bars.
This is why a backtest covering 2010–2019 is methodologically thin even when it spans a decade. It contains a single sustained low-volatility bull regime with brief interruptions. A strategy fit on that period has been validated against one realization of one regime, and its forward behavior under any other regime is unknown. The 2020 volatility shock and the 2022 rate cycle were genuinely informative precisely because they expanded the regime sample by 40%.
What this implies for backtest design
The practical implication is that data requirements scale with the timescale of your strategy's edge. A high-frequency strategy operating on microstructure may have its statistical case made within months; a quarterly factor rotation needs decades. The mismatch — running a slow strategy and validating it on fast data — is the silent killer of most systematic research.
Three rules of thumb that survive contact with reality:
First, require at least 30 non-overlapping holding periods before treating any performance metric as informative, and prefer 100. Second, ensure your data spans at least two qualitatively different regimes — a rising-rate period and a falling-rate period, a high-volatility era and a low-volatility one. Third, treat any strategy with a Sharpe confidence interval that includes zero as unvalidated, regardless of point estimate.
The asymmetry that matters
More data cannot make a bad strategy good, but less data can make a bad strategy look good. The asymmetry runs against the researcher: small samples produce noisy estimates, noisy estimates produce false positives, and false positives are the strategies you actually deploy because they passed your filters. The strategies that genuinely work tend to have boring, stable, well-estimated performance over long histories — which is why they look less exciting than the overfitted alternatives competing for your attention.
Treat data scarcity as the binding constraint it is. Design strategies whose decision frequency matches the sample you can credibly assemble, and discard any result whose confidence interval cannot survive an honest accounting of how many independent observations underlie it.