Methodology4 min read

In-Sample vs Out-of-Sample Data

In-sample data is the slice of historical price action used to design, calibrate, and select a strategy. Out-of-sample data is the slice withheld from that process, used exactly once to estimate how the strategy behaves on observations the researcher has never seen. The gap between in-sample and out-of-sample performance is the single most direct measurement of overfitting available to a backtester.

The partition

The mechanics are a deterministic split of the full historical record by time, never by random shuffle. A typical partition reserves the most recent fraction of data as the out-of-sample (OOS) set and uses the older fraction as the in-sample (IS) set. Parameter optimization, feature selection, and model architecture decisions consume only the IS set.

Degradation = (Metric_IS - Metric_OOS) / |Metric_IS|

The degradation ratio is computed per metric — CAGR, Sortino, profit factor, hit rate — and quantifies how much performance collapses when the strategy meets unseen data. A complementary statistic is the OOS-to-IS ratio, simply Metric_OOS / Metric_IS, where values near 1.0 indicate stable generalization and values near 0 or negative indicate the IS performance was an artifact of fitting.

How to interpret the gap

Treat the OOS-to-IS ratio as a generalization score. Ratios above 0.7 on a return-adjusted-for-risk metric like Sortino indicate the edge survives the transition to unseen data. Ratios between 0.4 and 0.7 indicate partial overfit — some real signal exists, but parameter sensitivity is inflating IS numbers. Ratios below 0.4, and especially negative ratios, indicate the IS result was largely noise-fitting.

Absolute OOS performance matters more than the ratio when the IS result was modest. A strategy with IS Sortino of 1.2 and OOS Sortino of 1.0 (ratio 0.83) is more credible than one with IS Sortino of 4.0 and OOS Sortino of 1.5 (ratio 0.38), even though both are usable in isolation. Large IS-to-OOS drops from extreme IS values almost always indicate the optimizer found a corner of parameter space that exploits historical accidents.

The OOS set must be touched exactly once. Every time a researcher inspects OOS results and then revises the strategy, the OOS set has been partially absorbed into the IS process. After three or four such revisions, the OOS estimate is functionally worthless.

What this split does not capture

A clean IS/OOS split addresses parameter overfitting on a fixed distribution. It does not address regime change — if the OOS period happens to resemble the IS period (both trending, both low-volatility), generalization within similar regimes proves nothing about behavior in regimes neither set contains. The 2017–2019 equity environment is not informative about 2020 March, regardless of how the split was constructed.

Nor does it correct for selection bias across strategies. Running 200 candidate strategies through an IS/OOS pipeline and reporting the best OOS performer is statistically equivalent to overfitting on the OOS set itself. The expected best-of-N OOS result on pure noise is positive and can be substantial; this is the multiple-testing problem, and a single split cannot resolve it.

The split also tells you nothing about transaction-cost robustness, capacity, slippage modeling accuracy, or whether the data feed itself contains survivorship bias or look-ahead leakage. A strategy with a clean OOS result built on biased data is still broken.

An OOS result confirms the absence of one specific failure mode — parameter overfitting on the available distribution. It does not certify the strategy. Walk-forward analysis, regime-stratified evaluation, and Deflated Sharpe Ratio adjustments address what a simple split cannot.

Presentation in Kestrel Signal

Kestrel Signal enforces the IS/OOS boundary at the backtest engine level. The OOS window is locked at strategy creation and is invisible to the parameter optimizer; attempted reads during optimization raise a pipeline error rather than silently leaking. Backtest reports display IS and OOS metrics side-by-side with the OOS-to-IS ratio computed per metric, and the platform tracks the number of times each OOS window has been queried to flag strategies whose OOS estimates have lost statistical meaning through repeated inspection.