Top Qs
Timeline
Chat
Perspective

Backtesting

Testing a predictive model on historical data From Wikipedia, the free encyclopedia

Remove ads

Backtesting is a term used in modeling to refer to testing a predictive model on historical data. Backtesting is a type of retrodiction, and a special type of cross-validation applied to previous time period(s). In quantitative finance, backtesting is an important step before deploying algorithmic strategies in live markets.

Financial analysis

Summarize
Perspective

In the economic and financial field, backtesting seeks to estimate the performance of a strategy or model if it had been employed during a past period. This requires simulating past conditions with sufficient detail, making one limitation of backtesting the need for detailed historical data. A second limitation is the inability to model strategies that would affect historic prices. Finally, backtesting, like other modeling, is limited by potential overfitting. That is, it is often possible to find a strategy that would have worked well in the past, but will not work well in the future.[1] Despite these limitations, backtesting provides information not available when models and strategies are tested on synthetic data.

Historically, backtesting was only performed by large institutions and professional money managers due to the expense of obtaining and using detailed datasets. However, backtesting is increasingly used on a wider basis, and independent web-based backtesting platforms [2] have emerged. Although the technique is widely used, it is prone to weaknesses.[3] Basel financial regulations require large financial institutions to backtest certain risk models.

For a Value at Risk 1-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table:[4]

Thumb
backtesting exceptions 1Dx250
More information Zone, Number exceptions ...

For a Value at Risk 10-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table:

Thumb
backtesting exceptions 10Dx250
More information Zone, Number exceptions ...
Remove ads

Backtesting through cross-validation in finance

Summarize
Perspective

Traditional backtesting evaluates a strategy on a single historical path. Although intuitive, this approach is sensitive to regime changes, path dependence, and look-ahead leakage. To address these limitations, practitioners adapt cross-validation (CV) methods to time-ordered financial data. Because financial observations are not independent and identically distributed (IID), randomized CV is inappropriate, motivating the use of specialized temporal CV procedures.[5]

Walk-forward / rolling-window backtesting

Walk-forward analysis divides historical data into sequential training and testing windows. A model is trained on an initial in-sample period, tested on the subsequent period, and the window is rolled forward repeatedly.[5]

Advantages

  • Provides a clear historical interpretation, as each testing period mirrors a realistic paper-trading scenario.[5]
  • Avoids look-ahead bias because the training set always predates the testing set; with trailing windows and proper purging, test samples remain fully out-of-sample.[6]
  • Enables robustness assessment across market regimes through periodic reoptimization, adapting to evolving volatility and price dynamics.[7]

Limitations

  • Relies on a single historical path, making results sensitive to sequencing and increasing overfitting risk.[8]
  • May not generalize to alternative market orderings, as reversing observations often yields inconsistent outcomes.[5]
  • Provides limited out-of-sample evaluation because each window uses only a subset of observations.[5]
  • Frequent reoptimization may overfit transient structures, overstating robustness.[7]

Purged cross-validation (with embargoing)

Purged cross-validation adapts k-fold CV to financial series by purging observations whose label-formation overlaps with the test fold and applying an embargo to avoid leakage from serial dependence.[6] Its purpose is not historical accuracy but evaluation across multiple out-of-sample stress scenarios.[5]

Advantages

  • Evaluates strategies across many alternative out-of-sample scenarios rather than one historical path.[5]
  • Uses each sample exactly once for testing, achieving maximal out-of-sample usage.
  • Prevents leakage through purging and embargoing.

Limitations

  • The training set does not trail the testing set, requiring careful purging/embargo to prevent leakage.[6]
  • Reduces effective sample size when labels span long periods.[5]
  • Still produces a single forecast per observation, yielding one performance path.

Combinatorial purged cross-validation (CPCV)

Combinatorial purged cross-validation partitions a time series into non-overlapping groups and evaluates combinations of these groups as test sets. Each fold is purged and embargoed, yielding a distribution of performance estimates and reducing selection bias inherent in walk-forward and standard CV methods.[5]

Advantages

  • Produces a distribution of performance statistics rather than a single path, improving inference.[5]
  • Lowers variance in Sharpe ratio estimates by averaging across many nearly uncorrelated paths.
  • Reduces sensitivity to specific windows or local market regimes.
  • Used to compute the Probability of backtest overfitting (PBO).[9]

Limitations

  • Computationally intensive due to the number of path combinations.[5]
  • Requires selecting the number and size of groups, which affects variance.
  • More complex to implement and typically relies on custom tooling.
Remove ads

Backtest statistics in quantitative finance

Summarize
Perspective

Backtests often produce performance metrics that appear statistically significant even when driven by noise. Because financial returns have low signal-to-noise ratio, non-normal characteristics, and regime dependence, backtest evaluation requires statistics that adjust for multiple trials, selection bias, and sampling error.[10]

General characteristics

General structural characteristics affecting reliability include:[10]

  • Time range and number of market regimes: The time range of a backtest must span multiple market regimes to ensure the strategy’s performance is reasonably robust
  • Average assets under management (AUM): A strategy managing larger AUM must be able to absorb liquidity costs and maintain capacity
  • Capacity constraints and market impact: Capacity measures how much capital a strategy can trade before the performance of the strategy degrades from market impact
  • Leverage: It shows how much borrowing the strategy implicitly uses to generate targeted returns. Leverage amplifies both return and risk, and borrowing costs must be justified by excess performance.
  • Maximum position size and concentration: This shows whether the strategy occasionally takes oversized bets relative to its typical AUM. Strategies that rely on rare, extremely large positions are less stable and more exposed to tail events.
  • Ratio of long positions: A market-neutral strategy should be balanced (≈50% long). A persistent tilt suggests exposure to systematic risk (beta), not pure alpha.
  • Frequency of independent bets: How often the strategy identifies independent opportunities
  • Average holding period: Short holding periods imply high trading costs and lower capacity; long holding periods imply stronger persistence of the underlying signal. This reflects the trade-off between agility and cost efficiency.
  • Annualized turnover: Turnover measures how intensively the strategy trades relative to its capital base. High turnover = high transaction costs → capacity constrained; Low turnover = cost efficient, but possibly slower reaction.
  • Correlation to the asset universe: High correlation → the strategy is basically repackaged beta; Low or negative correlation → diversifying or hedging properties. Correlation reveals whether returns are true alpha or just market exposure.


Performance

  • Profit and loss: The change in the value of a position over a period of time.
  • Long-side PnL: PnL calculated when a trader buys an asset.
  • Annualized return/CAGR: Geometric average return of an investment over a period of time.
  • Hit ratio: The percentage of profitable trades
  • Average gain vs. average loss: The average return generated from profitable/loss-making trades.

[10]

Time-weighted rate of return

The time-weighted rate of return (TWRR) is a measure of investment performance that isolates the return generated by the portfolio itself, independent of external cash flows. It divides the performance into subperiods defined by deposits or withdrawals and compounds the returns of those subperiods, ensuring that each interval contributes equally to the final result. Because TWRR removes the effect of investor-driven cash flows, it is commonly used to evaluate asset managers and compare investment strategies. This contrasts with the CAGR, which reflects the growth of an investor’s actual account value and is therefore sensitive to the timing and size of contributions and withdrawals.

Runs and drawdowns

Most investment strategies do not generate returns from an independent and identically distributed (IID) process. Because returns are not IID, they often exhibit sequences of same-direction outcomes, known as runs. For example: +1%, +0.8%, +0.5% represent a positive run, while –0.7%, –1.2%, –0.4% form a negative run. Such negative runs can significantly amplify downside risk, meaning that averages or standard deviations alone are insufficient to assess the strategy’s true risk profile. Instead, one must rely on risk measures that capture the impact of persistent patterns:[10]

  • Runs of same-sign returns: sequences of consecutive positive or consecutive negative returns, reflecting the tendency of returns to cluster rather than alternate independently
  • Return concentration (e.g., Herfindahl–Hirschman index): the degree to which a portfolio’s total performance is driven by a small number of large returns
  • Drawdowns: declines in portfolio value from a historical peak to a subsequent trough, used to assess the magnitude of losses during adverse periods
  • Time under water (TuW): the duration a portfolio remains below its previous peak, indicating the length of recovery following a drawdown

Implementation shortfall

Implementation shortfall measures the erosion of performance due to execution frictions:[10]

  • Brokerage fees: the explicit transaction charges imposed by brokers for executing trades
  • Slippage: the difference between the expected transaction price and the actual execution price, typically arising from market impact and short-term price movements
  • Dollar PnL per turnover: the PnL generated per unit of portfolio turnover
  • Return on execution costs: a performance metric comparing the strategy’s returns to the costs incurred from trading (the ratio between return and execution cost), indicating whether the generated alpha sufficiently compensates for execution expenses

Efficiency metrics

  • Sharpe ratio: a risk-adjusted performance measure that evaluates excess returns per unit of total return volatility
  • Information ratio: a strategy’s active returns relative to its tracking error
  • Probabilistic Sharpe Ratio (PSR): a statistical adjustment of the Sharpe ratio that estimates the probability that an observed Sharpe ratio exceeds a given threshold after accounting for estimation error
  • Deflated Sharpe Ratio (DSR): an adjusted Sharpe ratio that corrects for selection bias under multiple testing and non-normal return distributions[10]

Overfitting and validation

Backtests are vulnerable to overfitting when many variations are tested. The Probability of Backtest Overfitting (PBO) quantifies this risk, often using CPCV.[9]

Classification-based metrics

Machine-learning-based strategies are evaluated with:[10]

Attribution

Performance attribution decomposes returns across risk categories (e.g., duration, credit, liquidity, sector).[10]

Remove ads

Limitations and pitfalls

Summarize
Perspective

Backtesting is widely used to evaluate historical performance, but it is vulnerable to several sources of error. Because backtests rely on historical data rather than controlled experiments, they cannot establish causality and may reflect patterns that occurred by chance.[11][12]

Common sources of error

  • Survivorship bias
  • Look-ahead bias
  • Data mining
  • Ignoring realistic transaction costs
  • Dependence on outliers
  • Shorting constraints
  • Hidden risks (liquidity, funding)
  • Non-representative sample periods
  • Ignoring drawdowns

These issues reduce reliability even before considering sampling error or the risk of overfitting.[11]

Limits of backtesting as a research tool

Backtesting is frequently misused as an idea-generation tool. A backtest can evaluate a fully specified strategy but cannot explain why it should work or whether the economic rationale will persist. Iteratively modifying models in response to backtest outcomes increases the likelihood of overfitting, producing results that do not generalize out of sample.[12][11]

Practical recommendations

  • Perform data cleaning and feature engineering before any backtest.
  • Record all experiments to approximate the effective number of trials.
  • Favor broad insights over security-specific patterns.
  • Use cross-validation methods (walk-forward, purged CV, CPCV).
  • Use ensembles or bagging to detect unstable models.
  • Use alternative data partitions, simulations, or scenario analysis.
  • Restart research instead of repeatedly tuning a single model.

While none of these practices fully eliminate overfitting, they help identify strategies with a higher likelihood of out-of-sample validity.[11]

Remove ads

Hindcast

Summarize
Perspective
Thumb
Temporal representation of hindcasting.[13]

In oceanography[14] and meteorology,[15] backtesting is also known as hindcasting: a hindcast is a way of testing a mathematical model; researchers enter known or closely estimated inputs for past events into the model to see how well the output matches the known results.

Hindcasting usually refers to a numerical-model integration of a historical period where no observations have been assimilated. This distinguishes a hindcast run from a reanalysis. Oceanographic observations of salinity and temperature as well as observations of surface-wave parameters such as the significant wave height are much scarcer than meteorological observations, making hindcasting more common in oceanography than in meteorology. Also, since surface waves represent a forced system where the wind is the only generating force, wave hindcasting is often considered adequate for generating a reasonable representation of the wave climate with little need for a full reanalysis. Hydrologists use hindcasting for model stream flows.[16]

An example of hindcasting would be entering climate forcings (events that force change) into a climate model. If the hindcast showed reasonably-accurate climate response, the model would be considered successful.

The ECMWF re-analysis is an example of a combined atmospheric reanalysis coupled with a wave-model integration where no wave parameters were assimilated, making the wave part a hindcast run.

Remove ads

See also

Remove ads

References

Loading related searches...

Wikiwand - on

Seamless Wikipedia browsing. On steroids.

Remove ads