Top Qs
Timeline
Chat
Perspective
Backtesting
Testing a predictive model on historical data From Wikipedia, the free encyclopedia
Remove ads
Backtesting is a term used in modeling to refer to testing a predictive model on historical data. Backtesting is a type of retrodiction, and a special type of cross-validation applied to previous time period(s). In quantitative finance, backtesting is an important step before deploying algorithmic strategies in live markets.
Financial analysis
Summarize
Perspective
In the economic and financial field, backtesting seeks to estimate the performance of a strategy or model if it had been employed during a past period. This requires simulating past conditions with sufficient detail, making one limitation of backtesting the need for detailed historical data. A second limitation is the inability to model strategies that would affect historic prices. Finally, backtesting, like other modeling, is limited by potential overfitting. That is, it is often possible to find a strategy that would have worked well in the past, but will not work well in the future.[1] Despite these limitations, backtesting provides information not available when models and strategies are tested on synthetic data.
Historically, backtesting was only performed by large institutions and professional money managers due to the expense of obtaining and using detailed datasets. However, backtesting is increasingly used on a wider basis, and independent web-based backtesting platforms [2] have emerged. Although the technique is widely used, it is prone to weaknesses.[3] Basel financial regulations require large financial institutions to backtest certain risk models.
For a Value at Risk 1-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table:[4]

For a Value at Risk 10-day at 99% backtested 250 days in a row, the test is considered green (0-95%), orange (95-99.99%) or red (99.99-100%) depending on the following table:

Remove ads
Backtesting through cross-validation in finance
Summarize
Perspective
Traditional backtesting evaluates a strategy on a single historical path. Although intuitive, this approach is sensitive to regime changes, path dependence, and look-ahead leakage. To address these limitations, practitioners adapt cross-validation (CV) methods to time-ordered financial data. Because financial observations are not independent and identically distributed (IID), randomized CV is inappropriate, motivating the use of specialized temporal CV procedures.[5]
Walk-forward / rolling-window backtesting
Walk-forward analysis divides historical data into sequential training and testing windows. A model is trained on an initial in-sample period, tested on the subsequent period, and the window is rolled forward repeatedly.[5]
Advantages
- Provides a clear historical interpretation, as each testing period mirrors a realistic paper-trading scenario.[5]
- Avoids look-ahead bias because the training set always predates the testing set; with trailing windows and proper purging, test samples remain fully out-of-sample.[6]
- Enables robustness assessment across market regimes through periodic reoptimization, adapting to evolving volatility and price dynamics.[7]
Limitations
- Relies on a single historical path, making results sensitive to sequencing and increasing overfitting risk.[8]
- May not generalize to alternative market orderings, as reversing observations often yields inconsistent outcomes.[5]
- Provides limited out-of-sample evaluation because each window uses only a subset of observations.[5]
- Frequent reoptimization may overfit transient structures, overstating robustness.[7]
Purged cross-validation (with embargoing)
Purged cross-validation adapts k-fold CV to financial series by purging observations whose label-formation overlaps with the test fold and applying an embargo to avoid leakage from serial dependence.[6] Its purpose is not historical accuracy but evaluation across multiple out-of-sample stress scenarios.[5]
Advantages
- Evaluates strategies across many alternative out-of-sample scenarios rather than one historical path.[5]
- Uses each sample exactly once for testing, achieving maximal out-of-sample usage.
- Prevents leakage through purging and embargoing.
Limitations
Combinatorial purged cross-validation (CPCV)
Combinatorial purged cross-validation partitions a time series into non-overlapping groups and evaluates combinations of these groups as test sets. Each fold is purged and embargoed, yielding a distribution of performance estimates and reducing selection bias inherent in walk-forward and standard CV methods.[5]
Advantages
- Produces a distribution of performance statistics rather than a single path, improving inference.[5]
- Lowers variance in Sharpe ratio estimates by averaging across many nearly uncorrelated paths.
- Reduces sensitivity to specific windows or local market regimes.
- Used to compute the Probability of backtest overfitting (PBO).[9]
Limitations
- Computationally intensive due to the number of path combinations.[5]
- Requires selecting the number and size of groups, which affects variance.
- More complex to implement and typically relies on custom tooling.
Remove ads
Backtest statistics in quantitative finance
Summarize
Perspective
Backtests often produce performance metrics that appear statistically significant even when driven by noise. Because financial returns have low signal-to-noise ratio, non-normal characteristics, and regime dependence, backtest evaluation requires statistics that adjust for multiple trials, selection bias, and sampling error.[10]
General characteristics
General structural characteristics affecting reliability include:[10]
- Time range and number of market regimes: The time range of a backtest must span multiple market regimes to ensure the strategy’s performance is reasonably robust
- Average assets under management (AUM): A strategy managing larger AUM must be able to absorb liquidity costs and maintain capacity
- Capacity constraints and market impact: Capacity measures how much capital a strategy can trade before the performance of the strategy degrades from market impact
- Leverage: It shows how much borrowing the strategy implicitly uses to generate targeted returns. Leverage amplifies both return and risk, and borrowing costs must be justified by excess performance.
- Maximum position size and concentration: This shows whether the strategy occasionally takes oversized bets relative to its typical AUM. Strategies that rely on rare, extremely large positions are less stable and more exposed to tail events.
- Ratio of long positions: A market-neutral strategy should be balanced (≈50% long). A persistent tilt suggests exposure to systematic risk (beta), not pure alpha.
- Frequency of independent bets: How often the strategy identifies independent opportunities
- Average holding period: Short holding periods imply high trading costs and lower capacity; long holding periods imply stronger persistence of the underlying signal. This reflects the trade-off between agility and cost efficiency.
- Annualized turnover: Turnover measures how intensively the strategy trades relative to its capital base. High turnover = high transaction costs → capacity constrained; Low turnover = cost efficient, but possibly slower reaction.
- Correlation to the asset universe: High correlation → the strategy is basically repackaged beta; Low or negative correlation → diversifying or hedging properties. Correlation reveals whether returns are true alpha or just market exposure.
Performance
- Profit and loss: The change in the value of a position over a period of time.
- Long-side PnL: PnL calculated when a trader buys an asset.
- Annualized return/CAGR: Geometric average return of an investment over a period of time.
- Hit ratio: The percentage of profitable trades
- Average gain vs. average loss: The average return generated from profitable/loss-making trades.
Time-weighted rate of return
The time-weighted rate of return (TWRR) is a measure of investment performance that isolates the return generated by the portfolio itself, independent of external cash flows. It divides the performance into subperiods defined by deposits or withdrawals and compounds the returns of those subperiods, ensuring that each interval contributes equally to the final result. Because TWRR removes the effect of investor-driven cash flows, it is commonly used to evaluate asset managers and compare investment strategies. This contrasts with the CAGR, which reflects the growth of an investor’s actual account value and is therefore sensitive to the timing and size of contributions and withdrawals.
Runs and drawdowns
Most investment strategies do not generate returns from an independent and identically distributed (IID) process. Because returns are not IID, they often exhibit sequences of same-direction outcomes, known as runs. For example: +1%, +0.8%, +0.5% represent a positive run, while –0.7%, –1.2%, –0.4% form a negative run. Such negative runs can significantly amplify downside risk, meaning that averages or standard deviations alone are insufficient to assess the strategy’s true risk profile. Instead, one must rely on risk measures that capture the impact of persistent patterns:[10]
- Runs of same-sign returns: sequences of consecutive positive or consecutive negative returns, reflecting the tendency of returns to cluster rather than alternate independently
- Return concentration (e.g., Herfindahl–Hirschman index): the degree to which a portfolio’s total performance is driven by a small number of large returns
- Drawdowns: declines in portfolio value from a historical peak to a subsequent trough, used to assess the magnitude of losses during adverse periods
- Time under water (TuW): the duration a portfolio remains below its previous peak, indicating the length of recovery following a drawdown
Implementation shortfall
Implementation shortfall measures the erosion of performance due to execution frictions:[10]
- Brokerage fees: the explicit transaction charges imposed by brokers for executing trades
- Slippage: the difference between the expected transaction price and the actual execution price, typically arising from market impact and short-term price movements
- Dollar PnL per turnover: the PnL generated per unit of portfolio turnover
- Return on execution costs: a performance metric comparing the strategy’s returns to the costs incurred from trading (the ratio between return and execution cost), indicating whether the generated alpha sufficiently compensates for execution expenses
Efficiency metrics
- Sharpe ratio: a risk-adjusted performance measure that evaluates excess returns per unit of total return volatility
- Information ratio: a strategy’s active returns relative to its tracking error
- Probabilistic Sharpe Ratio (PSR): a statistical adjustment of the Sharpe ratio that estimates the probability that an observed Sharpe ratio exceeds a given threshold after accounting for estimation error
- Deflated Sharpe Ratio (DSR): an adjusted Sharpe ratio that corrects for selection bias under multiple testing and non-normal return distributions[10]
Overfitting and validation
Backtests are vulnerable to overfitting when many variations are tested. The Probability of Backtest Overfitting (PBO) quantifies this risk, often using CPCV.[9]
Classification-based metrics
Machine-learning-based strategies are evaluated with:[10]
Attribution
Performance attribution decomposes returns across risk categories (e.g., duration, credit, liquidity, sector).[10]
Remove ads
Limitations and pitfalls
Summarize
Perspective
Backtesting is widely used to evaluate historical performance, but it is vulnerable to several sources of error. Because backtests rely on historical data rather than controlled experiments, they cannot establish causality and may reflect patterns that occurred by chance.[11][12]
Common sources of error
- Survivorship bias
- Look-ahead bias
- Data mining
- Ignoring realistic transaction costs
- Dependence on outliers
- Shorting constraints
- Hidden risks (liquidity, funding)
- Non-representative sample periods
- Ignoring drawdowns
These issues reduce reliability even before considering sampling error or the risk of overfitting.[11]
Limits of backtesting as a research tool
Backtesting is frequently misused as an idea-generation tool. A backtest can evaluate a fully specified strategy but cannot explain why it should work or whether the economic rationale will persist. Iteratively modifying models in response to backtest outcomes increases the likelihood of overfitting, producing results that do not generalize out of sample.[12][11]
Practical recommendations
- Perform data cleaning and feature engineering before any backtest.
- Record all experiments to approximate the effective number of trials.
- Favor broad insights over security-specific patterns.
- Use cross-validation methods (walk-forward, purged CV, CPCV).
- Use ensembles or bagging to detect unstable models.
- Use alternative data partitions, simulations, or scenario analysis.
- Restart research instead of repeatedly tuning a single model.
While none of these practices fully eliminate overfitting, they help identify strategies with a higher likelihood of out-of-sample validity.[11]
Remove ads
Hindcast
Summarize
Perspective

In oceanography[14] and meteorology,[15] backtesting is also known as hindcasting: a hindcast is a way of testing a mathematical model; researchers enter known or closely estimated inputs for past events into the model to see how well the output matches the known results.
Hindcasting usually refers to a numerical-model integration of a historical period where no observations have been assimilated. This distinguishes a hindcast run from a reanalysis. Oceanographic observations of salinity and temperature as well as observations of surface-wave parameters such as the significant wave height are much scarcer than meteorological observations, making hindcasting more common in oceanography than in meteorology. Also, since surface waves represent a forced system where the wind is the only generating force, wave hindcasting is often considered adequate for generating a reasonable representation of the wave climate with little need for a full reanalysis. Hydrologists use hindcasting for model stream flows.[16]
An example of hindcasting would be entering climate forcings (events that force change) into a climate model. If the hindcast showed reasonably-accurate climate response, the model would be considered successful.
The ECMWF re-analysis is an example of a combined atmospheric reanalysis coupled with a wave-model integration where no wave parameters were assimilated, making the wave part a hindcast run.
Remove ads
See also
- Applied research (customer foresight) – Anticipating consumer preferences/wishes with future products and services
- Backcasting – Influencing current reality from desired future state scenario
- Black box model – System where only the inputs and outputs can be viewed, and not its implementation
- Climate – Long-term weather pattern of a region
- Computer simulation – Process of mathematical modelling, performed on a computer
- ECMWF re-analysis – Data set for retrospective weather analysis
- Economic forecast – Process of making predictions about the economy
- Forecasting – Making predictions based on available data
- NCEP re-analysis – Open data set of the Earth's atmosphere
- Numerical weather prediction – Weather prediction using mathematical models of the atmosphere and oceans
- Predictive modelling – Form of modelling that uses statistics to predict outcomes
- Retrodiction – Making a "prediction" about the past
- Statistical arbitrage – Short-term financial trading strategy
- Thought Experiment – Hypothetical situation
- Value at risk – Estimated potential loss for an investment under a given set of conditions
- Cross-validation (statistics)
- Walk-forward optimization
- Purged cross-validation
- Probability of backtest overfitting
- Multiple comparisons problem
- Overfitting
- Ensemble learning
Remove ads
References
Wikiwand - on
Seamless Wikipedia browsing. On steroids.
Remove ads