Walk-forward optimization

Backtest.optimize() picks the best parameter combination over the entire dataset — and reports its in-sample performance. This is the classic recipe for fitting noise. The “best Sharpe 2.4” you see almost never holds up live.

Backtest.walk_forward_optimize() runs the same grid search inside rolling windows and reports performance on data the strategy was never tuned on:

                        train          test
window 0:   [─────────────────────][──────]
window 1:           [─────────────────────][──────]
window 2:                   [─────────────────────][──────]
                                                step

For every window pair, QTrade picks the best params on the train slice and evaluates them on the immediately following test slice. The aggregate of the test slices is your honest, out-of-sample (OoS) estimate of how the strategy actually performs.

API

result = bt.walk_forward_optimize(
    train_window=200,
    test_window=50,
    maximize='Sharpe Ratio',
    step=50,
    constraint=lambda p: p['n1'] < p['n2'],
    n1=range(5, 30, 5),
    n2=range(20, 60, 10),
)

Required:

  • train_window: bars per training slice.

  • test_window: bars per test slice (immediately follows the train slice).

  • maximize: name of the metric from calculate_stats() to maximize within each train slice. Common choices: Sharpe Ratio, Total Return [%], Calmar Ratio. See the stats glossary for what each means.

  • **params_grid: same syntax as optimize() — keyword arguments whose values are iterables of candidates.

Optional:

  • step: how many bars to advance between consecutive train starts. Defaults to test_window (non-overlapping test windows). Smaller values give overlapping (correlated) test windows; larger values leave gaps.

  • constraint: filter lambda p: bool applied to each parameter dict before evaluating. Use this to skip nonsensical combinations (e.g. fast SMA window > slow SMA window).

Reading the result

result is a dict:

result['windows']    # list of per-window dicts
result['summary']    # aggregate OoS metrics

Each window dict contains:

Key

Value

train_start, train_end

Timestamps of the train slice.

test_start, test_end

Timestamps of the test slice.

best_params

The parameter combo that maximized maximize on the train slice.

train_stats

Full stats dict from the train run with best_params.

test_stats

Full stats dict from the test run.

test_equity

The test slice’s equity_history Series.

The summary:

Key

Meaning

n_windows

Total number of train/test pairs evaluated.

mean_oos_return

Average Total Return [%] across test windows.

hit_rate

Fraction of test windows with positive return.

min_oos_return, max_oos_return

Best and worst test windows.

A typical workflow

from qtrade.backtest import Backtest
from qtrade.utils.stats import calculate_stats

# 1. Cheap in-sample search to confirm the strategy is even worth tuning.
best_params, best_stats, _ = bt.optimize(
    maximize='Sharpe Ratio',
    n1=range(5, 30, 5),
    n2=range(20, 60, 10),
    constraint=lambda p: p['n1'] < p['n2'],
)
print("In-sample Sharpe:", best_stats['Sharpe Ratio'])

# 2. Walk-forward to get the realistic number.
result = bt.walk_forward_optimize(
    train_window=200,
    test_window=50,
    maximize='Sharpe Ratio',
    n1=range(5, 30, 5),
    n2=range(20, 60, 10),
    constraint=lambda p: p['n1'] < p['n2'],
)
print("OoS hit rate:", result['summary']['hit_rate'])
print("OoS mean return:", result['summary']['mean_oos_return'], '%')

# 3. Inspect window-by-window for parameter stability.
for w in result['windows']:
    print(
        f"{w['test_start'].date()}{w['test_end'].date()}: "
        f"params={w['best_params']} "
        f"oos_return={w['test_stats']['Total Return [%]']:.2f}%"
    )

If the in-sample Sharpe and OoS mean are wildly different, you’ve found your overfit. The OoS number is the one to trust.

Choosing the windows

The right train_window and test_window depend on:

  • Strategy “memory”: how many bars does the strategy need to fit a parameter? A 20-bar SMA needs at least ~50 bars to give the system some room. A 200-bar lookback model needs a much larger train_window.

  • Market regime length: a train window that’s too small only sees one regime; too large straddles multiple regimes (averaging conflicting optima). 6–12 months of daily bars is a typical starting point.

  • Number of windows you can afford: longer windows = fewer windows, but more data per fit; shorter windows = more windows but each is noisier. Aim for at least 5–10 windows so the summary statistics are meaningful.

Rule of thumb: train_window 4–10 × test_window is a common ratio.

Common pitfalls

“OoS hit rate is 0.5”

Coin flip. Either the strategy genuinely doesn’t have edge, or your parameter grid doesn’t cover the meaningful space. Try:

  1. Look at which params win in each window — if they’re wildly different across windows, the strategy isn’t stable.

  2. Widen the grid. If the optimum keeps landing at the edge, you’re not exploring far enough.

  3. Try a longer train_window.

“Mean OoS return is positive but min is catastrophic”

A common signature of strategies that work most of the time but blow up periodically (e.g. martingale-like betting, undefended short vol). The mean hides the tail.

Always look at the per-window list, not just the summary.

“Walk-forward Sharpe is much lower than in-sample Sharpe”

Expected and normal. The gap is your overfit. A 2:1 ratio (in-sample 2.0 → OoS 1.0) is fine; 5:1 means you’re fitting noise. Reduce parameter freedom (smaller grid) or extend the strategy’s inductive bias (use a constraint, share parameters across assets, etc.).

“Step smaller than test_window”

This produces overlapping test windows. The summary metrics double-count bars and inflate n_windows. Sometimes you want this (e.g. for adaptive walk-forward that retrains weekly), but understand that the windows aren’t statistically independent.

See also