Gym Trading Environment¶

QTrade ships a Gymnasium environment (qtrade.env.TradingEnv) that wraps the same Broker as Backtest. The agent steps through bars one at a time; same accounting, fill semantics, SL/TP behavior — just driven by step() instead of a Python loop.

For customizing the action / observation / reward, see Customizing Trading Environment.

Initializing¶

import yfinance as yf
from qtrade.env import TradingEnv
from qtrade.core.commission import PercentageCommission

data = yf.download(
    "GC=F",
    start="2022-01-01",
    end="2024-01-01",
    interval="1d",
    multi_level_index=False,
)

# Indicators added to the DataFrame become observable features by default.
data['Rsi'] = data['Close'].pct_change().rolling(14).mean()  # placeholder for ta.rsi
data['Diff'] = data['Close'].diff()
data.dropna(inplace=True)

env = TradingEnv(
    data=data,
    cash=3000,
    commission=PercentageCommission(0.001),
    window_size=10,        # observation lookback (also defines warmup)
    max_steps=400,         # max bars per episode
    random_start=False,    # start at index `window_size`
    trade_on_close=True,   # market orders fill at current bar's close
)

The default ObserverScheme returns a (window_size, n_features) window of every column except OHLCV. So the indicators you add to the DataFrame become the observation features automatically.

The step API¶

env.step(action) returns the standard 5-tuple:

obs, reward, terminated, truncated, info = env.step(action)

Field	Meaning
`obs`	Whatever the `ObserverScheme` produces. Default: `np.ndarray` of shape `(window_size, n_features)`.
`reward`	`RewardScheme.get_reward(env)`. Default: log-return of trades closed this step minus commission.
`terminated`	`True` iff `current_step >= len(data) - 1` — the data ran out.
`truncated`	`True` iff `current_step - start_idx >= max_steps` — the episode hit its time limit.
`info`	Dict (see below).

When either terminated or truncated becomes True, the broker calls close_all_positions() automatically — the episode is “settled” before returning.

What’s in `info`?¶

Every step:

Key	Meaning
`equity`	`broker.equity` (cash + unrealized PnL).
`unrealized_pnl`	`broker.unrealized_pnl`.
`cumulative_return`	`broker.cumulative_returns` — multiplicative growth since the episode start.
`position`	Net signed `position.size`.
`total_trades`	Count of closed trades so far this episode.
`trades_profit`	Sum of `profit` across all closed trades.
`avg_trade_duration`	Mean `exit_index − entry_index` in bars (0 if no trades).
`is_success`	`trades_profit > 0`. Useful for SB3’s `EvalCallback` success-rate logging.

If you need additional fields (open trade count, drawdown so far, specific trade properties), subclass TradingEnv and override step to extend the dict.

Episodes: terminated vs truncated¶

The Gymnasium convention is:

terminated: the episode reached a natural end (won, lost, or ran out of valid states).
truncated: the episode was cut by an artificial time limit.

In TradingEnv:

terminated fires when there’s no more data (current_step == len(data) - 1). The episode is “complete.”
truncated fires when max_steps is reached. The episode could have continued but you set a budget.

stable-baselines3’s value bootstrapping treats them differently — make sure you set max_steps short enough that most episodes truncate (so the agent learns from many short episodes), but not so short that the strategy never has time to play out.

`random_start` for episode diversity¶

By default random_start=False: every episode starts at bar index window_size. With random_start=True, each reset() picks a random start in [window_size, len(data) - max_steps), so episodes sample different market regimes.

Required: len(data) > window_size + max_steps. The constructor raises a clear ValueError if not — keep max_steps < len(data) - window_size to leave room.

A typical training loop¶

from stable_baselines3 import PPO

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=200_000)

obs, _ = env.reset(seed=42)
for _ in range(400):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)
    if terminated or truncated:
        break

env.show_stats()
env.plot()

env.show_stats() and env.plot() work the same as on a Backtest instance — same metrics, same multi-panel Bokeh report — because they both delegate to the underlying Broker.

Rendering¶

env.render('human') opens a live mplfinance candle chart updating every step:

Trading Environment Render

For RGB array output (e.g. SB3’s VecVideoRecorder):

env = TradingEnv(..., render_mode='rgb_array')
frame = env.render()  # → ndarray of shape (h, w, 3)

Heads up: rendering is slow. For training, leave it off (render_mode='human' is the default but you don’t have to call render()) and only render during evaluation.

Common pitfalls¶

Sparse reward: the default reward only fires on bars where a trade closes. Long episodes with no closures train poorly. Either use an equity-based reward (see Customizing Trading Environment) or shorten episodes so end-of-episode auto-closure provides regular signal.
Indicators with NaN at start: any column you add to data is observed as-is. NaN in the observation crashes most policies. Drop the warmup rows with data.dropna(inplace=True) after computing indicators.
Action / observation drift: if you change schemes mid-experiment and load an old model, you’ll get shape mismatches. Save model and scheme together when you’re going to load.