# Gym Trading Environment QTrade ships a [Gymnasium](https://gymnasium.farama.org/) environment (`qtrade.env.TradingEnv`) that wraps the same `Broker` as `Backtest`. The agent steps through bars one at a time; same accounting, fill semantics, SL/TP behavior — just driven by `step()` instead of a Python loop. For customizing the action / observation / reward, see [Customizing Trading Environment](customize_environment.md). ## Initializing ```python import yfinance as yf from qtrade.env import TradingEnv from qtrade.core.commission import PercentageCommission data = yf.download( "GC=F", start="2022-01-01", end="2024-01-01", interval="1d", multi_level_index=False, ) # Indicators added to the DataFrame become observable features by default. data['Rsi'] = data['Close'].pct_change().rolling(14).mean() # placeholder for ta.rsi data['Diff'] = data['Close'].diff() data.dropna(inplace=True) env = TradingEnv( data=data, cash=3000, commission=PercentageCommission(0.001), window_size=10, # observation lookback (also defines warmup) max_steps=400, # max bars per episode random_start=False, # start at index `window_size` trade_on_close=True, # market orders fill at current bar's close ) ``` The default `ObserverScheme` returns a `(window_size, n_features)` window of every column except OHLCV. So the indicators you add to the DataFrame become the observation features automatically. ## The step API `env.step(action)` returns the standard 5-tuple: ```python obs, reward, terminated, truncated, info = env.step(action) ``` | Field | Meaning | |---|---| | `obs` | Whatever the `ObserverScheme` produces. Default: `np.ndarray` of shape `(window_size, n_features)`. | | `reward` | `RewardScheme.get_reward(env)`. Default: log-return of trades closed this step minus commission. | | `terminated` | `True` iff `current_step >= len(data) - 1` — the data ran out. | | `truncated` | `True` iff `current_step - start_idx >= max_steps` — the episode hit its time limit. | | `info` | Dict (see below). | When either `terminated` or `truncated` becomes True, the broker calls `close_all_positions()` automatically — the episode is "settled" before returning. ### What's in `info`? Every step: | Key | Meaning | |---|---| | `equity` | `broker.equity` (cash + unrealized PnL). | | `unrealized_pnl` | `broker.unrealized_pnl`. | | `cumulative_return` | `broker.cumulative_returns` — multiplicative growth since the episode start. | | `position` | Net signed `position.size`. | | `total_trades` | Count of closed trades so far this episode. | | `trades_profit` | Sum of `profit` across all closed trades. | | `avg_trade_duration` | Mean `exit_index − entry_index` in bars (0 if no trades). | | `is_success` | `trades_profit > 0`. Useful for SB3's `EvalCallback` success-rate logging. | If you need additional fields (open trade count, drawdown so far, specific trade properties), subclass `TradingEnv` and override `step` to extend the dict. ## Episodes: terminated vs truncated The Gymnasium convention is: - **`terminated`**: the episode reached a *natural* end (won, lost, or ran out of valid states). - **`truncated`**: the episode was *cut* by an artificial time limit. In `TradingEnv`: - `terminated` fires when there's no more data (`current_step == len(data) - 1`). The episode is "complete." - `truncated` fires when `max_steps` is reached. The episode could have continued but you set a budget. `stable-baselines3`'s value bootstrapping treats them differently — make sure you set `max_steps` short enough that most episodes truncate (so the agent learns from many short episodes), but not so short that the strategy never has time to play out. ## `random_start` for episode diversity By default `random_start=False`: every episode starts at bar index `window_size`. With `random_start=True`, each `reset()` picks a random start in `[window_size, len(data) - max_steps)`, so episodes sample different market regimes. Required: `len(data) > window_size + max_steps`. The constructor raises a clear `ValueError` if not — keep `max_steps` < `len(data) - window_size` to leave room. ## A typical training loop ```python from stable_baselines3 import PPO model = PPO("MlpPolicy", env, verbose=1) model.learn(total_timesteps=200_000) obs, _ = env.reset(seed=42) for _ in range(400): action, _ = model.predict(obs, deterministic=True) obs, reward, terminated, truncated, info = env.step(action) if terminated or truncated: break env.show_stats() env.plot() ``` `env.show_stats()` and `env.plot()` work the same as on a `Backtest` instance — same metrics, same multi-panel Bokeh report — because they both delegate to the underlying `Broker`. ## Rendering `env.render('human')` opens a live mplfinance candle chart updating every step: ![Trading Environment Render](../_static/render_rgb.gif) For RGB array output (e.g. SB3's `VecVideoRecorder`): ```python env = TradingEnv(..., render_mode='rgb_array') frame = env.render() # → ndarray of shape (h, w, 3) ``` > Heads up: rendering is slow. For training, leave it off > (`render_mode='human'` is the default but you don't have to *call* > `render()`) and only render during evaluation. ## Common pitfalls - **Sparse reward**: the default reward only fires on bars where a trade closes. Long episodes with no closures train poorly. Either use an equity-based reward (see [Customizing Trading Environment](customize_environment.md)) or shorten episodes so end-of-episode auto-closure provides regular signal. - **Indicators with NaN at start**: any column you add to `data` is observed as-is. NaN in the observation crashes most policies. Drop the warmup rows with `data.dropna(inplace=True)` after computing indicators. - **Action / observation drift**: if you change schemes mid-experiment and load an old model, you'll get shape mismatches. Save model and scheme together when you're going to load.