Skip to main content
A rollout is a complete episode of agent-environment interaction: an agent takes actions, the environment transitions, and rewards accumulate until the episode ends. Reproducibility means that given the same random seed and the same action sequence, every rollout produces exactly the same observations, transitions, and rewards. This property is foundational for RL engineering — without it, you cannot reliably compare two policy versions, reproduce a bug seen during training, or verify that a reward spike was real and not noise. The hard part is that real training runs hundreds or thousands of rollouts in parallel. Each worker must be completely isolated from the others: no shared filesystem, no shared process state, no network side-effects leaking across episodes. If any state bleeds between workers, your “reproducible” seed no longer controls the outcome and you lose the guarantee. Tensorlake sandboxes enforce this isolation at the infrastructure level — every rollout gets its own fresh environment, and the seed is the only variable in play.

Core concepts

Isolation means each rollout runs in its own compute environment with no shared resources. Two workers seeded with different values must not be able to influence each other’s trajectories through any shared channel — not a shared pip cache, not a shared /tmp, not a shared network state. In production, this matters most when you are running hundreds of rollouts per training step: any shared state becomes a source of variance that your reward signal cannot explain. Stateful resets mean the environment always starts from a known, controlled baseline when a new rollout begins. A reset that partially inherits state from a previous episode is one of the most common and hardest-to-debug sources of non-reproducibility. Because each sandbox is created fresh per rollout, the reset is total — there is no prior episode state to inherit. Determinism means the environment’s random number generator is seeded before any interaction begins, and the seed is the sole source of randomness for the entire episode. Given the same seed, the same initial observation, and the same action sequence, the trajectory must be identical byte-for-byte. This lets you replay any episode from training history, compare policy versions on equal footing, and write regression tests against specific trajectories.

How Tensorlake sandboxes provide this

SandboxClient().create_and_connect() starts a fresh, isolated compute environment and returns a box handle. Every sandbox is a separate process tree with its own filesystem and memory. There is no shared state between two sandboxes created from the same client. The seed is passed into the environment harness as a string literal embedded in the Python script that runs inside the sandbox — not set on the host process. This keeps the host’s random state completely separate from the environment’s, which is important when you dispatch many rollouts from a single host thread pool. Parallel rollouts map cleanly onto ThreadPoolExecutor: each thread creates its own sandbox, runs its episode, collects its trajectory, and the sandbox is destroyed when the context manager exits. The executor manages concurrency; the sandboxes manage isolation.

Prerequisites

pip install tensorlake gymnasium python-dotenv
Create a .env file in your project root:
TENSORLAKE_API_KEY="your-api-key-here"

Full example

"""
Reproducible RL Rollouts with Tensorlake Sandboxes
===================================================
Demonstrates three properties:
  1. Isolation   — each rollout runs in its own sandbox
  2. Determinism — same seed → same trajectory, verified by assertion
  3. Parallelism — multiple seeds dispatched concurrently via ThreadPoolExecutor
"""

from dotenv import load_dotenv
load_dotenv()

import json
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from typing import List, Tuple
from tensorlake.sandbox import SandboxClient


# ─── Data models ──────────────────────────────────────────────────────────────

@dataclass
class RolloutConfig:
    seed: int
    env_name: str = "CartPole-v1"
    max_steps: int = 200


@dataclass
class RolloutResult:
    seed: int
    total_reward: float
    steps: int
    # Each element is (observation, action, reward, terminated)
    trajectory: List[Tuple] = field(default_factory=list)


# ─── Gymnasium harness ────────────────────────────────────────────────────────

# This script runs inside the sandbox. It is a self-contained string so that
# the host's Python environment has no influence on the episode's random state.
_GYM_HARNESS = """
import gymnasium as gym
import json
import sys

seed     = {seed}
env_name = {env_name!r}
max_steps = {max_steps}

env = gym.make(env_name)
obs, _ = env.reset(seed=seed)
# env.reset(seed=) only seeds the observation/transition RNG.
# The action space has its own RNG that must be seeded separately.
env.action_space.seed(seed)

trajectory = []
total_reward = 0.0
steps = 0

for _ in range(max_steps):
    action = env.action_space.sample()
    next_obs, reward, terminated, truncated, _ = env.step(action)
    trajectory.append((obs.tolist(), int(action), float(reward), bool(terminated)))
    total_reward += reward
    steps += 1
    obs = next_obs
    if terminated or truncated:
        break

env.close()

result = {{
    "seed": seed,
    "total_reward": total_reward,
    "steps": steps,
    "trajectory": trajectory,
}}
# The only output is the JSON result — the caller reads stdout
print(json.dumps(result))
"""


# ─── Single rollout ───────────────────────────────────────────────────────────

def run_single_rollout(config: RolloutConfig) -> RolloutResult:
    """
    Run one complete RL episode in a fresh, isolated sandbox.

    A new sandbox is created for every call so there is no shared filesystem
    or process state between concurrent rollouts. The seed is embedded in the
    harness string rather than set on the host, which keeps the host's random
    state fully separate from the environment's.
    """
    harness = _GYM_HARNESS.format(
        seed=config.seed,
        env_name=config.env_name,
        max_steps=config.max_steps,
    )

    with SandboxClient().create_and_connect(memory_mb=2048) as box:
        # Use python3 -m pip to install into the sandbox's managed environment
        box.run("python3", ["-m", "pip", "install", "gymnasium",
                            "--break-system-packages", "-q"])

        execution = box.run("python3", ["-c", harness])
        raw = (execution.stdout or "").strip()

    data = json.loads(raw)
    return RolloutResult(
        seed=data["seed"],
        total_reward=data["total_reward"],
        steps=data["steps"],
        trajectory=data["trajectory"],
    )


# ─── Parallel rollout collection ──────────────────────────────────────────────

def collect_parallel_rollouts(
    seeds: List[int],
    env_name: str = "CartPole-v1",
    max_steps: int = 200,
) -> List[RolloutResult]:
    """
    Dispatch one sandbox per seed, all running concurrently.

    ThreadPoolExecutor manages the concurrency; the sandboxes manage isolation.
    Results are returned in seed order regardless of completion order.
    """
    configs = [RolloutConfig(seed=s, env_name=env_name, max_steps=max_steps) for s in seeds]

    results_by_seed = {}
    with ThreadPoolExecutor(max_workers=len(configs)) as pool:
        future_to_seed = {pool.submit(run_single_rollout, cfg): cfg.seed for cfg in configs}
        for future in as_completed(future_to_seed):
            seed = future_to_seed[future]
            results_by_seed[seed] = future.result()

    return [results_by_seed[s] for s in seeds]


# ─── Reproducibility check ────────────────────────────────────────────────────

def verify_reproducibility(
    seed: int = 42,
    env_name: str = "CartPole-v1",
    max_steps: int = 200,
) -> None:
    """
    Run the same seed twice in independent sandboxes and assert the trajectories
    are identical. This is the core guarantee: isolation + determinism means the
    seed fully determines the episode.
    """
    print(f"Verifying reproducibility for seed={seed}...")

    config = RolloutConfig(seed=seed, env_name=env_name, max_steps=max_steps)
    result_a = run_single_rollout(config)
    result_b = run_single_rollout(config)

    assert result_a.steps == result_b.steps, (
        f"Step count mismatch: {result_a.steps} vs {result_b.steps}"
    )
    assert result_a.total_reward == result_b.total_reward, (
        f"Reward mismatch: {result_a.total_reward} vs {result_b.total_reward}"
    )
    assert result_a.trajectory == result_b.trajectory, (
        "Trajectory mismatch: observations or actions differed between runs"
    )

    print(
        f"  Passed. seed={seed}{result_a.steps} steps, "
        f"reward={result_a.total_reward:.1f} (identical across both runs)"
    )


# ─── Main ─────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    # Step 1: Verify that the same seed always produces the same trajectory
    verify_reproducibility(seed=42)
    print(
        "  → Same seed, two independent sandboxes, identical trajectory.\n"
        "    The seed is the only source of variation — no shared state, no host RNG leakage."
    )

    # Step 2: Collect 4 rollouts in parallel, one sandbox per seed
    seeds = [0, 1, 2, 3]
    print(f"\nCollecting {len(seeds)} parallel rollouts...")
    results = collect_parallel_rollouts(seeds)

    # Step 3: Print a summary table
    print(f"\n{'Seed':>6}  {'Steps':>6}  {'Total Reward':>14}")
    print("-" * 32)
    for r in results:
        print(f"{r.seed:>6}  {r.steps:>6}  {r.total_reward:>14.1f}")

    best  = max(results, key=lambda r: r.total_reward)
    worst = min(results, key=lambda r: r.total_reward)
    print(
        f"\n  → CartPole rewards 1.0 per step, so total reward equals episode length.\n"
        f"    Seed {best.seed} balanced the longest ({int(best.total_reward)} steps); "
        f"seed {worst.seed} fell first ({int(worst.total_reward)} steps).\n"
        f"    Different seeds produce different episodes because the initial pole\n"
        f"    angle varies — run again with the same seeds and you get identical numbers."
    )
Expected output:
Verifying reproducibility for seed=42...
  Passed. seed=42 → 30 steps, reward=30.0 (identical across both runs)

Collecting 4 parallel rollouts...
  Seed   Steps    Total Reward
--------------------------------
     0      18            18.0
     1      29            29.0
     2      14            14.0
     3      15            15.0
In CartPole, the reward is 1.0 per step regardless of action — so total reward equals step count. The episode ends when the pole tips past 12 degrees or the cart leaves the track. Different seeds produce different episode lengths because the initial pole angle varies. The reproducibility assertion confirms that seed=42 always produces the exact same 30-step trajectory in two independent sandboxes.

Tic-tac-toe: policy evaluation

This example extends the CartPole infrastructure to a custom two-player environment and shows where sandboxes are more directly necessary. Policies are defined as code strings — the same pattern used in RL Training with GSPO for LLM-generated completions. A policy that crashes, loops, or behaves unexpectedly only kills its own sandbox; the rest of the evaluation runs unaffected. The data model is the same as CartPole: TttConfig extends RolloutConfig by replacing env_name with policy_x and policy_o; run_ttt_batch returns the same RolloutResult. The total_reward field becomes the mean return per game from X’s perspective (+1 win, −1 loss, 0 draw). evaluate_matchup follows the same parallel dispatch pattern as collect_parallel_rollouts, running one sandbox per seed to get a reliable return estimate. This is the policy evaluation step in policy iteration — you would call it after each policy update to measure how much the return improved.
from dotenv import load_dotenv
load_dotenv()

import json
import statistics
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
from tensorlake.sandbox import SandboxClient


# ─── Reuse RolloutResult from the CartPole section ────────────────────────────
# total_reward = mean reward per game, X's perspective (+1 win, -1 loss, 0 draw)
# steps        = total moves across all games in the batch
# trajectory   = list of per-game outcomes

@dataclass
class RolloutResult:
    seed: int
    total_reward: float
    steps: int
    trajectory: List[dict] = field(default_factory=list)


# ─── Tic-tac-toe config ───────────────────────────────────────────────────────
# Extends the RolloutConfig pattern: swap env_name for policy_x / policy_o,
# add n_games (games per sandbox call = one rollout batch).

@dataclass
class TttConfig:
    seed: int
    policy_x: str    # key into POLICIES
    policy_o: str
    n_games: int = 50


# ─── Policies as code strings ─────────────────────────────────────────────────
# Treat these like LLM-generated completions: they run inside the sandbox,
# never in the host process. A buggy policy crashes its sandbox, not the loop.

POLICIES: Dict[str, str] = {
    "random": """
def choose_action(board, player, rng):
    moves = [i for i, v in enumerate(board) if v is None]
    return rng.choice(moves)
""",

    "greedy": """
def choose_action(board, player, rng):
    WINS = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]
    opponent = "O" if player == "X" else "X"
    moves = [i for i, v in enumerate(board) if v is None]
    # Take the win if available
    for move in moves:
        b = board[:]; b[move] = player
        for a, c, d in WINS:
            if b[a] and b[a] == b[c] == b[d]: return move
    # Block the opponent's win
    for move in moves:
        b = board[:]; b[move] = opponent
        for a, c, d in WINS:
            if b[a] and b[a] == b[c] == b[d]: return move
    return rng.choice(moves)
""",
}


# ─── Harness ──────────────────────────────────────────────────────────────────

# Runs n_games games inside a single sandbox and returns the batch return.
# Both policies execute in separate namespaces so they can't overwrite each
# other's globals — important when policies come from different sources.
_TTT_HARNESS = """
import json, random

WINS = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]
ns_x, ns_o = {{}}, {{}}
exec({policy_x!r}, ns_x); exec({policy_o!r}, ns_o)
choose_x = ns_x["choose_action"]; choose_o = ns_o["choose_action"]

def winner(b):
    for a, c, d in WINS:
        if b[a] and b[a] == b[c] == b[d]: return b[a]
    return None

rng   = random.Random({seed})
games = []
for _ in range({n_games}):
    board, moves_played = [None] * 9, 0
    for turn in range(9):
        player = "X" if turn % 2 == 0 else "O"
        action = (choose_x if player == "X" else choose_o)(board[:], player, rng)
        board[action] = player; moves_played += 1
        w = winner(board)
        if w:
            games.append({{"outcome": w + " wins", "reward": 1 if w == "X" else -1, "moves": moves_played}})
            break
    else:
        games.append({{"outcome": "draw", "reward": 0, "moves": moves_played}})

print(json.dumps({{
    "total_reward": sum(g["reward"] for g in games) / len(games),
    "steps":        sum(g["moves"]  for g in games),
    "trajectory":   games,
}}))
"""


# ─── Interactive move oracle ──────────────────────────────────────────────────

# For interactive play the sandbox stays open for the whole game session.
# Each opponent turn sends the current board and gets one action back.
# timeout_secs gives the human up to 5 minutes of total think time.
_MOVE_HARNESS = """
import random
ns = {{}}
exec({policy!r}, ns)
action = ns["choose_action"]({board!r}, {player!r}, random.Random({seed}))
print(action)
"""

WINS = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]


def _winner(board: list):
    for a, c, d in WINS:
        if board[a] and board[a] == board[c] == board[d]:
            return board[a]
    return None


def _display(board: list) -> None:
    row = lambda i: " | ".join(
        str(i * 3 + j) if board[i * 3 + j] is None else board[i * 3 + j]
        for j in range(3)
    )
    print(f" {row(0)}\n---+---+---\n {row(1)}\n---+---+---\n {row(2)}\n")


def play_against(human_side: str = "X", opponent_policy: str = "greedy") -> None:
    """
    Play a game of tic-tac-toe against a policy running in a sandbox.

    The sandbox opens once at the start of the game and stays live until the
    game ends. Each opponent turn is a single box.run() call — the policy code
    never executes in the host process.

    human_side:      "X" (you move first) or "O" (opponent moves first)
    opponent_policy: any key in POLICIES
    """
    assert human_side in ("X", "O"), "human_side must be 'X' or 'O'"
    opponent_side = "O" if human_side == "X" else "X"

    board = [None] * 9
    print(f"\nYou are {human_side}. Opponent: {opponent_policy}.")
    print("Empty squares show their position number (0–8).\n")
    _display(board)

    # Keep one sandbox alive for the whole game — no re-creation per move
    with SandboxClient().create_and_connect(memory_mb=1024, timeout_secs=300) as box:
        for turn in range(9):
            player    = "X" if turn % 2 == 0 else "O"
            available = [i for i, v in enumerate(board) if v is None]

            if player == human_side:
                while True:
                    try:
                        move = int(input(f"Your move ({human_side}), choose from {available}: "))
                        if move in available:
                            break
                        print(f"  Square {move} is taken. Choose from {available}.")
                    except ValueError:
                        print(f"  Enter a number from {available}.")
            else:
                # The seed is the turn number — deterministic but varies per turn
                harness = _MOVE_HARNESS.format(
                    policy=POLICIES[opponent_policy],
                    board=board,
                    player=player,
                    seed=turn,
                )
                ex   = box.run("python3", ["-c", harness])
                move = int((ex.stdout or "").strip())
                print(f"  {opponent_side} ({opponent_policy}) plays {move}")

            board[move] = player
            _display(board)

            w = _winner(board)
            if w:
                print("You win!" if w == human_side else f"{opponent_policy} wins!")
                return

    print("Draw!")


# ─── Single batch rollout ─────────────────────────────────────────────────────

def run_ttt_batch(config: TttConfig) -> RolloutResult:
    """
    Run one batch of n_games in a fresh sandbox and return a RolloutResult.

    Follows the same signature as run_single_rollout from the CartPole section:
    one config in, one RolloutResult out, one sandbox per call.
    """
    harness = _TTT_HARNESS.format(
        policy_x=POLICIES[config.policy_x],
        policy_o=POLICIES[config.policy_o],
        seed=config.seed,
        n_games=config.n_games,
    )
    with SandboxClient().create_and_connect(memory_mb=1024) as box:
        ex   = box.run("python3", ["-c", harness])
        data = json.loads((ex.stdout or "").strip())
    return RolloutResult(
        seed=config.seed,
        total_reward=data["total_reward"],
        steps=data["steps"],
        trajectory=data["trajectory"],
    )


# ─── Policy evaluation ────────────────────────────────────────────────────────

def evaluate_matchup(
    policy_x: str,
    policy_o: str,
    seeds: List[int],
    n_games: int = 50,
) -> Tuple[float, float]:
    """
    Run one batch per seed in parallel; return (mean_return, std_return).

    Follows the same parallel dispatch pattern as collect_parallel_rollouts:
    one sandbox per seed, all running concurrently. More seeds = tighter
    estimate of the true policy return.
    """
    configs = [
        TttConfig(seed=s, policy_x=policy_x, policy_o=policy_o, n_games=n_games)
        for s in seeds
    ]
    returns: List[float] = [0.0] * len(configs)
    with ThreadPoolExecutor(max_workers=len(configs)) as pool:
        futures = {pool.submit(run_ttt_batch, cfg): i for i, cfg in enumerate(configs)}
        for future in as_completed(futures):
            returns[futures[future]] = future.result().total_reward
    return statistics.mean(returns), statistics.stdev(returns)


# ─── Q-learning ───────────────────────────────────────────────────────────────

# Uses str(s)+","+str(a) as Q-key to avoid f-string braces conflicting
# with .format() when the harness template is rendered on the host.
_QLEARN_HARNESS = """
import json, random

WINS = [(0,1,2),(3,4,5),(6,7,8),(0,3,6),(1,4,7),(2,5,8),(0,4,8),(2,4,6)]

def greedy_move(board, rng):
    moves = [i for i, v in enumerate(board) if v is None]
    for move in moves:
        b = board[:]; b[move] = "O"
        for a, c, d in WINS:
            if b[a] and b[a] == b[c] == b[d]: return move
    for move in moves:
        b = board[:]; b[move] = "X"
        for a, c, d in WINS:
            if b[a] and b[a] == b[c] == b[d]: return move
    return rng.choice(moves)

def winner(b):
    for a, c, d in WINS:
        if b[a] and b[a] == b[c] == b[d]: return b[a]
    return None

def skey(b):     return tuple(0 if v is None else 1 if v == "X" else 2 for v in b)
def qkey(s, a):  return str(s) + "," + str(a)
def qv(q, s, a): return q.get(qkey(s, a), 0.0)

q       = json.loads({q_json!r})
rng     = random.Random({seed})
alpha, gamma, epsilon = {alpha}, {gamma}, {epsilon}

ep_rewards = []
for _ in range({n_episodes}):
    board = [None] * 9
    ep_r  = 0.0
    while True:
        moves = [i for i, v in enumerate(board) if v is None]
        if not moves: ep_rewards.append(ep_r); break
        s = skey(board)
        a = rng.choice(moves) if rng.random() < epsilon else max(moves, key=lambda x: qv(q, s, x))
        board[a] = "X"
        w = winner(board)
        if w or not any(v is None for v in board):
            r = 1.0 if w == "X" else -1.0 if w == "O" else 0.0
            q[qkey(s, a)] = qv(q, s, a) + alpha * (r - qv(q, s, a))
            ep_r += r; ep_rewards.append(ep_r); break
        board[greedy_move(board[:], rng)] = "O"
        w = winner(board)
        r = 1.0 if w == "X" else -1.0 if w == "O" else 0.0
        s2     = skey(board)
        moves2 = [i for i, v in enumerate(board) if v is None]
        nq = max((qv(q, s2, x) for x in moves2), default=0.0) if moves2 else 0.0
        q[qkey(s, a)] = qv(q, s, a) + alpha * (r + gamma * nq - qv(q, s, a))
        ep_r += r
        if w or not moves2: ep_rewards.append(ep_r); break

print(json.dumps({{"q_table": q, "mean_reward": sum(ep_rewards)/len(ep_rewards), "n_states": len(q)}}))
"""


@dataclass
class QConfig:
    seed: int
    q_table: dict = field(default_factory=dict)
    epsilon: float = 0.3   # exploration rate — high early, can decay over iterations
    alpha:   float = 0.5   # learning rate
    gamma:   float = 0.9   # discount factor
    n_episodes: int = 300


def run_qlearning_iter(config: QConfig) -> dict:
    """Run one training iteration in a sandbox; return updated Q-table + stats."""
    harness = _QLEARN_HARNESS.format(
        q_json=json.dumps(config.q_table),
        seed=config.seed,
        alpha=config.alpha,
        gamma=config.gamma,
        epsilon=config.epsilon,
        n_episodes=config.n_episodes,
    )
    with SandboxClient().create_and_connect(memory_mb=1024) as box:
        ex = box.run("python3", ["-c", harness])
        return json.loads((ex.stdout or "").strip())


def train_q(n_iter: int = 8, episodes_per_iter: int = 300) -> dict:
    """
    Train a Q-table over n_iter sequential sandbox calls.

    Each call receives the Q-table from the previous iteration and returns
    an updated one. Mean reward moving from negative to positive confirms
    the policy is improving against the greedy opponent.
    """
    q_table: dict = {}
    print(f"{'Iter':>5}  {'Mean reward':>13}  {'Q-states':>10}")
    print("-" * 34)
    for i in range(n_iter):
        result  = run_qlearning_iter(QConfig(seed=i, q_table=q_table, n_episodes=episodes_per_iter))
        q_table = result["q_table"]
        print(f"{i+1:>5}  {result['mean_reward']:>+13.3f}  {result['n_states']:>10}")
    return q_table


def q_policy_code(q_table: dict) -> str:
    """
    Serialize the Q-table into a choose_action string compatible with POLICIES.

    This lets the learned policy plug directly into evaluate_matchup and
    play_against without any changes to those functions.
    """
    q_json = json.dumps(q_table)
    return (
        "import json as _j\n"
        "_Q = _j.loads(" + repr(q_json) + ")\n"
        "def choose_action(board, player, rng):\n"
        "    def skey(b): return tuple(0 if v is None else 1 if v == 'X' else 2 for v in b)\n"
        "    def qkey(s, a): return str(s) + ',' + str(a)\n"
        "    moves = [i for i, v in enumerate(board) if v is None]\n"
        "    return max(moves, key=lambda a: _Q.get(qkey(skey(board), a), 0.0))\n"
    )


# ─── Main ─────────────────────────────────────────────────────────────────────

if __name__ == "__main__":
    matchups = [
        ("random", "random"),
        ("greedy", "random"),
        ("random", "greedy"),
        ("greedy", "greedy"),
    ]
    seeds = [0, 1, 2, 3]   # one sandbox per seed per matchup = 16 sandboxes total

    print("Evaluating all matchups (4 seeds × 50 games each, one sandbox per seed)...")
    print(f"\n{'X policy':>10}  {'O policy':>10}  {'mean return':>13}  {'std':>6}")
    print("-" * 48)
    eval_results = {}
    for x, o in matchups:
        mean, std = evaluate_matchup(x, o, seeds=seeds)
        eval_results[(x, o)] = (mean, std)
        print(f"{x:>10}  {o:>10}  {mean:>+13.3f}  {std:>6.3f}")

    print(
        f"\n  → Mean return is the expected reward per game from X's perspective\n"
        f"    (+1 win, −1 loss, 0 draw), averaged over {seeds} seeds × 50 games.\n"
        f"    greedy-vs-random ({eval_results[('greedy','random')][0]:+.3f}) shows how\n"
        f"    strongly a win/block heuristic dominates pure chance.\n"
        f"    greedy-vs-greedy ({eval_results[('greedy','greedy')][0]:+.3f} ≠ 0) reveals a\n"
        f"    fork vulnerability: X can reach positions that greedy-O cannot\n"
        f"    simultaneously block, which a stronger policy would eliminate.\n"
        f"    Low std (0.04–0.10) confirms 4 seeds × 50 games is enough to\n"
        f"    rank policies reliably — scale up seeds for tighter confidence intervals."
    )

    # ── Train and add the learned policy ─────────────────────────────────────
    print("\nTraining Q-learner vs greedy opponent (8 iterations × 300 episodes)...")
    q_table = train_q(n_iter=8, episodes_per_iter=300)

    # Serialize the Q-table into a choose_action string — same interface as
    # random and greedy, so evaluate_matchup works without any changes.
    POLICIES["q_learned"] = q_policy_code(q_table)

    print("\nEvaluating learned policy against baselines:")
    print(f"\n{'Matchup':>28}  {'mean return':>13}  {'std':>6}")
    print("-" * 54)
    for x, o in [("q_learned", "greedy"), ("greedy", "q_learned"), ("q_learned", "random")]:
        mean, std = evaluate_matchup(x, o, seeds=seeds)
        print(f"{x+' vs '+o:>28}  {mean:>+13.3f}  {std:>6.3f}")

    print(
        "\n  → q_learned was trained as X against greedy O.\n"
        "    It does not know how to play as O — greedy vs q_learned\n"
        "    exposes this: the policy is role-specialized, not general."
    )

    # ── Play a game ───────────────────────────────────────────────────────────
    side = input("\nPlay a game? Choose your side [X/O] (or press Enter to skip): ").strip().upper()
    if side in ("X", "O"):
        available_policies = list(POLICIES.keys())
        opp = input(f"Opponent policy {available_policies} (default: greedy): ").strip().lower()
        if opp not in POLICIES:
            opp = "greedy"
        play_against(human_side=side, opponent_policy=opp)
Expected output — evaluation:
Evaluating all matchups (4 seeds × 50 games each, one sandbox per seed)...

  X policy    O policy    mean return     std
------------------------------------------------
    random      random         +0.255   0.100
    greedy      random         +0.900   0.043
    random      greedy         -0.640   0.069
    greedy      greedy         +0.180   0.059

  → Mean return is the expected reward per game from X's perspective
    (+1 win, −1 loss, 0 draw), averaged over [0, 1, 2, 3] seeds × 50 games.
    greedy-vs-random (+0.900) shows how strongly a win/block heuristic dominates pure chance.
    greedy-vs-greedy (+0.180 ≠ 0) reveals a fork vulnerability: X can reach positions
    that greedy-O cannot simultaneously block, which a stronger policy would eliminate.
    Low std (0.04–0.10) confirms 4 seeds × 50 games is enough to rank policies reliably.
Expected output — Q-learning training:
Training Q-learner vs greedy opponent (8 iterations × 300 episodes)...
 Iter    Mean reward    Q-states
----------------------------------
    1         -0.470         393
    2         -0.113         592
    3         -0.177         823
    4         -0.080         963
    5         +0.117        1051
    6         +0.087        1159
    7         +0.073        1226
    8         +0.053        1297

Evaluating learned policy against baselines:

                     Matchup    mean return     std
------------------------------------------------------
          q_learned vs greedy         +0.927   0.034
          greedy vs q_learned         +0.990   0.008
          q_learned vs random         +0.785   0.051

  → q_learned was trained as X against greedy O.
    It does not know how to play as O — greedy vs q_learned
    exposes this: the policy is role-specialized, not general.
The training loop passes the Q-table from each iteration into the next via JSON. Mean reward moving from −0.47 to +0.05 over 8 iterations shows the policy improving against a greedy opponent. Each iteration is a separate sandbox call — the host owns the Q-table and the loop control; the sandbox owns the episode dynamics. The jump in Q-states from 393 to 1297 reflects the agent exploring new board positions as its policy improves. Early iterations barely escape losing positions; later ones have enough coverage to exploit the greedy opponent’s fork blindspot. After training, q_policy_code() serializes the Q-table into a choose_action string with the same interface as random and greedy. This lets the learned policy drop into evaluate_matchup and play_against with zero changes to those functions. After the evaluation the script prompts you to play. Choose X to move first or O to let the opponent open. Expected output — interactive game as O against greedy:
Play a game? Choose your side [X/O] (or press Enter to skip): O
Opponent policy ['random', 'greedy', 'q_learned'] (default: greedy):

You are O. Opponent: greedy.
Empty squares show their position number (0–8).

 0 | 1 | 2
---+---+---
 3 | 4 | 5
---+---+---
 6 | 7 | 8

  X (greedy) plays 4
 0 | 1 | 2
---+---+---
 3 | X | 5
---+---+---
 6 | 7 | 8

Your move (O), choose from [0, 1, 2, 3, 5, 6, 7, 8]: 0
 O | 1 | 2
---+---+---
 3 | X | 5
---+---+---
 6 | 7 | 8

  X (greedy) plays 8
...
The opponent’s policy code runs inside the sandbox on every turn — the choose_action function never executes in your host process. The sandbox stays open for the whole game session (timeout_secs=300); only the move oracle harness re-runs on each turn.

Key design callouts

Why the seed is embedded in the harness string

The seed is formatted directly into the Python script that runs inside the sandbox, not set via an environment variable or a host-side call. This means the host process’s random state has no path into the episode. If you set the seed on the host and then passed the environment object into the sandbox, any host-side RNG calls between setup and rollout would shift the environment’s random state relative to what you expected. Embedding it in the harness makes the episode fully self-contained. In gymnasium specifically, env.reset(seed=seed) only seeds the observation and transition RNG — the action space has a separate RNG that must be seeded independently with env.action_space.seed(seed). Forgetting the second call produces non-deterministic trajectories even when everything else is correct.

Why each rollout gets its own sandbox

Sharing a sandbox across rollouts would mean sharing filesystem state, installed package versions, and any residual process state from prior episodes. Even if you call env.reset() correctly, state outside the environment object — temporary files, cached computations, mutated globals — can persist and affect the next episode. Creating a fresh sandbox per rollout makes the isolation structural rather than depending on careful cleanup.

How this relates to the GSPO pattern

In RL Training with GSPO, the sandbox is a reward oracle: each model completion is sent to a sandbox that runs a hidden test suite and returns a score. The reproducibility concern is different there — you need each completion to be evaluated fairly, not that the environment is deterministic. But the underlying mechanism is the same: one sandbox per evaluation, no shared state. The reproducibility pattern here is what you would use when the environment itself (not just the evaluator) needs to be deterministic across training runs.
This example uses python-dotenv to load your Tensorlake API key. Create a .env file in your project root:
TENSORLAKE_API_KEY="your-api-key-here"
The SandboxClient will pick it up automatically.

What to build next

RL Training with GSPO

Use sandboxes as a reward oracle to fine-tune a language model on code generation tasks.

Agentic Swarm Intelligence

Dispatch parallel sandboxes across a swarm of worker agents for large-scale rollout collection.

Snapshots

Freeze environment state mid-rollout to create branching experiments without re-running from scratch.