A rollout is a complete episode of agent-environment interaction: an agent takes actions, the environment transitions, and rewards accumulate until the episode ends. Reproducibility means that given the same random seed and the same action sequence, every rollout produces exactly the same observations, transitions, and rewards. This property is foundational for RL engineering — without it, you cannot reliably compare two policy versions, reproduce a bug seen during training, or verify that a reward spike was real and not noise. The hard part is that real training runs hundreds or thousands of rollouts in parallel. Each worker must be completely isolated from the others: no shared filesystem, no shared process state, no network side-effects leaking across episodes. If any state bleeds between workers, your “reproducible” seed no longer controls the outcome and you lose the guarantee. Tensorlake sandboxes enforce this isolation at the infrastructure level — every rollout gets its own fresh environment, and the seed is the only variable in play.Documentation Index
Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
Use this file to discover all available pages before exploring further.
Core concepts
Isolation means each rollout runs in its own compute environment with no shared resources. Two workers seeded with different values must not be able to influence each other’s trajectories through any shared channel — not a shared pip cache, not a shared/tmp, not a shared network state. In production, this matters most when you are running hundreds of rollouts per training step: any shared state becomes a source of variance that your reward signal cannot explain.
Stateful resets mean the environment always starts from a known, controlled baseline when a new rollout begins. A reset that partially inherits state from a previous episode is one of the most common and hardest-to-debug sources of non-reproducibility. Because each sandbox is created fresh per rollout, the reset is total — there is no prior episode state to inherit.
Determinism means the environment’s random number generator is seeded before any interaction begins, and the seed is the sole source of randomness for the entire episode. Given the same seed, the same initial observation, and the same action sequence, the trajectory must be identical byte-for-byte. This lets you replay any episode from training history, compare policy versions on equal footing, and write regression tests against specific trajectories.
How Tensorlake sandboxes provide this
Sandbox.create() starts a fresh, isolated compute environment and returns a box handle. Every sandbox is a separate process tree with its own filesystem and memory. There is no shared state between two sandboxes created from the same client.
The seed is passed into the environment harness as a string literal embedded in the Python script that runs inside the sandbox — not set on the host process. This keeps the host’s random state completely separate from the environment’s, which is important when you dispatch many rollouts from a single host thread pool.
Parallel rollouts map cleanly onto ThreadPoolExecutor: each thread creates its own sandbox, runs its episode, collects its trajectory, and the sandbox is destroyed when the context manager exits. The executor manages concurrency; the sandboxes manage isolation.
Prerequisites
.env file in your project root:
TypeScript SDK starter
The same reproducibility pattern works from Node.js: embed the seed in the harness, run one rollout per sandbox, and compare trajectories across identical seeds.Promise.all() with one rollout per seed and aggregate the returned JSON results by seed.
Full example
Tic-tac-toe: policy evaluation
This example extends the CartPole infrastructure to a custom two-player environment and shows where sandboxes are more directly necessary. Policies are defined as code strings — the same pattern used in RL Training with GSPO for LLM-generated completions. A policy that crashes, loops, or behaves unexpectedly only kills its own sandbox; the rest of the evaluation runs unaffected. The data model is the same as CartPole:TttConfig extends RolloutConfig by replacing env_name with policy_x and policy_o; run_ttt_batch returns the same RolloutResult. The total_reward field becomes the mean return per game from X’s perspective (+1 win, −1 loss, 0 draw). evaluate_matchup follows the same parallel dispatch pattern as collect_parallel_rollouts, running one sandbox per seed to get a reliable return estimate. This is the policy evaluation step in policy iteration — you would call it after each policy update to measure how much the return improved.
q_policy_code() serializes the Q-table into a choose_action string with the same interface as random and greedy. This lets the learned policy drop into evaluate_matchup and play_against with zero changes to those functions.
After the evaluation the script prompts you to play. Choose X to move first or O to let the opponent open.
Expected output — interactive game as O against greedy:
choose_action function never executes in your host process. The sandbox stays open for the whole game session (timeout_secs=300); only the move oracle harness re-runs on each turn.
Key design callouts
Why the seed is embedded in the harness string
The seed is formatted directly into the Python script that runs inside the sandbox, not set via an environment variable or a host-side call. This means the host process’s random state has no path into the episode. If you set the seed on the host and then passed the environment object into the sandbox, any host-side RNG calls between setup and rollout would shift the environment’s random state relative to what you expected. Embedding it in the harness makes the episode fully self-contained. In gymnasium specifically,env.reset(seed=seed) only seeds the observation and transition RNG — the action space has a separate RNG that must be seeded independently with env.action_space.seed(seed). Forgetting the second call produces non-deterministic trajectories even when everything else is correct.
Why each rollout gets its own sandbox
Sharing a sandbox across rollouts would mean sharing filesystem state, installed package versions, and any residual process state from prior episodes. Even if you callenv.reset() correctly, state outside the environment object — temporary files, cached computations, mutated globals — can persist and affect the next episode. Creating a fresh sandbox per rollout makes the isolation structural rather than depending on careful cleanup.
How this relates to the GSPO pattern
In RL Training with GSPO, the sandbox is a reward oracle: each model completion is sent to a sandbox that runs a hidden test suite and returns a score. The reproducibility concern is different there — you need each completion to be evaluated fairly, not that the environment is deterministic. But the underlying mechanism is the same: one sandbox per evaluation, no shared state. The reproducibility pattern here is what you would use when the environment itself (not just the evaluator) needs to be deterministic across training runs.This example uses The SDK will pick it up automatically.
python-dotenv to load your Tensorlake API key. Create a .env file in your project root:What to build next
RL Training with GSPO
Use sandboxes as a reward oracle to fine-tune a language model on code generation tasks.
Agentic Swarm Intelligence
Dispatch parallel sandboxes across a swarm of worker agents for large-scale rollout collection.
Snapshots
Freeze environment state mid-rollout to create branching experiments without re-running from scratch.