--env flag. Tensorlake plugs in as one of those providers — alongside other sandboxes and local Docker — so the same Harbor commands run on Tensorlake sandboxes without changing your tasks, agents, or evaluators.
This guide focuses on running CLI-agent evaluations against benchmarks like Terminal-Bench. Harbor also supports generating rollouts for RL optimization — we’ll cover those workflows in follow-up guides.
Quick start
Get a Tensorlake API key
Grab one from the Tensorlake Dashboard. You’ll also need an API key for whichever agent provider you want to evaluate (e.g., Anthropic).
Install Harbor with the Tensorlake provider
The
harbor[tensorlake] extra installs the TensorLakeEnvironment provider alongside Harbor.- uv
- pip
Run a Terminal-Bench task
Run a single Terminal-Bench task on Tensorlake with Claude Code as the agent:Drop
--include-task-name to run the full Terminal-Bench 2.1 suite. --ae KEY=VALUE forwards an environment variable from your shell into the sandbox where the agent runs — add more --ae flags for any other secrets the agent needs.Why Tensorlake for Harbor
Harbor’s value comes from running large fleets of environments in parallel and trusting the results. Tensorlake’s runtime is designed for exactly that workload:- Per-trial sandboxes — each task starts on a clean machine and is destroyed at the end. No shared kernel state between trials, which matters for both eval reproducibility and RL reward integrity.
- Full task-environment support — Tensorlake imports a task’s real Docker image and converts it into a sandbox image that boots directly, so every trial runs the exact environment the benchmark defines rather than one approximated by replaying a Dockerfile. That closes the environment gap that otherwise quietly skews results.
- Pre-warmed snapshots — environments with heavy
apt/pipinstalls (PyTorch, CUDA toolchains, full Linux desktops) can be built once, snapshotted, and restored under a second for every subsequent trial or rollout. - Independent verification — Harbor’s test script runs inside the sandbox and writes
1.0or0.0toreward.txt. The agent never sees or touches the verifier, so “the agent said it worked” is never confused with “the tests pass.” - Parallel scale — Tensorlake schedules thousands of sandboxes concurrently, which is what RL rollout generation and full benchmark sweeps need.
Anatomy of a Harbor task
Harbor expects each task to be laid out like this - take gcode-to-text as an example:environment/Dockerfiledefines the base image and any setup steps.instruction.mdis the prompt the agent receives.solution/is an oracle reference used to validate the environment itself.tests/test.shruns after the agent finishes and producesreward.txt.
Tune sandbox resources
Each task’stask.toml controls the sandbox Harbor provisions on Tensorlake. Set resources in the [environment] block:
task.toml
| Field | Default | Forwarded to Tensorlake |
|---|---|---|
cpus | 1 | cpus |
memory_mb | 2048 | memory_mb |
storage_mb | 10240 | ephemeral_disk_mb |
allow_internet | true | allow_internet_access |
Tensorlake requires
memory_mb to be between 1024 and 8192 MB per CPU core.- Large or heavy images — if your
environment/Dockerfilepulls in big toolchains (PyTorch, CUDA, full Linux desktops, large datasets), bumpcpusandmemory_mbso the build and runtime have headroom, and raisestorage_mbpast the image size plus working-set room. Underprovisioned sandboxes show up as build timeouts or OOMs mid-trial. - Lock down
allow_internet— setallow_internet = falseto stop the agent from searching the web for answers. If the verifier needs network access, bake those dependencies into the Dockerfile. Per-host allowlists are coming soon, so you’ll be able to block search engines while leaving package mirrors reachable.
Image build & caching
Each trial boots from an image. Harbor uses a prebuilt image when the task declares one, and otherwise builds the task’s Dockerfile:| Source | How to set it | When to use |
|---|---|---|
| Prebuilt image | docker_image in task.toml | Fastest start — boot directly from an image with the environment already baked in. Terminal-Bench ships these. |
| Dockerfile | environment/Dockerfile | No prebuilt image declared. Built once, cached, and reused across trials. |
Prebuilt image
If a task declares adocker_image in task.toml, Harbor boots directly from that image and skips the Dockerfile entirely:
task.toml
~/.tensorlake/config.toml (organization and project) if it’s present.
Dockerfile
If a task has nodocker_image, Harbor builds its environment/Dockerfile once via Tensorlake’s image builder, caches it, and boots every later trial directly from the cached image — no per-trial apt/pip work. The cache is keyed on the Dockerfile and every file in the build context, so editing a requirements.txt pin or any COPY’d file automatically triggers a rebuild.
RUN/COPY steps on each trial, so a trial is never blocked — it just runs a little slower. The fallback is also available as an explicit escape hatch while you iterate on a Dockerfile:
docker build, so a few Docker conventions need small adjustments:
COPYdoes not auto-create parent directories —COPY x /a/b/cfails if/a/bdoesn’t exist yet. AddRUN mkdir -p /a/bbefore theCOPY.- Don’t pin exact apt versions (
apt-get install curl=8.5.0-2ubuntu10.6) — drop the pin or pick a version that exists in the target distro. - Use a FROM image that ships the Python you need (e.g.
python:3.10-bookworm) rather than relying on a non-native version being fetched at build time.
--force-build. This applies to that run only and doesn’t disturb the cache used by subsequent normal runs.
Sharing images publicly
By default, images Harbor builds or imports are private to your organization — only your org can boot from them. Add--ek is_public=true to register a freshly built or imported image as public, so any organization with Tensorlake access can reuse it:
docker_image imports. Automatic boot-from-public by another organization is wired through the prebuilt docker_image path — that’s how the published Terminal-Bench images are reused — so if your goal is to publish an environment others boot directly, prefer a docker_image reference.
Publishing public images is gated to an allow list. If your account isn’t on it, the flag is ignored and the image stays private. Reach out to Tensorlake to be added.
is_public only takes effect when the image is newly registered. If an image with the same name already exists — a private copy from an earlier run, or an existing public one — Harbor boots it as-is and won’t republish it. To turn an already-private image public, delete it first (or change the build context so it gets a new name), then rerun with --ek is_public=true.
Ad-hoc native dependencies
If a task just needs a couple of extra apt packages and you don’t want to edit the Dockerfile or maintain a snapshot, usepreinstall_packages:
Interactive debugging
When a trial fails and you want to poke around the live environment, attach to the session:Structured logs
Each trial produces structured artifacts, e.g.:- The agent’s actions and outputs
- What the verifier checked
- Why the trial passed or failed
What to build next
Snapshots
Build an environment once, snapshot it, and restore in seconds for every trial.
Reproducible RL Environments
Use sandboxes as a deterministic reward oracle for RL training loops.