Skip to main content
Harbor is a framework from the creators of Terminal-Bench for evaluating and optimizing agents and language models. With Harbor you can evaluate arbitrary agents (Claude Code, OpenHands, Codex CLI, and others) against curated datasets like Terminal-Bench, SWE-Bench, and Aider Polyglot, build and share your own benchmarks, run thousands of trials in parallel across cloud providers, and generate rollouts for RL optimization. Harbor abstracts the execution backend behind an --env flag. Tensorlake plugs in as one of those providers — alongside other sandboxes and local Docker — so the same Harbor commands run on Tensorlake sandboxes without changing your tasks, agents, or evaluators.
This guide focuses on running CLI-agent evaluations against benchmarks like Terminal-Bench. Harbor also supports generating rollouts for RL optimization — we’ll cover those workflows in follow-up guides.
New to Tensorlake? Sign up at the dashboard — new accounts include free credits, enough to run a full Terminal-Bench sweep before you pay for anything.

Quick start

1

Get a Tensorlake API key

Grab one from the Tensorlake Dashboard. You’ll also need an API key for whichever agent provider you want to evaluate (e.g., Anthropic).
2

Install Harbor with the Tensorlake provider

The harbor[tensorlake] extra installs the TensorLakeEnvironment provider alongside Harbor.
uv pip install "harbor[tensorlake]"
3

Set your environment variables

export TENSORLAKE_API_KEY="tl_..."
export ANTHROPIC_API_KEY="sk-ant-..."   # or another agent provider
4

Run a Terminal-Bench task

Run a single Terminal-Bench task on Tensorlake with Claude Code as the agent:
harbor run --env tensorlake \
  --include-task-name terminal-bench/pytorch-model-cli \
  --dataset terminal-bench/terminal-bench-2-1 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6 \
  --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
Drop --include-task-name to run the full Terminal-Bench 2.1 suite. --ae KEY=VALUE forwards an environment variable from your shell into the sandbox where the agent runs — add more --ae flags for any other secrets the agent needs.

Why Tensorlake for Harbor

Harbor’s value comes from running large fleets of environments in parallel and trusting the results. Tensorlake’s runtime is designed for exactly that workload:
  • Per-trial sandboxes — each task starts on a clean machine and is destroyed at the end. No shared kernel state between trials, which matters for both eval reproducibility and RL reward integrity.
  • Full task-environment support — Tensorlake imports a task’s real Docker image and converts it into a sandbox image that boots directly, so every trial runs the exact environment the benchmark defines rather than one approximated by replaying a Dockerfile. That closes the environment gap that otherwise quietly skews results.
  • Pre-warmed snapshots — environments with heavy apt/pip installs (PyTorch, CUDA toolchains, full Linux desktops) can be built once, snapshotted, and restored under a second for every subsequent trial or rollout.
  • Independent verification — Harbor’s test script runs inside the sandbox and writes 1.0 or 0.0 to reward.txt. The agent never sees or touches the verifier, so “the agent said it worked” is never confused with “the tests pass.”
  • Parallel scale — Tensorlake schedules thousands of sandboxes concurrently, which is what RL rollout generation and full benchmark sweeps need.

Anatomy of a Harbor task

Harbor expects each task to be laid out like this - take gcode-to-text as an example:
gcode-to-text
├── environment
│   ├── Dockerfile
│   └── text.gcode.gz
├── instruction.md
├── solution
│   └── solve.sh
├── task.toml
└── tests
    ├── test_outputs.py
    └── test.sh
  • environment/Dockerfile defines the base image and any setup steps.
  • instruction.md is the prompt the agent receives.
  • solution/ is an oracle reference used to validate the environment itself.
  • tests/test.sh runs after the agent finishes and produces reward.txt.

Tune sandbox resources

Each task’s task.toml controls the sandbox Harbor provisions on Tensorlake. Set resources in the [environment] block:
task.toml
[environment]
cpus = 2
memory_mb = 4096
storage_mb = 20480
allow_internet = true
FieldDefaultForwarded to Tensorlake
cpus1cpus
memory_mb2048memory_mb
storage_mb10240ephemeral_disk_mb
allow_internettrueallow_internet_access
Tensorlake requires memory_mb to be between 1024 and 8192 MB per CPU core.
A few rules of thumb:
  • Large or heavy images — if your environment/Dockerfile pulls in big toolchains (PyTorch, CUDA, full Linux desktops, large datasets), bump cpus and memory_mb so the build and runtime have headroom, and raise storage_mb past the image size plus working-set room. Underprovisioned sandboxes show up as build timeouts or OOMs mid-trial.
  • Lock down allow_internet — set allow_internet = false to stop the agent from searching the web for answers. If the verifier needs network access, bake those dependencies into the Dockerfile. Per-host allowlists are coming soon, so you’ll be able to block search engines while leaving package mirrors reachable.

Image build & caching

Each trial boots from an image. Harbor uses a prebuilt image when the task declares one, and otherwise builds the task’s Dockerfile:
SourceHow to set itWhen to use
Prebuilt imagedocker_image in task.tomlFastest start — boot directly from an image with the environment already baked in. Terminal-Bench ships these.
Dockerfileenvironment/DockerfileNo prebuilt image declared. Built once, cached, and reused across trials.
Either way the image is built or imported once and reused — you only pay the cost on the first trial, then every later trial boots directly from the cached image. If a task sets both, the prebuilt image wins over the Dockerfile if it’s exist.
Reusing one heavy environment across many runs — RL rollouts or repeated eval of the same task — can restore in under a second from a pre-warmed snapshot instead of rebuilding. See Snapshots.

Prebuilt image

If a task declares a docker_image in task.toml, Harbor boots directly from that image and skips the Dockerfile entirely:
task.toml
[environment]
docker_image = "myorg/my-task-env:2025-06"
Harbor looks the image up in Tensorlake by name and boots from it; if it isn’t registered yet, it imports the image once and reuses it on every later trial. The registered name is derived from the reference string you put here, so the first import of a given reference is what every later run boots.
Always publish with an immutable tag — never latest. Because the registered name comes from the reference string (not the image contents), a tag like latest is captured at its first import and then frozen: if you push new content to latest, Harbor keeps booting the old image and never re-pulls. Use an immutable tag or digest (e.g. myorg/my-task-env:2025-06 or ...@sha256:...) so a new build means a new reference — which is what triggers a fresh import. This is the convention the published Terminal-Bench images follow.To refresh an image that’s already registered, point docker_image at a new immutable tag/digest, or delete the registered Tensorlake image so the next run re-imports it. (--force-build does not re-import — it builds from the Dockerfile instead.)
Terminal-Bench 2.1 images are already published. We’ve registered every Terminal-Bench 2.1 task image publicly, so anyone with Tensorlake access boots straight from them — no build, no import. Just run the dataset as usual and each task picks up its published image.
Set your org and project context. Looking an image up by name requires organization and project context. Without it, Harbor can’t find the published image and falls back to importing it fresh — you’ll see a line like this in the logs:
importing docker_image alexgshaw/pypi-server:20251031 directly: Looking up a sandbox image by name requires organization and project context (TENSORLAKE_ORGANIZATION_ID and TENSORLAKE_PROJECT_ID).
Set both before running so the lookup resolves and the boot stays instant:
export TENSORLAKE_ORGANIZATION_ID="..."
export TENSORLAKE_PROJECT_ID="..."
Harbor also reads these from ~/.tensorlake/config.toml (organization and project) if it’s present.

Dockerfile

If a task has no docker_image, Harbor builds its environment/Dockerfile once via Tensorlake’s image builder, caches it, and boots every later trial directly from the cached image — no per-trial apt/pip work. The cache is keyed on the Dockerfile and every file in the build context, so editing a requirements.txt pin or any COPY’d file automatically triggers a rebuild.
harbor run --env tensorlake \
  --dataset terminal-bench/terminal-bench-2-1 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
If a build ever fails, Harbor automatically falls back to replaying the Dockerfile’s RUN/COPY steps on each trial, so a trial is never blocked — it just runs a little slower. The fallback is also available as an explicit escape hatch while you iterate on a Dockerfile:
harbor run --env tensorlake ... --ek use_oci_image_build=false
Dockerfile requirements The image builder is stricter than a local docker build, so a few Docker conventions need small adjustments:
  • COPY does not auto-create parent directoriesCOPY x /a/b/c fails if /a/b doesn’t exist yet. Add RUN mkdir -p /a/b before the COPY.
  • Don’t pin exact apt versions (apt-get install curl=8.5.0-2ubuntu10.6) — drop the pin or pick a version that exists in the target distro.
  • Use a FROM image that ships the Python you need (e.g. python:3.10-bookworm) rather than relying on a non-native version being fetched at build time.
To force a fresh rebuild even when a valid cached image exists, add --force-build. This applies to that run only and doesn’t disturb the cache used by subsequent normal runs.
harbor run --env tensorlake \
  --force-build \
  --dataset terminal-bench/terminal-bench-2-1 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6

Sharing images publicly

By default, images Harbor builds or imports are private to your organization — only your org can boot from them. Add --ek is_public=true to register a freshly built or imported image as public, so any organization with Tensorlake access can reuse it:
harbor run --env tensorlake \
  --ek is_public=true \
  --dataset terminal-bench/terminal-bench-2-1 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
The flag applies to both Dockerfile-built images and prebuilt docker_image imports. Automatic boot-from-public by another organization is wired through the prebuilt docker_image path — that’s how the published Terminal-Bench images are reused — so if your goal is to publish an environment others boot directly, prefer a docker_image reference.
Publishing public images is gated to an allow list. If your account isn’t on it, the flag is ignored and the image stays private. Reach out to Tensorlake to be added.
is_public only takes effect when the image is newly registered. If an image with the same name already exists — a private copy from an earlier run, or an existing public one — Harbor boots it as-is and won’t republish it. To turn an already-private image public, delete it first (or change the build context so it gets a new name), then rerun with --ek is_public=true.

Ad-hoc native dependencies

If a task just needs a couple of extra apt packages and you don’t want to edit the Dockerfile or maintain a snapshot, use preinstall_packages:
harbor run --env tensorlake \
  --ek 'preinstall_packages=["build-essential","rustc","cargo"]' \
  --dataset terminal-bench/terminal-bench-2-1 \
  --agent claude-code \
  --model anthropic/claude-sonnet-4-6
The packages are installed at the start of each trial. Prefer snapshots when the package set is large or reused across many runs so you pay the install cost once.

Interactive debugging

When a trial fails and you want to poke around the live environment, attach to the session:
harbor env attach <session_id>
Drop directly into the running sandbox to inspect state, rerun tests by hand, and confirm whether the failure was the agent or the environment.

Structured logs

Each trial produces structured artifacts, e.g.:
gcode-to-text__UFALMLv
├── agent/
├── verifier/
├── result.json
└── trial.log
So you can trace:
  • The agent’s actions and outputs
  • What the verifier checked
  • Why the trial passed or failed

What to build next

Snapshots

Build an environment once, snapshot it, and restore in seconds for every trial.

Reproducible RL Environments

Use sandboxes as a deterministic reward oracle for RL training loops.