> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Harbor

> Run Harbor evaluations and RL rollouts on Tensorlake Sandboxes — fresh isolation per trial, pre-warmed snapshots for expensive environments, and independent test verification.

[Harbor](https://github.com/harbor-framework/harbor) is a framework from the creators of [Terminal-Bench](https://www.tbench.ai/) for evaluating and optimizing agents and language models. With Harbor you can evaluate arbitrary agents (Claude Code, OpenHands, Codex CLI, and others) against curated datasets like Terminal-Bench, SWE-Bench, and Aider Polyglot, build and share your own benchmarks, run thousands of trials in parallel across cloud providers, and generate rollouts for RL optimization.

Harbor abstracts the execution backend behind an `--env` flag. Tensorlake plugs in as one of those providers — alongside other sandboxes and local Docker — so the same Harbor commands run on Tensorlake sandboxes without changing your tasks, agents, or evaluators.

<Note>
  This guide focuses on running CLI-agent evaluations against benchmarks like Terminal-Bench. Harbor also supports generating rollouts for RL optimization — we'll cover those workflows in follow-up guides.
</Note>

<Tip>
  New to Tensorlake? Sign up at the [dashboard](https://cloud.tensorlake.ai) — new accounts include free credits, enough to run a full Terminal-Bench sweep before you pay for anything.
</Tip>

## Quick start

<Steps>
  <Step title="Get a Tensorlake API key">
    Grab one from the [Tensorlake Dashboard](https://cloud.tensorlake.ai). You'll also need an API key for whichever agent provider you want to evaluate (e.g., Anthropic).
  </Step>

  <Step title="Install Harbor with the Tensorlake provider">
    The `harbor[tensorlake]` extra installs the `TensorLakeEnvironment` provider alongside Harbor.

    <Tabs>
      <Tab title="uv">
        ```bash theme={null}
        uv pip install "harbor[tensorlake]"
        ```
      </Tab>

      <Tab title="pip">
        ```bash theme={null}
        pip install "harbor[tensorlake]"
        ```
      </Tab>
    </Tabs>
  </Step>

  <Step title="Set your environment variables">
    ```bash theme={null}
    export TENSORLAKE_API_KEY="tl_..."
    export ANTHROPIC_API_KEY="sk-ant-..."   # or another agent provider
    ```
  </Step>

  <Step title="Run a Terminal-Bench task">
    Run a single Terminal-Bench task on Tensorlake with Claude Code as the agent:

    ```bash theme={null}
    harbor run --env tensorlake \
      --include-task-name pytorch-model-cli \
      --dataset terminal-bench@2.0 \
      --agent claude-code \
      --model anthropic/claude-sonnet-4-6 \
      --ae ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY
    ```

    Drop `--include-task-name` to run the full Terminal-Bench 2.0 suite. `--ae KEY=VALUE` forwards an environment variable from your shell into the sandbox where the agent runs — add more `--ae` flags for any other secrets the agent needs.
  </Step>
</Steps>

## Why Tensorlake for Harbor

Harbor's value comes from running large fleets of environments in parallel and trusting the results. Tensorlake's runtime is designed for exactly that workload:

* **Per-trial sandboxes** — each task starts on a clean machine and is destroyed at the end. No shared kernel state between trials, which matters for both eval reproducibility and RL reward integrity.
* **Pre-warmed snapshots** — environments with heavy `apt`/`pip` installs (PyTorch, CUDA toolchains, full Linux desktops) can be built once, snapshotted, and restored under a second for every subsequent trial or rollout.
* **Independent verification** — Harbor's test script runs inside the sandbox and writes `1.0` or `0.0` to `reward.txt`. The agent never sees or touches the verifier, so "the agent said it worked" is never confused with "the tests pass."
* **Parallel scale** — Tensorlake schedules thousands of sandboxes concurrently, which is what RL rollout generation and full benchmark sweeps need.

## Anatomy of a Harbor task

Harbor expects each task to be laid out like this - take [gcode-to-text](https://github.com/harbor-framework/terminal-bench-2/tree/main/gcode-to-text) as an example:

```
gcode-to-text
├── environment
│   ├── Dockerfile
│   └── text.gcode.gz
├── instruction.md
├── solution
│   └── solve.sh
├── task.toml
└── tests
    ├── test_outputs.py
    └── test.sh
```

* `environment/Dockerfile` defines the base image and any setup steps.
* `instruction.md` is the prompt the agent receives.
* `solution/` is an oracle reference used to validate the environment itself.
* `tests/test.sh` runs after the agent finishes and produces `reward.txt`.

## Tune sandbox resources

Each task's `task.toml` controls the sandbox Harbor provisions on Tensorlake. Set resources in the `[environment]` block:

```toml task.toml theme={null}
[environment]
cpus = 2
memory_mb = 4096
storage_mb = 20480
allow_internet = true
```

| Field            | Default | Forwarded to Tensorlake |
| ---------------- | ------- | ----------------------- |
| `cpus`           | `1`     | `cpus`                  |
| `memory_mb`      | `2048`  | `memory_mb`             |
| `storage_mb`     | `10240` | `ephemeral_disk_mb`     |
| `allow_internet` | `true`  | `allow_internet_access` |

<Note>
  Tensorlake requires `memory_mb` to be between 1024 and 8192 MB per CPU core.
</Note>

A few rules of thumb:

* **Large or heavy images** — if your `environment/Dockerfile` pulls in big toolchains (PyTorch, CUDA, full Linux desktops, large datasets), bump `cpus` and `memory_mb` so the build and runtime have headroom, and raise `storage_mb` past the image size plus working-set room. Underprovisioned sandboxes show up as build timeouts or OOMs mid-trial.
* **Lock down `allow_internet`** — set `allow_internet = false` to stop the agent from searching the web for answers. If the verifier needs network access, bake those dependencies into the Dockerfile. Per-host allowlists are coming soon, so you'll be able to block search engines while leaving package mirrors reachable.

## Interactive debugging

When a trial fails and you want to poke around the live environment, attach to the session:

```bash theme={null}
harbor env attach <session_id>
```

Drop directly into the running sandbox to inspect state, rerun tests by hand, and confirm whether the failure was the agent or the environment.

## Structured logs

Each trial produces structured artifacts, e.g.:

```
gcode-to-text__UFALMLv
├── agent/
├── verifier/
├── result.json
└── trial.log
```

So you can trace:

* The agent's actions and outputs
* What the verifier checked
* Why the trial passed or failed

## What to build next

<CardGroup cols={2}>
  <Card title="Snapshots" icon="camera" href="/sandboxes/snapshots">
    Build an environment once, snapshot it, and restore in seconds for every trial.
  </Card>

  <Card title="Reproducible RL Environments" icon="dumbbell" href="/sandboxes/agentic-rl-reproducible-env">
    Use sandboxes as a deterministic reward oracle for RL training loops.
  </Card>
</CardGroup>
