Skip to main content
Automate ML research iteration with an LLM agent that reads your training script, proposes targeted code changes, and validates each change by running it in an isolated sandbox. Inspired by Karpathy’s autoresearch (March 2026), the loop runs unattended — each accepted modification becomes the new baseline, and the agent builds on what it has already learned.

How it works

  1. Calibrate: Run the baseline training script in a sandbox to establish a starting validation loss.
  2. Propose: The agent reads the current best script and experiment history, then proposes N candidate modifications (with increasing temperature for diversity).
  3. Race: All N candidates run in parallel TensorLake sandboxes for a fixed step budget.
  4. Evaluate: Parse val_loss from each sandbox’s stdout. The candidate with the lowest loss wins the round.
  5. Hill-climb: Accept the winner only if it beats the current best. Update the baseline script and history.
  6. Repeat: Loop until the iteration budget is exhausted.

Why sandboxes are required

The agent emits complete, self-contained Python scripts. Running untrusted LLM-generated training code in your host process would be unsafe — the model could write arbitrary filesystem operations or import unexpected packages. Each candidate runs in an isolated sandbox with a fixed memory ceiling and is killed automatically when the step budget completes.

Prerequisites

pip install tensorlake openai rich python-dotenv
Create a .env file in your project root:
TENSORLAKE_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-openai-key-here"

Full example

Pass --smoke for a fast proof-of-concept run (3 iterations, 2 candidates, 150 training steps, ~5 minutes). The full run uses 8 iterations, 3 candidates, and 300 steps (~20 minutes).
"""
Karpathy Autoresearch Loop + TensorLake Sandboxes
==================================================
Inspired by github.com/karpathy/autoresearch (March 2026, 21k⭐).

The "Karpathy Loop":
  1. Give an AI agent a training script and a plain-English program.md
  2. Agent proposes one targeted code modification per iteration
  3. Run the modified script in isolation for a fixed step budget
  4. If val_loss improves → accept, update the baseline
  5. Repeat overnight → hundreds of validated improvements

TensorLake sandboxes are the right tool here:
  • Modified training code is untrusted (the agent could emit anything)
  • Multiple candidate modifications can race in parallel sandboxes
  • Each sandbox is killed after the time budget — no runaway experiments
  • The host process never imports/executes model weights or agent code

This example:
  State    : current best train.py + full experiment history
  Action   : LLM proposes one self-contained code modification
  Sandbox  : TensorLake runs the modified script for STEPS training steps
  Reward   : Δval_loss = best_val_loss − new_val_loss  (positive = improvement)
  Update   : Greedy hill-climbing — accept if reward > 0
  Parallelism: CANDIDATES sandboxes race each iteration (ThreadPoolExecutor)

Smoke : --smoke  → 3 iters, 2 candidates, 150 steps/run  (~5 min)
Full  : 8 iters, 3 candidates, 300 steps/run              (~20 min)
"""

from dotenv import load_dotenv
load_dotenv()

import re
import sys
from concurrent.futures import ThreadPoolExecutor, as_completed
from dataclasses import dataclass, field
from typing import List, Optional
from openai import OpenAI
from tensorlake.sandbox import SandboxClient
from rich.console import Console
from rich.panel import Panel
from rich.table import Table
from rich.rule import Rule
from rich import box

console = Console()
SMOKE = "--smoke" in sys.argv


# ─── program.md — plain-English guidance for the agent ───────────────────────
# This is what Karpathy calls "the human's job": describe the search space.

PROGRAM_GUIDANCE = """\
You are an ML research agent optimising a character-level MLP language model
trained on a small text corpus using numpy only (no torch/tensorflow).

The training script defines these tunable constants near the top:
  CTX    (context window / n-gram size, int)
  HIDDEN (hidden layer size, int)
  LR     (learning rate, float)
  BATCH  (batch size, int)
  WDECAY (L2 weight decay, float)
  STEPS  (DO NOT CHANGE — fixed budget)

Good things to try:
  • Learning rate: sweep 1e-4 → 0.1
  • Learning rate decay: multiply LR by 0.999 each step (add near opt.step)
  • Hidden size: 32 / 64 / 128 / 256
  • Context window CTX: 2 / 4 / 8
  • Weight decay WDECAY: 0 → 1e-4
  • Activation: replace np.tanh with np.maximum(0, x) (ReLU) or np.clip(x,0,None)
  • Initialization scale: change 0.01 to 0.1 or use He/Xavier init
  • Add a second hidden layer (W3, b3) with size HIDDEN//2
  • Momentum: track velocity vectors, apply SGD+momentum
  • Batch size: 16 / 32 / 64

Constraints:
  • numpy only — do not import torch, tensorflow, sklearn
  • STEPS must stay unchanged
  • Output format: last printed line must be   val_loss: X.XXXX
"""


# ─── Baseline training script ─────────────────────────────────────────────────
# ~130 K-param nano-GPT on a small public-domain text.
# Runs in ~15 s on CPU (150 steps) / ~30 s (300 steps).

BASELINE_SCRIPT = '''\
import subprocess, sys
subprocess.run(["python3","-m","pip","install","numpy","-q","--target","/tmp/pkgs"],
               capture_output=True, check=False)
sys.path.insert(0, "/tmp/pkgs")
import numpy as np
np.random.seed(42)

# ── Corpus (opening of Alice in Wonderland, public domain) ──────────────────
TEXT = (
    "Alice was beginning to get very tired of sitting by her sister on the bank,"
    " and of having nothing to do: once or twice she had peeped into the book her"
    " sister was reading, but it had no pictures or conversations in it, and what"
    " is the use of a book thought Alice without pictures or conversations so she"
    " was considering in her own mind as well as she could for the hot day made"
    " her feel very sleepy and stupid whether the pleasure of making a daisy-chain"
    " would be worth the trouble of getting up and picking the daisies when"
    " suddenly a White Rabbit with pink eyes ran close by her there was nothing so"
    " very remarkable in that nor did Alice think it so very much out of the way"
    " to hear the Rabbit say to itself oh dear oh dear I shall be late when she"
    " thought it over afterwards it occurred to her that she ought to have wondered"
    " at this but at the time it all seemed quite natural but when the Rabbit"
    " actually took a watch out of its waistcoat-pocket and looked at it and then"
    " hurried on Alice started to her feet for it flashed across her mind that she"
    " had never before seen a rabbit with either a waistcoat-pocket or a watch to"
    " take out of it and burning with curiosity she ran across the field after it"
) * 4   # ~4 800 chars

# ── Tokeniser ────────────────────────────────────────────────────────────────
chars = sorted(set(TEXT))
vocab = len(chars)
stoi  = {c: i for i, c in enumerate(chars)}
data  = [stoi[c] for c in TEXT]
split = int(0.9 * len(data))
train_d, val_d = data[:split], data[split:]

# ── Hyperparameters (agent modifies these) ───────────────────────────────────
CTX    = 4       # context window (n-gram)
HIDDEN = 64      # hidden layer size
LR     = 0.05    # learning rate
BATCH  = 32      # mini-batch size
WDECAY = 0.0     # L2 weight decay
STEPS  = STEPS_PLACEHOLDER   # fixed budget — do not change

# ── Parameters ───────────────────────────────────────────────────────────────
W1 = np.random.randn(vocab * CTX, HIDDEN) * 0.01
b1 = np.zeros(HIDDEN)
W2 = np.random.randn(HIDDEN, vocab) * 0.01
b2 = np.zeros(vocab)

def get_batch(d):
    idx = np.random.randint(0, len(d) - CTX, BATCH)
    X   = np.zeros((BATCH, vocab * CTX))
    for i, start in enumerate(idx):
        for j in range(CTX):
            X[i, j * vocab + d[start + j]] = 1.0
    Y = np.array([d[i + CTX] for i in idx])
    return X, Y

def forward(X):
    H      = np.tanh(X @ W1 + b1)
    logits = H @ W2 + b2
    logits -= logits.max(1, keepdims=True)
    probs   = np.exp(logits)
    probs  /= probs.sum(1, keepdims=True)
    return H, probs

def ce_loss(probs, Y):
    return -np.log(probs[np.arange(len(Y)), Y] + 1e-8).mean()

# ── Training loop ─────────────────────────────────────────────────────────────
for step in range(STEPS):
    X, Y    = get_batch(train_d)
    H, probs = forward(X)

    dl = probs.copy(); dl[np.arange(BATCH), Y] -= 1; dl /= BATCH
    dW2 = H.T @ dl;    db2 = dl.sum(0)
    dH  = dl @ W2.T * (1 - H**2)
    dW1 = X.T @ dH;    db1 = dH.sum(0)

    W1 -= LR * (dW1 + WDECAY * W1)
    b1 -= LR * db1
    W2 -= LR * (dW2 + WDECAY * W2)
    b2 -= LR * db2

# ── Evaluate ──────────────────────────────────────────────────────────────────
losses = [ce_loss(forward(get_batch(val_d)[0])[1], get_batch(val_d)[1]) for _ in range(30)]
print(f"val_loss: {np.mean(losses):.4f}")
'''


# ─── Data models ──────────────────────────────────────────────────────────────

@dataclass
class Experiment:
    iteration:   int
    candidate:   int
    description: str
    script:      str
    val_loss:    Optional[float] = None
    delta:       Optional[float] = None   # positive = improvement
    accepted:    bool = False
    error:       Optional[str] = None

@dataclass
class ResearchState:
    best_script:   str
    best_val_loss: float = 999.0
    history:       List[Experiment] = field(default_factory=list)

    def history_summary(self) -> str:
        if not self.history:
            return "No experiments yet."
        lines = []
        for e in self.history[-8:]:   # last 8
            status = "✓ ACCEPTED" if e.accepted else ("✗ error" if e.error else "✗ rejected")
            vl = f"{e.val_loss:.4f}" if e.val_loss else "—"
            d  = f{e.delta:+.4f}" if e.delta is not None else ""
            lines.append(f"  [{status}] iter={e.iteration} val={vl} {d}{e.description}")
        return "\n".join(lines)


# ─── Agent: propose one code modification ────────────────────────────────────

def propose_modification(state: ResearchState, candidate_idx: int) -> tuple[str, str]:
    """Returns (description, modified_script)."""
    client = OpenAI()
    prompt = f"""{PROGRAM_GUIDANCE}

Current best val_loss: {state.best_val_loss:.4f}

Experiment history:
{state.history_summary()}

Current best script:
```python
{state.best_script}
```

Propose modification #{candidate_idx + 1} (make it different from recent attempts).
Return ONLY a JSON object with two keys:
  "description": one sentence describing the change
  "script": the complete modified Python script

No markdown fences around the JSON. Just the raw JSON object."""

    resp = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.9 + candidate_idx * 0.1,   # more exploration for later candidates
        response_format={"type": "json_object"},
    )

    import json
    data = json.loads(resp.choices[0].message.content)
    return data["description"], data["script"]


# ─── Sandbox runner ───────────────────────────────────────────────────────────

def run_experiment(exp: Experiment) -> Experiment:
    """Run exp.script in a TensorLake sandbox, parse val_loss from stdout."""
    console.print(f"  [dim]→ sandbox iter={exp.iteration} cand={exp.candidate}: "
                  f"{exp.description[:60]}[/dim]")
    try:
        sb = SandboxClient()
        with sb.create_and_connect(memory_mb=2048) as box:
            ex = box.run("python3", ["-c", exp.script])
            stdout = (ex.stdout or "").strip()
            stderr = (ex.stderr or "").strip()

        m = re.search(r"val_loss:\s*([0-9.]+)", stdout)
        if not m:
            exp.error = (stderr or stdout)[:120] or "no val_loss in output"
            return exp

        exp.val_loss = float(m.group(1))

    except Exception as exc:
        exp.error = str(exc)[:120]

    return exp


# ─── Main autoresearch loop ───────────────────────────────────────────────────

def autoresearch(iterations: int = 8, candidates: int = 3):
    steps = 150 if SMOKE else 300

    console.print(Panel(
        "[bold cyan]Karpathy Autoresearch Loop + TensorLake Sandboxes[/bold cyan]\n\n"
        "[dim]Inspired by github.com/karpathy/autoresearch (March 2026)\n\n"
        "Loop:\n"
        "  1. Agent reads current best script + experiment history\n"
        "  2. Proposes CANDIDATES modifications (different temperatures)\n"
        "  3. All CANDIDATES race in parallel TensorLake sandboxes\n"
        "  4. Best val_loss wins; accepted if it beats the current best\n"
        "  5. Accepted script becomes the new baseline\n\n"
        "Reward  = Δval_loss (positive = improvement)\n"
        "Policy  = GPT-4o prompted with program.md + experiment history\n"
        "Update  = greedy hill-climbing (accept if reward > 0)\n"
        f"Mode    = {'SMOKE (3 iters, 2 candidates, 150 steps)' if SMOKE else f'Full ({iterations} iters, {candidates} candidates, {steps} steps)'}[/dim]",
        border_style="cyan",
    ))

    # ── Calibrate baseline ───────────────────────────────────────────────────
    baseline = BASELINE_SCRIPT.replace("STEPS_PLACEHOLDER", str(steps))
    console.print(Rule("[yellow]Calibrating baseline[/yellow]", style="yellow"))
    console.print("[dim]Running baseline script in sandbox to establish starting val_loss...[/dim]")

    calib = Experiment(0, 0, "baseline", baseline)
    calib = run_experiment(calib)

    if calib.error or calib.val_loss is None:
        console.print(f"[red]Baseline failed: {calib.error}[/red]")
        return

    state = ResearchState(best_script=baseline, best_val_loss=calib.val_loss)
    console.print(f"  Baseline val_loss: [bold yellow]{state.best_val_loss:.4f}[/bold yellow]\n")

    # ── Research iterations ──────────────────────────────────────────────────
    for it in range(1, iterations + 1):
        console.print(Rule(f"[cyan]Iteration {it}/{iterations}[/cyan]", style="cyan"))
        console.print(f"  [dim]Current best: {state.best_val_loss:.4f}  "
                      f"Proposing {candidates} candidates in parallel...[/dim]")

        # Propose candidates (sequentially — they call the LLM)
        proposals = []
        for c in range(candidates):
            desc, script = propose_modification(state, c)
            proposals.append(Experiment(it, c, desc, script))

        # Race all candidates in parallel sandboxes
        with ThreadPoolExecutor(max_workers=candidates) as pool:
            futures = {pool.submit(run_experiment, exp): exp for exp in proposals}
            results = []
            for fut in as_completed(futures):
                results.append(fut.result())

        # Score & rank
        valid = [r for r in results if r.val_loss is not None]
        for r in valid:
            r.delta = state.best_val_loss - r.val_loss   # positive = improvement

        valid.sort(key=lambda r: r.val_loss)

        # Print iteration table
        t = Table(box=box.SIMPLE, show_header=True, header_style="bold white")
        t.add_column("C",          width=3)
        t.add_column("Modification",  width=52)
        t.add_column("val_loss",   width=9,  justify="right")
        t.add_column("Δ",          width=9,  justify="right")
        t.add_column("",           width=3)

        for r in sorted(results, key=lambda r: (r.val_loss or 999)):
            if r.error:
                t.add_row(str(r.candidate), r.description[:50], "—", "—",
                          f"[red]✗[/red]")
                continue
            delta_str = (f"[green]{r.delta:+.4f}[/green]" if r.delta and r.delta > 0
                         else f"[red]{r.delta:+.4f}[/red]" if r.delta
                         else "—")
            t.add_row(str(r.candidate), r.description[:50],
                      f"{r.val_loss:.4f}", delta_str, "")

        console.print(t)

        # Accept best if improved
        if valid and valid[0].delta is not None and valid[0].delta > 0:
            winner = valid[0]
            winner.accepted = True
            state.best_val_loss = winner.val_loss
            state.best_script   = winner.script
            console.print(
                f"  [bold green]✓ Accepted: {winner.description}\n"
                f"    val_loss {calib.val_loss:.4f}{state.best_val_loss:.4f}  "
                f"(Δ{winner.delta:+.4f})[/bold green]"
            )
        else:
            console.print("  [dim]No improvement this iteration — baseline unchanged.[/dim]")

        state.history.extend(results)

    # ── Final summary ────────────────────────────────────────────────────────
    accepted = [e for e in state.history if e.accepted]
    total_improvement = calib.val_loss - state.best_val_loss

    pct   = total_improvement / calib.val_loss * 100
    color = "bold green" if pct > 5 else "yellow" if pct > 0 else "red"

    console.print(Panel(
        f"[bold green]Autoresearch complete[/bold green]\n\n"
        f"Baseline val_loss  : [yellow]{calib.val_loss:.4f}[/yellow]\n"
        f"Final val_loss     : [bold green]{state.best_val_loss:.4f}[/bold green]\n"
        f"Total improvement  : [{color}]{total_improvement:+.4f}  ({pct:+.1f}%)[/{color}]\n"
        f"Accepted changes   : {len(accepted)} / {len(state.history)}\n\n"
        + ("\n".join(f"  ✓ iter {e.iteration}: {e.description}" for e in accepted)
           if accepted else "  (none)"),
        title="[bold]Research Summary[/bold]",
        border_style="green",
    ))

    # ── Result interpretation ─────────────────────────────────────────────────
    accept_rate = len(accepted) / len(state.history) * 100 if state.history else 0
    console.print(Panel(
        f"[bold]What these numbers mean[/bold]\n\n"
        f"val_loss is cross-entropy on held-out characters (nats).\n"
        f"Lower = the model assigns higher probability to the correct next character.\n\n"
        f"  Baseline {calib.val_loss:.4f} → Final {state.best_val_loss:.4f}  "
        f"([{color}]{pct:+.1f}%[/{color}])\n\n"
        f"Context:\n"
        f"  • A random character predictor on this ~50-char vocabulary scores ln(50) ≈ 3.91\n"
        f"  • The baseline MLP ({calib.val_loss:.2f}) already beats random — it learned\n"
        f"    that 'e', space, and 't' are far more likely than 'Z'\n"
        f"  • Each accepted change is a genuine algorithmic improvement:\n"
        f"    the agent modified real training code and the sandbox verified it\n"
        f"    on held-out data — not on the training set\n\n"
        f"Acceptance rate: {len(accepted)}/{len(state.history)} ({accept_rate:.0f}%)\n"
        f"  • Typical for greedy hill-climbing on a small model: 25–40% is normal\n"
        f"  • Rejected experiments are still informative — they update the agent's\n"
        f"    memory so it avoids the same dead ends next iteration\n\n"
        f"Smoke vs full run:\n"
        f"  • Smoke (3 iters, 2 candidates, 150 steps) is a proof-of-concept\n"
        f"  • Full run (8 iters, 3 candidates, 300 steps) gives the agent\n"
        f"    enough budget to explore LR schedules, second layers, momentum,\n"
        f"    and architecture changes — improvements compound across iterations\n"
        f"  • Karpathy's original loop ran ~700 experiments overnight and found\n"
        f"    11% speed improvement; the same pattern scales here",
        title="[bold cyan]Score interpretation[/bold cyan]",
        border_style="cyan",
    ))

    if accepted:
        console.print(Rule("[green]Final best script[/green]", style="green"))
        console.print(Panel(state.best_script[:1200] + ("..." if len(state.best_script) > 1200 else ""),
                            border_style="green"))


if __name__ == "__main__":
    autoresearch(
        iterations=3 if SMOKE else 8,
        candidates=2 if SMOKE else 3,
    )

What happens step-by-step

StepComponentAction
1CalibrationBaseline script runs in a sandbox. The resulting val_loss becomes the threshold every candidate must beat.
2ProposalGPT-4o receives the current best script, program.md guidance, and the last 8 experiments. It returns N JSON objects, each with a description and a complete modified script.
3Parallel raceAll N candidates are submitted to a ThreadPoolExecutor. Each thread creates a TensorLake sandbox, runs the script for the fixed step budget, and parses val_loss from stdout.
4SelectionCandidates are ranked by val_loss. The winner is accepted only if its loss is strictly lower than the current best (greedy hill-climbing).
5State updateThe accepted script replaces the baseline. All results — accepted and rejected — are appended to the history so the agent avoids revisiting dead ends.
6Next iterationThe agent is prompted again with the updated script and history. Improvements compound across iterations.

Key design decisions

Increasing temperature across candidates

Each candidate is proposed with a slightly higher temperature (0.9 + candidate_idx * 0.1). The first candidate is a focused, conservative change. Later candidates are more exploratory. This covers both the safe and speculative ends of the search space in a single iteration.

Experiment history as agent memory

The agent receives a rolling window of the last 8 experiments (accepted and rejected), each annotated with val_loss and Δ. This prevents the agent from re-proposing changes that already failed and nudges it toward unexplored directions without any external memory store.

Fixed STEPS budget enforced in program.md

The STEPS constant is explicitly marked as off-limits in the guidance. Without this constraint, the agent could trivially reduce val_loss by running more training steps — a form of reward hacking that would make comparisons between candidates meaningless.

Greedy hill-climbing over rollout

The loop uses simple greedy acceptance (accept if Δval_loss > 0) rather than beam search or rollout. For an overnight research loop where each experiment costs real CPU time, greedy hill-climbing maximises the number of validated improvements within the time budget.
This example uses python-dotenv to load your API keys. Create a .env file in your project root:
TENSORLAKE_API_KEY="your-api-key-here"
OPENAI_API_KEY="your-openai-key-here"
Both clients pick them up automatically.

What to build next

RL Training with GSPO

Train a model directly with RL using sandboxes as the reward oracle — a complementary approach to autoresearch.

Agentic Swarm Intelligence

Dispatch parallel sandboxes across a swarm of specialized worker agents.