Crash Recovery

Durable Execution is in Technical PreviewThis feature is currently in technical preview and under active development. Please contact us on Slack if you’d like to ask a question or try it out.

Agents call LLMs, scrape websites, query databases, and invoke external APIs. Any of these can fail — rate limits, timeouts, transient network errors, OOM kills. Without durability, a failure means restarting the entire agent from scratch, repeating every LLM call and API request. Tensorlake checkpoints every @function() call automatically. When a request fails, you replay it and only the failed step re-executes. Everything before it is served from the checkpoint.

Why LLM Calls Must Be Durable

LLM calls are unlike normal API calls. They are non-deterministic — the same prompt can produce a different response on every invocation. This makes re-execution dangerous, not just wasteful. Consider a travel agent that plans a trip. On the first run, the LLM decides on flights to Whistler. The agent books the flights, then crashes while searching for hotels. Without durable execution, the agent restarts from scratch. This time the LLM decides on Japan instead. Now the user has unwanted Whistler flights and a completely different trip plan. Making LLM calls durable solves three problems at once:

Consistency — Prior LLM decisions are preserved on replay. The agent resumes searching for Whistler hotels, not re-planning the entire trip.
Cost — LLM inference is expensive. Re-executing 14 successful tool-calling iterations because the 15th failed wastes tokens and money.
Rate limits — Agentic applications multiply downstream calls by an order of magnitude. Re-executing all of them increases the chance of hitting rate limits again.

On Tensorlake, every @function() call is automatically checkpointed. When a request is replayed, previously successful LLM calls return their recorded outputs — the model is not called again.

Durable Tool Calls

The most common agent pattern is a loop: the LLM decides which tool to call, the tool runs, the result feeds back into the LLM. Each iteration is an expensive operation — an LLM inference plus a tool execution. Wrap each tool in its own @function() to make every tool call a checkpoint:

from tensorlake.applications import application, function, Image

llm_image = Image().run("pip install openai")

@function()
def search_web(query: str) -> list[dict]:
    import requests
    response = requests.get("https://api.search.com/v1/search", params={"q": query})
    return response.json()["results"]

@function()
def read_document(url: str) -> str:
    import requests
    return requests.get(url).text

@function(image=llm_image)
def call_llm(messages: list[dict]) -> dict:
    from openai import OpenAI
    response = OpenAI().chat.completions.create(
        model="gpt-4o",
        messages=messages,
        tools=[
            {"type": "function", "function": {"name": "search_web", "parameters": {"type": "object", "properties": {"query": {"type": "string"}}}}},
            {"type": "function", "function": {"name": "read_document", "parameters": {"type": "object", "properties": {"url": {"type": "string"}}}}},
        ]
    )
    return response.choices[0].message

@application()
@function(image=llm_image, timeout=1800)
def research_agent(topic: str) -> str:
    tools = {"search_web": search_web, "read_document": read_document}
    messages = [{"role": "user", "content": f"Research this topic: {topic}"}]

    for _ in range(20):  # max iterations
        response = call_llm(messages)       # checkpointed
        messages.append(response)

        if not response.get("tool_calls"):
            return response["content"]

        for tool_call in response["tool_calls"]:
            fn = tools[tool_call["function"]["name"]]
            result = fn(**tool_call["function"]["arguments"])  # checkpointed
            messages.append({"role": "tool", "content": str(result), "tool_call_id": tool_call["id"]})

    return messages[-1].get("content", "Max iterations reached")

If the agent crashes on iteration 15, a replay skips the first 14 iterations entirely. The LLM calls, web searches, and document reads from those iterations are all served from checkpoints. The agent resumes from iteration 15 with the full conversation history intact.

Surviving Partial Failures in Fan-Out

When you process a batch of items in parallel using map, each item is an independent function call with its own checkpoint. If 3 out of 1,000 items fail, replay only re-processes those 3.

from tensorlake.applications import application, function

@function(timeout=120)
def process_document(doc_url: str) -> dict:
    """Parse a single document. Each call is independently checkpointed."""
    content = fetch_and_parse(doc_url)
    extracted = extract_fields(content)
    return extracted

@function()
def aggregate_results(results: list[dict], acc: dict) -> dict:
    """Combine results as they arrive."""
    acc["documents"].append(results)
    return acc

@application()
@function()
def batch_processor(doc_urls: list[str]) -> dict:
    results = process_document.map(doc_urls)
    summary = results.reduce(aggregate_results, {"documents": []})
    return summary

This is the pattern behind durable data ingestion pipelines. Whether you’re processing SEC filings, insurance forms, or research papers — partial failures don’t lose the work already completed.

Idempotent Side Effects

When a function sends an email, charges a credit card, or writes to an external database, you don’t want that action repeated on replay. Wrap the side effect in its own @function() — since the function’s output is checkpointed, replay skips it entirely.

@function()
def send_notification(user_id: str, message: str) -> str:
    """Send once, skip on replay."""
    response = email_api.send(user_id, message)
    return response.message_id

@application()
@function()
def onboarding_agent(user_id: str) -> str:
    profile = build_profile(user_id)
    send_notification(user_id, f"Welcome, {profile['name']}!")  # sent once
    return setup_account(profile)

Functions That Must Always Run Fresh

Some function calls should never be replayed from a checkpoint — they need live data every time. Mark them with durable=False:

@function(durable=False)
def get_current_price(ticker: str) -> float:
    """Always fetches the latest price, even on replay."""
    return stock_api.get_price(ticker)

@function()
def get_historical_data(ticker: str) -> list[dict]:
    """Historical data doesn't change — safe to checkpoint."""
    return stock_api.get_history(ticker, days=30)

See Disabling durable execution for the full implications.

When to Use Durable Execution

Scenario	Benefit
Agent loops with 10+ tool calls	Crash on call #N resumes from #N, not #1
Batch processing 100s-1000s of documents	Partial failures only re-process failed items
Pipelines with expensive LLM calls	No repeated inference costs on retry
Multi-step workflows with external side effects	Emails, payments, API calls aren’t duplicated

For the technical details of how checkpointing, fingerprinting, and replay modes work, see Durable Execution.

Overview

Concepts

Data Workflows

Guides

Production

Open Source

Why LLM Calls Must Be Durable

Durable Tool Calls

Surviving Partial Failures in Fan-Out

Idempotent Side Effects

Functions That Must Always Run Fresh

When to Use Durable Execution

Overview

Concepts

Data Workflows

Guides

Production

Open Source

​Why LLM Calls Must Be Durable

​Durable Tool Calls

​Surviving Partial Failures in Fan-Out

​Idempotent Side Effects

​Functions That Must Always Run Fresh

​When to Use Durable Execution

Why LLM Calls Must Be Durable

Durable Tool Calls

Surviving Partial Failures in Fan-Out

Idempotent Side Effects

Functions That Must Always Run Fresh

When to Use Durable Execution