Scaling Agents

Agents scale out automatically as they receive requests. When someone calls your agent’s endpoint, Tensorlake spins up containers to handle the work, then scales back down when idle. You don’t need to provision servers or configure autoscaling — it just works.

How It Works

By default, agents scale from zero. The first request after an idle period experiences a “cold start” while the container loads your code and dependencies. Subsequent requests are served by warm containers until the agent goes idle again. When multiple requests arrive simultaneously, Tensorlake automatically creates more containers to handle the load in parallel. There’s no upper limit by default — your agent scales to meet demand.

Configuration Options

You can tune scaling behavior with a few optional parameters on @function():

Keep Containers Warm

If cold starts are problematic for your use case, keep some containers pre-warmed:

@function(warm_containers=2)
def agent(prompt: str) -> str:
    # 2 containers always ready, no cold start for first requests
    ...

This is useful for latency-sensitive agents where instant response matters.

Limit Maximum Concurrency

To control costs or respect API rate limits, cap the maximum number of concurrent executions:

@function(max_containers=10)
def agent(prompt: str) -> str:
    # Maximum 10 containers running at once
    # Additional requests are queued automatically
    ...

When the limit is reached, incoming requests queue up and process in FIFO order as containers become available.

Guarantee Minimum Capacity

Ensure a baseline level of capacity is always available:

@function(min_containers=3)
def agent(prompt: str) -> str:
    # At least 3 containers always running
    ...

This prevents scaling to zero and eliminates cold starts entirely.

Control Request Concurrency

By default, each container handles one request at a time. If your agent is I/O-bound (waiting on API calls, database queries), you can increase concurrency:

@function(concurrency=5)
def agent(prompt: str) -> str:
    # Each container handles up to 5 requests concurrently
    ...

Total concurrent requests = max_containers × concurrency For example, max_containers=10 and concurrency=5 allows up to 50 concurrent requests.

Combining Options

You can combine parameters for fine-grained control:

@function(
    warm_containers=2,   # 2 containers always ready
    max_containers=20,   # Scale up to 20 containers
    concurrency=2        # Each handles 2 requests
)
def agent(prompt: str) -> str:
    # Total capacity: 40 concurrent requests (20 × 2)
    # 2 containers pre-warmed for instant response
    ...

When to Configure Scaling

Most agents work fine with the defaults. Consider configuring scaling when:

Latency is critical — Use warm_containers to eliminate cold starts
You have cost constraints — Use max_containers to cap spending
External APIs have rate limits — Use max_containers and concurrency to stay within limits
You need guaranteed capacity — Use min_containers to ensure availability

Overview

Concepts

Data Workflows

Guides

Production

Open Source

How It Works

Configuration Options

Keep Containers Warm

Limit Maximum Concurrency

Guarantee Minimum Capacity

Control Request Concurrency

Combining Options

When to Configure Scaling

Learn More

Scale-Out & Queuing

Rate Limits

Overview

Concepts

Data Workflows

Guides

Production

Open Source

​How It Works

​Configuration Options

​Keep Containers Warm

​Limit Maximum Concurrency

​Guarantee Minimum Capacity

​Control Request Concurrency

​Combining Options

​When to Configure Scaling

​Learn More

Scale-Out & Queuing

Rate Limits

How It Works

Configuration Options

Keep Containers Warm

Limit Maximum Concurrency

Guarantee Minimum Capacity

Control Request Concurrency

Combining Options

When to Configure Scaling

Learn More