Autoscaling

Tensorlake automatically scales containers to match your workload. By default, containers are created on-demand when requests arrive and removed when idle. You can tune this behavior with min_containers, max_containers, and warm_containers to trade off cost against latency.

How It Works

Every @function() runs in its own container. When a request arrives and no container is available, Tensorlake creates one — this incurs cold start latency while the container image is pulled and dependencies are loaded. For latency-sensitive workloads, you can keep containers warm so they’re ready before requests arrive.

On-demand (default)

Without any scaling parameters, Tensorlake creates containers as requests come in and removes them when they’re idle. This is cost-efficient but has cold start latency on the first request.

from tensorlake.applications import function

@function()
def process(data: str) -> str:
    """Containers created on-demand, scaled to zero when idle."""
    ...

Warm containers

Set warm_containers to keep a pool of pre-started containers ready for immediate allocation. As containers are consumed by requests, Tensorlake reconciles the pool by creating replacements to maintain the warm count.

@function(warm_containers=4)
def process(data: str) -> str:
    """4 containers always ready. New ones created as these are consumed."""
    ...

When a request arrives, it’s assigned to a warm container instantly — no cold start. Tensorlake then spins up a replacement to keep the pool at 4.

Min and max containers

Use min_containers and max_containers to set bounds on scaling:

@function(min_containers=2, max_containers=20, warm_containers=4)
def process(data: str) -> str:
    """2 always running + 4 warm = 6 containers ready at baseline.
    Scales up to 20 under load."""
    ...

Parameter	Effect
`min_containers`	Minimum containers always running, regardless of demand
`max_containers`	Upper limit on total containers — caps autoscaling
`warm_containers`	Additional unallocated containers kept ready beyond what’s needed
`max_concurrency`	Number of concurrent function invocations per container (default: 1)

With min_containers=2, max_containers=20, and warm_containers=4, Tensorlake creates 6 containers at baseline (2 min + 4 warm). As requests consume warm containers, replacements are created to maintain the warm count — up to the max of 20.

Concurrency per container

By default, each container handles one function invocation at a time (max_concurrency=1). For I/O-bound functions that spend most of their time waiting on network responses — LLM calls, API requests, database queries — you can increase max_concurrency to pack more work into fewer containers.

@function(max_concurrency=5, warm_containers=2)
def call_llm(messages: list[dict]) -> dict:
    """Each container handles 5 concurrent LLM calls.
    2 warm containers = 10 requests served without cold start."""
    from openai import OpenAI
    response = OpenAI().chat.completions.create(model="gpt-4o", messages=messages)
    return response.choices[0].message

`max_concurrency` value	Best for
`1` (default)	CPU-bound work, GPU inference, code execution
`5-10`	I/O-bound functions: LLM API calls, web requests, database queries
`10+`	Lightweight I/O like webhook delivery or notification dispatch

Higher max_concurrency means fewer containers needed for the same throughput, which reduces cost and cold starts. But CPU-bound functions won’t benefit — they’ll contend for the same core. Concurrency interacts with the other scaling parameters:

Total capacity = containers × max_concurrency. With max_containers=10 and max_concurrency=5, you can handle 50 concurrent invocations.
Warm capacity = warm_containers × max_concurrency. With warm_containers=2 and max_concurrency=5, 10 requests can be served instantly.

Scaling behavior

No warm containers specified — Tensorlake autoscales purely on demand. Containers are created when requests arrive and removed when idle.
Warm containers specified — Tensorlake autoscales to meet current demand plus the warm count. As load increases, more containers are created. As load decreases, excess containers are removed, but the warm pool is maintained.

Choosing Values

Workload	Recommendation
Batch processing, async pipelines	No scaling params needed — on-demand is cost-efficient
User-facing APIs with latency requirements	Set `warm_containers` to expected concurrent users
Predictable steady-state load	Set `min_containers` to baseline, `warm_containers` for burst headroom
Cost-sensitive with hard upper limits	Set `max_containers` to cap spend
Real-time interactive agents	Set `warm_containers=1` or more for instant response on first request

Overview

Concepts

Data Workflows

Guides

Production

Open Source

How It Works

On-demand (default)

Warm containers

Min and max containers

Concurrency per container

Scaling behavior

Choosing Values

Overview

Concepts

Data Workflows

Guides

Production

Open Source

​How It Works

​On-demand (default)

​Warm containers

​Min and max containers

​Concurrency per container

​Scaling behavior

​Choosing Values

How It Works

On-demand (default)

Warm containers

Min and max containers

Concurrency per container

Scaling behavior

Choosing Values