Skip to main content
Tensorlake automatically scales containers to match your workload. By default, containers are created on-demand when requests arrive and removed when idle. You can tune this behavior with min_containers, max_containers, and warm_containers to trade off cost against latency.

How It Works

Every @function() runs in its own container. When a request arrives and no container is available, Tensorlake creates one — this incurs cold start latency while the container image is pulled and dependencies are loaded. For latency-sensitive workloads, you can keep containers warm so they’re ready before requests arrive.

On-demand (default)

Without any scaling parameters, Tensorlake creates containers as requests come in and removes them when they’re idle. This is cost-efficient but has cold start latency on the first request.
from tensorlake.applications import function

@function()
def process(data: str) -> str:
    """Containers created on-demand, scaled to zero when idle."""
    ...

Warm containers

Set warm_containers to keep a pool of pre-started containers ready for immediate allocation. As containers are consumed by requests, Tensorlake reconciles the pool by creating replacements to maintain the warm count.
@function(warm_containers=4)
def process(data: str) -> str:
    """4 containers always ready. New ones created as these are consumed."""
    ...
When a request arrives, it’s assigned to a warm container instantly — no cold start. Tensorlake then spins up a replacement to keep the pool at 4.

Min and max containers

Use min_containers and max_containers to set bounds on scaling:
@function(min_containers=2, max_containers=20, warm_containers=4)
def process(data: str) -> str:
    """2 always running + 4 warm = 6 containers ready at baseline.
    Scales up to 20 under load."""
    ...
ParameterEffect
min_containersMinimum containers always running, regardless of demand
max_containersUpper limit on total containers — caps autoscaling
warm_containersAdditional unallocated containers kept ready beyond what’s needed
max_concurrencyNumber of concurrent function invocations per container (default: 1)
With min_containers=2, max_containers=20, and warm_containers=4, Tensorlake creates 6 containers at baseline (2 min + 4 warm). As requests consume warm containers, replacements are created to maintain the warm count — up to the max of 20.

Concurrency per container

By default, each container handles one function invocation at a time (max_concurrency=1). For I/O-bound functions that spend most of their time waiting on network responses — LLM calls, API requests, database queries — you can increase max_concurrency to pack more work into fewer containers.
@function(max_concurrency=5, warm_containers=2)
def call_llm(messages: list[dict]) -> dict:
    """Each container handles 5 concurrent LLM calls.
    2 warm containers = 10 requests served without cold start."""
    from openai import OpenAI
    response = OpenAI().chat.completions.create(model="gpt-4o", messages=messages)
    return response.choices[0].message
max_concurrency valueBest for
1 (default)CPU-bound work, GPU inference, code execution
5-10I/O-bound functions: LLM API calls, web requests, database queries
10+Lightweight I/O like webhook delivery or notification dispatch
Higher max_concurrency means fewer containers needed for the same throughput, which reduces cost and cold starts. But CPU-bound functions won’t benefit — they’ll contend for the same core. Concurrency interacts with the other scaling parameters:
  • Total capacity = containers × max_concurrency. With max_containers=10 and max_concurrency=5, you can handle 50 concurrent invocations.
  • Warm capacity = warm_containers × max_concurrency. With warm_containers=2 and max_concurrency=5, 10 requests can be served instantly.

Scaling behavior

  • No warm containers specified — Tensorlake autoscales purely on demand. Containers are created when requests arrive and removed when idle.
  • Warm containers specified — Tensorlake autoscales to meet current demand plus the warm count. As load increases, more containers are created. As load decreases, excess containers are removed, but the warm pool is maintained.

Choosing Values

WorkloadRecommendation
Batch processing, async pipelinesNo scaling params needed — on-demand is cost-efficient
User-facing APIs with latency requirementsSet warm_containers to expected concurrent users
Predictable steady-state loadSet min_containers to baseline, warm_containers for burst headroom
Cost-sensitive with hard upper limitsSet max_containers to cap spend
Real-time interactive agentsSet warm_containers=1 or more for instant response on first request