min_containers, max_containers, and warm_containers to trade off cost against latency.
How It Works
Every@function() runs in its own container. When a request arrives and no container is available, Tensorlake creates one — this incurs cold start latency while the container image is pulled and dependencies are loaded.
For latency-sensitive workloads, you can keep containers warm so they’re ready before requests arrive.
On-demand (default)
Without any scaling parameters, Tensorlake creates containers as requests come in and removes them when they’re idle. This is cost-efficient but has cold start latency on the first request.Warm containers
Setwarm_containers to keep a pool of pre-started containers ready for immediate allocation. As containers are consumed by requests, Tensorlake reconciles the pool by creating replacements to maintain the warm count.
Min and max containers
Usemin_containers and max_containers to set bounds on scaling:
| Parameter | Effect |
|---|---|
min_containers | Minimum containers always running, regardless of demand |
max_containers | Upper limit on total containers — caps autoscaling |
warm_containers | Additional unallocated containers kept ready beyond what’s needed |
max_concurrency | Number of concurrent function invocations per container (default: 1) |
min_containers=2, max_containers=20, and warm_containers=4, Tensorlake creates 6 containers at baseline (2 min + 4 warm). As requests consume warm containers, replacements are created to maintain the warm count — up to the max of 20.
Concurrency per container
By default, each container handles one function invocation at a time (max_concurrency=1). For I/O-bound functions that spend most of their time waiting on network responses — LLM calls, API requests, database queries — you can increase max_concurrency to pack more work into fewer containers.
max_concurrency value | Best for |
|---|---|
1 (default) | CPU-bound work, GPU inference, code execution |
5-10 | I/O-bound functions: LLM API calls, web requests, database queries |
10+ | Lightweight I/O like webhook delivery or notification dispatch |
max_concurrency means fewer containers needed for the same throughput, which reduces cost and cold starts. But CPU-bound functions won’t benefit — they’ll contend for the same core.
Concurrency interacts with the other scaling parameters:
- Total capacity =
containers × max_concurrency. Withmax_containers=10andmax_concurrency=5, you can handle 50 concurrent invocations. - Warm capacity =
warm_containers × max_concurrency. Withwarm_containers=2andmax_concurrency=5, 10 requests can be served instantly.
Scaling behavior
- No warm containers specified — Tensorlake autoscales purely on demand. Containers are created when requests arrive and removed when idle.
- Warm containers specified — Tensorlake autoscales to meet current demand plus the warm count. As load increases, more containers are created. As load decreases, excess containers are removed, but the warm pool is maintained.
Choosing Values
| Workload | Recommendation |
|---|---|
| Batch processing, async pipelines | No scaling params needed — on-demand is cost-efficient |
| User-facing APIs with latency requirements | Set warm_containers to expected concurrent users |
| Predictable steady-state load | Set min_containers to baseline, warm_containers for burst headroom |
| Cost-sensitive with hard upper limits | Set max_containers to cap spend |
| Real-time interactive agents | Set warm_containers=1 or more for instant response on first request |