Skip to main content
This page describes the architecture of Tensorlake’s Application Runtime — the distributed runtime that runs your agents and functions. Learn how the server, container scheduler, application scheduler, dataplanes, and language runtimes interact to serve requests. Tensorlake is a complex system with many moving parts. To help users build a mental model of how it works, this page documents the system architecture.
Advanced topic. You do not need to understand these details to effectively use Tensorlake. The details are documented here for those who wish to learn about them without having to go spelunking through the source code.

High-Level Overview

When a request hits your application, the runtime creates a new sandbox in milliseconds and your agent starts in an isolated environment with its own filesystem. Every function decorated with @function() can run in its own remote sandbox with dedicated resources — from your code it looks like a normal function call, but under the hood the runtime is scheduling containers, managing state, and handling failures. At a high level, the system looks like this: The server is the control plane. It receives requests from clients, persists all state, and runs two schedulers. The application scheduler manages the lifecycle of function calls — it builds the execution graph for each request, creates allocations, checkpoints outputs, and handles replay on failure. The container scheduler manages the infrastructure layer — it tracks resources across all dataplanes, places containers on worker nodes, manages warm pools, and scales containers up and down based on demand. A dataplane manages containers on a pool of worker nodes. You can think of it as a regional cluster of compute capacity. Multiple dataplanes can run in parallel, and the server distributes work across them. Each dataplane maintains a persistent bidirectional gRPC stream with the server — it reports its current state (running containers, resource usage, allocation results) and receives new work assignments in return. A language runtime is the sandbox that runs your code. Every @function() call runs in its own isolated container with its own filesystem, dependencies, and resource limits. When a function calls another function, the child runs in a separate sandbox — a lightweight orchestrator can dispatch work to GPU-equipped containers without needing GPU resources itself. From your code, this is invisible.

Why a Custom Scheduler

A common question is why Tensorlake built its own container scheduler instead of using Kubernetes. The short answer is that Kubernetes was designed for long-running services, not for workloads that create a new container for every request and need it running in milliseconds. Tensorlake’s execution model is fundamentally different from what Kubernetes expects. When a request arrives, the runtime creates a fresh sandbox — an isolated container with its own filesystem — in single-digit milliseconds. At peak load, the scheduler creates hundreds of these per second. In Kubernetes, creating a pod involves writing to etcd, passing through admission controllers, waiting for the kubelet to sync, and pulling images. This takes seconds at best, often longer. Creating a pod per request at this rate would overwhelm the Kubernetes control plane. Beyond raw speed, the container scheduler is tightly integrated with the application scheduler in ways that a general-purpose orchestrator can’t be. It understands function-level container pools — warm pools, minimum counts, and buffer sizes per function — and uses this to make smarter placement decisions. Its eviction algorithm knows which containers have active allocations and never evicts them, prioritizing containers above pool buffers first. It tracks container affinity per function so it can route work to dataplanes that already have warm containers, avoiding cold starts entirely. None of these concepts exist in Kubernetes scheduling. The desired state model is also purpose-built for this workload. The server pushes desired state to dataplanes over a persistent gRPC stream, and dataplanes reconcile in real-time. Kubernetes uses a watch/list model over etcd that works well for long-running services but adds latency when you need the scheduling loop to react in milliseconds — for example, when a function completes and the next step in a workflow needs to start immediately. Finally, there is an operational argument. Deploying agents on Kubernetes means writing YAMLs, configuring Horizontal Pod Autoscalers, managing image pull policies, setting up KEDA or Knative for scale-to-zero, and running a separate durable execution server for crash recovery. Tensorlake collapses all of that into a single runtime. You deploy your Python code, and the scheduler handles the rest.

The Server

The server is the single control plane for the entire system. It exposes the HTTP API that receives requests from clients and the SDK, persists all state to a durable store, and runs the two schedulers that coordinate all work. Every request, every function call, every allocation, and every container decision flows through the server. When a request arrives, the server creates a request context — a record that tracks the full state of the request, including the function call graph, all function runs, and the final outcome. The request context is persisted immediately. From this point, the application scheduler and container scheduler work together to execute the request.

Container Scheduler

The container scheduler is responsible for the infrastructure layer: deciding which containers run on which machines, and managing their lifecycle. The container scheduler maintains a real-time view of every executor (worker node) in the system — its total and free resources (CPU, memory, GPU), and every container running on it. It also tracks container pools, which group containers by function. Each pool has configurable minimums, maximums, and buffer sizes that control scaling behavior. When the application scheduler needs a container for a function, the container scheduler first checks whether a warm container already exists in the function’s pool — a pre-initialized container with no active work. If one is available, it claims it immediately, avoiding cold-start latency entirely. If no warm container is available, the scheduler runs its placement engine. It finds candidate executors that satisfy the function’s resource requirements and constraints, then selects one. If no executor has enough free resources, the scheduler runs a vacuum pass — it looks for lower-priority containers that can be evicted to free up space. Eviction follows a priority order: containers above the pool’s buffer count are evicted first, then those above the minimum, and only as a last resort those at or below the minimum. Containers with active allocations are never evicted. The container scheduler communicates with dataplanes through a desired state model. Rather than issuing imperative commands, it declares the desired state of each container (running or terminated) and the dataplane reconciles its actual state to match. This makes the system resilient to transient failures — if a message is lost, the next reconciliation cycle corrects the drift. Scaling is driven by demand. When requests arrive, new containers are created. When traffic drops, idle containers are terminated. Functions with no traffic have no running containers and incur no cost. You can configure warm pools to keep a buffer of pre-initialized containers ready for latency-sensitive functions, or set concurrency caps to limit the total number of concurrent instances.

Application Scheduler

The application scheduler manages the execution of your code: the function call graph, allocations, checkpointing, and replay. When a request arrives, the application scheduler creates an initial function call — a node in the execution graph that represents a function to invoke with specific inputs. For each function call, it creates a function run — an execution instance that tracks status (pending, running, completed) and stores the checkpointed output when the function finishes. To actually execute a function run, the application scheduler creates an allocation — a unit of work that binds a function run to a specific container. It first checks whether an existing container has capacity (based on the function’s max_concurrency setting). If not, it asks the container scheduler to create a new one. The allocation is persisted and pushed to the dataplane through the desired state stream. When a function calls another function, the language runtime reports the child function call back to the server. The application scheduler adds a new node to the execution graph, creates a function run for it, and the cycle repeats. This is how Tensorlake builds the full DAG of function calls for each request — the graph grows dynamically as your code executes. When a function run completes, the application scheduler checkpoints its output. The output data is stored in object storage (not in the database), so your agents can pass large files between functions without workarounds. The scheduler then propagates the output to any downstream function calls that depend on it, creating new function runs as inputs become available. Replay is how the system recovers from failures. When a request is replayed, the application scheduler walks the function call graph from the beginning. Function runs that already have checkpointed outputs return their results instantly without re-executing. The replay fast-forwards through completed work until it reaches the function that failed, then starts running it again from scratch. From the application’s perspective, it picks up right where it left off. Retries handle individual function failures. When a function run fails — whether from an exception, a container crash, or a timeout — the application scheduler checks the function’s retry policy. If retries are available, it creates a new allocation and runs the function again with the same inputs.

Dataplanes

A dataplane manages the containers on a pool of worker nodes. It is the bridge between the server’s scheduling decisions and actual code execution. Each dataplane maintains two communication channels with the server. A bidirectional gRPC stream carries the desired state — the server pushes container specifications and allocations to the dataplane, and the dataplane acknowledges receipt. A heartbeat fires every five seconds, reporting the dataplane’s current state: which containers are running, their resource usage, and the results of completed allocations. When the dataplane receives a new desired state, its state reconciler compares it against the actual state of the local system. If a container should exist but doesn’t, it creates one. If a container should be terminated, it shuts it down. If an allocation needs to be executed, it routes it to the appropriate language runtime controller. Allocation results flow back to the server through the heartbeat channel. The dataplane buffers results and fragments large payloads across multiple heartbeats (with a 10MB limit per message) to avoid overwhelming the connection. Results are only removed from the buffer after the server acknowledges receipt, ensuring nothing is lost in transit. If the gRPC stream disconnects, the dataplane reconnects automatically. If heartbeats fail, it uses exponential backoff. The desired state model means that temporary disconnections don’t cause inconsistency — the next successful sync brings everything back in line.

Language Runtimes

A language runtime is the sandbox that runs your function code. It is a container with its own filesystem, dependencies, and resource limits, managed by a controller on the dataplane. When a language runtime receives an allocation, it processes it through a three-phase pipeline. In the preparing phase, the runtime downloads input data and presigns blob URLs for outputs. This phase does not occupy a concurrency slot, so the container can prepare multiple allocations in parallel while running others. In the running phase, the function code executes. The language runtime streams state updates back to the dataplane controller in real-time: progress updates, output blob requests, child function calls, and the final result. If the function calls another @function()-decorated function, the language runtime reports the child call to the server, which creates a new allocation for it. For blocking calls, the language runtime registers a watcher and pauses until the child function’s result arrives from the server. In the finalizing phase, the runtime completes any multipart uploads, cleans up blob handles, and releases the concurrency slot. Each language runtime has a configurable max_concurrency that limits how many allocations it can execute simultaneously in the running phase. The application scheduler respects this limit when placing allocations — if all slots are full, it either queues the work or asks the container scheduler for a new container.

Getting in Depth

This has been a high-level overview of the Application Runtime architecture. The durable execution model, crash recovery behavior, autoscaling configuration, and queuing behavior are all documented in more detail.