Why Use Tensorlake + Outlines?
The Problem:- LLMs return malformed JSON, mix up date formats, and hallucinate values
- Traditional solutions (regex cleanup, validation scripts, retry loops) don’t scale
- Production pipelines break when outputs don’t match expected schemas
- Guaranteed valid JSON on every run - no parsing failures
- Type-safe outputs that match your Pydantic models exactly
- No post-processing - outputs are ready for downstream systems
- Production-ready - eliminates an entire class of pipeline failures
Installation
Quick Start
Step 1: Parse Documents with Tensorlake
Tensorlake converts your documents into structured fragments with metadata:Step 2: Define Your Schema
Use Pydantic to describe the fields you expect:Step 3: Create Few-Shot Examples
Help the model understand the extraction pattern:Step 4: Extract with Schema Enforcement
Output
The result is clean, type-safe, and ready for downstream systems:How Outlines Enforces Schema Constraints
Language models normally work by outputting probability distributions over the next token at each step. Nothing stops the model from outputting invalid JSON or incorrect types. Outlines changes the decoding loop:- Builds a finite state machine (FSM) that represents all valid outputs for your schema
- At each decoding step, masks out any tokens that would violate the schema
- Only allows tokens that keep the output valid
- If your schema says
"total_amount"
must be a number, Outlines prunes away every token that isn’t a digit, decimal point, or valid number continuation - If your schema requires valid JSON, the FSM ensures braces, commas, and quotes are placed correctly - preventing
{,]
or unclosed strings
Use Cases
Financial Document Processing
Extract structured data from invoices, receipts, and financial statements with guaranteed field presence and type safety.Contract Analysis
Parse contracts with complex schemas where missing fields or type mismatches break downstream workflows.Insurance Claims Processing
Extract claim data with validated amounts, dates, and classifications that comply with downstream systems.Legal Document Review
Structure legal documents into typed objects that can be safely stored in databases and processed by analytics pipelines.Best Practices
1. Design Schemas Carefully
Keep schemas as simple as possible with low nesting levels. Experiment with different schema keys and descriptions.2. Filter Before Extraction
Use Tensorlake’s page classification and fragment typing to discard irrelevant sections (like footers or signatures) before passing text to the model. This reduces noise and improves accuracy.3. Validate Twice
Outlines guarantees schema validity during decoding, but validate again downstream with Pydantic as an extra safety net before writing to databases.4. Handle Missing Values Explicitly
Instead of letting models hallucinate, define optional fields in your schema so the absence of data is captured cleanly:5. Benchmark Cost and Latency
Constrained decoding has overhead, especially for large schemas. Measure the trade-offs between schema complexity and generation speed.Complete Example
Try the full working example in our Colab notebook:Schema-Enforced Pipeline Notebook
Complete code walkthrough with invoice extraction example