Outlines - Tensorlake

Outlines is a library that enables structured text generation with language models by constraining outputs to match JSON schemas or Pydantic models. When combined with Tensorlake’s document parsing, you get a complete pipeline: Tensorlake extracts structured data from complex documents, and Outlines guarantees that LLM-generated outputs always match your schema. This integration is particularly powerful for production document AI pipelines where schema violations break downstream systems.

Run this end-to-end in Colab:

Why Use Tensorlake + Outlines?

The Problem:

LLMs return malformed JSON, mix up date formats, and hallucinate values
Traditional solutions (regex cleanup, validation scripts, retry loops) don’t scale
Production pipelines break when outputs don’t match expected schemas

The Solution: Outlines enforces schema constraints during generation, not after. Instead of hoping a model follows instructions, constrained decoding guarantees every output matches your JSON Schema or Pydantic model. Key Benefits:

Guaranteed valid JSON on every run - no parsing failures
Type-safe outputs that match your Pydantic models exactly
No post-processing - outputs are ready for downstream systems
Production-ready - eliminates an entire class of pipeline failures

Installation

pip install tensorlake outlines

Quick Start

Step 1: Parse Documents with Tensorlake

Tensorlake converts your documents into structured fragments with metadata:

from tensorlake.documentai import DocumentAI, ParseStatus

doc_ai = DocumentAI()
file_id = doc_ai.upload("invoice.pdf")
result = doc_ai.parse_and_wait(file_id)

assert result.status == ParseStatus.SUCCESSFUL

# Combine parsed chunks into structured text
document_text = '\n'.join([frag.content for frag in result.chunks])

Step 2: Define Your Schema

Use Pydantic to describe the fields you expect:

from pydantic import BaseModel, Field

class Invoice(BaseModel):
    invoice_number: str = Field(description="Invoice number on the invoice")
    issue_date: str = Field(description="Date when invoice was issued")
    due_date: str = Field(description="Payment due date")
    vendor_name: str = Field(description="Name of the vendor/seller")
    total_amount: float = Field(description="Total amount to be paid")

This schema becomes the contract for your pipeline. Downstream code can rely on this shape without extra validation.

Step 3: Create Few-Shot Examples

Help the model understand the extraction pattern:

from outlines import Template

examples = [
    {
        "document": "Invoice #: INV-1234\nDate: 2025-01-15\nDue: 2025-02-15\nVendor: Tech Solutions Inc.\nTotal: $5,250.00",
        "json": '{"invoice_number": "INV-1234", "issue_date": "2025-01-15", "due_date": "2025-02-15", "vendor_name": "Tech Solutions Inc.", "total_amount": 5250.00}'
    },
    {
        "document": "Invoice Number: 2024-0567\nIssued: March 10, 2024\nPayment Due: April 10, 2024\nFrom: Office Supplies Co.\nAmount Due: 1,875.50",
        "json": '{"invoice_number": "2024-0567", "issue_date": "2024-03-10", "due_date": "2024-04-10", "vendor_name": "Office Supplies Co.", "total_amount": 1875.50}'
    }
]

# Create the extraction template
invoice_extraction_prompt = Template.from_string(
    """
    {% for example in examples %}
    DOCUMENT: {{example.document}}
    JSON: {{example.json}}
    {% endfor %}
    DOCUMENT: {{document}}
    JSON:"""
)

Step 4: Extract with Schema Enforcement

import outlines
import openai

# Generate the prompt with your document
prompt = invoice_extraction_prompt(document=document_text, examples=examples)

# Use Outlines with OpenAI
model = outlines.from_openai(openai.OpenAI(), "gpt-4o-mini")
answer = model(prompt)

# Parse the result
invoice = Invoice.model_validate_json(answer)
print(invoice.model_dump())

Output

The result is clean, type-safe, and ready for downstream systems:

{
  "invoice_number": "INV-2387",
  "issue_date": "2025-05-10",
  "due_date": "2025-06-10",
  "vendor_name": "Acme Supplies Ltd.",
  "total_amount": 3000.0
}

No regex cleanup, no retries, no manual corrections.

How Outlines Enforces Schema Constraints

Language models normally work by outputting probability distributions over the next token at each step. Nothing stops the model from outputting invalid JSON or incorrect types. Outlines changes the decoding loop:

Builds a finite state machine (FSM) that represents all valid outputs for your schema
At each decoding step, masks out any tokens that would violate the schema
Only allows tokens that keep the output valid

For example:

If your schema says "total_amount" must be a number, Outlines prunes away every token that isn’t a digit, decimal point, or valid number continuation
If your schema requires valid JSON, the FSM ensures braces, commas, and quotes are placed correctly - preventing {,] or unclosed strings

The constraint happens during generation, not after. That’s why every output from Outlines is guaranteed to be valid.

Use Cases

Financial Document Processing

Extract structured data from invoices, receipts, and financial statements with guaranteed field presence and type safety.

Contract Analysis

Parse contracts with complex schemas where missing fields or type mismatches break downstream workflows.

Insurance Claims Processing

Extract claim data with validated amounts, dates, and classifications that comply with downstream systems.

Legal Document Review

Structure legal documents into typed objects that can be safely stored in databases and processed by analytics pipelines.

Best Practices

1. Design Schemas Carefully

Keep schemas as simple as possible with low nesting levels. Experiment with different schema keys and descriptions.

2. Filter Before Extraction

Use Tensorlake’s page classification and fragment typing to discard irrelevant sections (like footers or signatures) before passing text to the model. This reduces noise and improves accuracy.

3. Validate Twice

Outlines guarantees schema validity during decoding, but validate again downstream with Pydantic as an extra safety net before writing to databases.

4. Handle Missing Values Explicitly

Instead of letting models hallucinate, define optional fields in your schema so the absence of data is captured cleanly:

class Invoice(BaseModel):
    invoice_number: str
    issue_date: str
    due_date: Optional[str] = None  # Optional field
    vendor_name: str
    total_amount: float

5. Benchmark Cost and Latency

Constrained decoding has overhead, especially for large schemas. Measure the trade-offs between schema complexity and generation speed.

Complete Example

Try the full working example in our Colab notebook:

Schema-Enforced Pipeline Notebook

Complete code walkthrough with invoice extraction example

Resources

Need Help?

Join our community to discuss schema-enforced document pipelines:

Slack Community

​Why Use Tensorlake + Outlines?

​Installation

​Quick Start

​Step 1: Parse Documents with Tensorlake

​Step 2: Define Your Schema

​Step 3: Create Few-Shot Examples

​Step 4: Extract with Schema Enforcement

​Output

​How Outlines Enforces Schema Constraints

​Use Cases

​Financial Document Processing

​Contract Analysis

​Insurance Claims Processing

​Legal Document Review

​Best Practices

​1. Design Schemas Carefully

​2. Filter Before Extraction

​3. Validate Twice

​4. Handle Missing Values Explicitly

​5. Benchmark Cost and Latency

​Complete Example

Schema-Enforced Pipeline Notebook

​Resources

​Need Help?

Why Use Tensorlake + Outlines?

Installation

Quick Start

Step 1: Parse Documents with Tensorlake

Step 2: Define Your Schema

Step 3: Create Few-Shot Examples

Step 4: Extract with Schema Enforcement

Output

How Outlines Enforces Schema Constraints

Use Cases

Financial Document Processing

Contract Analysis

Insurance Claims Processing

Legal Document Review

Best Practices

1. Design Schemas Carefully

2. Filter Before Extraction

3. Validate Twice

4. Handle Missing Values Explicitly

5. Benchmark Cost and Latency

Complete Example

Resources

Need Help?