Why These Benchmarks Matter
Traditional OCR metrics like Word Error Rate (WER) and Character Error Rate (CER) don’t predict production success. A document can achieve 99% text similarity while completely failing if:- Tables collapse into flat text, destroying data relationships
 - Reading order is scrambled, corrupting RAG context
 - Critical fields are missing (invoice totals, ID numbers)
 - Charts and figures are ignored, losing key visual information
 
Our Evaluation Framework
Two-Stage Methodology
We mirror real-world workflows with a two-stage evaluation process: Stage 1: Document → OCR (Structural Preservation)- Models generate Markdown/HTML output
 - Evaluated using TEDS (Tree Edit Distance Similarity)
 - Measures preservation of reading order, table integrity, and layout coherence
 
- Markdown passed through standardized LLM (GPT-4o) with predefined schemas
 - Evaluated using JSON F1 (Field-Level Precision and Recall)
 - Isolates how OCR quality impacts real extraction workflows
 
Key Metrics Explained
TEDS (Tree Edit Distance Similarity)
- Compares predicted vs. ground-truth Markdown/HTML tree structures
 - Captures structural fidelity in tables and complex layouts
 - Widely adopted in OCRBench v2 and OmniDocBench
 - Answers: “Is this table still a table?” Not just “Is the text similar?”
 
JSON F1 Score
- Precision: Correctness of extracted fields
 - Recall: Completeness of required field capture
 - F1: Harmonic mean balancing both
 - Answers: “Can automation use this data?” Not just “Is text present?”
 
Public Benchmark Results
Document Parsing (English) - OCRBench v2
Evaluated on 400 images from the public OCRBench v2 dataset, measuring both structural preservation and text accuracy:
Table Parsing - OmniDocBench
Evaluated on 512 document images with tables from OmniDocBench (CVPR-accepted benchmark):| Model | TEDS | TEDS-Structure only | 
|---|---|---|
| Marker¹ | 57.88% | 71.17% | 
| Docling | 63.84% | 77.68% | 
| Azure | 78.14% | 83.61% | 
| Textract | 80.75% | 88.78% | 
| Tensorlake | 86.79% | 90.62% | 
Enterprise Document Performance
Real-World Dataset (100 Pages)
We evaluated on 100 document pages spanning banking, retail, and insurance sectors. This represents actual production workloads: invoices with water damage, scanned contracts with skewed text, bank statements with multi-level tables.
- Tensorlake achieves 91.7% F1—demonstrating superior OCR quality feeds better extraction
 - The gap between 91.7% and 68.9% F1 is massive: it’s 5 extra fields correctly extracted out of every 20
 - In production processing thousands of documents daily, this accuracy gap compounds into significant error reduction
 
Production Impact Example
For an insurance claims processor handling 10,000 documents per month:- At 85% F1: 1,500 documents require manual review
 - At 90% F1: 1,000 documents require manual review
 - At 91.7% F1 (Tensorlake): 830 documents require manual review
 
Cost & Performance Comparison
Accuracy without affordability isn’t practical. Here’s the complete picture:| Provider | Cost/1,000 Pages | TEDS | JSON F1 | 
|---|---|---|---|
| Docling (open-source) | Free* | 63.3% | 68.9% | 
| Marker (open-source) | $6 | 71.1% | 71.2% | 
| Azure Document Intelligence | $10 | 78.6% | 88.1% | 
| AWS Textract | $15 | 81.0% | 88.4% | 
| Tensorlake | $10 | 84.1% | 91.7% | 
Visual Comparison: Where Competitors Fail
Example: Contact Information Extraction
When parsing Section 21 (NOTICES) of a real estate contract:
- Missing opening parenthesis in phone number
 - Two-column layout collapsed into confusing single column
 
- Completely wrong phone number in buyer field (shows seller’s phone)
 - Buyer’s phone 
(123)456-7890entirely missing 
- Perfect extraction of both phone numbers: 
(123)456-7890and(456)789-1234 - Two-column structure preserved with clear buyer/seller separation
 - All contact fields accurately captured
 
Why Tensorlake Wins: Multi-Modal Understanding
Documents communicate through more than text. Tensorlake’s multi-modal approach captures:For RAG Applications
- Chart summarization: Converts figures into descriptive text for retrieval
 - Visual content capture: Research paper’s Figure 3 becomes: “Bar chart showing 15% performance improvement across three benchmark datasets”
 - Preserved context: Reading order maintained for accurate semantic retrieval
 
For Workflow Automation
- Signature detection: Identifies stamps, signatures, and annotations
 - Form understanding: Preserves spatial relationships in complex layouts
 - Table integrity: Multi-level tables maintain hierarchical structure
 
The Tensorlake Advantage
- Superior OCR Performance - Best-in-class recognition on degraded, scanned documents representing real-world conditions
 - Reading Order Preservation - Ensures pipelines process documents in logical sequence (critical for RAG)
 - Spatial Structure Integrity - Tables stay tables, forms stay forms
 - Multi-Modal Parsing - Captures figures, charts, signatures, and annotations
 - Rigorous Evaluation - Public benchmarks + private enterprise datasets
 
Ground Truth & Reproducibility
Public Datasets
- OCRBench v2: We audited and corrected inconsistencies in published ground truth
 - OmniDocBench: CVPR-accepted benchmark, using v1.5 evaluation code
 
JSON Schema Generation
- Initial schemas generated via Gemini Pro 2.5
 - Human reviewers audit and correct all gold standards
 - Ensures high-quality, unbiased evaluation
 
Reproducibility
To reproduce our table results:- Generate Markdown outputs using models listed above
 - Run evaluation from OmniDocBench repository
 - Use document data with tables (512 images) with v1.5 code version