Our Evaluation Framework
Two-Stage Methodology
We mirror real-world workflows with a two-stage evaluation process: Stage 1: Document Reading Abilities (OCR and Structural Preservation)- Models generate Markdown/HTML output
- Evaluated using TEDS (Tree Edit Distance Similarity).
- Captures predicted vs. ground-truth Markdown, structural fidelity in tables and complex layouts
- Answers: “Is this table still a table?” Not just “Is the text similar?”
- Markdown passed through standardized LLM (GPT-4o) with predefined schemas
- Evaluated using JSON F1 (Field-Level Precision and Recall)
- Isolates how OCR quality impacts real extraction workflows
- We measure precision to measure correctness of extracted fields and recall to measure completeness of required field capture.
- F1 score combines both metrics for a holistic view.
- Answers: “Can automation use this data?” Not just “Is text present?”
Document Reading Benchmark Results
Datasets
- OCRBench v2: 400 diverse document images (invoices, contracts, forms), measuring overall structural and text accuracy. The data was audited to ensure consistency in ground truth.
- OmniDocBench: 512 document images with complex tables, focusing on table parsing capabilities. We are using v1.5 evaluation code from the official repository.

Table Parsing
Evaluated on 512 document images with tables from OmniDocBench (CVPR-accepted benchmark):
Structured Extraction Benchmark Results
Datasets used:
- We collected 100 document pages of proprietary data spanning banking, retail, and insurance sectors. This represents actual production workloads: invoices with water damage, scanned contracts with skewed text, bank statements with multi-level tables.
- Ground truth schemas were generated using Gemini Pro 2.5 and audited by human reviewers to ensure accuracy.

- Tensorlake achieves 91.7% F1—demonstrating superior OCR quality feeds better extraction
- The gap between 91.7% and 68.9% F1 is massive: it’s 5 extra fields correctly extracted out of every 20
- In production processing thousands of documents daily, this accuracy gap compounds into significant error reduction
Production Impact Example
For an insurance claims processor handling 10,000 documents per month:- At 85% F1: 1,500 documents require manual review
- At 90% F1: 1,000 documents require manual review
- At 91.7% F1 (Tensorlake): 830 documents require manual review
Cost & Performance Comparison
Accuracy without affordability isn’t practical. Here’s the complete picture:| Provider | Cost/1,000 Pages | TEDS | JSON F1 |
|---|---|---|---|
| Docling (open-source) | Free* | 63.3% | 68.9% |
| Marker (open-source) | Free* | 71.1% | 71.2% |
| Azure Document Intelligence | $10 | 78.6% | 88.1% |
| AWS Textract | $15 | 81.0% | 88.4% |
| Tensorlake | $10 | 84.1% | 91.7% |
Visual Comparison: Where Competitors Fail
Example: Contact Information Extraction
When parsing Section 21 (NOTICES) of a real estate contract:
Reproducibility
To reproduce our table results:- Generate Markdown outputs using models listed above
- Run evaluation from OmniDocBench repository
- Use document data with tables (512 images) with v1.5 code version
Deep Dive: Full Benchmark Analysis
Read our comprehensive blog post: The Document Parsing Benchmark That Actually Matters The blog includes:- Detailed failure mode analysis
- Additional benchmark datasets
- Technical methodology deep-dive
- Production deployment case studies
- Code examples and reproducibility guides