Tensorlake
- Introduction
- Quickstart
- Playground
- Auth and Access
Document Ingestion
- Overview
- Managing Files
- Parsing Documents
- Datasets
Webhooks
Workflows
- Overview
- Programming Guide
- Dependencies
- Secrets
FAQ
Open Source
Document Ingestion Quickstart
The most basic use-cases of Document Ingestion API are:
- Convert the Document to Markdown for feeding into an LLM.
- Extract structured data from the document specified by a JSON schema.
You will learn how to convert a rental agreement document to markdown chunks, and extract structured data from the document specified by a schema.
The example can be run in a Google Colab notebook.
Step-by-Step Guide
Prerequisites
- Python 3.10+
- A Tensorlake API key
Install the SDK
pip install tensorlake
Set your API key
Export the variable and the SDK will reference your environment variables, looking for TENSORLAKE_API_KEY
:
export TENSORLAKE_API_KEY=your-api-key-here
Parse a document
import json
import os
from typing import Optional
from pydantic import BaseModel, Field
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import ParsingOptions, StructuredExtractionOptions
from tensorlake.documentai.models.enums import ChunkingStrategy
doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))
# Use a publicly accessible URL or upload a file to Tensorlake and use the file ID.
file_url = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"
# Define a JSON schema using Pydantic
# Our structured extraction model will identify the properties we want to extract from the document.
# In this case, we are extracting the names and signature dates of the buyer and seller.
class Signers(BaseModel):
buyer_name: Optional[str] = Field(
default=None, description="The name of the buyer, do not extract initials"
)
buyer_signature_date: Optional[str] = Field(
default=None, description="Date and time that the buyer signed."
)
seller_name: Optional[str] = Field(
default=None, description="The name of the seller, do not extract initials"
)
seller_signature_date: Optional[str] = Field(
default=None, description="Date and time that the seller signed."
)
# Create a structured extraction options object with the schema
#
# You can send as many schemas as you want, and the API will return structured data for each schema
# indexed by the schema name.
real_estate_agreement_extraction_options = StructuredExtractionOptions(
schema_name="Signers",
json_schema=Signers,
)
# Tune in the options for finding data in the document
#
# We provide sane defaults, but every document is different, so you can adjust
# every option to your needs.
#
# In this example, we are using the PAGE chunking strategy, which means that each page of the document will be a separate chunk.
parsing_options = ParsingOptions(
chunking_strategy=ChunkingStrategy.PAGE,
)
# Submit the parse operation and wait for the job to complete
parse_id = doc_ai.parse(
file=file_url,
page_range="9-10",
parsing_options=parsing_options,
structured_extraction_options=[real_estate_agreement_extraction_options],
)
Wait for the job to complete
result = doc_ai.wait_for_completion(parse_id)
Use the results
markdown_chunks = result.chunks
with open("markdown_chunks.md", "w") as f:
for chunk_number, chunk in enumerate(markdown_chunks):
f.write(f"## CHUNK NUMBER {chunk_number}\n\n")
f.write(f"## Page {chunk.page_number}\n\n{chunk.content}\n\n")
serializable_data = result.model_dump()
with open("structured_data.json", "w") as f:
json.dump(serializable_data["structured_data"], f, indent=2)
Prerequisites
- Python 3.10+
- A Tensorlake API key
Install the SDK
pip install tensorlake
Set your API key
Export the variable and the SDK will reference your environment variables, looking for TENSORLAKE_API_KEY
:
export TENSORLAKE_API_KEY=your-api-key-here
Parse a document
import json
import os
from typing import Optional
from pydantic import BaseModel, Field
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models import ParsingOptions, StructuredExtractionOptions
from tensorlake.documentai.models.enums import ChunkingStrategy
doc_ai = DocumentAI(api_key=os.getenv("TENSORLAKE_API_KEY"))
# Use a publicly accessible URL or upload a file to Tensorlake and use the file ID.
file_url = "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf"
# Define a JSON schema using Pydantic
# Our structured extraction model will identify the properties we want to extract from the document.
# In this case, we are extracting the names and signature dates of the buyer and seller.
class Signers(BaseModel):
buyer_name: Optional[str] = Field(
default=None, description="The name of the buyer, do not extract initials"
)
buyer_signature_date: Optional[str] = Field(
default=None, description="Date and time that the buyer signed."
)
seller_name: Optional[str] = Field(
default=None, description="The name of the seller, do not extract initials"
)
seller_signature_date: Optional[str] = Field(
default=None, description="Date and time that the seller signed."
)
# Create a structured extraction options object with the schema
#
# You can send as many schemas as you want, and the API will return structured data for each schema
# indexed by the schema name.
real_estate_agreement_extraction_options = StructuredExtractionOptions(
schema_name="Signers",
json_schema=Signers,
)
# Tune in the options for finding data in the document
#
# We provide sane defaults, but every document is different, so you can adjust
# every option to your needs.
#
# In this example, we are using the PAGE chunking strategy, which means that each page of the document will be a separate chunk.
parsing_options = ParsingOptions(
chunking_strategy=ChunkingStrategy.PAGE,
)
# Submit the parse operation and wait for the job to complete
parse_id = doc_ai.parse(
file=file_url,
page_range="9-10",
parsing_options=parsing_options,
structured_extraction_options=[real_estate_agreement_extraction_options],
)
Wait for the job to complete
result = doc_ai.wait_for_completion(parse_id)
Use the results
markdown_chunks = result.chunks
with open("markdown_chunks.md", "w") as f:
for chunk_number, chunk in enumerate(markdown_chunks):
f.write(f"## CHUNK NUMBER {chunk_number}\n\n")
f.write(f"## Page {chunk.page_number}\n\n{chunk.content}\n\n")
serializable_data = result.model_dump()
with open("structured_data.json", "w") as f:
json.dump(serializable_data["structured_data"], f, indent=2)
Prerequisites
- A Tensorlake API key
Parse a document
async function parseFileUrl(fileUrl, tensorlakeApiKey) {
const signersSchema = {
title: "Signers",
type: "object",
properties: {
buyerName: {
type: "string",
description: "The name of the buyer, do not extract initials",
title: "Buyer Name"
},
buyerSignatureDate: {
type: "string",
description: "Date and time that the buyer signed.",
title: "Buyer Signature Date"
},
sellerName: {
type: "string",
description: "The name of the seller, do not extract initials",
title: "Seller Name"
},
sellerSignatureDate: {
type: "string",
description: "Date and time that the seller signed.",
title: "Seller Signature Date"
}
}
};
const realEstateAgreementExtractionOptions = {
schema_name: "Signers",
json_schema: signersSchema,
};
const parsingOptions = {
chunking_strategy: "page",
};
const body = {
file_url: fileUrl,
page_range: "9-10",
parsing_options: parsingOptions,
structured_extraction_options: [realEstateAgreementExtractionOptions],
};
const options = {
method: 'POST',
headers: {
Authorization: `Bearer ${tensorlakeApiKey}`,
'Content-Type': 'application/json',
},
body: JSON.stringify(body),
};
const response = await fetch(
'https://api.tensorlake.ai/documents/v2/parse',
options
);
const result = await response.json();
console.log('result:', JSON.stringify(result, null, 2));
return result.jobId;
}
const fileId =
'https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf';
const tensorlakeApiKey =
'your-tensorlake-api-key-here';
const parseId = await parseFileUrl(fileId, tensorlakeApiKey);
Poll for parse completion and write the results
import { writeFileSync } from 'fs';
function writeParseResults(jobResult) {
const structuredData = jobResult.structured_data;
const markdownChunks = jobResult.chunks;
writeFileSync(
'structured_data.json',
JSON.stringify(structuredData, null, 2)
);
let markdownContent = '';
markdownChunks.forEach((chunk, chunkNumber) => {
markdownContent += `## CHUNK NUMBER ${chunkNumber}\n\n`;
markdownContent += `## Page ${chunk.page_number}\n\n${chunk.content}\n\n`;
});
writeFileSync('markdown_chunks.md', markdownContent);
}
async function getParseResults(parseId, tensorlakeApiKey) {
while (true) {
const response = await fetch(
`https://api.tensorlake.ai/documents/v2/parse/${parseId}`,
{
method: 'GET',
headers: {
Authorization: `Bearer ${tensorlakeApiKey}`,
'Content-Type': 'application/json',
},
}
);
if (!response.ok) {
console.error(`Error fetching job: ${response.statusText}`);
return;
}
const result = await response.json();
if (result.status === 'pending' || result.status === 'processing') {
console.log('waiting 5s...');
await new Promise((resolve) => setTimeout(resolve, 5000));
console.log(`job status: ${result.status}`);
} else {
if (result.status === 'successful') {
console.log(result);
writeParseResults(result);
return result;
} else {
console.error(`Job finished with status: ${result.status}`);
return result;
}
}
}
}
const parseId = 'your-parse-id-here';
const tensorlakeApiKey = 'your-tensorlake-api-key-here';
await getParseResults(parseId, tensorlakeApiKey);
The output you will see
When the parsing is complete, you will see -
- Two files called
structured_data.json
andmarkdown_chunks.md
with the structure data and the markdown chunks.
{
"Signers": [
{
"data": {
"buyer_name": "Nova Ellison",
"buyer_signature_date": "September 10, 2025",
"seller_name": "Juno Vegi",
"seller_signature_date": "September 10, 2025"
},
"page_numbers": [
9,
10
],
"schema_name": "Signers"
}
]
}
{
"Signers": [
{
"data": {
"buyer_name": "Nova Ellison",
"buyer_signature_date": "September 10, 2025",
"seller_name": "Juno Vegi",
"seller_signature_date": "September 10, 2025"
},
"page_numbers": [
9,
10
],
"schema_name": "Signers"
}
]
}
## CHUNK NUMBER 0
## Page 9
relationships in accordance with any agreement(s) made with licensed real estate agent(s). Seller has read and acknowledges receipt of a copy of this Agreement and authorizes any licensed real estate agent(s) to deliver a signed copy to the Buyer.
Delivery may be in any of the following: (i) hand delivery; (ii) email under the condition that the Party transmitting the email receives electronic confirmation that the email was received to the intended recipient; and (iii) by facsimile to the other Party or the other Party’s licensee, but only if the transmitting fax machine prints a confirmation that the transmission was successful.
XXX. LICENSED REAL ESTATE AGENT(S). If Buyer or Seller have hired the services of licensed real estate agent(s) to perform representation on their behalf, he/she/they shall be entitled to payment for their services as outlined in their separate written agreement.
XXXI. DISCLOSURES. It is acknowledged by the Parties that: (check one)
- There are no attached addendums or disclosures to this Agreement.
- The following addendums or disclosures are attached to this Agreement: (check all that apply)
-
Lead-Based Paint Disclosure Form [ ]
- [ ]
- [ ]
- [ ]
-
-
-
XXXII. ADDITIONAL TERMS AND CONDITIONS.
None
XXXIII. ENTIRE AGREEMENT. This Agreement together with any attached addendums or disclosures shall supersede any and all other prior understandings and agreements, either oral or in writing, between the Parties with respect to the subject matter hereof and shall constitute the sole and only agreements between the Parties with respect to the said Property. All prior negotiations and agreements between the Parties with respect to the Property hereof are merged into this Agreement. Each Party to this Agreement acknowledges that no representations, inducements, promises, or agreements, orally or otherwise, have been made by any Party or by anyone acting on behalf of any Party, which are not embodied in this Agreement and that any agreement, statement or promise that is not contained in this Agreement shall not be valid or binding or of any force or effect.
e
Buyer's Initials NE
-
Seller's Initials JV.
Page 9 of 10
## CHUNK NUMBER 1
## Page 10
XXXIV. EXECUTION.
| | |
|----------------------------------------------------------------------------------|--------------------|
| Buyer Signature: Nova Ellison Date: Print Name: Nova Ellison | September 10, 2025 |
| Buyer Signature: Date: Print Name: | |
| Seller Signature: Juno Vegi Date: Print Name: J uno Vega | September 10, 2025 |
| Seller Signature: Date: Print Name: | |
| Agent Signature: Aster Polaris Date: Print Name: Aster Polaris Polaris Group LLC | September 10, 2025 |
| Agent Signature: Date: Print Name: | |
e
Page 10 of 10
Next Steps
Was this page helpful?