Parse Documents - Tensorlake

cURL

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ]
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null
  },
  "file_id": "<string>",
  "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf",
  "raw_text": "<string>",
  "mime_type": null,
  "page_range": "<string>",
  "labels": {
    "priority": "high",
    "source": "email"
  }
}'

{
  "parse_id": "<string>",
  "created_at": "<string>"
}

POST

documents

parse

cURL

curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ]
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null
  },
  "file_id": "<string>",
  "file_url": "https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf",
  "raw_text": "<string>",
  "mime_type": null,
  "page_range": "<string>",
  "labels": {
    "priority": "high",
    "source": "email"
  }
}'

{
  "parse_id": "<string>",
  "created_at": "<string>"
}

Submit a uploaded file, an internet-reachable URL, or any kind of raw text for document parsing. If you have configured a webhook, we will notify you when the job is complete, be it a success or a failure. The API will convert the document into markdown, and provide document layout information. You can also classify pages into categories and perform structured extraction using JSON Schema. Once submitted, the API will return a parse response with a parse_id field. You can query the status and results of the parse operation with the Get Parse Result endpoint.

Using a file

When submitting a parse job, you can provide the content of the file in one of three ways:

file_id: The ID of a file that has been previously uploaded to the Upload Files. This is the most common method.
file_url: A publicly accessible URL that points to the file you want to parse. The API will download the file from this URL. Redirects are also supported, but the URL and the Location header must point to a file that is publicly accessible.
raw_text: Raw text content, if you want to perform structured extraction from non-file sources; such as emails, HTML, CSV, XML, etc.

Supported mime types

The following mime types are supported for file parsing:

text/plain: Plain text files (default for raw_text)
text/csv: CSV files.
text/html: HTML files.
application/pdf: PDF files.
application/vnd.openxmlformats-officedocument.wordprocessingml.document**: DOCX files.
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet: XLSX files.
application/vnd.ms-excel.sheet.macroEnabled.12: XLSM files.
application/vnd.openxmlformats-officedocument.presentationml.presentation: PPTX files.
application/vnd.ms-excel: XLS files.
image/jpeg: JPEG images.
image/png: PNG images.

The API will attempt to detect the mime-type automatically based on the file extension. You can provide a mime_type field to override the inferred mime-type. This is useful if you know the content type of the file and want to ensure the model interprets it correctly. For the raw_text method, you must specify the mime_type field to indicate the type of content you are providing. This is necessary for the model to correctly interpret the text.

Page classification

You can classify pages of a document into categories, or tags. Pass in an array of categories along with their descriptions to guide the classifier in the page_classifications field. The API will return the page class for each page of the document.

Structured extraction

For structured extraction, you can provide one or more schemas to guide the extraction process. The schema must be in the form of a JSON Schema object. The JSON Schema object can be provided in the structured_extraction_options array, which can contain multiple objects. Known limitations include:

The schema can only be at most 5 levels deep
All fields must be required
Root level fields must be objects

Page Classification labels can be combined with structured extraction, to make the API perform structured extraction on a subset of pages.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

This object defines the request body for the parse endpoint.

Response

200

application/json

Created parse job details

The response is of type object.

Get file metadata List Parse Jobs

API Documentation

Document Ingestion

Parse

Using a file

Supported mime types

Page classification

Structured extraction

Authorizations

Body

Response

API Documentation

Document Ingestion

​Using a file

​Supported mime types

​Page classification

​Structured extraction

Authorizations

Body

Response

Using a file

Supported mime types

Page classification

Structured extraction