Skip to main content
POST
/
documents
/
v2
/
parse
cURL
curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/parse \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "page_range": "1-5,8,10",
  "file_name": "document.pdf",
  "file_id": "file_abc123xyz",
  "mime_type": "application/pdf",
  "parsing_options": {
    "table_output_mode": "html",
    "table_parsing_format": "tsr",
    "chunking_strategy": "none",
    "signature_detection": false,
    "remove_strikethrough_lines": false,
    "skew_detection": false,
    "disable_layout_detection": false,
    "ignore_sections": [],
    "cross_page_header_detection": false,
    "ocr_model": "model01"
  },
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ],
      "provide_citations": true
    }
  ],
  "page_classifications": [
    {
      "name": "<string>",
      "description": "<string>"
    }
  ],
  "enrichment_options": {
    "table_summarization": false,
    "table_summarization_prompt": null,
    "figure_summarization": false,
    "figure_summarization_prompt": null,
    "include_full_page_image": false
  },
  "labels": {
    "priority": "high",
    "source": "email"
  }
}'
{
  "parse_id": "<string>",
  "created_at": "<string>"
}
Submit a uploaded file, an internet-reachable URL, or any kind of raw text for document parsing. If you have configured a webhook, we will notify you when the job is complete, be it a success or a failure. This API is an advanced version of the Extract Document, Classify Document, and Read Document endpoints. We recommend using this API for more advanced use cases e.g. getting structured data and page classifications in a single request. The API will convert the document into markdown, and provide document layout information. You can also classify pages into categories and perform structured extraction using JSON Schema. Once submitted, the API will return a parse response with a parse_id field. You can query the status and results of the parse operation with the Get Parse Result endpoint.

Using a file

When submitting a parse job, you can provide the content of the file in one of three ways:
  1. file_id: The ID of a file that has been previously uploaded to the Upload Files. This is the most common method.
  2. file_url: A publicly accessible URL that points to the file you want to parse. The API will download the file from this URL. Redirects are also supported, but the URL and the Location header must point to a file that is publicly accessible.
  3. raw_text: Raw text content, if you want to perform structured extraction from non-file sources; such as emails, HTML, CSV, XML, etc.
The API will attempt to detect the mime-type automatically based on the file extension. You can provide a mime_type field to override the inferred mime-type. This is useful if you know the content type of the file and want to ensure the model interprets it correctly.

Page classification

You can classify pages of a document into categories, or tags. Pass in an array of categories along with their descriptions to guide the classifier in the page_classifications field. The API will return the page class for each page of the document.

Structured extraction

For structured extraction, you can provide one or more schemas to guide the extraction process. The schema must be in the form of a JSON Schema object. The JSON Schema object can be provided in the structured_extraction_options array, which can contain multiple objects. Known limitations include:
  • The schema can only be at most 5 levels deep
  • All fields must be required
  • Root level fields must be objects
Page Classification labels can be combined with structured extraction, to make the API perform structured extraction on a subset of pages.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
  • file_id
  • file_url
  • raw_text

File source - must be exactly one of: file_id, file_url, or raw_text

file_id
string
required

ID of the file previously uploaded to Tensorlake. Has tensorlake- (V1) or file_ (V2) prefix.

Examples:

"file_abc123xyz"

parsing_options
object

The properties of this object define the configuration for the document parsing process.

Tensorlake provides sane defaults that work well for most documents, so this object is not required. However, every document is different, and you may want to customize the parsing process to better suit your needs.

structured_extraction_options
object[] | null

The properties of this object define the configuration for structured data extraction.

If this object is present, the API will perform structured data extraction on the document.

page_classifications
object[] | null

The properties of this object define the configuration for page classify.

If this object is present, the API will perform page classify on the document.

enrichment_options
object

The properties of this object help to extend the output of the document parsing process with additional information.

This includes summarization of tables and figures, which can help to provide a more comprehensive understanding of the document.

This object is not required, and the API will use default settings if it is not present.

labels
object | null

Additional metadata to identify the parse request. The labels are returned in the parse response.

Example:
{ "priority": "high", "source": "email" }
page_range
string

Comma-separated list of page numbers or ranges to parse (e.g., '1,2,3-5'). Default: all pages.

Examples:

"1-5,8,10"

file_name
string

Name of the file. Only populated when using file_id.

Examples:

"document.pdf"

mime_type
enum<string>
Available options:
application/pdf,
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
application/msword,
application/vnd.openxmlformats-officedocument.presentationml.presentation,
application/vnd.apple.keynote,
image/jpeg,
text/plain,
text/html,
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,
application/vnd.ms-excel.sheet.macroenabled.12,
application/vnd.ms-excel,
text/xml,
text/csv,
image/png,
application/octet-stream

Response

Created parse job details

parse_id
string
required

The unique identifier for the parse job

This is the ID that can be used to track the status of the parse job. Used in the GET /documents/v2/parse/{parse_id} endpoint to retrieve the status and results of the parse job.

created_at
string
required

The creation date and time of the parse job.

The date is in RFC 3339 format.

I