POST
/
documents
/
v2
/
extract
cURL
curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "page_range": "1-5,8,10",
  "file_name": "document.pdf",
  "file_id": "file_abc123xyz",
  "mime_type": "application/pdf",
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<any>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ],
      "provide_citations": true
    }
  ],
  "labels": {
    "priority": "high",
    "source": "email"
  }
}'
{
  "parse_id": "<string>",
  "created_at": "<string>"
}
Submit a uploaded file, an internet-reachable URL, or any kind of raw text for document parsing. If you have configured a webhook, we will notify you when the job is complete, be it a success or a failure. Once submitted, the API will return a parse response with a parse_id field. You can query the status and results of the parse operation with the Get Parse Result endpoint.

Using a schema

For this operation, you must provide one or more schemas to guide the extraction process. The schema must be in the form of a JSON Schema object. The JSON Schema object can be provided in the structured_extraction_options array, which can contain multiple objects. Known limitations include:
  • The schema can only be at most 5 levels deep
  • Root level fields must be objects
Page Classification labels can be combined with structured extraction, to make the API perform structured extraction on a subset of pages.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json

File source - must be exactly one of: file_id, file_url, or raw_text

file_id
string
required

ID of the file previously uploaded to Tensorlake. Has tensorlake- (V1) or file_ (V2) prefix.

Examples:

"file_abc123xyz"

structured_extraction_options
object[]

The properties of this object define the configuration for structured data extraction.

If this object is present, the API will perform structured data extraction on the document.

labels
object | null

Additional metadata to identify the extraction request. The labels are returned in the extraction response.

Example:
{ "priority": "high", "source": "email" }
page_range
string

Comma-separated list of page numbers or ranges to parse (e.g., '1,2,3-5'). Default: all pages.

Examples:

"1-5,8,10"

file_name
string

Name of the file. Only populated when using file_id.

Examples:

"document.pdf"

mime_type
enum<string>
Available options:
application/pdf,
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
application/msword,
application/vnd.openxmlformats-officedocument.presentationml.presentation,
application/vnd.apple.keynote,
image/jpeg,
text/plain,
text/html,
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,
application/vnd.ms-excel.sheet.macroenabled.12,
application/vnd.ms-excel,
text/xml,
text/csv,
image/png,
application/octet-stream

Response

Created parse job details

parse_id
string
required

The unique identifier for the parse job

This is the ID that can be used to track the status of the parse job. Used in the GET /documents/v2/parse/{parse_id} endpoint to retrieve the status and results of the parse job.

created_at
string
required

The creation date and time of the parse job.

The date is in RFC 3339 format.