Skip to main content
POST
/
documents
/
v2
/
extract
cURL
curl --request POST \
  --url https://api.tensorlake.ai/documents/v2/extract \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "file_id": "<string>",
  "page_range": "<string>",
  "file_name": "<string>",
  "mime_type": "application/pdf",
  "structured_extraction_options": [
    {
      "schema_name": "<string>",
      "json_schema": "<unknown>",
      "skip_ocr": true,
      "prompt": "<string>",
      "model_provider": "tensorlake",
      "partition_strategy": "none",
      "page_classes": [
        "<string>"
      ],
      "provide_citations": true
    }
  ],
  "labels": {
    "priority": "high",
    "source": "email"
  }
}
'
{
  "parse_id": "<string>",
  "created_at": "<string>"
}
Submit a uploaded file, an internet-reachable URL, or any kind of raw text for document parsing. If you have configured a webhook, we will notify you when the job is complete, be it a success or a failure. Once submitted, the API will return a parse response with a parse_id field. You can query the status and results of the parse operation with the Get Parse Result endpoint.

Using a schema

For this operation, you must provide one or more schemas to guide the extraction process. The schema must be in the form of a JSON Schema object. The JSON Schema object can be provided in the structured_extraction_options array, which can contain multiple objects. Known limitations include:
  • The schema can only be at most 5 levels deep
  • Root level fields must be objects
Page Classification labels can be combined with structured extraction, to make the API perform structured extraction on a subset of pages.

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Body

application/json
  • file_id
  • file_url
  • raw_text

File source - must be exactly one of: file_id, file_url, or raw_text

file_id
string
required

ID of the file previously uploaded to Tensorlake. Has tensorlake- (V1) or file_ (V2) prefix.

page_range
string

Comma-separated list of page numbers or ranges to parse (e.g., '1,2,3-5'). Default: all pages.

file_name
string

Name of the file. Only populated when using file_id.

mime_type
enum<string>
Available options:
application/pdf,
application/vnd.openxmlformats-officedocument.wordprocessingml.document,
application/msword,
application/vnd.openxmlformats-officedocument.presentationml.presentation,
application/vnd.apple.keynote,
image/jpeg,
image/tiff,
text/plain,
text/html,
text/markdown,
text/x-markdown,
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet,
application/vnd.ms-excel.sheet.macroenabled.12,
application/vnd.ms-excel,
text/xml,
text/csv,
image/png,
application/octet-stream,
application/pkcs7-mime,
application/x-pkcs7-mime,
application/pkcs7-signature
structured_extraction_options
object[]

The properties of this object define the configuration for structured data extraction.

If this object is present, the API will perform structured data extraction on the document.

labels
object

Additional metadata to identify the extraction request. The labels are returned in the extraction response.

Example:
{ "priority": "high", "source": "email" }

Response

Created parse job details

parse_id
string
required

The unique identifier for the parse job

This is the ID that can be used to track the status of the parse job. Used in the GET /documents/v2/parse/{parse_id} endpoint to retrieve the status and results of the parse job.

created_at
string
required

The creation date and time of the parse job.

The date is in RFC 3339 format.