> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Parse

Submit a uploaded file, an internet-reachable URL, or any kind of raw text for document parsing. If you have configured a webhook,
we will notify you when the job is complete, be it a success or a failure.

This API is an advanced version of the [Extract Document](./extract), [Classify Document](./classify), and [Read Document](./read) endpoints.
We recommend using this API for more advanced use cases e.g. getting structured data and page classifications in a single request.

The API will convert the document into markdown, and provide document layout information. You can also classify pages into categories and
perform structured extraction using JSON Schema.

Once submitted, the API will return a parse response with a `parse_id` field. You can query the status and results of the parse operation
with the [Get Parse Result](./get) endpoint.

## Using a file

When submitting a parse job, you can provide the content of the file in one of three ways:

1. `file_id`: The ID of a file that has been previously uploaded to the [Upload Files](../../v2/files/upload). This is the most common method.
2. `file_url`: A publicly accessible URL that points to the file you want to parse. The API will download the file from this URL.
   Redirects are also supported, but the URL and the `Location` header must point to a file that is publicly accessible.
3. `raw_text`: Raw text content, if you want to perform structured extraction from non-file sources; such as emails, HTML, CSV, XML, etc.

The API will attempt to detect the mime-type automatically based on the file extension. You can provide a `mime_type` field to override the inferred mime-type. This is useful if you know the content type of the file and want to ensure the model interprets it correctly.

## Page classification

You can classify pages of a document into categories, or tags. Pass in an array of categories along with their descriptions to guide the classifier in the
`page_classifications` field. The API will return the page class for each page of the document.

## Structured extraction

For structured extraction, you can provide one or more schemas to guide the extraction process. The schema must be in the form of a JSON Schema object.

The JSON Schema object can be provided in the `structured_extraction_options` array, which can contain multiple objects.

Known limitations include:

* The schema can only be at most 5 levels deep
* All fields must be required
* Root level fields must be objects

<Icon icon="lightbulb" iconType="solid" /> Page Classification labels can be combined
with structured extraction, to make the API perform structured extraction on a subset
of pages.


## OpenAPI

````yaml post /documents/v2/parse
openapi: 3.1.0
info:
  title: Tensorlake API
  description: >-
    Tensorlake Cloud APIs for Sandboxes, Document Ingestion, and Serverless
    Workflows
  license:
    name: ''
  version: 0.1.0
servers:
  - url: https://api.tensorlake.ai/
security:
  - bearerAuth: []
tags:
  - name: Tensorlake Cloud API
    description: >-
      Tensorlake Cloud APIs for Sandboxes, Document Ingestion, and Serverless
      Workflows
paths:
  /documents/v2/parse:
    post:
      tags:
        - parse
      operationId: post_parse
      requestBody:
        content:
          application/json:
            schema:
              $ref: '#/components/schemas/ParseRequest'
        required: true
      responses:
        '200':
          description: Created parse job details
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ParseCreatedResponse'
        '400':
          description: Invalid request
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApiError'
        '401':
          description: Unauthorized. Invalid or missing credentials
        '403':
          description: Forbidden. You do not have permission to access this resource
        '404':
          description: Resource not found
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApiError'
        '422':
          description: Invalid properties in request body
          content:
            text/plain: {}
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApiError'
components:
  schemas:
    ParseRequest:
      allOf:
        - $ref: '#/components/schemas/RequestFileInfo'
        - $ref: '#/components/schemas/ParseConfiguration'
        - type: object
          properties:
            labels:
              type:
                - object
                - 'null'
              description: >-
                Additional metadata to identify the parse request. The labels
                are

                returned in the parse response.
              additionalProperties: {}
              propertyNames:
                type: string
              example:
                priority: high
                source: email
    ParseCreatedResponse:
      type: object
      required:
        - parse_id
        - created_at
      properties:
        parse_id:
          type: string
          description: >-
            The unique identifier for the parse job


            This is the ID that can be used to track the status of the parse
            job.

            Used in the `GET /documents/v2/parse/{parse_id}` endpoint to
            retrieve

            the status and results of the parse job.
        created_at:
          type: string
          description: |-
            The creation date and time of the parse job.

            The date is in RFC 3339 format.
    ApiError:
      type: object
      required:
        - message
        - code
        - timestamp
      properties:
        message:
          type: string
          description: A human-readable error message
        code:
          $ref: '#/components/schemas/ApiErrorCode'
          description: The error code, which can be used to programmatically handle errors
        timestamp:
          type: integer
          format: int64
          description: Millis since Unix epoch; easy to parse in every language
        trace_id:
          type:
            - string
            - 'null'
          description: Optional request correlation-id for distributed tracing
        details:
          description: Optional field-level validation errors, etc.
    RequestFileInfo:
      allOf:
        - type: object
          properties:
            page_range:
              type: string
              description: >-
                Comma-separated list of page numbers or ranges to parse (e.g.,
                '1,2,3-5'). Default: all pages.
              examples:
                - 1-5,8,10
            file_name:
              type: string
              description: Name of the file. Only populated when using file_id.
              examples:
                - document.pdf
        - oneOf:
            - type: object
              title: file_id
              required:
                - file_id
              properties:
                file_id:
                  type: string
                  description: >-
                    ID of the file previously uploaded to Tensorlake. Has
                    tensorlake- (V1) or file_ (V2) prefix.
                  examples:
                    - file_abc123xyz
                mime_type:
                  type: string
                  enum:
                    - application/pdf
                    - >-
                      application/vnd.openxmlformats-officedocument.wordprocessingml.document
                    - application/msword
                    - >-
                      application/vnd.openxmlformats-officedocument.presentationml.presentation
                    - application/vnd.ms-powerpoint
                    - application/vnd.apple.keynote
                    - image/jpeg
                    - image/tiff
                    - text/plain
                    - text/html
                    - text/markdown
                    - text/x-markdown
                    - >-
                      application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
                    - application/vnd.ms-excel.sheet.macroenabled.12
                    - application/vnd.ms-excel
                    - text/xml
                    - text/csv
                    - image/png
                    - text/rtf
                    - application/rtf
                    - application/octet-stream
                    - application/pkcs7-mime
                    - application/x-pkcs7-mime
                    - application/pkcs7-signature
            - type: object
              title: file_url
              required:
                - file_url
              properties:
                file_url:
                  type: string
                  format: uri-template
                  description: >-
                    External URL of the file to parse. Must be publicly
                    accessible.
                  examples:
                    - >-
                      https://pub-226479de18b2493f96b64c6674705dd8.r2.dev/real-estate-purchase-all-signed.pdf
                mime_type:
                  type: string
                  enum:
                    - application/pdf
                    - >-
                      application/vnd.openxmlformats-officedocument.wordprocessingml.document
                    - application/msword
                    - >-
                      application/vnd.openxmlformats-officedocument.presentationml.presentation
                    - application/vnd.ms-powerpoint
                    - application/vnd.apple.keynote
                    - image/jpeg
                    - image/tiff
                    - text/plain
                    - text/html
                    - text/markdown
                    - text/x-markdown
                    - >-
                      application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
                    - application/vnd.ms-excel.sheet.macroenabled.12
                    - application/vnd.ms-excel
                    - text/xml
                    - text/csv
                    - image/png
                    - text/rtf
                    - application/rtf
                    - application/octet-stream
                    - application/pkcs7-mime
                    - application/x-pkcs7-mime
                    - application/pkcs7-signature
            - type: object
              title: raw_text
              required:
                - raw_text
                - mime_type
              properties:
                raw_text:
                  type: string
                  description: The raw text content to parse.
                  examples:
                    - This is the document content...
                mime_type:
                  type: string
                  enum:
                    - application/pdf
                    - >-
                      application/vnd.openxmlformats-officedocument.wordprocessingml.document
                    - application/msword
                    - >-
                      application/vnd.openxmlformats-officedocument.presentationml.presentation
                    - application/vnd.ms-powerpoint
                    - application/vnd.apple.keynote
                    - image/jpeg
                    - image/tiff
                    - text/plain
                    - text/html
                    - text/markdown
                    - text/x-markdown
                    - >-
                      application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
                    - application/vnd.ms-excel.sheet.macroenabled.12
                    - application/vnd.ms-excel
                    - text/xml
                    - text/csv
                    - image/png
                    - text/rtf
                    - application/rtf
                    - application/octet-stream
                    - application/pkcs7-mime
                    - application/x-pkcs7-mime
                    - application/pkcs7-signature
          description: 'File source - must be exactly one of: file_id, file_url, or raw_text'
    ParseConfiguration:
      type: object
      properties:
        parsing_options:
          $ref: '#/components/schemas/ParsingOptions'
          description: >-
            The properties of this object define the configuration for the
            document

            parsing process.


            Tensorlake provides sane defaults that work well for most

            documents, so this object is not required. However, every document

            is different, and you may want to customize the parsing process to

            better suit your needs.
        structured_extraction_options:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/StructuredExtractionOptions'
          description: >-
            The properties of this object define the configuration for
            structured

            data extraction.


            If this object is present, the API will perform structured data

            extraction on the document.
        page_classifications:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/PageClassConfig'
          description: |-
            The properties of this object define the configuration for page
            classify.

            If this object is present, the API will perform page classify on
            the document.
        enrichment_options:
          $ref: '#/components/schemas/EnrichmentOptions'
          description: >-
            The properties of this object help to extend the output of the
            document

            parsing process with additional information.


            This includes summarization of tables and figures, which can help to

            provide a more comprehensive understanding of the document.


            This object is not required, and the API will use default settings
            if it

            is not present.
    ApiErrorCode:
      oneOf:
        - type: string
          enum:
            - QUOTA_EXCEEDED
        - type: string
          enum:
            - INVALID_JSON_SCHEMA
        - type: string
          enum:
            - INVALID_CONFIGURATION
        - type: string
          enum:
            - INVALID_PAGE_CLASSIFICATION
        - type: string
          enum:
            - ENTITY_NOT_FOUND
        - type: string
          enum:
            - ENTITY_ALREADY_EXISTS
        - type: string
          enum:
            - INVALID_FILE
        - type: string
          enum:
            - INVALID_PAGE_RANGE
        - type: string
          enum:
            - INVALID_MIME_TYPE
        - type: string
          enum:
            - INVALID_DATASET_NAME
        - type: string
          enum:
            - INVALID_JOB_STATE
        - type: string
          enum:
            - INTERNAL_ERROR
        - type: string
          enum:
            - INVALID_MULTIPART
        - type: string
          enum:
            - MULTIPART_STREAM_END
        - type: string
          enum:
            - CLIENT_DISCONNECT
        - type: string
          enum:
            - INVALID_ID
        - type: object
          required:
            - INVALID_QUERY_PARAMS
          properties:
            INVALID_QUERY_PARAMS:
              type: object
              required:
                - property
              properties:
                property:
                  type: string
                message:
                  type:
                    - string
                    - 'null'
    ParsingOptions:
      type: object
      properties:
        table_output_mode:
          oneOf:
            - $ref: '#/components/schemas/TableOutputMode'
              description: |-
                The format for the tables extracted from the document.

                `HTML` - tables are represented as HTML strings.
                `Markdown` - tables are represented as Markdown strings.

                The default is `HTML`.
          default: html
        table_parsing_format:
          oneOf:
            - $ref: '#/components/schemas/TableParsingFormat'
              description: >-
                Determines which model the system uses to identify and extract
                tables

                from the document.


                `tsr` - identifies the structure of

                the table first, and then the cells of the tables. Better suited
                for

                dense, long or grid-like tables.

                `vlm` - uses a VLM model to identify and extract the cells of
                the

                tables. Better suited for tables with merged cells or irregular

                structures.


                The default is `tsr`.
          default: tsr
        chunking_strategy:
          oneOf:
            - $ref: '#/components/schemas/ChunkingStrategy'
              description: >-
                Determines how the document is chunked into smaller pieces.


                `None` - no chunking is applied.

                `Page` - chunks the document into pages.

                `Section` - chunks the document into sections.

                `Fragment` - chunks the document by objects detected in the
                document.

                Every text block, image, table, etc. is considered a fragment.


                The default is `None`.
          default: none
        signature_detection:
          type: boolean
          description: |-
            Flag to enable the detection of signatures in the document.

            This flag incurs additional billing costs.

            The default is `false`.
          default: false
        remove_strikethrough_lines:
          type: boolean
          description: >-
            Flag to enable the detection, and removal, of strikethrough text in
            the

            document.


            This flag incurs additional billing costs.


            The default is `false`.
          default: false
        skew_detection:
          type: boolean
          description: |-
            Boolean flag to detect and correct skewed or rotated pages in the
            document.

            Setting this to `true` will increase the processing time of the
            document.

            The default is `false`.
          default: false
        disable_layout_detection:
          type: boolean
          description: |-
            Disable bounding box detection for the document. Leads to faster
            document parsing.

            The default is `false`.
          default: false
        ignore_sections:
          type: array
          items:
            $ref: '#/components/schemas/PageFragmentType'
          description: >-
            A set of page fragment types to ignore during parsing.


            This can be used to skip certain types of content that are not
            relevant

            for the parsing process, such as headers, footers, or other

            non-essential elements.


            The default is an empty set.
          default: []
          uniqueItems: true
        cross_page_header_detection:
          type: boolean
          description: |-
            Enable header-hierarchy detection across pages.

            When set to `true`, the parser will consider headers from different
            pages when determining the hierarchy of headers within a single
            page.

            The default is `false`.
          default: false
        include_images:
          type: boolean
          description: |-
            Embed images from document in the markdown

            The default is `false`.
          default: false
        barcode_detection:
          type: boolean
          description: |-
            Enable barcode reader for the document.

            The default is `false`.
          default: false
        merge_tables:
          type: boolean
          description: >-
            Enable table merging for the document.


            When set to `true`, adjacent tables that are part of the same
            logical table will be

            merged into a single table.


            The default is `false`.
          default: false
        ocr_model:
          oneOf:
            - $ref: '#/components/schemas/OcrPipelineProvider'
              description: >-
                The model to use for OCR (Optical Character Recognition).


                `model01` - It's fast but could have lower accuracy on

                complex tables. It's good for legal documents with footnotes.

                `model02` - It's slower but could have higher accuracy on
                complex

                tables. It's good for financial documents with merged cells.

                `model03` (default model) - it is our best model in terms of
                accuracy for business documents.

                This model can be deployed on dedicated

                hardware in their own datacenter.

                `gemini3`

                Google Gemini 3 API for OCR processing.
          default: model03
      additionalProperties: false
    StructuredExtractionOptions:
      type: object
      required:
        - schema_name
        - json_schema
      properties:
        schema_name:
          type: string
          description: >-
            The name of the schema. This is used to tag the structured data
            output

            with a name in the response.
        json_schema:
          description: >-
            The JSON schema to guide structured data extraction from the file.


            This schema should be a valid JSON schema that defines the structure
            of

            the data to be extracted.


            The API supports a subset of the JSON schema specification.


            This value must be provided if `structured_extraction` is present in
            the

            request.
        skip_ocr:
          type: boolean
          description: >-
            Boolean flag to skip converting the document blob to OCR text before

            structured data extraction.


            If set to `true`, the API will skip the OCR step and directly
            extract

            structured data from the document.


            The default is `false`.
        prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to use for structured data extraction.

            If not provided, the default prompt will be used.
        model_provider:
          $ref: '#/components/schemas/Model'
          description: >-
            The model provider to use for structured data extraction.


            The default is `tensorlake`, which uses our private model, and runs
            on

            our servers.
        partition_strategy:
          $ref: '#/components/schemas/PartitionStrategy'
          description: >-
            Strategy to partition the document before structured data
            extraction.

            The API will return one structured data object per partition. This
            is

            useful when you want to extract certain fields from every page.


            Options -


            * `None`(*default*) - no partitioning is applied.

            * `Page` - partition the document into pages.

            * `Section` - partition the document into sections.

            A section is defined as a group of text blocks that are visually

            separated from other text blocks by whitespace or other visual
            elements.

            * `Fragment` - partition the document by fragments.

            A fragment is defined as a group of text blocks, images, tables,
            etc.

            that are visually grouped together.

            * `Patterns` - partition the document by custom patterns.

            This requires providing start_patterns and end_patterns to define
            the

            custom patterns. Patterns are defined as strings specific to the

            document content. The start_patterns and end_patterns are used to

            identify the beginning and end of each partition.
        page_classes:
          type:
            - array
            - 'null'
          items:
            type: string
          description: |-
            Filter the pages of the document to be used for structured data
            extraction by providing a list of page classes.

            The default is `None`, which means all pages will be used.
        provide_citations:
          type:
            - boolean
            - 'null'
          description: |-
            Flag to enable visual citations in the structured data output.
            It returns the bounding boxes of the coordinates of the document
            where the structured data was extracted from.

            The default is `false`.
      additionalProperties: false
    PageClassConfig:
      type: object
      required:
        - name
        - description
      properties:
        name:
          type: string
          description: The name of the page class.
        description:
          type: string
          description: |-
            The description of the page class to guide the model to classify the
            pages. Describe what the model should look for in the page to
            classify it.
    EnrichmentOptions:
      type: object
      properties:
        table_cell_grounding:
          type: boolean
          description: |-
            Grounding of table cells, providing the bounding box of the cells.

            The default is `false`.
          default: false
        table_summarization:
          type: boolean
          description: |-
            Generate a summary for parsed tables.

            The default is `false`.
          default: false
        table_summarization_prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to guide the table summarization.
            Ignored if `table_summarization` is `false`.
            Default prompt - "Summarize the table in a concise manner."
          default: null
        figure_summarization:
          type: boolean
          description: |-
            Generate a summary for parsed figures.

            The default is `false`.
          default: false
        figure_summarization_prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to guide the figure summarization.
            Ignored if `figure_summarization` is `false`.
            Default prompt - "Summarize the figure in a concise manner."
          default: null
        chart_extraction:
          type: boolean
          description: >-
            Extraction of chart type and structured data series from images,
            delivered as clean JSON suitable for analytics and ingestion.


            The default is `false`.
          default: false
        key_value_extraction:
          type: boolean
          description: |-
            Extraction of key/value pairs from forms as JSON.

            The default is `false`.
          default: false
        include_full_page_image:
          type: boolean
          description: >-
            Use full page image in addition to the cropped table and figure
            images.

            This provides Language Models context about the table and figure
            they

            are summarizing in addition to the cropped images, and could improve
            the

            summarization quality.


            The default is `false`.
          default: false
      additionalProperties: false
    TableOutputMode:
      type: string
      enum:
        - html
        - markdown
    TableParsingFormat:
      type: string
      enum:
        - tsr
        - vlm
    ChunkingStrategy:
      type: string
      enum:
        - none
        - page
        - section
        - fragment
    PageFragmentType:
      type: string
      enum:
        - section_header
        - title
        - text
        - table
        - figure
        - chart
        - formula
        - form
        - key_value_region
        - document_index
        - list_item
        - table_caption
        - figure_caption
        - formula_caption
        - page_footer
        - page_header
        - page_number
        - signature
        - strikethrough
        - tracked_changes
        - comments
        - barcode
    OcrPipelineProvider:
      type: string
      enum:
        - model01
        - model02
        - model03
        - gemini3
        - model06
    Model:
      type: string
      enum:
        - tensorlake
        - gemini3
        - sonnet
        - gpt4o_mini
    PartitionStrategy:
      oneOf:
        - type: string
          title: none
          description: |-
            No partitioning is applied. The entire document is used for
            structured data extraction.
          enum:
            - none
        - type: string
          title: page
          description: |-
            Partition the document into pages. Each page is used for structured
            data extraction separately.
          enum:
            - page
        - type: string
          title: section
          description: |-
            Partition the document into sections. Each section is used for
            structured data extraction separately.

            A section is defined as a group of text blocks that are visually
            separated from other text blocks by whitespace or other visual
            elements.
          enum:
            - section
        - type: string
          title: fragment
          description: >-
            Partition the document by fragments. Each fragment is used for

            structured data extraction separately.


            A fragment is defined as a group of text blocks, images, tables,
            etc.

            that are visually grouped together.
          enum:
            - fragment
        - type: object
          title: patterns
          description: >-
            Partition the document by custom patterns. Each pattern match is
            used

            for structured data extraction separately.


            This requires providing start_patterns and end_patterns to define

            the custom patterns.


            Patterns are defined as strings specific to the document content.

            The start_patterns and end_patterns are used to identify the

            beginning and end of each partition.
          required:
            - patterns
          properties:
            patterns:
              type: object
              description: >-
                Partition the document by custom patterns. Each pattern match is
                used

                for structured data extraction separately.


                This requires providing start_patterns and end_patterns to
                define

                the custom patterns.


                Patterns are defined as strings specific to the document
                content.

                The start_patterns and end_patterns are used to identify the

                beginning and end of each partition.
              required:
                - start_patterns
                - end_patterns
              properties:
                start_patterns:
                  type: array
                  items:
                    type: string
                end_patterns:
                  type: array
                  items:
                    type: string
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer

````