> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# List Parse Jobs

Retrieve a list of all parse jobs that have been submitted. This endpoint allows you to see the status and metadata of each parse job.

The endpoint is paginated. A page has the following fields:

* `items`: An array of parse jobs, each containing the fields described below.
* `has_more`: A boolean indicating whether there are more parse jobs available beyond the current page.
* `next_cursor`: A base64-encoded cursor for the next page of results. If `has_more` is `false`, this field will be `null`.
* `prev_cursor`: A base64-encoded cursor for the previous page of results. If this is the first page, this field will be `null`.

The response will include a page of parse jobs, each containing the following fields:

* `parse_id`: The unique identifier for the parse job.
* `status`: The current status of the parse job (e.g., `pending`, `processing`, `successful`, `failure`).
* `created_at`: The RFC 3339 timestamp when the parse job was created.
* `finished_at`: The RFC 3339 timestamp when the parse job was completed or failed.
* `options`: The configuration options used for the parse job, including the file ID, file URL, raw text, mime type, and structured extraction options, etc.

### Filters

You can filter the list of parse jobs by providing query parameters:

* `cursor`: A base64-encoded cursor for pagination. If not provided, the first page will be returned.
* `direction`: The direction of pagination. Can be `next` or `prev`. Defaults to `next`.
* `limit`: The maximum number of parse jobs to return per page. Defaults to 100, with a maximum of 1000.
* `filename`: Filter by the original filename of the file used for parsing. This is useful to find parse jobs related to a specific file.
* `status`: Filter by the status of the parse job. Can be `pending`, `processing`, `successful`, or `failure`.
* `id`: Filter by the unique identifier of the parse job. This is useful to retrieve a specific parse job, but is preferable to use the [Get Parse Result](./get) endpoint for that purpose.
* `created_after`: Filter by the creation date of the parse job. Only parse jobs created after this date will be returned. The date should be in RFC 3339 format.
* `created_before`: Filter by the creation date of the parse job. Only parse jobs created before this date will be returned. The date should be in RFC 3339 format.
* `finished_after`: Filter by the completion date of the parse job. Only parse jobs completed after this date will be returned. The date should be in RFC 3339 format.
* `finished_before`: Filter by the completion date of the parse job. Only parse jobs completed before this date will be returned. The date should be in RFC 3339 format.


## OpenAPI

````yaml get /documents/v2/parse
openapi: 3.1.0
info:
  title: Tensorlake API
  description: >-
    Tensorlake Cloud APIs for Sandboxes, Document Ingestion, and Serverless
    Workflows
  license:
    name: ''
  version: 0.1.0
servers:
  - url: https://api.tensorlake.ai/
security:
  - bearerAuth: []
tags:
  - name: Tensorlake Cloud API
    description: >-
      Tensorlake Cloud APIs for Sandboxes, Document Ingestion, and Serverless
      Workflows
paths:
  /documents/v2/parse:
    get:
      tags:
        - parse
      operationId: list_parse
      parameters:
        - name: cursor
          in: query
          description: |-
            Optional cursor for pagination.

            This is a base64-encoded string representing a timestamp.
            It is used to paginate through the results.
          required: false
          schema:
            oneOf:
              - type: 'null'
              - $ref: '#/components/schemas/Cursor'
        - name: direction
          in: query
          description: |-
            The direction of pagination.

            This can be either `next` or `prev`.

            The default is `next`, which means the next page of results will be
          required: false
          schema:
            oneOf:
              - type: 'null'
              - $ref: '#/components/schemas/PaginationDirection'
        - name: dataset_name
          in: query
          description: |-
            The name of the dataset to filter the results by.

            This is an optional parameter because not every parse operation is
            associated with a dataset.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: limit
          in: query
          description: |-
            The maximum number of results to return per page.

            The default is 100.
          required: false
          schema:
            type: integer
            minimum: 0
        - name: filename
          in: query
          description: |-
            The filename to filter the results by.

            This is an optional parameter that can be used to filter the results
            by the filename of the parsed document.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: status
          in: query
          description: |-
            The status of the parse operation to filter the results by.

            This is an optional parameter that can be used to filter the results
            by the status of the parse operation.

            The possible values are `pending`, `processing`, `failure`, and
            `successful`.
          required: false
          schema:
            oneOf:
              - type: 'null'
              - $ref: '#/components/schemas/ParseStatus'
        - name: id
          in: query
          description: The ID of the parse operation to filter the results by.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: created_after
          in: query
          description: |-
            The date and time after which the parse operation was created.

            The date should be in RFC3339 format.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: created_before
          in: query
          description: |-
            The date and time before which the parse operation was created.

            The date should be in RFC3339 format.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: finished_after
          in: query
          description: |-
            The date and time after which the parse operation was finished.

            The date should be in RFC3339 format.
          required: false
          schema:
            type:
              - string
              - 'null'
        - name: finished_before
          in: query
          description: |-
            The date and time before which the parse operation was finished.

            The date should be in RFC3339 format.
          required: false
          schema:
            type:
              - string
              - 'null'
      responses:
        '200':
          description: List of parse jobs
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/PaginatedResult_ParseResult'
        '401':
          description: Unauthorized. Invalid or missing credentials
        '403':
          description: Forbidden. You do not have permission to access this resource
        '422':
          description: Invalid query parameters
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApiError'
        '500':
          description: Internal server error
          content:
            application/json:
              schema:
                $ref: '#/components/schemas/ApiError'
components:
  schemas:
    Cursor:
      type: string
    PaginationDirection:
      type: string
      enum:
        - next
        - prev
    ParseStatus:
      type: string
      enum:
        - pending
        - processing
        - detecting_layout
        - detected_layout
        - extracting_data
        - extracted_data
        - formatting_output
        - formatted_output
        - successful
        - failure
    PaginatedResult_ParseResult:
      type: object
      required:
        - items
        - has_more
      properties:
        items:
          type: array
          items:
            type: object
            required:
              - parse_id
              - status
              - created_at
            properties:
              parse_id:
                type: string
                description: |-
                  The unique identifier for the parse job

                  This is the same as the value returned from the `POST
                  /documents/v2/parse` endpoint.
                default: ''
                example: parse_abcd1234
              dataset_id:
                type:
                  - string
                  - 'null'
                description: >-
                  If the parse job was scheduled from a dataset, this field
                  contains the

                  dataset id.


                  This is the identifier used in URLs and API endpoints to refer
                  to the

                  dataset.
                default: null
              parsed_pages_count:
                type: integer
                description: >-
                  The number of pages that were parsed successfully.


                  This is the total number of pages that were successfully
                  parsed in the

                  document.
                default: 0
                example: 5
                minimum: 0
              total_pages:
                type:
                  - integer
                  - 'null'
                description: >-
                  The total number of pages in the document.


                  This is the total number of pages in the original document
                  that was

                  parsed.


                  This value is only populated once the parse job is completed

                  successfully.
                default: null
                minimum: 0
              status:
                oneOf:
                  - $ref: '#/components/schemas/ParseStatus'
                    description: >-
                      The status of the parse job.


                      This indicates whether the job is pending, in progress,
                      completed, or

                      failed.


                      This can be used to track the progress of the parse
                      operation.
                default: pending
              error:
                type:
                  - string
                  - 'null'
                description: |-
                  Error occurred during any part of the parse execution.

                  This is only populated if the parse operation failed.
                default: null
              pages:
                type:
                  - array
                  - 'null'
                items:
                  $ref: '#/components/schemas/Page'
                description: >-
                  List of pages parsed from the document.


                  Each page has a list of fragments, which are detected objects
                  such as

                  tables, text, figures, section headers, etc.


                  We also return the detected text, structure of the table(if
                  its a

                  table), and the bounding box of the object.
                default: null
              chunks:
                type: array
                items:
                  $ref: '#/components/schemas/Chunk'
                description: >-
                  Chunks of the document.


                  This is a vector of `Chunk` objects, each containing a chunk
                  of the

                  document.

                  The number of chunks depend on the chunking strategy used
                  during

                  parsing.
                default: []
              structured_data:
                type:
                  - array
                  - 'null'
                items:
                  $ref: '#/components/schemas/StructuredData'
                description: >-
                  Structured data extracted from the document.


                  The structured data is a map where the keys are the schema
                  names

                  provided in the parse request, and the values are

                  `StructuredData` objects containing the structured data
                  extracted from

                  the document.


                  The number of structured data objects depends on the partition
                  strategy

                  **None** - one structured data object for the entire document.

                  **Page** - one structured data object for each page.
                default: null
              page_classes:
                type:
                  - array
                  - 'null'
                items:
                  $ref: '#/components/schemas/PageClass'
                description: >-
                  Page classes extracted from the document.


                  This is a map where the keys are page class names provided in
                  the parse

                  request under the `page_classification_options` field,

                  and the values are vectors of page numbers (1-indexed) where
                  each page

                  class appears.


                  This is used to categorize pages in the document based on the

                  classify options provided.
                default: null
              pdf_base64:
                type:
                  - string
                  - 'null'
                description: |-
                  The raw content of generated PDF, encoded in base64.

                  At the moment, this is only populated for DOCX files.
                  The PDF is generated from the original DOCX file.
                default: null
              tasks_completed_count:
                type:
                  - integer
                  - 'null'
                description: >-
                  The number of tasks that have been completed for the parse
                  job.


                  This is the number of tasks that have been successfully
                  processed in the

                  parse job.


                  It can be used to track the progress of the parse operation.
                default: null
                minimum: 0
              tasks_total_count:
                type:
                  - integer
                  - 'null'
                description: >-
                  The total number of tasks that are expected to be completed
                  for the

                  parse job.


                  This is the total number of tasks that are expected to be
                  processed in

                  the parse job.
                default: null
                minimum: 0
              created_at:
                type: string
                description: |-
                  The date and time when the parse job was created.

                  The date is in RFC 3339 format.

                  This can be used to track when the parse job was initiated.
                default: ''
                example: '2023-10-01T12:00:00Z'
              finished_at:
                type:
                  - string
                  - 'null'
                description: >-
                  The date and time when the parse job was finished.


                  The date is in RFC 3339 format.


                  This can be undefined if the parse job is still in progress or
                  pending.
                default: null
              labels:
                type: object
                description: >-
                  Labels associated with the parse job.


                  These are the key-value, or json, pairs submitted with the
                  parse

                  request.


                  This can be used to categorize or tag the parse job for easier

                  identification and filtering.


                  It can be undefined if no labels were provided in the request.
                default: {}
                additionalProperties: {}
                propertyNames:
                  type: string
              options:
                oneOf:
                  - type: 'null'
                  - $ref: '#/components/schemas/ParseRequestOptions'
                default: null
              usage:
                oneOf:
                  - type: 'null'
                  - $ref: '#/components/schemas/Usage'
                    description: >-
                      Resource usage associated with the parse job.


                      This includes details such as number of pages parsed,
                      tokens used for

                      OCR and extraction, etc.


                      Usage is only populated for successful jobs.


                      Billing is based on the resource usage.
                default: null
              message_update:
                type:
                  - string
                  - 'null'
                description: >-
                  Message update associated with the parse job.


                  This is used to provide progress update information about the
                  parse job.
                default: null
        has_more:
          type: boolean
        next_cursor:
          type:
            - string
            - 'null'
        prev_cursor:
          type:
            - string
            - 'null'
    ApiError:
      type: object
      required:
        - message
        - code
        - timestamp
      properties:
        message:
          type: string
          description: A human-readable error message
        code:
          $ref: '#/components/schemas/ApiErrorCode'
          description: The error code, which can be used to programmatically handle errors
        timestamp:
          type: integer
          format: int64
          description: Millis since Unix epoch; easy to parse in every language
        trace_id:
          type:
            - string
            - 'null'
          description: Optional request correlation-id for distributed tracing
        details:
          description: Optional field-level validation errors, etc.
    Page:
      type: object
      description: >-
        Entity representing a single page in the parsed document.


        Each page contains a list of fragments, which are detected objects such
        as

        tables, text, figures, section headers, etc.
      required:
        - page_number
      properties:
        page_number:
          type: integer
          description: 1-indexed page number in the document.
          minimum: 0
        page_fragments:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/PageFragment'
          description: |-
            Vector of text fragments extracted from the page.

            Each fragment represents a distinct section of text, such as titles,
            paragraphs, tables, figures, etc.
        dimensions:
          type:
            - array
            - 'null'
          items:
            type: integer
            format: int32
          description: >-
            Dimensions is a 2-element vector representing the width and height
            of

            the page in points.
        page_dimensions:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/PageDimensions'
              description: >-
                Dimensions of the page.


                This is only populated if the page dimensions could be
                determined.
        classification_reason:
          type:
            - string
            - 'null'
          description: >-
            If the page was classified into a specific class, this field
            contains

            the reason for the classification.
    Chunk:
      type: object
      required:
        - content
        - page_number
      properties:
        content:
          type: string
        page_number:
          type: integer
          minimum: 0
    StructuredData:
      type: object
      required:
        - data
        - page_numbers
      properties:
        data:
          description: |-
            The structured data extracted from the document.

            This is a JSON object containing the extracted data in the
            shape of the JSON schema provided in the parse request.
        page_numbers:
          $ref: '#/components/schemas/OneOrMany_usize'
          description: |-
            A list of page numbers (1-indexed) where the structured data was
            detected.

            The value may be a single page number or a vector of page numbers.
        schema_name:
          type:
            - string
            - 'null'
          description: >-
            The name of the schema provided in the structured extraction options
            of

            the parse request.


            This is used to identify the schema used for the structured data

            extraction.
    PageClass:
      type: object
      description: |-
        The classification result for a parse request that included
        `page_classification_options`.
      required:
        - page_class
        - page_numbers
      properties:
        page_class:
          type: string
          description: |-
            The name of the page class given in the parse request.

            This value should match one of the class names provided in the
            `page_classification_options` field of the parse request.
        page_numbers:
          type: array
          items:
            type: integer
            format: int32
          description: >-
            A list of page numbers (1-indexed) where the page class was
            detected.
        classification_reasons:
          type:
            - object
            - 'null'
          description: >-
            A map of reasons for classifying each page into this class.


            The keys are the page numbers (1-indexed) and the values are the
            reasons

            for classifying that page into this class.


            This field is optional and may be omitted if no reasons were
            provided

            during classification.
          additionalProperties:
            type: string
          propertyNames:
            type: integer
            format: int32
    ParseRequestOptions:
      type: object
      required:
        - job_type
        - configuration
      properties:
        file_id:
          type:
            - string
            - 'null'
          description: >-
            The tensorlake file ID.


            This is the ID of the file used for the parse job. It has
            `tensorlake_`

            prefix.


            It can be undefined if the parse operation was created with a
            `file_url`

            or `raw_text` field instead of a file ID.
        file_url:
          type:
            - string
            - 'null'
          description: >-
            The URL of the file used for the parse job.


            It can be undefined if the parse operation was created with a
            `file_id`

            or `raw_text` field instead of a file URL.
        raw_text:
          type:
            - string
            - 'null'
          description: >-
            The raw_text for the parse job.


            This is only populated if the parse operation was created with a

            `raw_text` field. And the mime type is of a text-based format (e.g.,

            plain text, HTML).


            It can be undefined if the parse operation was created with a
            `file_id`

            or `file_url` field instead of raw_text.
        file_name:
          type:
            - string
            - 'null'
          description: |-
            The name of the file used for the parse job.

            This is only populated if the parse operation was created with a
            `file_id`.
        file_labels:
          type: object
          description: |-
            Labels associated to the file used for the parse job.

            These are the key-value, or json, pairs submitted with the file
            upload.
          additionalProperties: {}
          propertyNames:
            type: string
        mime_type:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/MimeType'
              description: >-
                The mime type of the file used for the parse job.


                This can be undefined if the file has been removed since the
                parse job

                was created, or if the parse operation was created with a
                `file_url`

                field instead of a `file_id` or `raw_text`.
        trace_id:
          type:
            - string
            - 'null'
          description: |-
            The trace ID for the parse job.

            It can be undefined if the operation is still in pending state.

            This is used for debugging purposes.
        page_range:
          oneOf:
            - type: 'null'
            - $ref: '#/components/schemas/PageRange'
              description: >-
                The page range that was requested for parsing.


                This is the same as the value provided in the `pages` field of
                the

                request.


                It can be undefined if the parse operation was created without a

                specific page range. Meaning the whole document was parsed.
        job_type:
          $ref: '#/components/schemas/JobType'
          description: >-
            The type of job that was created.


            This indicates whether the job was created via the Parse, Read,
            Extract,

            Classification, Legacy, or Dataset endpoint.
        configuration:
          $ref: '#/components/schemas/ParseConfiguration'
          description: >-
            The configuration used for the parse job.


            This is derived from the configuration settings submitted with the
            parse

            request.


            It can be used to understand how the parse job was configured, such
            as

            the parsing strategy, extraction methods, etc.


            Values not provided in the request will be set to their default
            values.
    Usage:
      type: object
      required:
        - pages_parsed
        - signature_detected_pages
        - strikethrough_detected_pages
        - ocr_input_tokens_used
        - ocr_output_tokens_used
        - extraction_input_tokens_used
        - extraction_output_tokens_used
        - summarization_input_tokens_used
        - summarization_output_tokens_used
      properties:
        pages_parsed:
          type: integer
          format: int32
          description: |-
            The number of pages that were parsed.

            This is the total number of pages that were parsed in the document.
        signature_detected_pages:
          type: integer
          format: int32
          description: >-
            The number of pages that had signatures detected.


            This is the total number of pages that had signatures detected in
            the

            document. All pages are counted, even if multiple signatures were

            detected on a single page, or if no signatures were detected on

            other pages.


            This is only applicable if `signature_detection` was enabled in the

            parse configuration.
        strikethrough_detected_pages:
          type: integer
          format: int32
          description: >-
            The number of pages that had were processed with strikethrough

            detection.


            This is the total number of pages that were processed with
            strikethrough

            detection in the document. All pages are counted, even if no

            strikethroughs were detected on some pages.


            This is only applicable if `remove_strikethrough_lines` was enabled
            in

            the parse configuration.
        ocr_input_tokens_used:
          type: integer
          format: int32
          description: The number of input tokens used for OCR.
        ocr_output_tokens_used:
          type: integer
          format: int32
          description: The number of output tokens used for OCR.
        extraction_input_tokens_used:
          type: integer
          format: int32
          description: |-
            The number of input tokens used for structured extraction.

            This will include tokens used for each JSON schema in the
            `structured_extraction_options` field of the parse configuration.
        extraction_output_tokens_used:
          type: integer
          format: int32
          description: |-
            The number of output tokens used for structured extraction.

            This will include tokens used for each JSON schema in the
            `structured_extraction_options` field of the parse configuration.
        summarization_input_tokens_used:
          type: integer
          format: int32
          description: The number of input tokens used for figure summarization.
        summarization_output_tokens_used:
          type: integer
          format: int32
          description: The number of output tokens used for figure summarization.
    ApiErrorCode:
      oneOf:
        - type: string
          enum:
            - QUOTA_EXCEEDED
        - type: string
          enum:
            - INVALID_JSON_SCHEMA
        - type: string
          enum:
            - INVALID_CONFIGURATION
        - type: string
          enum:
            - INVALID_PAGE_CLASSIFICATION
        - type: string
          enum:
            - ENTITY_NOT_FOUND
        - type: string
          enum:
            - ENTITY_ALREADY_EXISTS
        - type: string
          enum:
            - INVALID_FILE
        - type: string
          enum:
            - INVALID_PAGE_RANGE
        - type: string
          enum:
            - INVALID_MIME_TYPE
        - type: string
          enum:
            - INVALID_DATASET_NAME
        - type: string
          enum:
            - INVALID_JOB_STATE
        - type: string
          enum:
            - INTERNAL_ERROR
        - type: string
          enum:
            - INVALID_MULTIPART
        - type: string
          enum:
            - MULTIPART_STREAM_END
        - type: string
          enum:
            - CLIENT_DISCONNECT
        - type: string
          enum:
            - INVALID_ID
        - type: object
          required:
            - INVALID_QUERY_PARAMS
          properties:
            INVALID_QUERY_PARAMS:
              type: object
              required:
                - property
              properties:
                property:
                  type: string
                message:
                  type:
                    - string
                    - 'null'
    PageFragment:
      type: object
      required:
        - fragment_type
        - content
      properties:
        fragment_type:
          $ref: '#/components/schemas/PageFragmentType'
        content: {}
        reading_order:
          type:
            - integer
            - 'null'
          format: int64
        bbox:
          type:
            - object
            - 'null'
          additionalProperties:
            type: number
            format: double
          propertyNames:
            type: string
    PageDimensions:
      type: object
      required:
        - width
        - height
      properties:
        width:
          type: integer
          format: int32
          description: Width of the page in points.
        height:
          type: integer
          format: int32
          description: Height of the page in points.
    OneOrMany_usize:
      oneOf:
        - type: integer
          minimum: 0
        - type: array
          items:
            type: integer
            minimum: 0
      description: Common objects used across multiple endpoints in the API.
    MimeType:
      type: string
      enum:
        - application/pdf
        - >-
          application/vnd.openxmlformats-officedocument.wordprocessingml.document
        - application/msword
        - >-
          application/vnd.openxmlformats-officedocument.presentationml.presentation
        - application/vnd.ms-powerpoint
        - application/vnd.apple.keynote
        - image/jpeg
        - image/tiff
        - text/plain
        - text/html
        - text/markdown
        - text/x-markdown
        - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
        - application/vnd.ms-excel.sheet.macroenabled.12
        - application/vnd.ms-excel
        - text/xml
        - text/csv
        - image/png
        - text/rtf
        - application/rtf
        - application/octet-stream
        - application/pkcs7-mime
        - application/x-pkcs7-mime
        - application/pkcs7-signature
    PageRange:
      oneOf:
        - type: array
          items:
            type: integer
            format: int32
            minimum: 0
          uniqueItems: true
        - type: string
    JobType:
      type: string
      enum:
        - parse
        - read
        - extract
        - classify
        - legacy
        - dataset
        - edit
    ParseConfiguration:
      type: object
      properties:
        parsing_options:
          $ref: '#/components/schemas/ParsingOptions'
          description: >-
            The properties of this object define the configuration for the
            document

            parsing process.


            Tensorlake provides sane defaults that work well for most

            documents, so this object is not required. However, every document

            is different, and you may want to customize the parsing process to

            better suit your needs.
        structured_extraction_options:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/StructuredExtractionOptions'
          description: >-
            The properties of this object define the configuration for
            structured

            data extraction.


            If this object is present, the API will perform structured data

            extraction on the document.
        page_classifications:
          type:
            - array
            - 'null'
          items:
            $ref: '#/components/schemas/PageClassConfig'
          description: |-
            The properties of this object define the configuration for page
            classify.

            If this object is present, the API will perform page classify on
            the document.
        enrichment_options:
          $ref: '#/components/schemas/EnrichmentOptions'
          description: >-
            The properties of this object help to extend the output of the
            document

            parsing process with additional information.


            This includes summarization of tables and figures, which can help to

            provide a more comprehensive understanding of the document.


            This object is not required, and the API will use default settings
            if it

            is not present.
    PageFragmentType:
      type: string
      enum:
        - section_header
        - title
        - text
        - table
        - figure
        - chart
        - formula
        - form
        - key_value_region
        - document_index
        - list_item
        - table_caption
        - figure_caption
        - formula_caption
        - page_footer
        - page_header
        - page_number
        - signature
        - strikethrough
        - tracked_changes
        - comments
        - barcode
    ParsingOptions:
      type: object
      properties:
        table_output_mode:
          oneOf:
            - $ref: '#/components/schemas/TableOutputMode'
              description: |-
                The format for the tables extracted from the document.

                `HTML` - tables are represented as HTML strings.
                `Markdown` - tables are represented as Markdown strings.

                The default is `HTML`.
          default: html
        table_parsing_format:
          oneOf:
            - $ref: '#/components/schemas/TableParsingFormat'
              description: >-
                Determines which model the system uses to identify and extract
                tables

                from the document.


                `tsr` - identifies the structure of

                the table first, and then the cells of the tables. Better suited
                for

                dense, long or grid-like tables.

                `vlm` - uses a VLM model to identify and extract the cells of
                the

                tables. Better suited for tables with merged cells or irregular

                structures.


                The default is `tsr`.
          default: tsr
        chunking_strategy:
          oneOf:
            - $ref: '#/components/schemas/ChunkingStrategy'
              description: >-
                Determines how the document is chunked into smaller pieces.


                `None` - no chunking is applied.

                `Page` - chunks the document into pages.

                `Section` - chunks the document into sections.

                `Fragment` - chunks the document by objects detected in the
                document.

                Every text block, image, table, etc. is considered a fragment.


                The default is `None`.
          default: none
        signature_detection:
          type: boolean
          description: |-
            Flag to enable the detection of signatures in the document.

            This flag incurs additional billing costs.

            The default is `false`.
          default: false
        remove_strikethrough_lines:
          type: boolean
          description: >-
            Flag to enable the detection, and removal, of strikethrough text in
            the

            document.


            This flag incurs additional billing costs.


            The default is `false`.
          default: false
        skew_detection:
          type: boolean
          description: |-
            Boolean flag to detect and correct skewed or rotated pages in the
            document.

            Setting this to `true` will increase the processing time of the
            document.

            The default is `false`.
          default: false
        disable_layout_detection:
          type: boolean
          description: |-
            Disable bounding box detection for the document. Leads to faster
            document parsing.

            The default is `false`.
          default: false
        ignore_sections:
          type: array
          items:
            $ref: '#/components/schemas/PageFragmentType'
          description: >-
            A set of page fragment types to ignore during parsing.


            This can be used to skip certain types of content that are not
            relevant

            for the parsing process, such as headers, footers, or other

            non-essential elements.


            The default is an empty set.
          default: []
          uniqueItems: true
        cross_page_header_detection:
          type: boolean
          description: |-
            Enable header-hierarchy detection across pages.

            When set to `true`, the parser will consider headers from different
            pages when determining the hierarchy of headers within a single
            page.

            The default is `false`.
          default: false
        include_images:
          type: boolean
          description: |-
            Embed images from document in the markdown

            The default is `false`.
          default: false
        barcode_detection:
          type: boolean
          description: |-
            Enable barcode reader for the document.

            The default is `false`.
          default: false
        merge_tables:
          type: boolean
          description: >-
            Enable table merging for the document.


            When set to `true`, adjacent tables that are part of the same
            logical table will be

            merged into a single table.


            The default is `false`.
          default: false
        ocr_model:
          oneOf:
            - $ref: '#/components/schemas/OcrPipelineProvider'
              description: >-
                The model to use for OCR (Optical Character Recognition).


                `model01` - It's fast but could have lower accuracy on

                complex tables. It's good for legal documents with footnotes.

                `model02` - It's slower but could have higher accuracy on
                complex

                tables. It's good for financial documents with merged cells.

                `model03` (default model) - it is our best model in terms of
                accuracy for business documents.

                This model can be deployed on dedicated

                hardware in their own datacenter.

                `gemini3`

                Google Gemini 3 API for OCR processing.
          default: model03
      additionalProperties: false
    StructuredExtractionOptions:
      type: object
      required:
        - schema_name
        - json_schema
      properties:
        schema_name:
          type: string
          description: >-
            The name of the schema. This is used to tag the structured data
            output

            with a name in the response.
        json_schema:
          description: >-
            The JSON schema to guide structured data extraction from the file.


            This schema should be a valid JSON schema that defines the structure
            of

            the data to be extracted.


            The API supports a subset of the JSON schema specification.


            This value must be provided if `structured_extraction` is present in
            the

            request.
        skip_ocr:
          type: boolean
          description: >-
            Boolean flag to skip converting the document blob to OCR text before

            structured data extraction.


            If set to `true`, the API will skip the OCR step and directly
            extract

            structured data from the document.


            The default is `false`.
        prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to use for structured data extraction.

            If not provided, the default prompt will be used.
        model_provider:
          $ref: '#/components/schemas/Model'
          description: >-
            The model provider to use for structured data extraction.


            The default is `tensorlake`, which uses our private model, and runs
            on

            our servers.
        partition_strategy:
          $ref: '#/components/schemas/PartitionStrategy'
          description: >-
            Strategy to partition the document before structured data
            extraction.

            The API will return one structured data object per partition. This
            is

            useful when you want to extract certain fields from every page.


            Options -


            * `None`(*default*) - no partitioning is applied.

            * `Page` - partition the document into pages.

            * `Section` - partition the document into sections.

            A section is defined as a group of text blocks that are visually

            separated from other text blocks by whitespace or other visual
            elements.

            * `Fragment` - partition the document by fragments.

            A fragment is defined as a group of text blocks, images, tables,
            etc.

            that are visually grouped together.

            * `Patterns` - partition the document by custom patterns.

            This requires providing start_patterns and end_patterns to define
            the

            custom patterns. Patterns are defined as strings specific to the

            document content. The start_patterns and end_patterns are used to

            identify the beginning and end of each partition.
        page_classes:
          type:
            - array
            - 'null'
          items:
            type: string
          description: |-
            Filter the pages of the document to be used for structured data
            extraction by providing a list of page classes.

            The default is `None`, which means all pages will be used.
        provide_citations:
          type:
            - boolean
            - 'null'
          description: |-
            Flag to enable visual citations in the structured data output.
            It returns the bounding boxes of the coordinates of the document
            where the structured data was extracted from.

            The default is `false`.
      additionalProperties: false
    PageClassConfig:
      type: object
      required:
        - name
        - description
      properties:
        name:
          type: string
          description: The name of the page class.
        description:
          type: string
          description: |-
            The description of the page class to guide the model to classify the
            pages. Describe what the model should look for in the page to
            classify it.
    EnrichmentOptions:
      type: object
      properties:
        table_cell_grounding:
          type: boolean
          description: |-
            Grounding of table cells, providing the bounding box of the cells.

            The default is `false`.
          default: false
        table_summarization:
          type: boolean
          description: |-
            Generate a summary for parsed tables.

            The default is `false`.
          default: false
        table_summarization_prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to guide the table summarization.
            Ignored if `table_summarization` is `false`.
            Default prompt - "Summarize the table in a concise manner."
          default: null
        figure_summarization:
          type: boolean
          description: |-
            Generate a summary for parsed figures.

            The default is `false`.
          default: false
        figure_summarization_prompt:
          type:
            - string
            - 'null'
          description: |-
            The prompt to guide the figure summarization.
            Ignored if `figure_summarization` is `false`.
            Default prompt - "Summarize the figure in a concise manner."
          default: null
        chart_extraction:
          type: boolean
          description: >-
            Extraction of chart type and structured data series from images,
            delivered as clean JSON suitable for analytics and ingestion.


            The default is `false`.
          default: false
        key_value_extraction:
          type: boolean
          description: |-
            Extraction of key/value pairs from forms as JSON.

            The default is `false`.
          default: false
        include_full_page_image:
          type: boolean
          description: >-
            Use full page image in addition to the cropped table and figure
            images.

            This provides Language Models context about the table and figure
            they

            are summarizing in addition to the cropped images, and could improve
            the

            summarization quality.


            The default is `false`.
          default: false
      additionalProperties: false
    TableOutputMode:
      type: string
      enum:
        - html
        - markdown
    TableParsingFormat:
      type: string
      enum:
        - tsr
        - vlm
    ChunkingStrategy:
      type: string
      enum:
        - none
        - page
        - section
        - fragment
    OcrPipelineProvider:
      type: string
      enum:
        - model01
        - model02
        - model03
        - gemini3
        - model06
    Model:
      type: string
      enum:
        - tensorlake
        - gemini3
        - sonnet
        - gpt4o_mini
    PartitionStrategy:
      oneOf:
        - type: string
          title: none
          description: |-
            No partitioning is applied. The entire document is used for
            structured data extraction.
          enum:
            - none
        - type: string
          title: page
          description: |-
            Partition the document into pages. Each page is used for structured
            data extraction separately.
          enum:
            - page
        - type: string
          title: section
          description: |-
            Partition the document into sections. Each section is used for
            structured data extraction separately.

            A section is defined as a group of text blocks that are visually
            separated from other text blocks by whitespace or other visual
            elements.
          enum:
            - section
        - type: string
          title: fragment
          description: >-
            Partition the document by fragments. Each fragment is used for

            structured data extraction separately.


            A fragment is defined as a group of text blocks, images, tables,
            etc.

            that are visually grouped together.
          enum:
            - fragment
        - type: object
          title: patterns
          description: >-
            Partition the document by custom patterns. Each pattern match is
            used

            for structured data extraction separately.


            This requires providing start_patterns and end_patterns to define

            the custom patterns.


            Patterns are defined as strings specific to the document content.

            The start_patterns and end_patterns are used to identify the

            beginning and end of each partition.
          required:
            - patterns
          properties:
            patterns:
              type: object
              description: >-
                Partition the document by custom patterns. Each pattern match is
                used

                for structured data extraction separately.


                This requires providing start_patterns and end_patterns to
                define

                the custom patterns.


                Patterns are defined as strings specific to the document
                content.

                The start_patterns and end_patterns are used to identify the

                beginning and end of each partition.
              required:
                - start_patterns
                - end_patterns
              properties:
                start_patterns:
                  type: array
                  items:
                    type: string
                end_patterns:
                  type: array
                  items:
                    type: string
  securitySchemes:
    bearerAuth:
      type: http
      scheme: bearer

````