Get Parse Result

Retrieve the results of a previously submitted parse job. The response will include:

Parsed content
- Markdown (chunked if a chunking strategy is specified)
- Pages
Structured extraction results (if schemas are provided during the parse request)
Page classification results (if page classifications are provided during the parse request)

Response Structure

When the job finishes successfully, the response will contain a JSON object with the following fields:

chunks

The chunks field contains an array of text chunks extracted from the document. Each chunk is an object with a property called content, which is the text content of the chunk. If a chunking strategy was specified during the parse request, the text will be chunked accordingly.

structured_data

The structured_data field contains a JSON object with every schema_name you provided in the parse request as a key. Each object in this array represents a structured data item extracted from the document, adhering to the specified schema. For example, if you provided the following schema for an invoice:

{
  "title": "Invoice",
  "type": "object",
  "properties": {
    "invoice_number": {
      "type": "string"
    },
    "date": {
      "type": "string",
      "format": "date"
    },
    "total_amount": {
      "type": "number"
    },
    "items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": {
            "type": "string"
          },
          "quantity": {
            "type": "number"
          },
          "price": {
            "type": "number"
          }
        }
      }
    }
  }
}

The structured_data field will contain objects that match that schema, such as:

{
  "invoice_number": "12345",
  "date": "2023-10-01",
  "total_amount": 100.0,
  "items": [
    {
      "description": "Item 1",
      "quantity": 2,
      "price": 50.0
    }
  ]
}

If our models were unable to find any text that complied to the schema, the structured_data field will be null. This can happen if the document does not contain any text that matches the schema you provided.

Errors

If a parse job is marked as failure, the errors field will contain an object with details about the error.

Lifecycle of a parse operation

The status field will indicate the current state of the parse job. Possible values are:

pending: The job is waiting to be processed.
processing: The job is currently being processed.
successful: The job has been successfully completed and the results are available.
failure: The job has failed, and the errors field will contain details about

Only when the job is in the successful state, you can access the structured_data, chunks and pages fields.

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <token>, where <token> is your auth token.

Path Parameters

parse_id

string

required

The public ID of the parse job

Query Parameters

with_options

boolean

Response

Parse result details (JSON) or progress stream (SSE)

parse_id

string

default:""

required

The unique identifier for the parse job

This is the same as the value returned from the POST /documents/v2/parse endpoint.

Example:

"parse_abcd1234"

status

enum<string>

default:pending

required

The status of the parse job.

This indicates whether the job is pending, in progress, completed, or failed.

This can be used to track the progress of the parse operation.

Available options:

pending,

processing,

detecting_layout,

detected_layout,

extracting_data,

extracted_data,

formatting_output,

formatted_output,

successful,

failure

created_at

string

default:""

required

The date and time when the parse job was created.

The date is in RFC 3339 format.

This can be used to track when the parse job was initiated.

Example:

"2023-10-01T12:00:00Z"

dataset_id

string | null

If the parse job was scheduled from a dataset, this field contains the dataset id.

This is the identifier used in URLs and API endpoints to refer to the dataset.

parsed_pages_count

integer

default:0

The number of pages that were parsed successfully.

This is the total number of pages that were successfully parsed in the document.

Required range: x >= 0

Example:

5

total_pages

integer | null

The total number of pages in the document.

This is the total number of pages in the original document that was parsed.

This value is only populated once the parse job is completed successfully.

Required range: x >= 0

error

string | null

Error occurred during any part of the parse execution.

This is only populated if the parse operation failed.

pages

object[] | null

List of pages parsed from the document.

Each page has a list of fragments, which are detected objects such as tables, text, figures, section headers, etc.

We also return the detected text, structure of the table(if its a table), and the bounding box of the object.

Show child attributes

pages.page_number

integer

required

1-indexed page number in the document.

Required range: x >= 0

pages.page_fragments

object[] | null

Vector of text fragments extracted from the page.

Each fragment represents a distinct section of text, such as titles, paragraphs, tables, figures, etc.

Show child attributes

pages.page_fragments.fragment_type

enum<string>

required

Available options:

section_header,

title,

text,

table,

figure,

formula,

form,

key_value_region,

document_index,

list_item,

table_caption,

figure_caption,

formula_caption,

page_footer,

page_header,

page_number,

signature,

strikethrough,

tracked_changes,

comments

pages.page_fragments.content

any

required

pages.page_fragments.reading_order

integer<int64> | null

pages.page_fragments.bbox

object

Show child attributes

pages.page_fragments.bbox.{key}

number<double>

pages.dimensions

integer<int32>[] | null

Dimensions is a 2-element vector representing the width and height of the page in points.

pages.page_dimensions

object

Dimensions of the page.

This is only populated if the page dimensions could be determined.

Show child attributes

pages.page_dimensions.width

integer<int32>

required

Width of the page in points.

pages.page_dimensions.height

integer<int32>

required

Height of the page in points.

pages.classification_reason

string | null

If the page was classified into a specific class, this field contains the reason for the classification.

chunks

object[]

Chunks of the document.

This is a vector of Chunk objects, each containing a chunk of the document. The number of chunks depend on the chunking strategy used during parsing.

Show child attributes

chunks.content

string

required

chunks.page_number

integer

required

Required range: x >= 0

structured_data

object[] | null

Structured data extracted from the document.

The structured data is a map where the keys are the schema names provided in the parse request, and the values are StructuredData objects containing the structured data extracted from the document.

The number of structured data objects depends on the partition strategy None - one structured data object for the entire document. Page - one structured data object for each page.

Show child attributes

structured_data.data

any

required

The structured data extracted from the document.

This is a JSON object containing the extracted data in the shape of the JSON schema provided in the parse request.

structured_data.page_numbers

required

A list of page numbers (1-indexed) where the structured data was detected.

The value may be a single page number or a vector of page numbers.

Required range: x >= 0

structured_data.schema_name

string | null

The name of the schema provided in the structured extraction options of the parse request.

This is used to identify the schema used for the structured data extraction.

page_classes

object[] | null

Page classes extracted from the document.

This is a map where the keys are page class names provided in the parse request under the page_classification_options field, and the values are vectors of page numbers (1-indexed) where each page class appears.

This is used to categorize pages in the document based on the classify options provided.

Show child attributes

page_classes.page_class

string

required

The name of the page class given in the parse request.

This value should match one of the class names provided in the page_classification_options field of the parse request.

page_classes.page_numbers

integer<int32>[]

required

A list of page numbers (1-indexed) where the page class was detected.

page_classes.classification_reasons

object

A map of reasons for classifying each page into this class.

The keys are the page numbers (1-indexed) and the values are the reasons for classifying that page into this class.

This field is optional and may be omitted if no reasons were provided during classification.

Show child attributes

page_classes.classification_reasons.{key}

string

pdf_base64

string | null

The raw content of generated PDF, encoded in base64.

At the moment, this is only populated for DOCX files. The PDF is generated from the original DOCX file.

finished_at

string | null

The date and time when the parse job was finished.

The date is in RFC 3339 format.

This can be undefined if the parse job is still in progress or pending.

labels

object

Labels associated with the parse job.

These are the key-value, or json, pairs submitted with the parse request.

This can be used to categorize or tag the parse job for easier identification and filtering.

It can be undefined if no labels were provided in the request.

Show child attributes

labels.{key}

any

usage

object

Resource usage associated with the parse job.

This includes details such as number of pages parsed, tokens used for OCR and extraction, etc.

Usage is only populated for successful jobs.

Billing is based on the resource usage.

Show child attributes

usage.pages_parsed

integer<int32>

required

The number of pages that were parsed.

This is the total number of pages that were parsed in the document.

usage.signature_detected_pages

integer<int32>

required

The number of pages that had signatures detected.

This is the total number of pages that had signatures detected in the document. All pages are counted, even if multiple signatures were detected on a single page, or if no signatures were detected on other pages.

This is only applicable if signature_detection was enabled in the parse configuration.

usage.strikethrough_detected_pages

integer<int32>

required

The number of pages that had were processed with strikethrough detection.

This is the total number of pages that were processed with strikethrough detection in the document. All pages are counted, even if no strikethroughs were detected on some pages.

This is only applicable if remove_strikethrough_lines was enabled in the parse configuration.

usage.ocr_input_tokens_used

integer<int32>

required

The number of input tokens used for OCR.

usage.ocr_output_tokens_used

integer<int32>

required

The number of output tokens used for OCR.

usage.extraction_input_tokens_used

integer<int32>

required

The number of input tokens used for structured extraction.

This will include tokens used for each JSON schema in the structured_extraction_options field of the parse configuration.

usage.extraction_output_tokens_used

integer<int32>

required

The number of output tokens used for structured extraction.

This will include tokens used for each JSON schema in the structured_extraction_options field of the parse configuration.

usage.summarization_input_tokens_used

integer<int32>

required

The number of input tokens used for figure summarization.

usage.summarization_output_tokens_used

integer<int32>

required

The number of output tokens used for figure summarization.

message_update

string | null

Message update associated with the parse job.

This is used to provide progress update information about the parse job.

API Documentation

Document Ingestion

Response Structure

pages

chunks

structured_data

Errors

Lifecycle of a parse operation

Authorizations

Path Parameters

Query Parameters

Response

API Documentation

Document Ingestion

​Response Structure

​pages

​chunks

​structured_data

​Errors

​Lifecycle of a parse operation

Authorizations

Path Parameters

Query Parameters

Response

Response Structure

pages

chunks

structured_data

Errors

Lifecycle of a parse operation