curl --request GET \
--url https://api.tensorlake.ai/documents/v2/datasets/{dataset_id}/data \
--header 'Authorization: Bearer <token>'{
"items": [
{
"parse_id": "parse_abcd1234",
"status": "pending",
"created_at": "2023-10-01T12:00:00Z",
"dataset_id": null,
"parsed_pages_count": 5,
"total_pages": null,
"error": null,
"pages": null,
"chunks": [],
"structured_data": null,
"page_classes": null,
"pdf_base64": null,
"tasks_completed_count": null,
"tasks_total_count": null,
"finished_at": null,
"labels": {},
"options": "<unknown>",
"usage": "<unknown>",
"message_update": null
}
],
"has_more": true,
"next_cursor": "<string>",
"prev_cursor": "<string>"
}Bearer authentication header of the form Bearer <token>, where <token> is your auth token.
The id of the dataset to retrieve data for
Optional cursor for pagination.
This is a base64-encoded string representing a timestamp. It is used to paginate through the results.
The direction of pagination.
This can be either next or prev.
The default is next, which means the next page of results will be
returned.
next, prev The maximum number of results to return per page.
The default is 100.
x >= 0The status of the parse operation to filter the results by.
This is an optional parameter that can be used to filter the results by the status of the parse operation.
The possible values are running and `idle``.
pending, processing, detecting_layout, detected_layout, extracting_data, extracted_data, formatting_output, formatted_output, successful, failure The ID of the parse operation to filter the results by.
This is an optional parameter that can be used to filter the results by the ID of the parse operation.
Prefer using /documents/v2/parse/{parse_id} endpoint to get the details of a specific parse operation instead of filtering by parse_id.
The name of the file to filter the results by.
This is an optional parameter that can be used to filter the results by the name of the file associated with the parse operation.
The date and time after which the parse operation was created.
The date should be in RFC3339 format.
The date and time before which the parse operation was created.
The date should be in RFC3339 format.
The date and time after which the parse operation was finished.
The date should be in RFC3339 format.
The date and time before which the parse operation was finished.
The date should be in RFC3339 format.
List of dataset jobs retrieved successfully
Show child attributes
The unique identifier for the parse job
This is the same as the value returned from the POST /documents/v2/parse endpoint.
"parse_abcd1234"
The status of the parse job.
This indicates whether the job is pending, in progress, completed, or failed.
This can be used to track the progress of the parse operation.
pending, processing, detecting_layout, detected_layout, extracting_data, extracted_data, formatting_output, formatted_output, successful, failure The date and time when the parse job was created.
The date is in RFC 3339 format.
This can be used to track when the parse job was initiated.
"2023-10-01T12:00:00Z"
If the parse job was scheduled from a dataset, this field contains the dataset id.
This is the identifier used in URLs and API endpoints to refer to the dataset.
The number of pages that were parsed successfully.
This is the total number of pages that were successfully parsed in the document.
x >= 05
The total number of pages in the document.
This is the total number of pages in the original document that was parsed.
This value is only populated once the parse job is completed successfully.
x >= 0Error occurred during any part of the parse execution.
This is only populated if the parse operation failed.
List of pages parsed from the document.
Each page has a list of fragments, which are detected objects such as tables, text, figures, section headers, etc.
We also return the detected text, structure of the table(if its a table), and the bounding box of the object.
Show child attributes
1-indexed page number in the document.
x >= 0Vector of text fragments extracted from the page.
Each fragment represents a distinct section of text, such as titles, paragraphs, tables, figures, etc.
Show child attributes
section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethrough, tracked_changes, comments, barcode Dimensions is a 2-element vector representing the width and height of the page in points.
If the page was classified into a specific class, this field contains the reason for the classification.
Structured data extracted from the document.
The structured data is a map where the keys are the schema names
provided in the parse request, and the values are
StructuredData objects containing the structured data extracted from
the document.
The number of structured data objects depends on the partition strategy None - one structured data object for the entire document. Page - one structured data object for each page.
Show child attributes
The structured data extracted from the document.
This is a JSON object containing the extracted data in the shape of the JSON schema provided in the parse request.
A list of page numbers (1-indexed) where the structured data was detected.
The value may be a single page number or a vector of page numbers.
x >= 0The name of the schema provided in the structured extraction options of the parse request.
This is used to identify the schema used for the structured data extraction.
Page classes extracted from the document.
This is a map where the keys are page class names provided in the parse
request under the page_classification_options field,
and the values are vectors of page numbers (1-indexed) where each page
class appears.
This is used to categorize pages in the document based on the classify options provided.
Show child attributes
The name of the page class given in the parse request.
This value should match one of the class names provided in the
page_classification_options field of the parse request.
A list of page numbers (1-indexed) where the page class was detected.
A map of reasons for classifying each page into this class.
The keys are the page numbers (1-indexed) and the values are the reasons for classifying that page into this class.
This field is optional and may be omitted if no reasons were provided during classification.
Show child attributes
The raw content of generated PDF, encoded in base64.
At the moment, this is only populated for DOCX files. The PDF is generated from the original DOCX file.
The number of tasks that have been completed for the parse job.
This is the number of tasks that have been successfully processed in the parse job.
It can be used to track the progress of the parse operation.
x >= 0The total number of tasks that are expected to be completed for the parse job.
This is the total number of tasks that are expected to be processed in the parse job.
x >= 0The date and time when the parse job was finished.
The date is in RFC 3339 format.
This can be undefined if the parse job is still in progress or pending.
Labels associated with the parse job.
These are the key-value, or json, pairs submitted with the parse request.
This can be used to categorize or tag the parse job for easier identification and filtering.
It can be undefined if no labels were provided in the request.
Show child attributes
Show child attributes
The type of job that was created.
This indicates whether the job was created via the Parse, Read, Extract, Classification, Legacy, or Dataset endpoint.
parse, read, extract, classify, legacy, dataset The configuration used for the parse job.
This is derived from the configuration settings submitted with the parse request.
It can be used to understand how the parse job was configured, such as the parsing strategy, extraction methods, etc.
Values not provided in the request will be set to their default values.
Show child attributes
The properties of this object define the configuration for the document parsing process.
Tensorlake provides sane defaults that work well for most documents, so this object is not required. However, every document is different, and you may want to customize the parsing process to better suit your needs.
Show child attributes
The format for the tables extracted from the document.
HTML - tables are represented as HTML strings.
Markdown - tables are represented as Markdown strings.
The default is HTML.
html, markdown Determines which model the system uses to identify and extract tables from the document.
tsr - identifies the structure of
the table first, and then the cells of the tables. Better suited for
dense, long or grid-like tables.
vlm - uses a VLM model to identify and extract the cells of the
tables. Better suited for tables with merged cells or irregular
structures.
The default is tsr.
tsr, vlm Determines how the document is chunked into smaller pieces.
None - no chunking is applied.
Page - chunks the document into pages.
Section - chunks the document into sections.
Fragment - chunks the document by objects detected in the document.
Every text block, image, table, etc. is considered a fragment.
The default is None.
none, page, section, fragment Flag to enable the detection of signatures in the document.
This flag incurs additional billing costs.
The default is false.
Flag to enable the detection, and removal, of strikethrough text in the document.
This flag incurs additional billing costs.
The default is false.
Boolean flag to detect and correct skewed or rotated pages in the document.
Setting this to true will increase the processing time of the
document.
The default is false.
Disable bounding box detection for the document. Leads to faster document parsing.
The default is false.
A set of page fragment types to ignore during parsing.
This can be used to skip certain types of content that are not relevant for the parsing process, such as headers, footers, or other non-essential elements.
The default is an empty set.
section_header, title, text, table, figure, formula, form, key_value_region, document_index, list_item, table_caption, figure_caption, formula_caption, page_footer, page_header, page_number, signature, strikethrough, tracked_changes, comments, barcode Enable header-hierarchy detection across pages.
When set to true, the parser will consider headers from different
pages when determining the hierarchy of headers within a single
page.
The default is false.
Embed images from document in the markdown
The default is false.
Enable barcode reader for the document.
The default is false.
The model to use for OCR (Optical Character Recognition).
The default is model01 - It's fast but could have lower accuracy on
complex tables. It's good for legal documents with footnotes.
model02 - It's slower but could have higher accuracy on complex
tables. It's good for financial documents with merged cells.
model03 - A compact model that we deliver to on-premise users. It
takes about 2 minutes to startup on Tensorlake's Cloud because
it's meant for testing for users who are eventually going to deploy this
model on dedicated hardware in their own datacenter.
model01, model02, model03, gemini3, model06 The properties of this object define the configuration for structured data extraction.
If this object is present, the API will perform structured data extraction on the document.
Show child attributes
The name of the schema. This is used to tag the structured data output with a name in the response.
The JSON schema to guide structured data extraction from the file.
This schema should be a valid JSON schema that defines the structure of the data to be extracted.
The API supports a subset of the JSON schema specification.
This value must be provided if structured_extraction is present in the
request.
Boolean flag to skip converting the document blob to OCR text before structured data extraction.
If set to true, the API will skip the OCR step and directly extract
structured data from the document.
The default is false.
The prompt to use for structured data extraction.
If not provided, the default prompt will be used.
The model provider to use for structured data extraction.
The default is tensorlake, which uses our private model, and runs on
our servers.
tensorlake, gemini3, sonnet, gpt4o_mini Strategy to partition the document before structured data extraction. The API will return one structured data object per partition. This is useful when you want to extract certain fields from every page.
Options -
None(default) - no partitioning is applied.Page - partition the document into pages.Section - partition the document into sections.
A section is defined as a group of text blocks that are visually
separated from other text blocks by whitespace or other visual elements.Fragment - partition the document by fragments.
A fragment is defined as a group of text blocks, images, tables, etc.
that are visually grouped together.Patterns - partition the document by custom patterns.
This requires providing start_patterns and end_patterns to define the
custom patterns. Patterns are defined as strings specific to the
document content. The start_patterns and end_patterns are used to
identify the beginning and end of each partition.none Filter the pages of the document to be used for structured data extraction by providing a list of page classes.
The default is None, which means all pages will be used.
Flag to enable visual citations in the structured data output. It returns the bounding boxes of the coordinates of the document where the structured data was extracted from.
The default is false.
The properties of this object define the configuration for page classify.
If this object is present, the API will perform page classify on the document.
Show child attributes
The name of the page class.
The description of the page class to guide the model to classify the pages. Describe what the model should look for in the page to classify it.
The properties of this object help to extend the output of the document parsing process with additional information.
This includes summarization of tables and figures, which can help to provide a more comprehensive understanding of the document.
This object is not required, and the API will use default settings if it is not present.
Show child attributes
Generate a summary for parsed tables.
The default is false.
The prompt to guide the table summarization.
Ignored if table_summarization is false.
Default prompt - "Summarize the table in a concise manner."
Generate a summary for parsed figures.
The default is false.
The prompt to guide the figure summarization.
Ignored if figure_summarization is false.
Default prompt - "Summarize the figure in a concise manner."
Use full page image in addition to the cropped table and figure images. This provides Language Models context about the table and figure they are summarizing in addition to the cropped images, and could improve the summarization quality.
The default is false.
The tensorlake file ID.
This is the ID of the file used for the parse job. It has tensorlake_
prefix.
It can be undefined if the parse operation was created with a file_url
or raw_text field instead of a file ID.
The URL of the file used for the parse job.
It can be undefined if the parse operation was created with a file_id
or raw_text field instead of a file URL.
The raw_text for the parse job.
This is only populated if the parse operation was created with a
raw_text field. And the mime type is of a text-based format (e.g.,
plain text, HTML).
It can be undefined if the parse operation was created with a file_id
or file_url field instead of raw_text.
The name of the file used for the parse job.
This is only populated if the parse operation was created with a
file_id.
The mime type of the file used for the parse job.
This can be undefined if the file has been removed since the parse job
was created, or if the parse operation was created with a file_url
field instead of a file_id or raw_text.
application/pdf, application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/msword, application/vnd.openxmlformats-officedocument.presentationml.presentation, application/vnd.ms-powerpoint, application/vnd.apple.keynote, image/jpeg, image/tiff, text/plain, text/html, text/markdown, text/x-markdown, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel.sheet.macroenabled.12, application/vnd.ms-excel, text/xml, text/csv, image/png, text/rtf, application/rtf, application/octet-stream, application/pkcs7-mime, application/x-pkcs7-mime, application/pkcs7-signature The trace ID for the parse job.
It can be undefined if the operation is still in pending state.
This is used for debugging purposes.
The page range that was requested for parsing.
This is the same as the value provided in the pages field of the
request.
It can be undefined if the parse operation was created without a specific page range. Meaning the whole document was parsed.
x >= 0Resource usage associated with the parse job.
This includes details such as number of pages parsed, tokens used for OCR and extraction, etc.
Usage is only populated for successful jobs.
Billing is based on the resource usage.
Show child attributes
The number of pages that were parsed.
This is the total number of pages that were parsed in the document.
The number of pages that had signatures detected.
This is the total number of pages that had signatures detected in the document. All pages are counted, even if multiple signatures were detected on a single page, or if no signatures were detected on other pages.
This is only applicable if signature_detection was enabled in the
parse configuration.
The number of pages that had were processed with strikethrough detection.
This is the total number of pages that were processed with strikethrough detection in the document. All pages are counted, even if no strikethroughs were detected on some pages.
This is only applicable if remove_strikethrough_lines was enabled in
the parse configuration.
The number of input tokens used for OCR.
The number of output tokens used for OCR.
The number of input tokens used for structured extraction.
This will include tokens used for each JSON schema in the
structured_extraction_options field of the parse configuration.
The number of output tokens used for structured extraction.
This will include tokens used for each JSON schema in the
structured_extraction_options field of the parse configuration.
The number of input tokens used for figure summarization.
The number of output tokens used for figure summarization.
Message update associated with the parse job.
This is used to provide progress update information about the parse job.
Was this page helpful?
curl --request GET \
--url https://api.tensorlake.ai/documents/v2/datasets/{dataset_id}/data \
--header 'Authorization: Bearer <token>'{
"items": [
{
"parse_id": "parse_abcd1234",
"status": "pending",
"created_at": "2023-10-01T12:00:00Z",
"dataset_id": null,
"parsed_pages_count": 5,
"total_pages": null,
"error": null,
"pages": null,
"chunks": [],
"structured_data": null,
"page_classes": null,
"pdf_base64": null,
"tasks_completed_count": null,
"tasks_total_count": null,
"finished_at": null,
"labels": {},
"options": "<unknown>",
"usage": "<unknown>",
"message_update": null
}
],
"has_more": true,
"next_cursor": "<string>",
"prev_cursor": "<string>"
}