Skip to main content
The Read API converts Documents to Markdown and provides spatial layouts of pages. The response of the Read API contains:
  • Markdown representation of pages. The elements in pages ordered by their natural reading order
  • Tables encoded as Markdown or HTML
  • Summary of tables and figures guided by custom prompts
  • Bounding boxes for each page element(e.g. signature, key-value pair, figure)
Read the Overview for understanding how to integrate Document Parsing to your existing workflows.

API Usage Guide

Calling the read endpoint will create a new document parsing job, starting in the pending state. It will transition to the processing state and then to the successful state when it’s parsed successfully.
  • Python SDK
  • REST API
If you are using the Python SDK, all the configuration options described above are expressed through the ParsingOptions class.
from tensorlake.documentai import (
  DocumentAI,
  ParsingOptions,
  ChunkingStrategy,
  TableOutputMode,
  TableParsingFormat,
)

doc_ai = DocumentAI(api_key="xxxx")
file_id = "file_xxxx"

parsing_options = ParsingOptions(
    chunking_strategy=ChunkingStrategy.FRAGMENT,
    table_output_mode=TableOutputMode.MARKDOWN
)

parse_id = doc_ai.read(file_id=file_id, page_range="1-2", parsing_options=parsing_options)

Options for Parsing Documents

Document Parsing can be customized by providing the parsing_options and enrichment_options in your request.
ParameterDescription
parsing_optionsCustomizes the OCR and table parsing process and chunking strategies. See Parsing Options.
enrichment_optionsEnables and configures table and figure summarization. See Summarization.
Get a full list of the configuration setting options on the /parse section of the API reference.

Parsing Options

ParameterDescriptionDefault Value
chunking_strategyChoose between , , , or .None
table_output_modeChoose between Markdown, .HTML
table_parsing_formatChoose between or .TSR
disable_layout_detectionBoolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.false
skew_detectionDetect and correct skewed or rotated pages. Please note this can increase the processing time.false
signature_detectionDetect signatures in the document. Please note this can increase the processing time, and incurs additional costs.false
remove_strikethrough_linesRemove strikethrough lines from the document. Please note this can increase the processing time, and incurs additional costs.false
ignore_sectionsA set of document fragments to ignore during parsing. This can be useful for excluding irrelevant sections from the output.[]
cross_page_header_detectionA boolean flag to enable header hierarchy detection across pages. This can improve the accuracy of header extraction in multi-page documents.false

Retrieve Output

The parsed document output can be retrieved using the /parse/{parse_id} endpoint, or using the get_job SDK function.
result = doc_ai.get_parsed_result(parse_id)

Markdown Chunks

Leveraging the markdown chunks is a common next step after parsing documents.
for chunk in result.chunks:
print(f"## Page Number: {chunk.page_number}\n")
print(f"## Content: {chunk.content}\n")

See Parse Output for more details about the output.

Table and Figure Summarization

Document Ingestion API can be used to summarize tables and figures in documents.
ParameterDescriptionDefault Value
table_summarizationEnable summarization of tables present in the document. This will generate a summary of the table content, including key insights and trends.false
figure_summarizationEnable summarization of figures present in the document. This will generate a summary of the figure content, including key insights and trends.false
table_summarization_promptA custom prompt to use for table summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used.-
figure_summarization_promptA custom prompt to use for figure summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used.-

Tables

Tales can be summarized by setting table_summarization to true in the enrichment_options JSON object when calling the parse API.
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models.options import (
    EnrichmentOptions,
)

enrichment_options = EnrichmentOptions(
    table_summarization=True,
    table_summarization_prompt="Summarize the table in a concise manner.",
)

doc_ai = DocumentAI(api_key=API_KEY)

parse_id = doc_ai.read(
    file_id="file_XXX",  # Replace with your file ID or URL
    enrichment_options=enrichment_options,
)

Figures

Figures can be summarized by setting figure_summarization to true in the enrichment_options JSON object when calling the parse API.
from tensorlake.documentai import (
    DocumentAI,
    EnrichmentOptions,
)

doc_ai = DocumentAI(api_key=API_KEY)

enrichment_options = EnrichmentOptions(
    figure_summarization=True,
    figure_summary_prompt="Summarize the figure in a way that is easy to understand and use for answering questions.",
)

parse_id = doc_ai.read(
    file_id="file_XXX",  # Replace with your file ID or URL
    enrichment_options=enrichment_options,
)