> ## Documentation Index
> Fetch the complete documentation index at: https://docs.tensorlake.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Read Documents

> Convert documents to Markdown with spatial page layouts — tables, figures, bounding boxes, and reading-order fragments returned by the Read API.

The Read API converts Documents to Markdown and provides spatial layouts of pages.

The response of the Read API contains:

* Markdown representation of pages. The elements in pages ordered by their natural reading order
* Tables encoded as Markdown or HTML
* Summary of tables and figures guided by custom prompts
* Bounding boxes for each page element(e.g. signature, key-value pair, figure)

<Info>
  Read the [Overview](/document-ingestion/overview) for understanding how to
  integrate Document Parsing to your existing workflows.
</Info>

## API Usage Guide

Calling the [read](/api-reference/v2/parse/read) endpoint will create a new document parsing job, starting in the `pending` state. It will transition to the `processing`
state and then to the `successful` state when it's parsed successfully.

<Tabs>
  <Tab title="Python SDK">
    If you are using the Python SDK, all the configuration options described above are expressed through
    the `ParsingOptions` class.

    ```python theme={null}
    from tensorlake.documentai import (
      DocumentAI,
      ParsingOptions,
      ChunkingStrategy,
      TableOutputMode,
      TableParsingFormat,
    )

    doc_ai = DocumentAI(api_key="xxxx")
    file_id = "file_xxxx"

    parsing_options = ParsingOptions(
        chunking_strategy=ChunkingStrategy.FRAGMENT,
        table_output_mode=TableOutputMode.MARKDOWN
    )

    parse_id = doc_ai.read(file_id=file_id, page_range="1-2", parsing_options=parsing_options)
    ```
  </Tab>

  <Tab title="REST API">
    The HTTP API for parsing is thoroughly documented [here](/api-reference/v2/parse/parse). Here is an example of how to initiate a parsing job:

    ```javascript theme={null}
        curl --request POST \
        --url https://api.tensorlake.ai/documents/v2/parse \
        --header 'Authorization: Bearer <token>' \
        --header 'Content-Type: application/json' \
        --data '{
            "file_id": "<string>",
            "page_range": "1-2",
            "parsing_options": {
                "chunking_strategy": "fragment",
                "table_output_mode": "markdown"
            }
        }'
    ```
  </Tab>
</Tabs>

## Options for Parsing Documents

Document Parsing can be customized by providing the `parsing_options` and `enrichment_options` in your request.

| Parameter            | Description                                                                                                                                    |
| -------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- |
| `parsing_options`    | Customizes the OCR and table parsing process and chunking strategies. See [Parsing Options](/document-ingestion/parsing/read#parsing-options). |
| `enrichment_options` | Enables and configures table and figure summarization. See [Summarization](/document-ingestion/parsing/read#table-and-figure-summarization).   |

<Note>
  Get a full list of the configuration setting options on the [`/parse` section
  of the API reference](/api-reference/v2/parse/parse).
</Note>

### Parsing Options

| Parameter                     | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                           | Default Value |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `chunking_strategy`           | Choose between <Tooltip tip="None means we don't apply any chunking, the whole document is returned as a single markdown document.">None</Tooltip>, <Tooltip tip="Chunks documents across page boundaries">Page</Tooltip>, <Tooltip tip="Chunks Documents by Sections">Section</Tooltip>, or <Tooltip tip="Every element is considered a chunk">Fragment</Tooltip>.                                                                                                   | `None`        |
| `table_output_mode`           | Choose between Markdown, <Tooltip tip="HTML is a more robust format for encoding tables from documents as text, because they preserve the structure of the table better when tables contain merged cells or complex headers.">HTML</Tooltip>.                                                                                                                                                                                                                         | `HTML`        |
| `ocr_model`                   | Choose between `model01`, `model02`, `model03`, and `gemini3`                                                                                                                                                                                                                                                                                                                                                                                                         | `model03`     |
| `disable_layout_detection`    | Boolean flag to skip layout detection and directly extract text. Useful for documents with many tables or images.                                                                                                                                                                                                                                                                                                                                                     | `false`       |
| `skew_detection`              | Detect and correct skewed or rotated pages. Please note this can increase the processing time.                                                                                                                                                                                                                                                                                                                                                                        | `false`       |
| `signature_detection`         | Detect signatures in the document. Please note this can increase the processing time, and incurs additional costs.                                                                                                                                                                                                                                                                                                                                                    | `false`       |
| `remove_strikethrough_lines`  | Remove strikethrough lines from the document. Please note this can increase the processing time, and incurs additional costs.                                                                                                                                                                                                                                                                                                                                         | `false`       |
| `ignore_sections`             | A set of document fragments to ignore during parsing. This can be useful for excluding irrelevant sections from the output. Potential values include: `section_header`, `title`, `text`, `table`, `figure`, `chart`, `formula`, `form`, `key_value_region`, `document_index`, `list_item`, `table_caption`, `figure_caption`, `formula_caption`, `page_footer`, `page_header`, `page_number`, `signature`, `strikethrough`, `tracked_changes`, `comments`, `barcode`. | `[]`          |
| `cross_page_header_detection` | A boolean flag to enable header hierarchy detection across pages. This can improve the accuracy of header extraction in multi-page documents.                                                                                                                                                                                                                                                                                                                         | `false`       |
| `barcode_detection`           | A boolean flag to enable barcode detection and reading across pages. This is currently supported only with `model03` OCR model.                                                                                                                                                                                                                                                                                                                                       | `false`       |
| `merge_tables`                | A boolean flag to enable the merging of adjacent tables that are part of the same logical table.                                                                                                                                                                                                                                                                                                                                                                      | `false`       |

## OCR Models

Tensorlake has a few different OCR models, with different strengths and weaknesses. We recommend experimenting with the models on your documents and using the best model for your use case.

1. `model03` - Our best model in terms of accuracy for business documents. It has the ability to read and describe complex tables and figures. Supports large scale ingestion of documents.
2. `model01` - Fast but could have lower accuracy on complex tables.
3. `model02` - Slower but could have higher accuracy on complex tables.
4. `gemini3` - Uses Google's Gemini3 for OCR processing.

A key difference between Model03 and Model01/02 is that Model01/02 provides bounding boxes of the table cells while Model03 doesn't.
Gemini3 doesn't provide any bounding boxes.

## Retrieve Output

The parsed document output can be retrieved using the [`/parse/{parse_id}`](/api-reference/v2/parse/get) endpoint, or using the `get_job` SDK function.

<CodeGroup>
  ```python Python SDK theme={null}
  result = doc_ai.get_parsed_result(parse_id)
  ```

  ```bash REST API theme={null}
  curl -X GET "https://api.tensorlake.ai/documents/v2/parse/parse_XXX" \
  -H "Authorization: Bearer <your-api-key>" \
  -H "Content-Type: application/json"
  ```
</CodeGroup>

## Markdown Chunks

Leveraging the markdown chunks is a common next step after parsing documents.

<CodeGroup>
  ```python Python SDK theme={null}
  for chunk in result.chunks:
    print(f"## Page Number: {chunk.page_number}\n")
    print(f"## Content: {chunk.content}\n")

  ```

  ```json JSON theme={null}
  {
    ...
  "chunks": [
      {
        "content": "....",
        "page_number": 0
      },
      {
        "content": "....",
        "page_number": 1
      },
      ...
  ],
  ...
  }
  ```
</CodeGroup>

<Tip>
  See [Parse Output](/document-ingestion/parsing/parse-output) for more details
  about the output.
</Tip>

## Bounding Boxes

Each page fragment includes bounding box coordinates that specify the exact location of the content on the page. This is useful for creating citations, highlighting source content in a UI, or debugging extraction quality.

<Info>
  [Google Colab Notebook](https://tlake.link/notebooks/bounding-boxes)
</Info>

### Accessing Bounding Boxes

```python theme={null}
result = doc_ai.parse_and_wait(file_id)

for page in result.pages:
  for fragment in page.page_fragments:
    bbox = fragment.bbox
    print(f"Fragment type: {fragment.fragment_type}")
    print(f"Top-left: ({bbox['x1']}, {bbox['y1']})")
    print(f"Bottom-right: ({bbox['x2']}, {bbox['y2']})")
```

### Coordinate System

Bounding boxes use the following coordinate system:

* **x1, y1**: Top-left corner of the bounding box
* **x2, y2**: Bottom-right corner of the bounding box
* **Origin (0,0)**: Top-left corner of the page
* **Units**: Pixels

All fragment types include bounding box coordinates.

## Table and Figure Summarization

Document Ingestion API can be used to summarize tables and figures in documents.

| Parameter                     | Description                                                                                                                                                                        | Default Value |
| ----------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| `table_cell_grounding`        | Grounding of table cells, providing the bounding box of the cells. This will create a list of cells with their reference id `ref_id`, the bounding box and the cell text.          | `false`       |
| `table_summarization`         | Enable summarization of tables present in the document. This will generate a summary of the table content, including key insights and trends.                                      | `false`       |
| `figure_summarization`        | Enable summarization of figures present in the document. This will generate a summary of the figure content, including key insights and trends.                                    | `false`       |
| `table_summarization_prompt`  | A custom prompt to use for table summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used.      | -             |
| `figure_summarization_prompt` | A custom prompt to use for figure summarization. This can be used to provide additional context or instructions to the LLM. If not specified, the default prompt will be used.     | -             |
| `include_full_page_image`     | Include the full page image as additional context when summarizing tables and figures, which can improve accuracy by capturing surrounding headers, captions, and related content. | `false`       |
| `chart_extraction`            | Extraction of chart type and structured data series from images, delivered as clean JSON suitable for analytics and ingestion.                                                     | `false`       |
| `key_value_extraction`        | Extraction of key-value pairs from forms as clean JSON.                                                                                                                            | `false`       |

### Tables

Tables can be summarized by setting `table_summarization` to `true` in the `enrichment_options` JSON object when calling the `parse` API.

<Info>
  [Google Colab Notebook](https://tlake.link/notebooks/table-summaries)
</Info>

<CodeGroup>
  ```python Python SDK theme={null}
  from tensorlake.documentai import DocumentAI
  from tensorlake.documentai.models.options import (
      EnrichmentOptions,
  )

  enrichment_options = EnrichmentOptions(
      table_summarization=True,
      table_summarization_prompt="Summarize the table in a concise manner.",
  )

  doc_ai = DocumentAI(api_key=API_KEY)

  parse_id = doc_ai.read(
      file_id="file_XXX",  # Replace with your file ID or URL
      enrichment_options=enrichment_options,
  )
  ```

  ```json REST API theme={null}
  {
      "enrichment_options": {
          "table_summarization": true,
          "table_summarization_prompt": "Summarize the table in a way that is easy to understand and use for answering questions."
      }
  }
  ```
</CodeGroup>

### Figures

Figures can be summarized by setting `figure_summarization` to `true` in the `enrichment_options` JSON object when calling the `parse` API.

<Info>
  [Google Colab Notebook](https://tlake.link/notebooks/figure-summaries)
</Info>

<CodeGroup>
  ```python Python SDK theme={null}
  from tensorlake.documentai import (
      DocumentAI,
      EnrichmentOptions,
  )

  doc_ai = DocumentAI(api_key=API_KEY)

  enrichment_options = EnrichmentOptions(
      figure_summarization=True,
      figure_summary_prompt="Summarize the figure in a way that is easy to understand and use for answering questions.",
  )

  parse_id = doc_ai.read(
      file_id="file_XXX",  # Replace with your file ID or URL
      enrichment_options=enrichment_options,
  )
  ```

  ```json REST API theme={null}
  {
      "enrichment_options": {
          "figure_summarization": true,
          "figure_summary_prompt": "Summarize the figure in a way that is easy to understand and use for answering questions."
      }
  }
  ```
</CodeGroup>

### Full Page Image Context

When summarizing tables and figures, you can optionally include the full page image as additional context. This helps the model better understand the surrounding content, headers, footers, and relationships between elements on the page.

<Info>
  [Google Colab Notebook](https://tlake.link/notebooks/full-page-summary)
</Info>

```python theme={null}
enrichment_options = EnrichmentOptions(
    table_summarization=True,
    figure_summarization=True,
    include_full_page_image=True
)

result = doc_ai.parse_and_wait(
    file_id,
    enrichment_options=enrichment_options
)
```

### Charts

Structured information about charts can be extracted by setting `chart_extraction` to `true` in the `enrichment_options` JSON object when calling the `parse` API.

<CodeGroup>
  ```python Python SDK theme={null}
  from tensorlake.documentai import DocumentAI
  from tensorlake.documentai.models.options import (
      EnrichmentOptions,
  )

  enrichment_options = EnrichmentOptions(
      chart_extraction=True,
  )

  doc_ai = DocumentAI(api_key=API_KEY)

  parse_id = doc_ai.read(
      file_id="file_XXX",  # Replace with your file ID or URL
      enrichment_options=enrichment_options,
  )
  ```

  ```json REST API theme={null}
  {
      "enrichment_options": {
          "chart_extraction": true,
      }
  }
  ```
</CodeGroup>

### Key/Value Pairs

Extraction of key-value pairs from forms can be done by setting `key_value_extraction`to `true` in the `enrichment_options` JSON object when calling the `parse` API.

<CodeGroup>
  ```python Python SDK theme={null}
  from tensorlake.documentai import DocumentAI
  from tensorlake.documentai.models.options import (
      EnrichmentOptions,
  )

  enrichment_options = EnrichmentOptions(
      key_value_extraction=True,
  )

  doc_ai = DocumentAI(api_key=API_KEY)

  parse_id = doc_ai.read(
      file_id="file_XXX",  # Replace with your file ID or URL
      enrichment_options=enrichment_options,
  )
  ```

  ```json REST API theme={null}
  {
      "enrichment_options": {
          "key_value_extraction": true,
      }
  }
  ```
</CodeGroup>
