Skip to main content

Overview

Charts in PDFs and documents are static images. Traditional parsers either skip them entirely or return a generic figure fragment with no underlying data. Tensorlake’s Agentic Chart Extraction transforms those images into structured, usable JSON — detecting the chart type, extracting data series and axis information, and producing output that can be fed directly into analytics, BI tools, or plotted programmatically. Enable it with chart_extraction=True in your EnrichmentOptions.

Enabling Chart Extraction

Set chart_extraction=True in your EnrichmentOptions:
from tensorlake.documentai import DocumentAI
from tensorlake.documentai.models.options import EnrichmentOptions

doc_ai = DocumentAI(api_key="YOUR_TENSORLAKE_CLOUD_API_KEY")

file_id = doc_ai.upload(path="document.pdf")

enrichment_options = EnrichmentOptions(
    chart_extraction=True,
)

parse_id = doc_ai.read(
    file_id=file_id,
    enrichment_options=enrichment_options,
)

result = doc_ai.wait_for_completion(parse_id)

How It Works

For each chart detected in the document, the system:
  1. Identifies the chart type (bar, line, scatter, or pie)
  2. Extracts axis definitions, series names, data points, and rendering hints (colors, markers, legend position)
  3. Outputs a standardized JSON object conforming to the schema for that chart type
All predictions conform to predefined schemas, so a single parser can consume every chart JSON produced without per-chart ad-hoc handling. Each JSON includes numeric arrays and rendering hints, making it directly plottable without any additional transformation.

Supported Chart Types

Chart typeSchema highlights
Barorientation (vertical/horizontal), named series for grouped/stacked bars, x_axis.categories, optional axis bounds and per-bar display flags
Linex/y axis definitions, explicit values arrays (numeric or categorical), multiple series with color, line_style, and marker styling
ScatterPer-series x_data/y_data arrays, marker styling (size, alpha, edge_color), and axis bounds
PieSlice-centric schema with label, value, optional percentage, colors, and display flags

Output Examples

Bar chart

{
  "type": "bar_chart",
  "title": "Annual Energy Consumption by Source (TWh)",
  "orientation": "vertical",
  "x_axis": {
    "label": "REGION",
    "categories": ["North America", "Europe", "Asia", "Africa"]
  },
  "y_axis": { "label": "TWh", "min": 0, "max": 1000, "format": "number" },
  "series": [
    { "name": "Solar",        "data": [120, 150, 200,  80], "color": "#FFD700" },
    { "name": "Wind",         "data": [180, 220, 300,  60], "color": "#00BFFF" },
    { "name": "Hydro",        "data": [250, 180, 400, 120], "color": "#32CD32" },
    { "name": "Nuclear",      "data": [300, 450, 280,  20], "color": "#FF4500" },
    { "name": "Fossil Fuels", "data": [500, 400, 800, 350], "color": "#B0B0B0" }
  ],
  "bar_style": "grouped",
  "grid": true
}

Scatter plot

{
  "type": "scatter_plot",
  "title": "Urban vs Rural: Income vs Spending",
  "x_axis": { "label": "Annual Income ($k)", "min": 10, "max": 150, "scale": "linear" },
  "y_axis": { "label": "Annual Spending ($k)", "min": 0, "max": 100, "scale": "linear" },
  "series": [
    {
      "name": "Urban",
      "x_data": [24, 32, 35, 42, 67, 80, 91, 110, 125],
      "y_data": [4,  11, 22, 23, 27, 46, 59,  57,  71],
      "color": "#5da5da", "marker": "o", "alpha": 0.75
    },
    {
      "name": "Rural",
      "x_data": [25, 28, 38, 47, 63, 77, 98, 115, 129],
      "y_data": [7,  16, 17, 29, 27, 44, 63,  78,  81],
      "color": "#faa43a", "marker": "s", "alpha": 0.75
    }
  ],
  "legend_position": "upper right",
  "grid": true
}

Line chart

{
  "type": "line_chart",
  "title": "Uncorrelated Remote Sensor Readings",
  "x_axis": { "label": "Observation Minute", "values": [0, 2, 4, 6, 8, 10], "scale": "linear" },
  "y_axis": { "label": "Value", "min": 15, "max": 90, "scale": "linear" },
  "series": [
    { "name": "Room A (Stable)",   "data": [56, 54, 48, 55, 50, 54], "color": "#F472B6", "line_style": "-" },
    { "name": "Room B (Cooling)",  "data": [81, 80, 76, 80, 79, 71], "color": "#9CA3AF", "line_style": "-" },
    { "name": "Room C (Cyclic)",   "data": [31, 32, 34, 33, 38, 39], "color": "#FDE047", "line_style": "-" },
    { "name": "Outdoor (Variable)","data": [40, 41, 41, 39, 41, 40], "color": "#9CD9D3", "line_style": "-" }
  ],
  "legend_position": "upper right",
  "grid": true
}

Common Use Cases

  • Financial reports — extract revenue, cost, and margin trends from bar and line charts without manual transcription
  • Scientific papers — recover experimental data points from scatter plots for further analysis or comparison
  • Business presentations — pull KPI charts into structured data for dashboards and reporting pipelines
  • RAG pipelines — surface chart data as structured context so LLMs can answer quantitative questions about visuals
  • BI and analytics — re-plot or aggregate extracted series directly using the output JSON without rebuilding data manually