We are starting to support running Tensorlake’s Document Ingestion API in your own infrastructure. The main components are the our OCR models, API services, and our open-source Compute Engine which runs the models. Please contact us at support@tensorlake.ai to get access. At the moment, we only support running in AWS infrastructure, but if you have other requirements, please contact us and we will be happy to discuss.

Compute and Storage Requirements

  1. 2 x 40GB A100 GPU - For Layout Understanding, OCR, Summarization and Table Extraction
  2. 1 x OpenAI API key - For leveraging OpenAI models for structured output extraction - optional
  3. 1 x RDS or equivalent Postgres database - For managing user data and document metadata.
  4. 2 x S3 buckets - For storing your documents and outputs.

Onboarding

To get an on-premise version of Document Ingestion, please contact us at support@tensorlake.ai. We will provide you with the installation package, which includes all the necessary components and instructions to set up the system in your infrastructure. Our IAM policies are based on AWS STS roles, so you will need to provide a role with the necessary permissions to access the S3 buckets and RDS database.

Installation

Once you have access to the on-premise version of Document Ingestion, you will receive a link to download the installation package. Follow the instructions in the package to install the necessary components. The components that comprise the Document Ingestion API are:
  1. Document AI Server - This is the entrypoint for all document processing requests.
  2. Document AI Worker - This component handles asynchronous document processing tasks.
  3. Indexify - The workflow orchestration engine, built and run by Tensorlake.
  4. Executors - These are the individual processing units that run the document processing tasks.
  5. File Normalization Executor - This executor handles file downloading files from the document storage bucket, and normalizing the format for further processing.
  6. OCR Executor - This executor runs the OCR models to extract text from documents. It uses in-house models for layout understanding and text extraction.
  7. Structured Extraction Executor - This executor runs the structured output extraction models. You can use our private structured extraction model on an H100 or use OpenAI’s models.
  8. Output Formatter Executor - This executor handles the finalization of the document processing workflow, including formatting the output and storing it in the appropriate location.

Deploying the Document Ingestion Workflow

The installation package includes a deployment script that automates the setup of the Indexify workflow and its executors. The script will guide you through the configuration process and ensure that all components are properly connected. The script will need to be run on the machine where Indexify is installed, and it will require access to the S3 buckets.

Installation script

#!/bin/bash

set -e

# check if aws cli is installed
if ! command -v aws &> /dev/null
then
    echo "aws cli could not be found, please install it first"
    exit 1
fi

# check if indexify-cli is installed, if not install it as a python package in a virtual environment

if ! command -v indexify-cli &> /dev/null
then
    echo "indexify-cli could not be found, installing it now"
    python3 -m venv venv
    source venv/bin/activate
    pip install indexify
fi

# download the number version from the arguments
if [ "$#" -ne 1 ]; then
    echo "Usage: $0 <version>"
    exit 1
fi

VERSION=$1

# get the indexify URL from environment variable
if [ -z "$INDEXIFY_URL" ]; then
    echo "Please set the INDEXIFY_URL environment variable"
    exit 1
fi

#Download everything to a temporary directory. The s3 uri looks like this: s3://tensorlake-document-ingestion-workflows-dev/workflows/onprem/$VERSIOn/

TEMP_DIR=$(mktemp -d)
echo "Downloading workflows to temporary directory: $TEMP_DIR"

aws s3 sync s3://tensorlake-document-ingestion-workflows-prod/workflows/onprem/$VERSION/ $TEMP_DIR/

# check if the download was successful
if [ $? -ne 0 ]; then
    echo "Failed to download workflows from s3"
    exit 1
fi

echo "Deploying workflows from version: $VERSION to Indexify instance at $INDEXIFY_URL"
# deploy the workflows using indexify-cli to the onprem_workflows.py file
indexify-cli deploy $TEMP_DIR/onprem_workflows.py
This script will only work if:
  1. The instance running this is using the STS authorized role to access the S3 bucket.
  2. The script has access to the Indexify instance via the INDEXIFY_URL environment variable.
  3. Is using a valid version number.

AWS Credentials

Every service in the Document Ingestion API uses the official AWS SDK to access S3 buckets. You will need to provide your AWS credentials via environment variables, using configuration files is not supported. However, the system supports every kind of AWS credential provider, including instance profiles and ECS task roles.

Docker support

We provide Docker images for all components of the Document Ingestion API. You can use these images to run the services in a containerized environment. To get started with Docker, Tensorlake will provide you with a docker-compose.yml file that defines the services and their configurations. You can then use Docker Compose to start and manage the services.

Configuration

Organization, projects and API Keys

By default, the on-premise installation of the Document Ingestion API comes with a default organization and project. You can configure additional organizations and projects using environment variables or a YAML configuration file. When there are no API keys configured, the system will allow unauthenticated access to the API. However, it is recommended to configure API keys for better security and access control. Once organizations, projects, and API keys are configured, the following headers will need to be provided with every request:
  1. X-Tensorlake-Organization-Id: The ID of the organization making the request.
  2. X-Tensorlake-Project-Id: The ID of the project belonging to the organization making the request.
  3. Authorization: Bearer token for the API key associated with the project.
The system does not provide a built-in user interface for managing organizations, projects, and API keys. You will need to manage these configurations manually. Every time you create a new organization or project, you will need to update the configuration files or environment variables accordingly, and restart the services for the changes to take effect.

Example configuration

listen_addr: 0.0.0.0:8700
on_prem_enabled: true

on_prem:
  organizations:
    - id: tensorlake_onprem_prg_1
      projects:
        - id: tensorlake_onprem_project_1
          api_keys:
            - tensorlake_onprem_project_apikey_123456789
            - tensorlake_onprem_project_apikey_987654321
        - id: tensorlake_onprem_project_2
          api_keys:
            - tensorlake_onprem_project2_apikey_123456789
With this configuration, only requests with the following headers would be valid:
X-Tensorlake-Organization-Id: tensorlake_onprem_prg_1
X-Tensorlake-Project-Id: tensorlake_onprem_project_1
Authorization: Bearer tensorlake_onprem_project_apikey_123456789

Executors networking

Each executor needs to be able to connect to the Indexify server. By default, the executors will try to connect to http://indexify-server:8900, which is the default hostname and port used in the provided docker-compose.yml file. If the executors where to run in a different host or network, you will need to configure the INDEXIFY_SERVER_HOST environment variable to point to the correct URL. The executors need to be able to reach the HTTP Port (8900 by default) and the gRPC port (8901 by default).