We are starting to support running Tensorlake’s Document Ingestion API in your own infrastructure. The main components are the our OCR models, API services, and our open-source Compute Engine which runs the models. Please contact us at support@tensorlake.ai to get access. At the moment, we only support running in AWS infrastructure, but if you have other requirements, please contact us and we will be happy to discuss.

Compute and Storage Requirements

  1. 2 x 40GB A100 GPU - For Layout Understanding, OCR, Summarization and Table Extraction
  2. 1 x H100 GPU - For structured output extraction - optional
  3. 1 x OpenAI API key - For leveraging OpenAI models for structured output extraction - optional
  4. 1 x RDS or equivalent Postgres database - For managing user data and document metadata.
  5. 2 x S3 buckets - For storing your documents and outputs.

Installation

Once you have access to the on-premise version of Document Ingestion, you will receive a link to download the installation package. Follow the instructions in the package to install the necessary components. The components that comprise the Document Ingestion API are:
  1. Document AI Server - This is the entrypoint for all document processing requests.
  2. Document AI Worker - This component handles asynchronous document processing tasks.
  3. Indexify - The workflow orchestration engine, built and run by Tensorlake.
  4. Executors - These are the individual processing units that run the document processing tasks.
  5. File Normalization Executor - This executor handles file downloading files from the document storage bucket, and normalizing the format for further processing.
  6. OCR Executor - This executor runs the OCR models to extract text from documents. It uses in-house models for layout understanding and text extraction.
  7. Structured Extraction Executor - This executor runs the structured output extraction models. You can use our private structured extraction model on an H100 or use OpenAI’s models.
  8. Output Formatter Executor - This executor handles the finalization of the document processing workflow, including formatting the output and storing it in the appropriate location.

Deploying the Document Ingestion Workflow

The installation package includes a deployment script that automates the setup of the Indexify workflow and its executors. The script will guide you through the configuration process and ensure that all components are properly connected. The script will need to be run on the machine where Indexify is installed, and it will require access to the S3 buckets.

AWS Credentials

Every service in the Document Ingestion API uses the official AWS SDK to access S3 buckets. You will need to provide your AWS credentials via environment variables, using configuration files is not supported. However, the system supports every kind of AWS credential provider, including instance profiles and ECS task roles.

Docker support

We provide Docker images for all components of the Document Ingestion API. You can use these images to run the services in a containerized environment. To get started with Docker, Tensorlake will provide you with a docker-compose.yml file that defines the services and their configurations. You can then use Docker Compose to start and manage the services.

Configuration

Organization, projects and API Keys

By default, the on-premise installation of the Document Ingestion API comes with a default organization and project. You can configure additional organizations and projects using environment variables or a YAML configuration file. When there are no API keys configured, the system will allow unauthenticated access to the API. However, it is recommended to configure API keys for better security and access control. Once an API key is configured, all requests to the API must include a valid API key in the Authorization header. The system does not provide a built-in user interface for managing organizations, projects, and API keys. You will need to manage these configurations manually. Every time you create a new organization or project, you will need to update the configuration files or environment variables accordingly, and restart the services for the changes to take effect.

Executors networking

Each executor needs to be able to connect to the Indexify server. By default, the executors will try to connect to http://indexify-server:8900, which is the default hostname and port used in the provided docker-compose.yml file. If the executors where to run in a different host or network, you will need to configure the INDEXIFY_SERVER_HOST environment variable to point to the correct URL. The executors need to be able to reach the HTTP Port (8900 by default) and the gRPC port (8901 by default).