Compute and Storage Requirements
- 2 x 40GB A100 GPU - For Layout Understanding, OCR, Summarization and Table Extraction
- 1 x OpenAI API key - For leveraging OpenAI models for structured output extraction - optional
- 1 x RDS or equivalent Postgres database - For managing user data and document metadata.
- 2 x S3 buckets - For storing your documents and outputs.
Onboarding
To get an on-premise version of Document Ingestion, please contact us at support@tensorlake.ai. We will provide you with the installation package, which includes all the necessary components and instructions to set up the system in your infrastructure. Our IAM policies are based on AWS STS roles, so you will need to provide a role with the necessary permissions to access the S3 buckets and RDS database.Installation
Once you have access to the on-premise version of Document Ingestion, you will receive a link to download the installation package. Follow the instructions in the package to install the necessary components. The components that comprise the Document Ingestion API are:- Document AI Server - This is the entrypoint for all document processing requests.
- Document AI Worker - This component handles asynchronous document processing tasks.
- Indexify - The workflow orchestration engine, built and run by Tensorlake.
- Executors - These are the individual processing units that run the document processing tasks.
- File Normalization Executor - This executor handles file downloading files from the document storage bucket, and normalizing the format for further processing.
- OCR Executor - This executor runs the OCR models to extract text from documents. It uses in-house models for layout understanding and text extraction.
- Structured Extraction Executor - This executor runs the structured output extraction models. You can use our private structured extraction model on an H100 or use OpenAI’s models.
- Output Formatter Executor - This executor handles the finalization of the document processing workflow, including formatting the output and storing it in the appropriate location.
Deploying the Document Ingestion Workflow
The installation package includes a deployment script that automates the setup of the Indexify workflow and its executors. The script will guide you through the configuration process and ensure that all components are properly connected. The script will need to be run on the machine where Indexify is installed, and it will require access to the S3 buckets.Installation script
- The instance running this is using the STS authorized role to access the S3 bucket.
- The script has access to the Indexify instance via the
INDEXIFY_URL
environment variable. - Is using a valid version number.
AWS Credentials
Every service in the Document Ingestion API uses the official AWS SDK to access S3 buckets. You will need to provide your AWS credentials via environment variables, using configuration files is not supported. However, the system supports every kind of AWS credential provider, including instance profiles and ECS task roles.Docker support
We provide Docker images for all components of the Document Ingestion API. You can use these images to run the services in a containerized environment. To get started with Docker, Tensorlake will provide you with adocker-compose.yml
file that defines the services and their configurations. You can then use Docker Compose to start and manage the services.
Configuration
Organization, projects and API Keys
By default, the on-premise installation of the Document Ingestion API comes with adefault
organization and project. You can configure additional organizations and projects
using environment variables or a YAML configuration file.
When there are no API keys configured, the system will allow unauthenticated access to the API. However, it is recommended to configure API keys for better security and access control.
Once organizations, projects, and API keys are configured, the following headers will need to be provided with every request:
X-Tensorlake-Organization-Id
: The ID of the organization making the request.X-Tensorlake-Project-Id
: The ID of the project belonging to the organization making the request.Authorization
: Bearer token for the API key associated with the project.
Example configuration
Executors networking
Each executor needs to be able to connect to the Indexify server. By default, the executors will try to connect tohttp://indexify-server:8900
, which is the default hostname and port used in the provided docker-compose.yml
file.
If the executors where to run in a different host or network, you will need to configure the INDEXIFY_SERVER_HOST
environment variable to point to the correct URL.
The executors need to be able to reach the HTTP Port (8900 by default) and the gRPC port (8901 by default).