Monitoring and Troubleshooting

This guide provides information on how to monitor and troubleshoot your Indexify deployment using available metrics and internal endpoints.

Prometheus Metrics

Indexify Server exposes Prometheus metrics at {server_url}/metrics/service. These metrics are valuable for monitoring system health and performance.

Key Metrics for Monitoring

Metric	Description	Use Case
`active_invocations_gauge`	Count of uncompleted invocations	Monitors system backlog
`active_tasks`	Count of uncompleted tasks	Tracks overall system load
`unallocated_tasks`	Count of tasks not allocated to executors	Identifies resource constraints
`max_invocation_age_seconds`	Age of oldest running invocation	Detects stuck invocations
`max_task_age_seconds`	Age of oldest running task	Identifies abnormally long-running tasks
`task_completion_latency_seconds_bucket_count{outcome="Success"}`	Count of successfully completed tasks	Tracks successful throughput
`task_completion_latency_seconds_bucket_count{outcome="Failure"}`	Count of failed tasks	Monitors system errors
`task_completion_latency_seconds_bucket`	Distribution of task completion times	Analyzes performance trends

Additional internal metrics are available and documented in the /metrics/service endpoint.

Troubleshooting Endpoints

Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:

Endpoint	Description	Use Case
`{server_url}/internal/allocations`	Lists current allocations per executor	Debugging executor load balance
`{server_url}/internal/unallocated_tasks`	Lists all tasks not being allocated	Identifying resource bottlenecks

Common Troubleshooting Scenarios

High Count of Unallocated Tasks

If unallocated_tasks metric is high:

Check if you have executors capable of handling the specific task types. Note: Make sure all unallocated tasks have at least one executor with the --function argument matching the unallocated task.
Check the current load on executors by examining the /internal/allocations endpoint to see if executors are at capacity
Examine executor logs for errors
Examine server logs for errors

Abnormally Long-Running Tasks

If max_task_age_seconds is unusually high:

Use /internal/allocations to identify the specific long-running tasks
Check the stdout of long running tasks using the Indexify UI at {server_url}/ui
Check the executor logs handling these tasks
Consider adjusting resource allocations or timeouts

Failed Tasks

If task_completion_latency_seconds_bucket_count{outcome="Failure"} is increasing:

Make sure Invocation Input Payload is valid.
Check for the root cause in the stdout or stderr of failed tasks using the Indexify UI at {server_url}/ui
Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.

Tensorlake

Applications

Document Ingestion

FAQ

Open Source

Monitoring and Troubleshooting

Prometheus Metrics

Key Metrics for Monitoring

Troubleshooting Endpoints

Common Troubleshooting Scenarios

High Count of Unallocated Tasks

Abnormally Long-Running Tasks

Failed Tasks

Tensorlake

Applications

Document Ingestion

FAQ

Open Source

​Prometheus Metrics

​Key Metrics for Monitoring

​Troubleshooting Endpoints

​Common Troubleshooting Scenarios

​High Count of Unallocated Tasks

​Abnormally Long-Running Tasks

​Failed Tasks

Prometheus Metrics

Key Metrics for Monitoring

Troubleshooting Endpoints

Common Troubleshooting Scenarios

High Count of Unallocated Tasks

Abnormally Long-Running Tasks

Failed Tasks