Skip to main content
This guide provides information on how to monitor and troubleshoot your Indexify deployment using available metrics and internal endpoints.

Prometheus Metrics

Indexify Server exposes Prometheus metrics at {server_url}/metrics/service. These metrics are valuable for monitoring system health and performance.

Key Metrics for Monitoring

MetricDescriptionUse Case
active_invocations_gaugeCount of uncompleted invocationsMonitors system backlog
active_tasksCount of uncompleted tasksTracks overall system load
unallocated_tasksCount of tasks not allocated to executorsIdentifies resource constraints
max_invocation_age_secondsAge of oldest running invocationDetects stuck invocations
max_task_age_secondsAge of oldest running taskIdentifies abnormally long-running tasks
task_completion_latency_seconds_bucket_count{outcome="Success"}Count of successfully completed tasksTracks successful throughput
task_completion_latency_seconds_bucket_count{outcome="Failure"}Count of failed tasksMonitors system errors
task_completion_latency_seconds_bucketDistribution of task completion timesAnalyzes performance trends
Additional internal metrics are available and documented in the /metrics/service endpoint.

Troubleshooting Endpoints

Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:
EndpointDescriptionUse Case
{server_url}/internal/allocationsLists current allocations per executorDebugging executor load balance
{server_url}/internal/unallocated_tasksLists all tasks not being allocatedIdentifying resource bottlenecks

Common Troubleshooting Scenarios

High Count of Unallocated Tasks

If unallocated_tasks metric is high:
  1. Check if you have executors capable of handling the specific task types. Note: Make sure all unallocated tasks have at least one executor with the --function argument matching the unallocated task.
  2. Check the current load on executors by examining the /internal/allocations endpoint to see if executors are at capacity
  3. Examine executor logs for errors
  4. Examine server logs for errors

Abnormally Long-Running Tasks

If max_task_age_seconds is unusually high:
  1. Use /internal/allocations to identify the specific long-running tasks
  2. Check the stdout of long running tasks using the Indexify UI at {server_url}/ui
  3. Check the executor logs handling these tasks
  4. Consider adjusting resource allocations or timeouts

Failed Tasks

If task_completion_latency_seconds_bucket_count{outcome="Failure"} is increasing:
  1. Make sure Invocation Input Payload is valid.
  2. Check for the root cause in the stdout or stderr of failed tasks using the Indexify UI at {server_url}/ui
  3. Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.
I