This guide provides information on how to monitor and troubleshoot your Indexify deployment using available metrics and internal endpoints.

Prometheus Metrics

Indexify Server exposes Prometheus metrics at {server_url}/metrics/service. These metrics are valuable for monitoring system health and performance.

Key Metrics for Monitoring

MetricDescriptionUse Case
active_invocations_gaugeCount of uncompleted invocationsMonitors system backlog
active_tasksCount of uncompleted tasksTracks overall system load
unallocated_tasksCount of tasks not allocated to executorsIdentifies resource constraints
max_invocation_age_secondsAge of oldest running invocationDetects stuck invocations
max_task_age_secondsAge of oldest running taskIdentifies abnormally long-running tasks
task_completion_latency_seconds_bucket_count{outcome="Success"}Count of successfully completed tasksTracks successful throughput
task_completion_latency_seconds_bucket_count{outcome="Failure"}Count of failed tasksMonitors system errors
task_completion_latency_seconds_bucketDistribution of task completion timesAnalyzes performance trends

Additional internal metrics are available and documented in the /metrics/service endpoint.

Troubleshooting Endpoints

Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:

EndpointDescriptionUse Case
{server_url}/internal/allocationsLists current allocations per executorDebugging executor load balance
{server_url}/internal/unallocated_tasksLists all tasks not being allocatedIdentifying resource bottlenecks

Common Troubleshooting Scenarios

High Count of Unallocated Tasks

If unallocated_tasks metric is high:

  1. Check if you have executors capable of handling the specific task types. Note: Make sure all unallocated tasks have at least one executor with the --function argument matching the unallocated task.
  2. Check the current load on executors by examining the /internal/allocations endpoint to see if executors are at capacity
  3. Examine executor logs for errors
  4. Examine server logs for errors

Abnormally Long-Running Tasks

If max_task_age_seconds is unusually high:

  1. Use /internal/allocations to identify the specific long-running tasks
  2. Check the stdout of long running tasks using the Indexify UI at {server_url}/ui
  3. Check the executor logs handling these tasks
  4. Consider adjusting resource allocations or timeouts

Failed Tasks

If task_completion_latency_seconds_bucket_count{outcome="Failure"} is increasing:

  1. Make sure Invocation Input Payload is valid.
  2. Check for the root cause in the stdout or stderr of failed tasks using the Indexify UI at {server_url}/ui
  3. Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.