Prometheus Metrics
Indexify Server exposes Prometheus metrics at{server_url}/metrics/service. These metrics are valuable for monitoring system health and performance.
Key Metrics for Monitoring
| Metric | Description | Use Case |
|---|---|---|
active_invocations_gauge | Count of uncompleted invocations | Monitors system backlog |
active_tasks | Count of uncompleted tasks | Tracks overall system load |
unallocated_tasks | Count of tasks not allocated to executors | Identifies resource constraints |
max_invocation_age_seconds | Age of oldest running invocation | Detects stuck invocations |
max_task_age_seconds | Age of oldest running task | Identifies abnormally long-running tasks |
task_completion_latency_seconds_bucket_count{outcome="Success"} | Count of successfully completed tasks | Tracks successful throughput |
task_completion_latency_seconds_bucket_count{outcome="Failure"} | Count of failed tasks | Monitors system errors |
task_completion_latency_seconds_bucket | Distribution of task completion times | Analyzes performance trends |
/metrics/service endpoint.
Troubleshooting Endpoints
Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:| Endpoint | Description | Use Case |
|---|---|---|
{server_url}/internal/allocations | Lists current allocations per executor | Debugging executor load balance |
{server_url}/internal/unallocated_tasks | Lists all tasks not being allocated | Identifying resource bottlenecks |
Common Troubleshooting Scenarios
High Count of Unallocated Tasks
Ifunallocated_tasks metric is high:
- Check if you have executors capable of handling the specific task types.
Note: Make sure all unallocated tasks have at least one executor with the
--functionargument matching the unallocated task. - Check the current load on executors by examining the
/internal/allocationsendpoint to see if executors are at capacity - Examine executor logs for errors
- Examine server logs for errors
Abnormally Long-Running Tasks
Ifmax_task_age_seconds is unusually high:
- Use
/internal/allocationsto identify the specific long-running tasks - Check the
stdoutof long running tasks using the Indexify UI at{server_url}/ui - Check the executor logs handling these tasks
- Consider adjusting resource allocations or timeouts
Failed Tasks
Iftask_completion_latency_seconds_bucket_count{outcome="Failure"} is increasing:
- Make sure Invocation Input Payload is valid.
- Check for the root cause in the
stdoutorstderrof failed tasks using the Indexify UI at{server_url}/ui - Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.