Prometheus Metrics
Indexify Server exposes Prometheus metrics at{server_url}/metrics/service
. These metrics are valuable for monitoring system health and performance.
Key Metrics for Monitoring
Metric | Description | Use Case |
---|---|---|
active_invocations_gauge | Count of uncompleted invocations | Monitors system backlog |
active_tasks | Count of uncompleted tasks | Tracks overall system load |
unallocated_tasks | Count of tasks not allocated to executors | Identifies resource constraints |
max_invocation_age_seconds | Age of oldest running invocation | Detects stuck invocations |
max_task_age_seconds | Age of oldest running task | Identifies abnormally long-running tasks |
task_completion_latency_seconds_bucket_count{outcome="Success"} | Count of successfully completed tasks | Tracks successful throughput |
task_completion_latency_seconds_bucket_count{outcome="Failure"} | Count of failed tasks | Monitors system errors |
task_completion_latency_seconds_bucket | Distribution of task completion times | Analyzes performance trends |
/metrics/service
endpoint.
Troubleshooting Endpoints
Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:Endpoint | Description | Use Case |
---|---|---|
{server_url}/internal/allocations | Lists current allocations per executor | Debugging executor load balance |
{server_url}/internal/unallocated_tasks | Lists all tasks not being allocated | Identifying resource bottlenecks |
Common Troubleshooting Scenarios
High Count of Unallocated Tasks
Ifunallocated_tasks
metric is high:
- Check if you have executors capable of handling the specific task types.
Note: Make sure all unallocated tasks have at least one executor with the
--function
argument matching the unallocated task. - Check the current load on executors by examining the
/internal/allocations
endpoint to see if executors are at capacity - Examine executor logs for errors
- Examine server logs for errors
Abnormally Long-Running Tasks
Ifmax_task_age_seconds
is unusually high:
- Use
/internal/allocations
to identify the specific long-running tasks - Check the
stdout
of long running tasks using the Indexify UI at{server_url}/ui
- Check the executor logs handling these tasks
- Consider adjusting resource allocations or timeouts
Failed Tasks
Iftask_completion_latency_seconds_bucket_count{outcome="Failure"}
is increasing:
- Make sure Invocation Input Payload is valid.
- Check for the root cause in the
stdout
orstderr
of failed tasks using the Indexify UI at{server_url}/ui
- Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.