Self-Hosted
Monitoring and Troubleshooting
This guide provides information on how to monitor and troubleshoot your Indexify deployment using available metrics and internal endpoints.
Prometheus Metrics
Indexify Server exposes Prometheus metrics at {server_url}/metrics/service
. These metrics are valuable for monitoring system health and performance.
Key Metrics for Monitoring
Metric | Description | Use Case |
---|---|---|
active_invocations_gauge | Count of uncompleted invocations | Monitors system backlog |
active_tasks | Count of uncompleted tasks | Tracks overall system load |
unallocated_tasks | Count of tasks not allocated to executors | Identifies resource constraints |
max_invocation_age_seconds | Age of oldest running invocation | Detects stuck invocations |
max_task_age_seconds | Age of oldest running task | Identifies abnormally long-running tasks |
task_completion_latency_seconds_bucket_count{outcome="Success"} | Count of successfully completed tasks | Tracks successful throughput |
task_completion_latency_seconds_bucket_count{outcome="Failure"} | Count of failed tasks | Monitors system errors |
task_completion_latency_seconds_bucket | Distribution of task completion times | Analyzes performance trends |
Additional internal metrics are available and documented in the /metrics/service
endpoint.
Troubleshooting Endpoints
Indexify provides internal endpoints for deeper troubleshooting when issues are detected through metrics:
Endpoint | Description | Use Case |
---|---|---|
{server_url}/internal/allocations | Lists current allocations per executor | Debugging executor load balance |
{server_url}/internal/unallocated_tasks | Lists all tasks not being allocated | Identifying resource bottlenecks |
Common Troubleshooting Scenarios
High Count of Unallocated Tasks
If unallocated_tasks
metric is high:
- Check if you have executors capable of handling the specific task types.
Note: Make sure all unallocated tasks have at least one executor with the
--function
argument matching the unallocated task. - Check the current load on executors by examining the
/internal/allocations
endpoint to see if executors are at capacity - Examine executor logs for errors
- Examine server logs for errors
Abnormally Long-Running Tasks
If max_task_age_seconds
is unusually high:
- Use
/internal/allocations
to identify the specific long-running tasks - Check the
stdout
of long running tasks using the Indexify UI at{server_url}/ui
- Check the executor logs handling these tasks
- Consider adjusting resource allocations or timeouts
Failed Tasks
If task_completion_latency_seconds_bucket_count{outcome="Failure"}
is increasing:
- Make sure Invocation Input Payload is valid.
- Check for the root cause in the
stdout
orstderr
of failed tasks using the Indexify UI at{server_url}/ui
- Verify your Compute Graph code to see if logs seen in stdout or stderr can be explained.
Was this page helpful?