Observability and Health
Every Caracal service exposes three HTTP endpoints for operational visibility: /health for liveness, /ready for readiness, and /metrics for service-specific counters.
Endpoint summary
Section titled “Endpoint summary”| Service | Port | /health | /ready | /metrics |
|---|---|---|---|---|
| STS | 8080 | 200 | 200 / 503 | 200 JSON |
| API | 3000 | 200 | 200 / 503 | — |
| Gateway | 8081 | 200 | 200 / 503 | 200 JSON |
| Coordinator | 4000 | 200 | 200 / 503 | 200 JSON |
| Audit | 9090 | 200 / 503 | 200 / 503 | 200 JSON |
/health is a liveness check — use it to determine if the process is alive and should continue receiving traffic. /ready is a readiness check — use it to gate traffic until dependencies are available.
GET /health
Section titled “GET /health”HTTP 200 with empty body. Always succeeds if the process is running.
GET /ready
Section titled “GET /ready”HTTP 200 on success. HTTP 503 with a plain-text error message if any dependency is unavailable:
postgres unreachable: <error>redis unreachable: <error>audit replay unavailable: <error>Checks: Postgres Ping(), Redis connectivity, audit replay buffer readiness.
GET /metrics
Section titled “GET /metrics”HTTP 200, JSON:
{ "sts": { "graph_traversals": 0, "graph_traversal_errors": 0, "audit_dropped": 0, "audit_replay_pending": 0, "audit_replay_replayed": 0, "audit_sink_errors": 0, "jwks_invalid_keys": 0 }, "opa": { ... }, "audit_dropped": 0}| Metric | Meaning |
|---|---|
graph_traversals | Delegation chain traversals performed |
graph_traversal_errors | Failed traversals (alert if non-zero) |
audit_dropped | Audit events dropped due to sink errors — these are not persisted |
audit_replay_pending | Events in the replay buffer awaiting re-delivery |
audit_replay_replayed | Events successfully re-delivered from the replay buffer |
audit_sink_errors | Errors publishing to the audit stream |
jwks_invalid_keys | Keys in the JWKS document that fail validation (alert if non-zero) |
audit_dropped and audit_sink_errors indicate that audit events are being silently lost. Investigate Redis availability and outbox lag.
GET /health
Section titled “GET /health”HTTP 200, JSON: { "ok": true }
GET /ready
Section titled “GET /ready”HTTP 200 or 503, JSON:
{ "ok": true, "draining": false}On failure:
{ "ok": false, "error": "<reason>"}Checks: draining flag (set during graceful shutdown), Postgres SELECT 1, Redis PING.
The API exposes no /metrics endpoint directly. Monitor it via the Postgres pool state (connection counts) and Redis stream consumer lag.
Gateway
Section titled “Gateway”GET /health
Section titled “GET /health”HTTP 200 with empty body.
GET /ready
Section titled “GET /ready”HTTP 200 or 503. Checks: bindings reload health, Postgres connectivity, Redis connectivity, STS /health reachable.
GET /metrics
Section titled “GET /metrics”HTTP 200, JSON:
{ "requests_total": 0, "requests_allowed": 0, "requests_denied": 0, "denials_missing_auth": 0, "denials_bad_bearer": 0, "denials_expiring": 0, "denials_bad_routing": 0, "denials_path_traversal": 0, "denials_signature": 0, "denials_jti_replay": 0, "denials_revoked": 0, "denials_binding": 0, "sts_exchange_errors": 0, "upstream_errors": 0, "bindings_loaded": 0, "revocations_active": 0}| Metric | Alert condition |
|---|---|
denials_revoked | Non-zero: sessions being revoked are reaching the Gateway |
denials_jti_replay | Non-zero: possible replay attack or clock skew |
sts_exchange_errors | Elevated: STS unreachable or overloaded |
upstream_errors | Elevated: upstream services degraded |
denials_path_traversal | Non-zero: investigate request source |
revocations_active | Should track known revoked sessions; sudden spike indicates batch revocation |
Coordinator
Section titled “Coordinator”GET /health
Section titled “GET /health”HTTP 200, JSON: { "ok": true }
GET /ready
Section titled “GET /ready”HTTP 200 or 503, JSON:
{ "ok": true}On failure:
{ "ok": false, "error": "<reason>"}Checks: Postgres SELECT 1, Redis PING.
GET /metrics
Section titled “GET /metrics”HTTP 200, JSON (structure approximate):
{ "invocations": { "pending": 0, "running": 0, "succeeded": 0, "failed": 0, "dead": 0 }, "outbox": { "pending": 0, "published": 0, "dead": 0 }, "ttl_sweeper": { ... }, "retention_cleaner": { ... }}outbox.dead non-zero means the Coordinator has outbox rows that exceeded OUTBOX_MAX_ATTEMPTS. These events (lifecycle, delegation, revocation) will not be delivered. Investigate Redis availability.
GET /health
Section titled “GET /health”HTTP 200 or 503. Checks consumer.Healthy() — the audit stream consumer must be running and processing. Returns 503 if the consumer is unhealthy.
GET /ready
Section titled “GET /ready”HTTP 200 or 503. Checks: Postgres Ping(), Redis Ping().
GET /metrics
Section titled “GET /metrics”HTTP 200, JSON (all fields are numeric unless noted):
{ "inserts_total": 0, "export_events_total": 0, "export_errors_total": 0, "export_duration_ms": 0, "consumer_lag": 0, "consumer_pel_oldest_secs": 0, "parse_errors_total": 0, "dlq_total": 0, "retries_total": 0, "hmac_failures_total": 0, "tamper_replay_total": 0, "tamper_checked_total": 0, "tamper_mismatch_total": 0, "tamper_chain_breaks": 0, "tamper_hmac_failures": 0, "tamper_last_sweep_unix": 0, "tamper_last_full_unix": 0, "retention_created_total": 0, "retention_dropped_total": 0, "is_export_leader": false, "is_retention_leader": false}| Metric | Alert condition |
|---|---|
consumer_lag | Growing: Audit is falling behind the event stream |
consumer_pel_oldest_secs | Large value: messages stuck in PEL, check consumer health |
dlq_total | Any increment: events exceeding max deliveries |
hmac_failures_total | Any increment: stream message signature verification failures |
tamper_mismatch_total | Any increment: audit chain integrity violation detected |
tamper_chain_breaks | Any increment: serious tamper event — escalate immediately |
export_errors_total | Sustained non-zero: S3 export failing, events not archived |
is_export_leader | Should be true on exactly one Audit replica |
is_retention_leader | Should be true on exactly one Audit replica |
Monitoring setup
Section titled “Monitoring setup”The metrics endpoints return JSON, not Prometheus exposition format. To feed them into Prometheus, use a JSON exporter sidecar (e.g., json_exporter) configured to scrape each service’s /metrics endpoint and map fields to Prometheus metrics.
Minimum recommended alerts:
| Alert | Source | Condition |
|---|---|---|
| Service down | /ready HTTP status | Returns 503 for > 1 minute |
| Audit chain tamper | Audit /metrics | tamper_chain_breaks > 0 |
| Audit DLQ growth | Audit /metrics | dlq_total increasing |
| Gateway denials spike | Gateway /metrics | denials_revoked or denials_jti_replay > threshold |
| STS audit drops | STS /metrics | audit_dropped > 0 |
| Outbox dead messages | Coordinator /metrics | outbox.dead > 0 |
Check stream consumer lag directly from Redis for deeper visibility:
# Audit ingestor lagredis-cli -a $REDIS_PASSWORD XINFO GROUPS caracal.audit.events
# Revocation consumer lagredis-cli -a $REDIS_PASSWORD XINFO GROUPS caracal.sessions.revoke