Skip to content

Observability and Health

Every Caracal service exposes three HTTP endpoints for operational visibility: /health for liveness, /ready for readiness, and /metrics for service-specific counters.

ServicePort/health/ready/metrics
STS8080200200 / 503200 JSON
API3000200200 / 503
Gateway8081200200 / 503200 JSON
Coordinator4000200200 / 503200 JSON
Audit9090200 / 503200 / 503200 JSON

/health is a liveness check — use it to determine if the process is alive and should continue receiving traffic. /ready is a readiness check — use it to gate traffic until dependencies are available.


HTTP 200 with empty body. Always succeeds if the process is running.

HTTP 200 on success. HTTP 503 with a plain-text error message if any dependency is unavailable:

postgres unreachable: <error>
redis unreachable: <error>
audit replay unavailable: <error>

Checks: Postgres Ping(), Redis connectivity, audit replay buffer readiness.

HTTP 200, JSON:

{
"sts": {
"graph_traversals": 0,
"graph_traversal_errors": 0,
"audit_dropped": 0,
"audit_replay_pending": 0,
"audit_replay_replayed": 0,
"audit_sink_errors": 0,
"jwks_invalid_keys": 0
},
"opa": { ... },
"audit_dropped": 0
}
MetricMeaning
graph_traversalsDelegation chain traversals performed
graph_traversal_errorsFailed traversals (alert if non-zero)
audit_droppedAudit events dropped due to sink errors — these are not persisted
audit_replay_pendingEvents in the replay buffer awaiting re-delivery
audit_replay_replayedEvents successfully re-delivered from the replay buffer
audit_sink_errorsErrors publishing to the audit stream
jwks_invalid_keysKeys in the JWKS document that fail validation (alert if non-zero)

audit_dropped and audit_sink_errors indicate that audit events are being silently lost. Investigate Redis availability and outbox lag.


HTTP 200, JSON: { "ok": true }

HTTP 200 or 503, JSON:

{
"ok": true,
"draining": false
}

On failure:

{
"ok": false,
"error": "<reason>"
}

Checks: draining flag (set during graceful shutdown), Postgres SELECT 1, Redis PING.

The API exposes no /metrics endpoint directly. Monitor it via the Postgres pool state (connection counts) and Redis stream consumer lag.


HTTP 200 with empty body.

HTTP 200 or 503. Checks: bindings reload health, Postgres connectivity, Redis connectivity, STS /health reachable.

HTTP 200, JSON:

{
"requests_total": 0,
"requests_allowed": 0,
"requests_denied": 0,
"denials_missing_auth": 0,
"denials_bad_bearer": 0,
"denials_expiring": 0,
"denials_bad_routing": 0,
"denials_path_traversal": 0,
"denials_signature": 0,
"denials_jti_replay": 0,
"denials_revoked": 0,
"denials_binding": 0,
"sts_exchange_errors": 0,
"upstream_errors": 0,
"bindings_loaded": 0,
"revocations_active": 0
}
MetricAlert condition
denials_revokedNon-zero: sessions being revoked are reaching the Gateway
denials_jti_replayNon-zero: possible replay attack or clock skew
sts_exchange_errorsElevated: STS unreachable or overloaded
upstream_errorsElevated: upstream services degraded
denials_path_traversalNon-zero: investigate request source
revocations_activeShould track known revoked sessions; sudden spike indicates batch revocation

HTTP 200, JSON: { "ok": true }

HTTP 200 or 503, JSON:

{
"ok": true
}

On failure:

{
"ok": false,
"error": "<reason>"
}

Checks: Postgres SELECT 1, Redis PING.

HTTP 200, JSON (structure approximate):

{
"invocations": {
"pending": 0,
"running": 0,
"succeeded": 0,
"failed": 0,
"dead": 0
},
"outbox": {
"pending": 0,
"published": 0,
"dead": 0
},
"ttl_sweeper": { ... },
"retention_cleaner": { ... }
}

outbox.dead non-zero means the Coordinator has outbox rows that exceeded OUTBOX_MAX_ATTEMPTS. These events (lifecycle, delegation, revocation) will not be delivered. Investigate Redis availability.


HTTP 200 or 503. Checks consumer.Healthy() — the audit stream consumer must be running and processing. Returns 503 if the consumer is unhealthy.

HTTP 200 or 503. Checks: Postgres Ping(), Redis Ping().

HTTP 200, JSON (all fields are numeric unless noted):

{
"inserts_total": 0,
"export_events_total": 0,
"export_errors_total": 0,
"export_duration_ms": 0,
"consumer_lag": 0,
"consumer_pel_oldest_secs": 0,
"parse_errors_total": 0,
"dlq_total": 0,
"retries_total": 0,
"hmac_failures_total": 0,
"tamper_replay_total": 0,
"tamper_checked_total": 0,
"tamper_mismatch_total": 0,
"tamper_chain_breaks": 0,
"tamper_hmac_failures": 0,
"tamper_last_sweep_unix": 0,
"tamper_last_full_unix": 0,
"retention_created_total": 0,
"retention_dropped_total": 0,
"is_export_leader": false,
"is_retention_leader": false
}
MetricAlert condition
consumer_lagGrowing: Audit is falling behind the event stream
consumer_pel_oldest_secsLarge value: messages stuck in PEL, check consumer health
dlq_totalAny increment: events exceeding max deliveries
hmac_failures_totalAny increment: stream message signature verification failures
tamper_mismatch_totalAny increment: audit chain integrity violation detected
tamper_chain_breaksAny increment: serious tamper event — escalate immediately
export_errors_totalSustained non-zero: S3 export failing, events not archived
is_export_leaderShould be true on exactly one Audit replica
is_retention_leaderShould be true on exactly one Audit replica

The metrics endpoints return JSON, not Prometheus exposition format. To feed them into Prometheus, use a JSON exporter sidecar (e.g., json_exporter) configured to scrape each service’s /metrics endpoint and map fields to Prometheus metrics.

Minimum recommended alerts:

AlertSourceCondition
Service down/ready HTTP statusReturns 503 for > 1 minute
Audit chain tamperAudit /metricstamper_chain_breaks > 0
Audit DLQ growthAudit /metricsdlq_total increasing
Gateway denials spikeGateway /metricsdenials_revoked or denials_jti_replay > threshold
STS audit dropsSTS /metricsaudit_dropped > 0
Outbox dead messagesCoordinator /metricsoutbox.dead > 0

Check stream consumer lag directly from Redis for deeper visibility:

Terminal window
# Audit ingestor lag
redis-cli -a $REDIS_PASSWORD XINFO GROUPS caracal.audit.events
# Revocation consumer lag
redis-cli -a $REDIS_PASSWORD XINFO GROUPS caracal.sessions.revoke