Skip to content

Monitor Health and Metrics

Every service exposes health/readiness endpoints; most services also expose metrics. Use readiness for automation gates and health for liveness.

ServiceHealthReadinessMetrics
API/health/ready/metrics
STS/health/ready/metrics, /metrics.json
Gateway/health/ready/metrics, /metrics.json
Audit/health/ready/metrics, /metrics.json
Coordinator/health/ready/metrics
Control/health/readyNot primary operator surface

In published builds (CARACAL_MODE=rc or stable), every metrics endpoint fails closed: it returns 401 unless METRICS_BEARER is set and the scraper presents Authorization: Bearer <token>. This applies to API, STS, Gateway, Audit, and Coordinator, all of which are reachable on the internal service network. caracal up generates the managed metricsBearer secret and mounts it into every service via METRICS_BEARER_FILE; caracal doctor discovers the same secret and authenticates its metrics probes automatically. Point external scrapers at the same secret file. In dev mode metrics stay open for local inspection.

flowchart TB
  Health[Process health] --> Storage[Postgres and Redis]
  Storage --> Streams[Streams, outbox, revocation, policy invalidation]
  Streams --> Service[Service-specific readiness]
  Service --> Smoke[End-to-end smoke test]
RungWhat it provesUser-facing impact when it fails
Process healthThe service process can answer liveness.Restart loops or dead containers.
Postgres and RedisDurable state and stream/cache dependencies are reachable.Management writes, token exchange, audit, or revocation can fail.
Streams and outboxEvent delivery paths are draining.Decisions may succeed while evidence or invalidation lags.
Service readinessService-specific invariants are met.That service should not receive production traffic.
End-to-end smoke testAPI, STS, Gateway, Audit, and Coordinator work together.User workflows may fail even when individual services look healthy.
Terminal window
caracal status
caracal status --ready
bash infra/scripts/smokeTest.sh

smokeTest.sh probes API /ready and /health, Gateway /ready, STS /ready, Audit /ready, and Coordinator /ready.

Terminal window
kubectl -n caracal get pods
kubectl -n caracal get servicemonitor,prometheusrule
kubectl -n caracal logs deploy/caracal-api

Enable serviceMonitor.enabled when using Prometheus Operator. Keep chart alert rules enabled or provide equivalent alerts.

Use Console diagnostics for API health, readiness, zone diagnostics, and local preflight checks. Use Console audit and request trace views for decision investigation.

SymptomCheck
Health passes but readiness failsDependency, stream, outbox, policy, revocation, or audit readiness.
Metrics scrape returns unauthorizedPublished builds require METRICS_BEARER; set it and match the scraper’s bearer token.
Readiness flapsCPU throttling, OOM, Postgres/Redis latency, probe timeouts, or dependency restarts.
Smoke test fails only for API /healthConfirm the API liveness endpoint responds in the deployment shape being tested.

Use Configure Alerts to wire the readiness, audit, revocation, policy, and capacity signals into on-call response.