Monitor Health and Metrics
Every service exposes health/readiness endpoints; most services also expose metrics. Use readiness for automation gates and health for liveness.
Endpoint Map
Section titled “Endpoint Map”| Service | Health | Readiness | Metrics |
|---|---|---|---|
| API | /health | /ready | /metrics |
| STS | /health | /ready | /metrics, /metrics.json |
| Gateway | /health | /ready | /metrics, /metrics.json |
| Audit | /health | /ready | /metrics, /metrics.json |
| Coordinator | /health | /ready | /metrics |
| Control | /health | /ready | Not primary operator surface |
Metrics Authentication
Section titled “Metrics Authentication”In published builds (CARACAL_MODE=rc or stable), every metrics endpoint fails closed: it returns 401 unless METRICS_BEARER is set and the scraper presents Authorization: Bearer <token>. This applies to API, STS, Gateway, Audit, and Coordinator, all of which are reachable on the internal service network. caracal up generates the managed metricsBearer secret and mounts it into every service via METRICS_BEARER_FILE; caracal doctor discovers the same secret and authenticates its metrics probes automatically. Point external scrapers at the same secret file. In dev mode metrics stay open for local inspection.
Readiness Ladder
Section titled “Readiness Ladder”flowchart TB Health[Process health] --> Storage[Postgres and Redis] Storage --> Streams[Streams, outbox, revocation, policy invalidation] Streams --> Service[Service-specific readiness] Service --> Smoke[End-to-end smoke test]
| Rung | What it proves | User-facing impact when it fails |
|---|---|---|
| Process health | The service process can answer liveness. | Restart loops or dead containers. |
| Postgres and Redis | Durable state and stream/cache dependencies are reachable. | Management writes, token exchange, audit, or revocation can fail. |
| Streams and outbox | Event delivery paths are draining. | Decisions may succeed while evidence or invalidation lags. |
| Service readiness | Service-specific invariants are met. | That service should not receive production traffic. |
| End-to-end smoke test | API, STS, Gateway, Audit, and Coordinator work together. | User workflows may fail even when individual services look healthy. |
Local Checks
Section titled “Local Checks”caracal statuscaracal status --readybash infra/scripts/smokeTest.shsmokeTest.sh probes API /ready and /health, Gateway /ready, STS /ready, Audit /ready, and Coordinator /ready.
Kubernetes Checks
Section titled “Kubernetes Checks”kubectl -n caracal get podskubectl -n caracal get servicemonitor,prometheusrulekubectl -n caracal logs deploy/caracal-apiEnable serviceMonitor.enabled when using Prometheus Operator. Keep chart alert rules enabled or provide equivalent alerts.
Operator Diagnostics
Section titled “Operator Diagnostics”Use Console diagnostics for API health, readiness, zone diagnostics, and local preflight checks. Use Console audit and request trace views for decision investigation.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Check |
|---|---|
| Health passes but readiness fails | Dependency, stream, outbox, policy, revocation, or audit readiness. |
| Metrics scrape returns unauthorized | Published builds require METRICS_BEARER; set it and match the scraper’s bearer token. |
| Readiness flaps | CPU throttling, OOM, Postgres/Redis latency, probe timeouts, or dependency restarts. |
Smoke test fails only for API /health | Confirm the API liveness endpoint responds in the deployment shape being tested. |
Next Step
Section titled “Next Step”Use Configure Alerts to wire the readiness, audit, revocation, policy, and capacity signals into on-call response.

