---
title: "Monitor Health and Metrics"
url: "https://docs.caracal.run/operations/observability/"
markdown_url: "https://docs.caracal.run/markdown/operations/observability.md"
description: "Monitor Caracal health, readiness, metrics, audit flow, and runtime diagnostics."
page_type: "workflow"
concepts: []
requires: []
---

# Monitor Health and Metrics

Canonical URL: https://docs.caracal.run/operations/observability/
Markdown URL: https://docs.caracal.run/markdown/operations/observability.md
Description: Monitor Caracal health, readiness, metrics, audit flow, and runtime diagnostics.
Page type: workflow
Concepts: none
Requires: none

---

Every service exposes health/readiness endpoints; most services also expose metrics. Use readiness for automation gates and health for liveness.

## Endpoint Map

| Service | Health | Readiness | Metrics |
| --- | --- | --- | --- |
| API | `/health` | `/ready` | `/metrics` |
| STS | `/health` | `/ready` | `/metrics`, `/metrics.json` |
| Gateway | `/health` | `/ready` | `/metrics`, `/metrics.json` |
| Audit | `/health` | `/ready` | `/metrics`, `/metrics.json` |
| Coordinator | `/health` | `/ready` | `/metrics` |
| Control | `/health` | `/ready` | Not primary operator surface |

## Metrics Authentication

In published builds (`CARACAL_MODE=rc` or `stable`), every metrics endpoint fails closed: it returns `401` unless `METRICS_BEARER` is set and the scraper presents `Authorization: Bearer <token>`. This applies to API, STS, Gateway, Audit, and Coordinator, all of which are reachable on the internal service network. `caracal up` generates the managed `metricsBearer` secret and mounts it into every service via `METRICS_BEARER_FILE`; `caracal doctor` discovers the same secret and authenticates its metrics probes automatically. Point external scrapers at the same secret file. In `dev` mode metrics stay open for local inspection.

## Readiness Ladder

```mermaid
flowchart TB
  Health[Process health] --> Storage[Postgres and Redis]
  Storage --> Streams[Streams, outbox, revocation, policy invalidation]
  Streams --> Service[Service-specific readiness]
  Service --> Smoke[End-to-end smoke test]
```

| Rung | What it proves | User-facing impact when it fails |
| --- | --- | --- |
| Process health | The service process can answer liveness. | Restart loops or dead containers. |
| Postgres and Redis | Durable state and stream/cache dependencies are reachable. | Management writes, token exchange, audit, or revocation can fail. |
| Streams and outbox | Event delivery paths are draining. | Decisions may succeed while evidence or invalidation lags. |
| Service readiness | Service-specific invariants are met. | That service should not receive production traffic. |
| End-to-end smoke test | API, STS, Gateway, Audit, and Coordinator work together. | User workflows may fail even when individual services look healthy. |

## Local Checks

```bash
caracal status
caracal status --ready
bash infra/scripts/smokeTest.sh
```

`smokeTest.sh` probes API `/ready` and `/health`, Gateway `/ready`, STS `/ready`, Audit `/ready`, and Coordinator `/ready`.

## Kubernetes Checks

```bash
kubectl -n caracal get pods
kubectl -n caracal get servicemonitor,prometheusrule
kubectl -n caracal logs deploy/caracal-api
```

Enable `serviceMonitor.enabled` when using Prometheus Operator. Keep chart alert rules enabled or provide equivalent alerts.

## Operator Diagnostics

Use Console `diagnostics` for API health, readiness, zone diagnostics, and local preflight checks. Use Console `audit` and `request trace` views for decision investigation.

## Troubleshooting

| Symptom | Check |
| --- | --- |
| Health passes but readiness fails | Dependency, stream, outbox, policy, revocation, or audit readiness. |
| Metrics scrape returns unauthorized | Published builds require `METRICS_BEARER`; set it and match the scraper's bearer token. |
| Readiness flaps | CPU throttling, OOM, Postgres/Redis latency, probe timeouts, or dependency restarts. |
| Smoke test fails only for API `/health` | Confirm the API liveness endpoint responds in the deployment shape being tested. |

## Next Step

Use [Configure Alerts](/operations/alerts/) to wire the readiness, audit, revocation, policy, and capacity signals into on-call response.
