Recover from Failures
Caracal is designed to fail closed for access-safety boundaries. Recovery should restore evidence, revocation, and policy freshness before reopening risky traffic.
Failure Matrix
Section titled “Failure Matrix”| Failure | User impact | Recovery |
|---|---|---|
| Postgres unavailable | API/STS/Gateway/Audit/Coordinator readiness fails or degrades. | Restore Postgres, confirm migrations, check pools, replay outboxes. |
| Redis unavailable | Streams, revocation, policy invalidation, audit ingestion, and coordination lag. | Restore Redis, verify streams/groups, watch pending entries and replay backlog. |
| STS unavailable | Token exchange and Gateway exchanges fail. | Restore STS readiness, policy bundle freshness, JWKS, and HMAC config. |
| Gateway unhealthy | Protected upstream traffic fails before provider dispatch. | Check bindings, STS exchange, revocation snapshot, upstream allowlist, and audit replay. |
| Audit unhealthy | Evidence ingestion delayed; DLQ or replay grows. | Recover Audit/Postgres/Redis, replay DLQ, verify tamper chain. |
| Coordinator unhealthy | Agent/delegation views and lifecycle management fail. | Restore Coordinator readiness, token, DB, Redis, and sweeper health. |
| Control unhealthy | Automation dispatch unavailable. | Use Console/Admin SDK directly if appropriate; restore Control gate, token, JWKS, Redis, and API reachability. |
Recovery Order
Section titled “Recovery Order”flowchart TD Freeze[Freeze risky rollouts] --> Storage[Restore Postgres and Redis] Storage --> Services[Restore STS, API, Gateway, Audit, Coordinator] Services --> Evidence[Drain audit replay, DLQ, outboxes] Evidence --> Safety[Confirm revocation and policy freshness] Safety --> Resume[Resume traffic or rollout]
Data-Flow Reconciliation
Section titled “Data-Flow Reconciliation”- Check Postgres migrations and expected tables.
- Check Redis stream groups and pending entries.
- Check API and Coordinator outbox tables.
- Check Audit DLQ and audit replay directories.
- Check Gateway revocation snapshot freshness.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Action |
|---|---|
| Requests fail closed after outage | Confirm STS policy freshness, revocation snapshot, and audit replay drain. |
| Events appear duplicated | Verify idempotent consumers and dedupe keys before manual cleanup. |
| Rollback does not restore service | Schema may have moved forward; roll forward with a compatible fix. |
Next Step
Section titled “Next Step”Use Run Failure Drills to rehearse the recovery path before a production incident.

