Skip to content

Recover from Failures

Caracal is designed to fail closed for access-safety boundaries. Recovery should restore evidence, revocation, and policy freshness before reopening risky traffic.

FailureUser impactRecovery
Postgres unavailableAPI/STS/Gateway/Audit/Coordinator readiness fails or degrades.Restore Postgres, confirm migrations, check pools, replay outboxes.
Redis unavailableStreams, revocation, policy invalidation, audit ingestion, and coordination lag.Restore Redis, verify streams/groups, watch pending entries and replay backlog.
STS unavailableToken exchange and Gateway exchanges fail.Restore STS readiness, policy bundle freshness, JWKS, and HMAC config.
Gateway unhealthyProtected upstream traffic fails before provider dispatch.Check bindings, STS exchange, revocation snapshot, upstream allowlist, and audit replay.
Audit unhealthyEvidence ingestion delayed; DLQ or replay grows.Recover Audit/Postgres/Redis, replay DLQ, verify tamper chain.
Coordinator unhealthyAgent/delegation views and lifecycle management fail.Restore Coordinator readiness, token, DB, Redis, and sweeper health.
Control unhealthyAutomation dispatch unavailable.Use Console/Admin SDK directly if appropriate; restore Control gate, token, JWKS, Redis, and API reachability.
flowchart TD
  Freeze[Freeze risky rollouts] --> Storage[Restore Postgres and Redis]
  Storage --> Services[Restore STS, API, Gateway, Audit, Coordinator]
  Services --> Evidence[Drain audit replay, DLQ, outboxes]
  Evidence --> Safety[Confirm revocation and policy freshness]
  Safety --> Resume[Resume traffic or rollout]
  1. Check Postgres migrations and expected tables.
  2. Check Redis stream groups and pending entries.
  3. Check API and Coordinator outbox tables.
  4. Check Audit DLQ and audit replay directories.
  5. Check Gateway revocation snapshot freshness.
SymptomAction
Requests fail closed after outageConfirm STS policy freshness, revocation snapshot, and audit replay drain.
Events appear duplicatedVerify idempotent consumers and dedupe keys before manual cleanup.
Rollback does not restore serviceSchema may have moved forward; roll forward with a compatible fix.

Use Run Failure Drills to rehearse the recovery path before a production incident.