Skip to content

Run Failure Drills

A failure drill injects one fault, confirms the expected alert fires, observes readiness behavior, then validates recovery. Run these against a non-production cluster before you depend on Caracal in production. The alerts referenced here are the PrometheusRule recipes shipped by the chart; the recovery steps extend Recover from Failures.

Caracal fails closed for access-safety boundaries, so a healthy drill ends with evidence, revocation, and policy freshness restored before traffic resumes.

  1. Confirm a green baseline: all /ready endpoints pass and no alerts are firing.
  2. Inject the fault from the drill.
  3. Confirm the expected alert transitions to firing within its for window.
  4. Observe the documented readiness and traffic behavior.
  5. Remove the fault and complete the recovery validation.
  6. Record time-to-detect and time-to-recover.
FieldValue
InjectScale managed Redis to unavailable, or block the Redis egress port.
Expected alertsCaracalAuditConsumerLagHigh, CaracalGatewayRevocationSnapshotStale, CaracalGatewayRevocationReloadErrors, and replay backlog via CaracalGatewayAuditReplayBacklogOld / CaracalSTSAuditReplayBacklogOld.
Expected behaviorStreams, revocation refresh, and audit ingestion lag; STS and Gateway write audit to replay volumes; readiness degrades for affected services.
RecoverRestore Redis, verify streams and consumer groups, drain replay backlog and DLQ, confirm revocation snapshot is fresh before resuming risky traffic.
FieldValue
InjectFail over or block the managed Postgres endpoint.
Expected alertsCaracalPostgresPoolSaturation, CaracalAPIOutboxPendingOldest, and CaracalReadinessFlapping.
Expected behaviorAPI, STS, Gateway, Audit, and Coordinator readiness fail or degrade; outbox delivery stalls.
RecoverRestore Postgres, confirm migrations applied, check connection pools, replay outboxes, confirm readiness stabilizes.
FieldValue
InjectScale STS to zero or block Gateway-to-STS traffic.
Expected alertsCaracalGatewaySTSExchangeErrors then CaracalGatewaySTSCircuitOpen.
Expected behaviorGateway token exchanges fail closed; protected upstream traffic is rejected before provider dispatch.
RecoverRestore STS readiness, JWKS, HMAC config, and policy bundle freshness; confirm the circuit closes and a canary exchange succeeds.
FieldValue
InjectPause policy distribution or hold the STS policy bundle past its freshness budget.
Expected alertsCaracalSTSPolicyBundleStale.
Expected behaviorSTS keeps evaluating the last good bundle; new policy activations do not take effect.
RecoverRestore distribution, confirm the active policy set version, and verify the alert clears.
FieldValue
InjectActivate a Rego policy set with a compile error in a test zone.
Expected alertsCaracalSTSOPACompileErrors.
Expected behaviorSTS rejects the broken bundle and continues on the last good policy; the activation does not widen access.
RecoverRoll the policy set forward to a valid version and confirm the alert clears.
FieldValue
InjectRevoke or expire the upstream provider credential the Gateway brokers.
Expected alertsCaracalSTSProviderRefreshErrors then CaracalSTSProviderCircuitOpen.
Expected behaviorProvider-backed exchanges fail closed; resources without brokered credentials are unaffected.
RecoverRestore the provider credential, confirm refresh succeeds, and verify the circuit closes.
FieldValue
InjectRevoke a session and delay the revocation stream consumer.
Expected alertsCaracalGatewayRevocationPropagationLag.
Expected behaviorRevocation eventually reaches the Gateway; the drill measures the propagation window against your budget.
RecoverClear the consumer delay, confirm the revoked session is denied at the Gateway, and verify the alert clears.
FieldValue
InjectPause the audit consumer or fault the audit datastore.
Expected alertsCaracalAuditDLQNonEmpty, CaracalAuditDLQGrowth, CaracalAPIOutboxDeadMessages.
Expected behaviorEvidence ingestion is delayed; DLQ and outbox backlogs grow but are not lost.
RecoverRestore the consumer and datastore, replay the DLQ and outboxes, and verify the tamper chain is intact.
FieldValue
InjectIn a disposable environment only, modify a stored audit record out of band.
Expected alertsCaracalAuditTamperDetected.
Expected behaviorThe integrity check flags the break; treat as a security incident.
RecoverRestore from a trusted backup, identify the blast radius, and follow incident response.

For every drill, confirm the green baseline returns: readiness passes, the alert clears, the DLQ and replay backlogs are empty, and revocation and policy freshness are confirmed. Capture the detect and recover timings so you can set realistic alert thresholds and on-call expectations.

Use Back Up and Retain Data to make sure recovery evidence, secrets, audit records, and durable state can be restored.