A failure drill injects one fault, confirms the expected alert fires, observes readiness behavior, then validates recovery. Run these against a non-production cluster before you depend on Caracal in production. The alerts referenced here are the PrometheusRule recipes shipped by the chart; the recovery steps extend Recover from Failures.
Caracal fails closed for access-safety boundaries, so a healthy drill ends with evidence, revocation, and policy freshness restored before traffic resumes.
- Confirm a green baseline: all
/ready endpoints pass and no alerts are firing.
- Inject the fault from the drill.
- Confirm the expected alert transitions to firing within its
for window.
- Observe the documented readiness and traffic behavior.
- Remove the fault and complete the recovery validation.
- Record time-to-detect and time-to-recover.
| Field | Value |
|---|
| Inject | Scale managed Redis to unavailable, or block the Redis egress port. |
| Expected alerts | CaracalAuditConsumerLagHigh, CaracalGatewayRevocationSnapshotStale, CaracalGatewayRevocationReloadErrors, and replay backlog via CaracalGatewayAuditReplayBacklogOld / CaracalSTSAuditReplayBacklogOld. |
| Expected behavior | Streams, revocation refresh, and audit ingestion lag; STS and Gateway write audit to replay volumes; readiness degrades for affected services. |
| Recover | Restore Redis, verify streams and consumer groups, drain replay backlog and DLQ, confirm revocation snapshot is fresh before resuming risky traffic. |
| Field | Value |
|---|
| Inject | Fail over or block the managed Postgres endpoint. |
| Expected alerts | CaracalPostgresPoolSaturation, CaracalAPIOutboxPendingOldest, and CaracalReadinessFlapping. |
| Expected behavior | API, STS, Gateway, Audit, and Coordinator readiness fail or degrade; outbox delivery stalls. |
| Recover | Restore Postgres, confirm migrations applied, check connection pools, replay outboxes, confirm readiness stabilizes. |
| Field | Value |
|---|
| Inject | Scale STS to zero or block Gateway-to-STS traffic. |
| Expected alerts | CaracalGatewaySTSExchangeErrors then CaracalGatewaySTSCircuitOpen. |
| Expected behavior | Gateway token exchanges fail closed; protected upstream traffic is rejected before provider dispatch. |
| Recover | Restore STS readiness, JWKS, HMAC config, and policy bundle freshness; confirm the circuit closes and a canary exchange succeeds. |
| Field | Value |
|---|
| Inject | Pause policy distribution or hold the STS policy bundle past its freshness budget. |
| Expected alerts | CaracalSTSPolicyBundleStale. |
| Expected behavior | STS keeps evaluating the last good bundle; new policy activations do not take effect. |
| Recover | Restore distribution, confirm the active policy set version, and verify the alert clears. |
| Field | Value |
|---|
| Inject | Activate a Rego policy set with a compile error in a test zone. |
| Expected alerts | CaracalSTSOPACompileErrors. |
| Expected behavior | STS rejects the broken bundle and continues on the last good policy; the activation does not widen access. |
| Recover | Roll the policy set forward to a valid version and confirm the alert clears. |
| Field | Value |
|---|
| Inject | Revoke or expire the upstream provider credential the Gateway brokers. |
| Expected alerts | CaracalSTSProviderRefreshErrors then CaracalSTSProviderCircuitOpen. |
| Expected behavior | Provider-backed exchanges fail closed; resources without brokered credentials are unaffected. |
| Recover | Restore the provider credential, confirm refresh succeeds, and verify the circuit closes. |
| Field | Value |
|---|
| Inject | Revoke a session and delay the revocation stream consumer. |
| Expected alerts | CaracalGatewayRevocationPropagationLag. |
| Expected behavior | Revocation eventually reaches the Gateway; the drill measures the propagation window against your budget. |
| Recover | Clear the consumer delay, confirm the revoked session is denied at the Gateway, and verify the alert clears. |
| Field | Value |
|---|
| Inject | Pause the audit consumer or fault the audit datastore. |
| Expected alerts | CaracalAuditDLQNonEmpty, CaracalAuditDLQGrowth, CaracalAPIOutboxDeadMessages. |
| Expected behavior | Evidence ingestion is delayed; DLQ and outbox backlogs grow but are not lost. |
| Recover | Restore the consumer and datastore, replay the DLQ and outboxes, and verify the tamper chain is intact. |
| Field | Value |
|---|
| Inject | In a disposable environment only, modify a stored audit record out of band. |
| Expected alerts | CaracalAuditTamperDetected. |
| Expected behavior | The integrity check flags the break; treat as a security incident. |
| Recover | Restore from a trusted backup, identify the blast radius, and follow incident response. |
For every drill, confirm the green baseline returns: readiness passes, the alert clears, the DLQ and replay backlogs are empty, and revocation and policy freshness are confirmed. Capture the detect and recover timings so you can set realistic alert thresholds and on-call expectations.
Use Back Up and Retain Data to make sure recovery evidence, secrets, audit records, and durable state can be restored.