---
title: "Run Failure Drills"
url: "https://docs.caracal.run/operations/failure-drills/"
markdown_url: "https://docs.caracal.run/markdown/operations/failure-drills.md"
description: "Rehearse Caracal failures by injecting faults and confirming the expected alerts, readiness behavior, and recovery."
page_type: "workflow"
concepts: []
requires: []
---

# Run Failure Drills

Canonical URL: https://docs.caracal.run/operations/failure-drills/
Markdown URL: https://docs.caracal.run/markdown/operations/failure-drills.md
Description: Rehearse Caracal failures by injecting faults and confirming the expected alerts, readiness behavior, and recovery.
Page type: workflow
Concepts: none
Requires: none

---

A failure drill injects one fault, confirms the **expected alert** fires, observes readiness behavior, then validates recovery. Run these against a non-production cluster before you depend on Caracal in production. The alerts referenced here are the [PrometheusRule recipes](/operations/alerts/) shipped by the chart; the recovery steps extend [Recover from Failures](/operations/failure-modes/).

Caracal fails closed for access-safety boundaries, so a healthy drill ends with evidence, revocation, and policy freshness restored **before** traffic resumes.

## How to Run a Drill

1. Confirm a green baseline: all `/ready` endpoints pass and no alerts are firing.
2. Inject the fault from the drill.
3. Confirm the expected alert transitions to firing within its `for` window.
4. Observe the documented readiness and traffic behavior.
5. Remove the fault and complete the recovery validation.
6. Record time-to-detect and time-to-recover.

## Drill Catalog

### Redis outage

| Field | Value |
| --- | --- |
| Inject | Scale managed Redis to unavailable, or block the Redis egress port. |
| Expected alerts | `CaracalAuditConsumerLagHigh`, `CaracalGatewayRevocationSnapshotStale`, `CaracalGatewayRevocationReloadErrors`, and replay backlog via `CaracalGatewayAuditReplayBacklogOld` / `CaracalSTSAuditReplayBacklogOld`. |
| Expected behavior | Streams, revocation refresh, and audit ingestion lag; STS and Gateway write audit to replay volumes; readiness degrades for affected services. |
| Recover | Restore Redis, verify streams and consumer groups, drain replay backlog and DLQ, confirm revocation snapshot is fresh before resuming risky traffic. |

### Postgres outage

| Field | Value |
| --- | --- |
| Inject | Fail over or block the managed Postgres endpoint. |
| Expected alerts | `CaracalPostgresPoolSaturation`, `CaracalAPIOutboxPendingOldest`, and `CaracalReadinessFlapping`. |
| Expected behavior | API, STS, Gateway, Audit, and Coordinator readiness fail or degrade; outbox delivery stalls. |
| Recover | Restore Postgres, confirm migrations applied, check connection pools, replay outboxes, confirm readiness stabilizes. |

### STS unreachable from Gateway

| Field | Value |
| --- | --- |
| Inject | Scale STS to zero or block Gateway-to-STS traffic. |
| Expected alerts | `CaracalGatewaySTSExchangeErrors` then `CaracalGatewaySTSCircuitOpen`. |
| Expected behavior | Gateway token exchanges fail closed; protected upstream traffic is rejected before provider dispatch. |
| Recover | Restore STS readiness, JWKS, HMAC config, and policy bundle freshness; confirm the circuit closes and a canary exchange succeeds. |

### Stale policy bundle

| Field | Value |
| --- | --- |
| Inject | Pause policy distribution or hold the STS policy bundle past its freshness budget. |
| Expected alerts | `CaracalSTSPolicyBundleStale`. |
| Expected behavior | STS keeps evaluating the last good bundle; new policy activations do not take effect. |
| Recover | Restore distribution, confirm the active policy set version, and verify the alert clears. |

### Bad policy activation

| Field | Value |
| --- | --- |
| Inject | Activate a Rego policy set with a compile error in a test zone. |
| Expected alerts | `CaracalSTSOPACompileErrors`. |
| Expected behavior | STS rejects the broken bundle and continues on the last good policy; the activation does not widen access. |
| Recover | Roll the policy set forward to a valid version and confirm the alert clears. |

### Provider credential refresh failure

| Field | Value |
| --- | --- |
| Inject | Revoke or expire the upstream provider credential the Gateway brokers. |
| Expected alerts | `CaracalSTSProviderRefreshErrors` then `CaracalSTSProviderCircuitOpen`. |
| Expected behavior | Provider-backed exchanges fail closed; resources without brokered credentials are unaffected. |
| Recover | Restore the provider credential, confirm refresh succeeds, and verify the circuit closes. |

### Revocation propagation lag

| Field | Value |
| --- | --- |
| Inject | Revoke a session and delay the revocation stream consumer. |
| Expected alerts | `CaracalGatewayRevocationPropagationLag`. |
| Expected behavior | Revocation eventually reaches the Gateway; the drill measures the propagation window against your budget. |
| Recover | Clear the consumer delay, confirm the revoked session is denied at the Gateway, and verify the alert clears. |

### Audit evidence pipeline backpressure

| Field | Value |
| --- | --- |
| Inject | Pause the audit consumer or fault the audit datastore. |
| Expected alerts | `CaracalAuditDLQNonEmpty`, `CaracalAuditDLQGrowth`, `CaracalAPIOutboxDeadMessages`. |
| Expected behavior | Evidence ingestion is delayed; DLQ and outbox backlogs grow but are not lost. |
| Recover | Restore the consumer and datastore, replay the DLQ and outboxes, and verify the tamper chain is intact. |

### Audit tamper detection

| Field | Value |
| --- | --- |
| Inject | In a disposable environment only, modify a stored audit record out of band. |
| Expected alerts | `CaracalAuditTamperDetected`. |
| Expected behavior | The integrity check flags the break; treat as a [security incident](/operations/incident-response/). |
| Recover | Restore from a trusted backup, identify the blast radius, and follow incident response. |

## After the Drill

For every drill, confirm the green baseline returns: readiness passes, the alert clears, the DLQ and replay backlogs are empty, and revocation and policy freshness are confirmed. Capture the detect and recover timings so you can set realistic alert thresholds and on-call expectations.

## Next Step

Use [Back Up and Retain Data](/operations/backup-retention/) to make sure recovery evidence, secrets, audit records, and durable state can be restored.
