Incident Response
This page covers the operational playbooks for the most likely Caracal incidents: compromised sessions, key compromise, audit integrity alerts, and service degradation. Each playbook starts with detection, moves through immediate containment, and ends with verification.
Incident severity
Section titled “Incident severity”| Severity | Criteria | Examples |
|---|---|---|
| P1 — Critical | Active compromise; data at risk; system unavailable | Key compromise, audit chain tamper, all services down |
| P2 — High | Security event contained but not resolved; partial service | Revocation backlog, STS down, audit DLQ growing |
| P3 — Medium | Degraded but functional; no active security event | Single service restart, export behind, PEL lag |
| P4 — Low | Informational; monitoring action required | Config drift, non-critical alert threshold crossed |
Playbook: Revoke a compromised session
Section titled “Playbook: Revoke a compromised session”Detection: Credential theft suspected, anomalous request patterns in audit log, user reports.
Step 1 — Identify the session:
caracal session list --zone <zone-id> --subject <user-id> --status activeOr query the audit log:
caracal audit tail --zone <zone-id> --limit 100 | grep "<subject>"Step 2 — Revoke the session:
caracal session revoke --zone <zone-id> --session-id <sid>This writes to sessions (status → revoked), publishes a revocation event to caracal.sessions.revoke via the transactional outbox, and the event propagates to:
- STS within seconds (via
sts-revocationconsumer group) - Any service running
RedisRevocationStorewithRedisRevocationConsumerwithin one poll cycle (typically < 5 seconds)
Step 3 — Verify propagation:
Check the revocation stream has delivered the event:
redis-cli -a $REDIS_PASSWORD XLEN caracal.sessions.revokeredis-cli -a $REDIS_PASSWORD XPENDING caracal.sessions.revoke sts-revocation - + 10A PEL count of 0 for sts-revocation means the STS has acknowledged the revocation.
Step 4 — Check Gateway metrics:
curl http://localhost:8081/metrics | jq '.revocations_active, .denials_revoked'revocations_active should reflect the revoked session. denials_revoked should increment if the session’s mandates are still being presented.
Important timing note: Per-call mandates have a 15-minute TTL. A mandate issued before revocation and still within its TTL will be rejected by the Gateway (which checks the revocation store on every request) but will be accepted by any service that does not run a RedisRevocationConsumer. Services behind the Gateway are protected immediately. Services accessed directly (bypassing the Gateway) accept revoked mandates until their TTL expires unless they subscribe to the revocation stream.
Playbook: Suspected ZONE_KEK compromise
Section titled “Playbook: Suspected ZONE_KEK compromise”Detection: Unauthorized access to the environment where ZONE_KEK is stored, audit log access to the secrets table.
Severity: P1. All zone signing keys are at risk if the KEK is compromised.
Step 1 — Generate a new KEK:
openssl rand -hex 32Step 2 — Re-encrypt all zone signing keys:
With the current (old) KEK still in environment, run the re-encryption operation. This reads each row from the secrets table, decrypts with the old KEK, and re-encrypts with the new KEK in a single transaction per row. The exact tooling depends on your deployment automation.
Step 3 — Rolling restart with the new KEK:
Update ZONE_KEK in your secrets manager, then restart STS and API replicas in sequence. Monitor /ready on each replica before proceeding to the next.
Step 4 — Rotate all zone signing keys:
After the KEK is updated, rotate every zone’s signing key to ensure no long-lived key material signed with the old KEK remains:
for ZONE_ID in $(caracal zone list --json | jq -r '.[].id'); do caracal zone rotate-key --zone "$ZONE_ID"doneStep 5 — Revoke all active sessions:
Mandates signed with keys that may have been exposed should be considered untrusted. Revoke all active sessions in all zones and force re-authentication:
# Enumerate active sessions per zone and revoke eachcaracal session list --zone <zone-id> --status active | xargs -n1 caracal session revoke --zone <zone-id> --session-idStep 6 — Audit the incident:
Query the audit log for all token exchanges and session creations during the suspected compromise window:
caracal audit tail --zone <zone-id> --since <ISO-8601-start> --until <ISO-8601-end> --event-type token_issuedPlaybook: Audit chain tamper alert
Section titled “Playbook: Audit chain tamper alert”Detection: tamper_mismatch_total > 0 or tamper_chain_breaks > 0 in Audit /metrics, or an entry in audit_ingest_alerts.
Severity: P1 if chain_breaks > 0; P2 if mismatch only.
Step 1 — Identify affected events:
SELECT kind, detail, zone_id, observed_atFROM audit_ingest_alertsWHERE observed_at > now() - interval '24 hours'ORDER BY observed_at DESC;Step 2 — Determine scope:
A chain_break means one or more events in a zone’s sequence were deleted or reordered. A mismatch means an event’s payload was modified in place.
SELECT chain_seq, id, zone_id, occurred_at, content_sha256FROM audit_eventsWHERE zone_id = '<affected-zone>' AND occurred_at > '<earliest-affected-time>'ORDER BY chain_seq;Compare content_sha256 values against expected values from your S3 Parquet export if available.
Step 3 — Preserve evidence:
Export the affected partition to a separate read-only location before any further database operations:
COPY (SELECT * FROM audit_events WHERE zone_id = '<zone-id>' AND occurred_at >= '<start>' AND occurred_at < '<end>')TO '/tmp/audit_evidence.csv' WITH CSV HEADER;Step 4 — Escalate:
Audit chain tampering is a security event requiring forensic investigation. Do not modify or delete any data until an investigation is complete. Notify your security team and preserve all access logs to the Postgres instance.
Playbook: STS service down
Section titled “Playbook: STS service down”Detection: GET http://localhost:8080/ready returns 503, or all token exchanges are failing.
Impact: No new mandates can be issued. Existing mandates with remaining TTL continue to work for their lifetime (up to 15 minutes for per-call mandates, up to 1 hour for ambient mandates).
Step 1 — Check STS logs:
docker compose logs sts --tail 100Step 2 — Check dependencies:
# Postgresdocker compose exec postgres pg_isready -U $POSTGRES_USER -d $POSTGRES_DB
# Redisredis-cli -a $REDIS_PASSWORD PINGStep 3 — Restart:
docker compose restart stsStep 4 — Verify recovery:
curl http://localhost:8080/readyStep 5 — Check for replay queue:
If the STS was down for longer than the audit replay buffer TTL, some audit events may have been dropped. Check:
curl http://localhost:8080/metrics | jq '.sts.audit_dropped, .sts.audit_replay_pending'A non-zero audit_dropped value means audit events were lost during the outage. Document the outage window for audit record purposes.
Playbook: Audit DLQ growing
Section titled “Playbook: Audit DLQ growing”Detection: dlq_total increasing in Audit /metrics, or XLEN caracal.audit.events.dlq > 0.
Impact: Audit events are not being persisted. Regulatory compliance is at risk if this persists.
Step 1 — Check Audit service health:
curl http://localhost:9090/healthcurl http://localhost:9090/readydocker compose logs audit --tail 100Step 2 — Check Postgres write capacity:
docker compose exec postgres psql -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT count(*) FROM audit_events WHERE ingested_at > now() - interval '5 minutes';"Step 3 — Inspect DLQ messages:
redis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + COUNT 5Review the payload to understand why delivery failed (malformed event, schema mismatch, Postgres write error).
Step 4 — Replay DLQ messages:
If the root cause is resolved, replay DLQ messages back to the main stream manually:
# Read from DLQ and re-add to main streamredis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + | \ # parse and re-add to caracal.audit.eventsDLQ replay requires custom tooling to re-add messages to the main stream with valid fields.
Escalation contacts
Section titled “Escalation contacts”Define escalation paths appropriate to your organization. At minimum:
| Tier | Scope | Contact |
|---|---|---|
| On-call | All P1/P2 | Primary on-call rotation |
| Security | P1 tamper, key compromise | Security team |
| Compliance | P1/P2 audit events lost | Compliance/legal |
Document all P1 incidents in a post-mortem within 48 hours of resolution.