Skip to content

Incident Response

This page covers the operational playbooks for the most likely Caracal incidents: compromised sessions, key compromise, audit integrity alerts, and service degradation. Each playbook starts with detection, moves through immediate containment, and ends with verification.

SeverityCriteriaExamples
P1 — CriticalActive compromise; data at risk; system unavailableKey compromise, audit chain tamper, all services down
P2 — HighSecurity event contained but not resolved; partial serviceRevocation backlog, STS down, audit DLQ growing
P3 — MediumDegraded but functional; no active security eventSingle service restart, export behind, PEL lag
P4 — LowInformational; monitoring action requiredConfig drift, non-critical alert threshold crossed

Detection: Credential theft suspected, anomalous request patterns in audit log, user reports.

Step 1 — Identify the session:

Terminal window
caracal session list --zone <zone-id> --subject <user-id> --status active

Or query the audit log:

Terminal window
caracal audit tail --zone <zone-id> --limit 100 | grep "<subject>"

Step 2 — Revoke the session:

Terminal window
caracal session revoke --zone <zone-id> --session-id <sid>

This writes to sessions (status → revoked), publishes a revocation event to caracal.sessions.revoke via the transactional outbox, and the event propagates to:

  • STS within seconds (via sts-revocation consumer group)
  • Any service running RedisRevocationStore with RedisRevocationConsumer within one poll cycle (typically < 5 seconds)

Step 3 — Verify propagation:

Check the revocation stream has delivered the event:

Terminal window
redis-cli -a $REDIS_PASSWORD XLEN caracal.sessions.revoke
redis-cli -a $REDIS_PASSWORD XPENDING caracal.sessions.revoke sts-revocation - + 10

A PEL count of 0 for sts-revocation means the STS has acknowledged the revocation.

Step 4 — Check Gateway metrics:

Terminal window
curl http://localhost:8081/metrics | jq '.revocations_active, .denials_revoked'

revocations_active should reflect the revoked session. denials_revoked should increment if the session’s mandates are still being presented.

Important timing note: Per-call mandates have a 15-minute TTL. A mandate issued before revocation and still within its TTL will be rejected by the Gateway (which checks the revocation store on every request) but will be accepted by any service that does not run a RedisRevocationConsumer. Services behind the Gateway are protected immediately. Services accessed directly (bypassing the Gateway) accept revoked mandates until their TTL expires unless they subscribe to the revocation stream.


Detection: Unauthorized access to the environment where ZONE_KEK is stored, audit log access to the secrets table.

Severity: P1. All zone signing keys are at risk if the KEK is compromised.

Step 1 — Generate a new KEK:

Terminal window
openssl rand -hex 32

Step 2 — Re-encrypt all zone signing keys:

With the current (old) KEK still in environment, run the re-encryption operation. This reads each row from the secrets table, decrypts with the old KEK, and re-encrypts with the new KEK in a single transaction per row. The exact tooling depends on your deployment automation.

Step 3 — Rolling restart with the new KEK:

Update ZONE_KEK in your secrets manager, then restart STS and API replicas in sequence. Monitor /ready on each replica before proceeding to the next.

Step 4 — Rotate all zone signing keys:

After the KEK is updated, rotate every zone’s signing key to ensure no long-lived key material signed with the old KEK remains:

Terminal window
for ZONE_ID in $(caracal zone list --json | jq -r '.[].id'); do
caracal zone rotate-key --zone "$ZONE_ID"
done

Step 5 — Revoke all active sessions:

Mandates signed with keys that may have been exposed should be considered untrusted. Revoke all active sessions in all zones and force re-authentication:

Terminal window
# Enumerate active sessions per zone and revoke each
caracal session list --zone <zone-id> --status active | xargs -n1 caracal session revoke --zone <zone-id> --session-id

Step 6 — Audit the incident:

Query the audit log for all token exchanges and session creations during the suspected compromise window:

Terminal window
caracal audit tail --zone <zone-id> --since <ISO-8601-start> --until <ISO-8601-end> --event-type token_issued

Detection: tamper_mismatch_total > 0 or tamper_chain_breaks > 0 in Audit /metrics, or an entry in audit_ingest_alerts.

Severity: P1 if chain_breaks > 0; P2 if mismatch only.

Step 1 — Identify affected events:

SELECT kind, detail, zone_id, observed_at
FROM audit_ingest_alerts
WHERE observed_at > now() - interval '24 hours'
ORDER BY observed_at DESC;

Step 2 — Determine scope:

A chain_break means one or more events in a zone’s sequence were deleted or reordered. A mismatch means an event’s payload was modified in place.

SELECT chain_seq, id, zone_id, occurred_at, content_sha256
FROM audit_events
WHERE zone_id = '<affected-zone>'
AND occurred_at > '<earliest-affected-time>'
ORDER BY chain_seq;

Compare content_sha256 values against expected values from your S3 Parquet export if available.

Step 3 — Preserve evidence:

Export the affected partition to a separate read-only location before any further database operations:

COPY (SELECT * FROM audit_events WHERE zone_id = '<zone-id>' AND occurred_at >= '<start>' AND occurred_at < '<end>')
TO '/tmp/audit_evidence.csv' WITH CSV HEADER;

Step 4 — Escalate:

Audit chain tampering is a security event requiring forensic investigation. Do not modify or delete any data until an investigation is complete. Notify your security team and preserve all access logs to the Postgres instance.


Detection: GET http://localhost:8080/ready returns 503, or all token exchanges are failing.

Impact: No new mandates can be issued. Existing mandates with remaining TTL continue to work for their lifetime (up to 15 minutes for per-call mandates, up to 1 hour for ambient mandates).

Step 1 — Check STS logs:

Terminal window
docker compose logs sts --tail 100

Step 2 — Check dependencies:

Terminal window
# Postgres
docker compose exec postgres pg_isready -U $POSTGRES_USER -d $POSTGRES_DB
# Redis
redis-cli -a $REDIS_PASSWORD PING

Step 3 — Restart:

Terminal window
docker compose restart sts

Step 4 — Verify recovery:

Terminal window
curl http://localhost:8080/ready

Step 5 — Check for replay queue:

If the STS was down for longer than the audit replay buffer TTL, some audit events may have been dropped. Check:

Terminal window
curl http://localhost:8080/metrics | jq '.sts.audit_dropped, .sts.audit_replay_pending'

A non-zero audit_dropped value means audit events were lost during the outage. Document the outage window for audit record purposes.


Detection: dlq_total increasing in Audit /metrics, or XLEN caracal.audit.events.dlq > 0.

Impact: Audit events are not being persisted. Regulatory compliance is at risk if this persists.

Step 1 — Check Audit service health:

Terminal window
curl http://localhost:9090/health
curl http://localhost:9090/ready
docker compose logs audit --tail 100

Step 2 — Check Postgres write capacity:

Terminal window
docker compose exec postgres psql -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT count(*) FROM audit_events WHERE ingested_at > now() - interval '5 minutes';"

Step 3 — Inspect DLQ messages:

Terminal window
redis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + COUNT 5

Review the payload to understand why delivery failed (malformed event, schema mismatch, Postgres write error).

Step 4 — Replay DLQ messages:

If the root cause is resolved, replay DLQ messages back to the main stream manually:

Terminal window
# Read from DLQ and re-add to main stream
redis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + | \
# parse and re-add to caracal.audit.events

DLQ replay requires custom tooling to re-add messages to the main stream with valid fields.


Define escalation paths appropriate to your organization. At minimum:

TierScopeContact
On-callAll P1/P2Primary on-call rotation
SecurityP1 tamper, key compromiseSecurity team
ComplianceP1/P2 audit events lostCompliance/legal

Document all P1 incidents in a post-mortem within 48 hours of resolution.