Incident Response

This page covers the operational playbooks for the most likely Caracal incidents: compromised sessions, key compromise, audit integrity alerts, and service degradation. Each playbook starts with detection, moves through immediate containment, and ends with verification.

Incident severity

Severity	Criteria	Examples
P1 — Critical	Active compromise; data at risk; system unavailable	Key compromise, audit chain tamper, all services down
P2 — High	Security event contained but not resolved; partial service	Revocation backlog, STS down, audit DLQ growing
P3 — Medium	Degraded but functional; no active security event	Single service restart, export behind, PEL lag
P4 — Low	Informational; monitoring action required	Config drift, non-critical alert threshold crossed

Playbook: Revoke a compromised session

Detection: Credential theft suspected, anomalous request patterns in audit log, user reports.

Step 1 — Identify the session:

caracal session list --zone <zone-id> --subject <user-id> --status active

Or query the audit log:

caracal audit tail --zone <zone-id> --limit 100 | grep "<subject>"

Step 2 — Revoke the session:

caracal session revoke --zone <zone-id> --session-id <sid>

This writes to sessions (status → revoked), publishes a revocation event to caracal.sessions.revoke via the transactional outbox, and the event propagates to:

STS within seconds (via sts-revocation consumer group)
Any service running RedisRevocationStore with RedisRevocationConsumer within one poll cycle (typically < 5 seconds)

Step 3 — Verify propagation:

Check the revocation stream has delivered the event:

redis-cli -a $REDIS_PASSWORD XLEN caracal.sessions.revoke
redis-cli -a $REDIS_PASSWORD XPENDING caracal.sessions.revoke sts-revocation - + 10

A PEL count of 0 for sts-revocation means the STS has acknowledged the revocation.

Step 4 — Check Gateway metrics:

curl http://localhost:8081/metrics | jq '.revocations_active, .denials_revoked'

revocations_active should reflect the revoked session. denials_revoked should increment if the session’s mandates are still being presented.

Important timing note: Per-call mandates have a 15-minute TTL. A mandate issued before revocation and still within its TTL will be rejected by the Gateway (which checks the revocation store on every request) but will be accepted by any service that does not run a RedisRevocationConsumer. Services behind the Gateway are protected immediately. Services accessed directly (bypassing the Gateway) accept revoked mandates until their TTL expires unless they subscribe to the revocation stream.

Playbook: Suspected ZONE_KEK compromise

Detection: Unauthorized access to the environment where ZONE_KEK is stored, audit log access to the secrets table.

Severity: P1. All zone signing keys are at risk if the KEK is compromised.

Step 1 — Generate a new KEK:

openssl rand -hex 32

Step 2 — Re-encrypt all zone signing keys:

With the current (old) KEK still in environment, run the re-encryption operation. This reads each row from the secrets table, decrypts with the old KEK, and re-encrypts with the new KEK in a single transaction per row. The exact tooling depends on your deployment automation.

Step 3 — Rolling restart with the new KEK:

Update ZONE_KEK in your secrets manager, then restart STS and API replicas in sequence. Monitor /ready on each replica before proceeding to the next.

Step 4 — Rotate all zone signing keys:

After the KEK is updated, rotate every zone’s signing key to ensure no long-lived key material signed with the old KEK remains:

for ZONE_ID in $(caracal zone list --json | jq -r '.[].id'); do
  caracal zone rotate-key --zone "$ZONE_ID"
done

Step 5 — Revoke all active sessions:

Mandates signed with keys that may have been exposed should be considered untrusted. Revoke all active sessions in all zones and force re-authentication:

# Enumerate active sessions per zone and revoke each
caracal session list --zone <zone-id> --status active | xargs -n1 caracal session revoke --zone <zone-id> --session-id

Step 6 — Audit the incident:

Query the audit log for all token exchanges and session creations during the suspected compromise window:

caracal audit tail --zone <zone-id> --since <ISO-8601-start> --until <ISO-8601-end> --event-type token_issued

Playbook: Audit chain tamper alert

Detection: tamper_mismatch_total > 0 or tamper_chain_breaks > 0 in Audit /metrics, or an entry in audit_ingest_alerts.

Severity: P1 if chain_breaks > 0; P2 if mismatch only.

Step 1 — Identify affected events:

SELECT kind, detail, zone_id, observed_at
FROM   audit_ingest_alerts
WHERE  observed_at > now() - interval '24 hours'
ORDER  BY observed_at DESC;

Step 2 — Determine scope:

A chain_break means one or more events in a zone’s sequence were deleted or reordered. A mismatch means an event’s payload was modified in place.

SELECT chain_seq, id, zone_id, occurred_at, content_sha256
FROM   audit_events
WHERE  zone_id = '<affected-zone>'
  AND  occurred_at > '<earliest-affected-time>'
ORDER  BY chain_seq;

Compare content_sha256 values against expected values from your S3 Parquet export if available.

Step 3 — Preserve evidence:

Export the affected partition to a separate read-only location before any further database operations:

COPY (SELECT * FROM audit_events WHERE zone_id = '<zone-id>' AND occurred_at >= '<start>' AND occurred_at < '<end>')
TO '/tmp/audit_evidence.csv' WITH CSV HEADER;

Step 4 — Escalate:

Audit chain tampering is a security event requiring forensic investigation. Do not modify or delete any data until an investigation is complete. Notify your security team and preserve all access logs to the Postgres instance.

Playbook: STS service down

Detection: GET http://localhost:8080/ready returns 503, or all token exchanges are failing.

Impact: No new mandates can be issued. Existing mandates with remaining TTL continue to work for their lifetime (up to 15 minutes for per-call mandates, up to 1 hour for ambient mandates).

Step 1 — Check STS logs:

docker compose logs sts --tail 100

Step 2 — Check dependencies:

# Postgres
docker compose exec postgres pg_isready -U $POSTGRES_USER -d $POSTGRES_DB

# Redis
redis-cli -a $REDIS_PASSWORD PING

Step 3 — Restart:

docker compose restart sts

Step 4 — Verify recovery:

curl http://localhost:8080/ready

Step 5 — Check for replay queue:

If the STS was down for longer than the audit replay buffer TTL, some audit events may have been dropped. Check:

curl http://localhost:8080/metrics | jq '.sts.audit_dropped, .sts.audit_replay_pending'

A non-zero audit_dropped value means audit events were lost during the outage. Document the outage window for audit record purposes.

Playbook: Audit DLQ growing

Detection: dlq_total increasing in Audit /metrics, or XLEN caracal.audit.events.dlq > 0.

Impact: Audit events are not being persisted. Regulatory compliance is at risk if this persists.

Step 1 — Check Audit service health:

curl http://localhost:9090/health
curl http://localhost:9090/ready
docker compose logs audit --tail 100

Step 2 — Check Postgres write capacity:

docker compose exec postgres psql -U $POSTGRES_USER -d $POSTGRES_DB -c "SELECT count(*) FROM audit_events WHERE ingested_at > now() - interval '5 minutes';"

Step 3 — Inspect DLQ messages:

redis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + COUNT 5

Review the payload to understand why delivery failed (malformed event, schema mismatch, Postgres write error).

Step 4 — Replay DLQ messages:

If the root cause is resolved, replay DLQ messages back to the main stream manually:

# Read from DLQ and re-add to main stream
redis-cli -a $REDIS_PASSWORD XRANGE caracal.audit.events.dlq - + | \
  # parse and re-add to caracal.audit.events

DLQ replay requires custom tooling to re-add messages to the main stream with valid fields.

Escalation contacts

Define escalation paths appropriate to your organization. At minimum:

Tier	Scope	Contact
On-call	All P1/P2	Primary on-call rotation
Security	P1 tamper, key compromise	Security team
Compliance	P1/P2 audit events lost	Compliance/legal

Document all P1 incidents in a post-mortem within 48 hours of resolution.