Skip to content

Respond to Incidents

Treat incidents involving audit tamper evidence, stale revocation state, STS/Gateway fail-open risk, secret exposure, or policy corruption as security incidents.

TriggerSeverity
CaracalAuditTamperDetectedCritical security incident.
Gateway revocation snapshot stale or propagation lagCritical access-safety incident.
Gateway STS circuit open during protected trafficCritical availability/access incident.
Audit DLQ growth or replay backlog agingCritical evidence pipeline incident when sustained.
Secret exposureCritical until rotated and blast radius is understood.
Policy activation causes unexpected broad allowCritical authorization incident.
flowchart TD
  Detect[Detect alert or report] --> Contain[Contain risky traffic and freeze rollouts]
  Contain --> Preserve[Preserve logs, audit, DB, Redis, and config evidence]
  Preserve --> Diagnose[Diagnose root cause]
  Diagnose --> Recover[Recover service and safety invariants]
  Recover --> Validate[Validate readiness, audit, revocation, and policy]
  Validate --> Review[Post-incident review]
SymptomSeverityContainPreserveValidate recovery
Audit tamper alertCriticalFreeze changes and stop destructive cleanup.Audit tables, exports, DLQ, service logs.Tamper sweep result, audit ingestion, and timeline are recorded.
Revocation staleCriticalFail closed at Gateway or resource servers.Redis streams, revocation snapshots, session IDs.Revoked sessions are denied and readiness is stable.
Bad policy allowCriticalActivate last known-good policy set.Policy set version, request IDs, audit/explain output.Canary allow/deny decisions match expected policy.
Secret exposureCriticalRotate exposed material and invalidate affected sessions.Secret location, key IDs, affected services, audit records.Old material no longer authenticates.
  • Alert name, firing time, and labels.
  • Request IDs, zone IDs, resource IDs, policy-set versions, agent session IDs, delegation edge IDs.
  • Console audit/explain output.
  • Service logs for Web, API, STS, Gateway, Audit, Coordinator, and Control.
  • Redis stream status, pending entries, and DLQ contents.
  • Postgres backup or snapshot before manual remediation.
  • Helm values or Compose config diff.
IncidentContainment
Audit tamperFreeze changes, preserve database and audit exports, rotate only after evidence capture.
Revocation staleFail closed at resource servers or Gateway, restore Redis/Postgres consumers, verify snapshot freshness.
Bad policy allowActivate last known-good policy set, verify canary denies/allows, inspect audit impact.
Secret exposureRotate exposed material, roll dependent services, invalidate sessions/grants as needed.
  • Relevant /ready endpoints pass.
  • Audit DLQ/replay/outbox backlogs are understood or drained.
  • Revocation and policy freshness metrics are healthy.
  • Canary token exchange and Gateway requests match expected decisions.
  • Evidence and timeline are preserved for review.

Use Plan a Platform Rollout when the environment is stable and ready for controlled change.