Backups must preserve both access state and evidence state. A restore that loses audit, revocation, keys, or delegation data can produce unsafe or unauditable behavior.
| Asset | Why it matters |
|---|
| Postgres database | Product state, policies, grants, sessions, audit events, agents, delegations, outboxes, gateway bindings. |
| Runtime secrets | Database/Redis credentials, admin token, Coordinator token, zone KEK, HMAC keys, service exchange keys. |
| STS/Gateway replay volumes | Audit replay files during Redis/Audit outages. |
| Redis snapshot or managed backup | Optional operational recovery for stream pending entries; Postgres remains authoritative. |
| Audit exports | Long-term evidence and SIEM/compliance integration. |
flowchart LR
PG[(Postgres)] --> Backup[Encrypted backup]
Secrets[Secret manager] --> Backup
Replay[Replay volumes] --> Backup
Audit[Audit export] --> Archive[Immutable archive]
Backup --> RestoreTest[Scheduled restore test]
| Area | Controls |
|---|
| Audit database | AUDIT_RETENTION_DAYS, partitions, audit export watermarks. |
| Coordinator data | DELEGATION_RETENTION_DAYS, OUTBOX_RETENTION_DAYS, sweeper intervals. |
| Redis streams | Provisioner intended max lengths and managed Redis retention. |
| Backups | Platform backup policy and legal/compliance requirements. |
- Restore Postgres into an isolated environment.
- Restore required secrets into the environment secret store.
- Run migration verification.
- Start services and verify
/ready.
- Confirm audit query, policy-set activation state, Gateway bindings, sessions, agents, and delegation records.
- Run a canary token exchange and protected Gateway request.
| Symptom | Check |
|---|
| Restored STS cannot decrypt keys | ZONE_KEK does not match the database secrets. |
| Audit chain verification fails | Missing audit rows, wrong AUDIT_HMAC_KEY, or partial restore. |
| Gateway cannot route | Missing gateway binding rows or stale binding revision. |
| Revocation state is incomplete | Restore Postgres revocation/session state and replay Redis revocation events where needed. |
Use Respond to Incidents to define containment, evidence preservation, and recovery validation.