Backup and Retention
The audit ledger is the most critical durable store in Caracal. This page covers how audit data is retained in Postgres, exported to S3 in Parquet format, verified for tamper-evidence, and how to approach disaster recovery.
Audit data model
Section titled “Audit data model”Audit events are stored in the audit_events table, which is range-partitioned on occurred_at with one partition per calendar month. The table schema includes chain linkage fields for tamper detection:
| Column | Purpose |
|---|---|
id | Event identifier |
zone_id | Zone that generated the event |
chain_seq | Monotonic per-zone sequence number |
content_sha256 | SHA-256 hash of the event payload |
prev_content_sha256 | Hash of the preceding event in the chain |
chain_hmac | HMAC-SHA256 linking this event to its predecessor |
ingest_signature | Signature added at ingest time |
occurred_at | Event timestamp (partition key) |
ingested_at | Time of database insertion |
The chain fields make any deletion or modification of an existing event detectable by the tamper sweeper.
Retention policy
Section titled “Retention policy”AUDIT_RETENTION_DAYS (default: 365) controls how long audit events are retained. The Audit service’s retention rotator runs every 6 hours (leader-elected via advisory lock 0x4341524130303032) and:
- Pre-creates monthly Postgres partitions for the current month and 3 future months.
- Identifies partitions whose entire date range falls before
now() - AUDIT_RETENTION_DAYS. - Drops those partitions.
Partition drop is irreversible. Data in dropped partitions is gone permanently unless it was exported to S3 before the partition was dropped. Configure S3 export before data reaches its retention cutoff if long-term archival is required.
Check active partitions:
SELECT relname, pg_size_pretty(pg_relation_size(oid)) AS sizeFROM pg_classWHERE relname LIKE 'audit_events_%'ORDER BY relname;Parquet export to S3
Section titled “Parquet export to S3”When AUDIT_EXPORT_S3_BUCKET is set, the Audit service exports completed hours of audit events to S3 in OCSF v1.7.0 Parquet format. The export runs hourly, leader-elected via advisory lock 0x4341524130303031.
Configuration:
AUDIT_EXPORT_S3_ENDPOINT=https://s3.amazonaws.com # Or MinIO endpointAUDIT_EXPORT_S3_BUCKET=my-audit-archiveAUDIT_EXPORT_S3_REGION=us-east-1 # defaultFor MinIO or other S3-compatible stores, set AUDIT_EXPORT_S3_ENDPOINT to the custom endpoint URL.
Export behavior:
- Exports the most recently completed whole hour of
ingested_atdata. - On fresh deployment, exports only the most recent complete hour (no historical catch-up).
- After the first export, tracks a watermark in
audit_export_watermarkand catches up hour-by-hour if the exporter was down. - Each export run emits
export_events_total,export_errors_total,export_duration_ms, andis_export_leadermetrics.
Verifying export health:
curl http://localhost:9090/metrics | jq '{ export_events_total, export_errors_total, is_export_leader}'If export_errors_total is increasing and export_events_total is not, check S3 connectivity and credentials.
Check the watermark:
SELECT last_exported_hour, updated_atFROM audit_export_watermarkWHERE name = 'default';If last_exported_hour is significantly behind now(), the exporter has fallen behind. This can happen after an extended outage or if the Audit service was stopped. The exporter will catch up automatically when restarted, processing one hour per run.
Tamper detection
Section titled “Tamper detection”The Audit service runs a continuous tamper sweeper that verifies the chain integrity of audit events using the AUDIT_HMAC_KEY. The sweeper checks:
content_sha256matches the actual payload hash.chain_hmacis a valid HMAC of this event linked toprev_content_sha256.chain_seqis strictly monotonic per zone.
Tamper metrics from GET /metrics on the Audit service:
| Metric | Meaning |
|---|---|
tamper_checked_total | Events verified by the sweeper |
tamper_mismatch_total | Events where hash or HMAC does not match |
tamper_chain_breaks | Chain sequence breaks detected |
tamper_hmac_failures | HMAC verification failures |
tamper_last_sweep_unix | Unix timestamp of the last rolling sweep |
tamper_last_full_unix | Unix timestamp of the last full sweep |
tamper_mismatch_total > 0 or tamper_chain_breaks > 0 indicates that audit event data has been modified after ingest. This is a security event — see Incident Response.
Tamper alerts are also written to the audit_ingest_alerts table:
SELECT kind, detail, zone_id, observed_atFROM audit_ingest_alertsORDER BY observed_at DESCLIMIT 20;Audit DLQ
Section titled “Audit DLQ”caracal.audit.events.dlq receives audit events that failed delivery more than AUDIT_MAX_DELIVERIES (default 5) times. Events in the DLQ are not inserted into Postgres.
Monitor DLQ depth:
redis-cli -a $REDIS_PASSWORD XLEN caracal.audit.events.dlqGrowing DLQ depth means events are being generated faster than the Audit service can persist them, or the Audit service has a database write error. Investigate Audit service logs and Postgres connectivity.
Disaster recovery posture
Section titled “Disaster recovery posture”Caracal does not provide built-in backup orchestration beyond the S3 Parquet export. For full disaster recovery:
PostgreSQL:
- Enable continuous WAL archiving (
wal_level=replica,archive_mode=on). - Take daily base backups with
pg_basebackupto durable object storage. - Test point-in-time recovery to a separate instance at least monthly.
- Target a recovery point objective (RPO) consistent with your audit and compliance requirements.
- The
audit_eventspartitioned table is the highest-priority data to protect.
Redis:
- The
redisDatavolume holds the AOF log (appendonly yes,appendfsync everysec). On crash, the worst-case data loss is 1 second of writes. - For production, use Redis persistence (
AOF + RDB) and back up theredisDatavolume. - Redis stream data is also in Postgres via the transactional outbox — if Redis is lost, in-flight outbox rows will be re-delivered on restart. Events already acknowledged by consumers are gone, but those were already persisted.
Recovery order:
Restore and start services in this order:
- Postgres (restore backup, apply WAL)
- Redis (restore AOF/RDB, or start fresh and let outbox re-deliver)
initcontainer (re-provisions streams and consumer groups)- STS → API → Audit → Coordinator → Gateway