Configure Alerts
The Helm chart can render PrometheusRule alerts for policy freshness, audit health, outbox health, Gateway exchange, revocation safety, database saturation, and readiness flapping.
Built-In Alerts
Section titled “Built-In Alerts”| Alert | Severity | First response |
|---|---|---|
CaracalSTSOPACompileErrors | warning | Inspect policy activation history and STS logs; affected policy loads fail closed. |
CaracalSTSPolicyBundleStale | warning | Check Redis policy invalidation and STS policy age. |
CaracalSTSProviderRefreshErrors | warning | Check Redis coordination and upstream provider availability. |
CaracalSTSProviderCircuitOpen | warning | Restore provider or credential health; provider-backed grants fail closed. |
CaracalAuditDLQNonEmpty | warning | Inspect Audit DLQ records and producer HMAC/signature failures. |
CaracalAuditDLQGrowth | critical | Stop risky rollouts and recover Audit/Redis/Postgres. |
CaracalAuditConsumerLagHigh | warning | Scale Audit or restore Postgres/Redis performance. |
CaracalGatewayAuditReplayBacklogOld | warning | Recover Redis/Audit and confirm Gateway replay drains. |
CaracalSTSAuditReplayBacklogOld | warning | Recover Redis/Audit and confirm STS replay drains. |
CaracalAuditTamperDetected | critical | Treat as a security incident. |
CaracalAPIOutboxDeadMessages | critical | Reconcile outbox rows and Redis consumers before continuing rollouts. |
CaracalAPIOutboxPendingOldest | warning | Check Redis health, API outbox workers, DB pool, and stream memory. |
CaracalGatewaySTSExchangeErrors | warning | Check Gateway readiness, STS readiness, bindings, and service exchange key. |
CaracalGatewaySTSCircuitOpen | critical | Restore STS or service exchange before routing protected traffic. |
CaracalGatewayRevocationSnapshotStale | critical | Treat as access-safety incident until snapshot freshness returns. |
CaracalGatewayRevocationPropagationLag | critical | Restore revocation stream consumers and pending-entry reclamation. |
CaracalGatewayRevocationReloadErrors | critical | Fix Postgres/Redis snapshot load before relying on protected routes. |
CaracalPostgresPoolSaturation | warning | Inspect long queries, pool sizing, and connection leaks. |
CaracalReadinessFlapping | warning | Check dependencies, throttling, OOM, and probe timeouts. |
CaracalSTSOPACompileErrors
Section titled “CaracalSTSOPACompileErrors”STS is reporting OPA policy compilation failures. Inspect policy activation history and STS logs for the failing bundle; affected policy loads fail closed until the bundle compiles successfully.
CaracalSTSPolicyBundleStale
Section titled “CaracalSTSPolicyBundleStale”STS policy age is above the configured freshness threshold. Check Redis policy invalidation, PostgreSQL polling, and STS readiness before relying on newly activated policies.
CaracalSTSProviderRefreshErrors
Section titled “CaracalSTSProviderRefreshErrors”STS provider refresh coordination is failing. Restore Redis coordination and upstream provider availability; provider-backed grants may fail closed while refresh results cannot be coordinated.
CaracalSTSProviderCircuitOpen
Section titled “CaracalSTSProviderCircuitOpen”STS has opened a provider refresh circuit after repeated provider failures. Restore the provider or credential health before retrying provider-backed grants.
CaracalAuditDLQNonEmpty
Section titled “CaracalAuditDLQNonEmpty”The Audit dead-letter queue contains records. Inspect Audit logs, DLQ records, producer HMAC/signature failures, and Redis stream health before treating evidence capture as healthy.
CaracalAuditDLQGrowth
Section titled “CaracalAuditDLQGrowth”The Audit dead-letter queue is growing. Stop risky rollouts and recover Audit, Redis, or Postgres before continuing changes that depend on complete audit evidence.
CaracalAuditConsumerLagHigh
Section titled “CaracalAuditConsumerLagHigh”Audit ingestion is behind the Redis stream. Scale Audit or restore Postgres and Redis performance until consumer lag returns below the configured threshold.
CaracalGatewayAuditReplayBacklogOld
Section titled “CaracalGatewayAuditReplayBacklogOld”Gateway has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm Gateway replay drains before continuing risky rollouts.
CaracalSTSAuditReplayBacklogOld
Section titled “CaracalSTSAuditReplayBacklogOld”STS has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm STS replay drains before continuing risky rollouts.
CaracalAuditTamperDetected
Section titled “CaracalAuditTamperDetected”Audit chain verification detected a mismatch, ordering break, or stored HMAC failure. Treat this as a security incident and preserve evidence before attempting recovery.
CaracalAPIOutboxDeadMessages
Section titled “CaracalAPIOutboxDeadMessages”The API transactional outbox has abandoned messages. Recover Redis or stream consumers, inspect API logs, and reconcile affected outbox rows before continuing rollouts.
CaracalAPIOutboxPendingOldest
Section titled “CaracalAPIOutboxPendingOldest”The API transactional outbox is not draining promptly. Check Redis health, API outbox workers, database pool saturation, and stream memory pressure.
CaracalGatewaySTSExchangeErrors
Section titled “CaracalGatewaySTSExchangeErrors”Gateway cannot reliably exchange mandates with STS. Check Gateway readiness, STS readiness, route bindings, and the service exchange key before routing protected traffic.
CaracalGatewaySTSCircuitOpen
Section titled “CaracalGatewaySTSCircuitOpen”Gateway is fast-failing STS exchanges after repeated STS-unavailable failures. Restore STS or service exchange health before routing protected traffic that requires exchanged authority.
CaracalGatewayRevocationSnapshotStale
Section titled “CaracalGatewayRevocationSnapshotStale”Gateway cannot prove it has a fresh revocation baseline. Treat this as an access-safety incident until Postgres and Redis revocation snapshot loading are healthy.
CaracalGatewayRevocationPropagationLag
Section titled “CaracalGatewayRevocationPropagationLag”Revocation stream messages are reaching Gateway outside the configured safety window. Restore Redis consumer health and confirm pending entries are reclaimed.
CaracalGatewayRevocationReloadErrors
Section titled “CaracalGatewayRevocationReloadErrors”Gateway cannot refresh revocation state from Postgres. Treat this as an access-safety incident until reloads succeed and snapshot freshness returns.
CaracalPostgresPoolSaturation
Section titled “CaracalPostgresPoolSaturation”A service is holding most of its Postgres connection pool. Inspect long-running queries, statement timeouts, pool sizing, and connection leaks before the pool exhausts.
CaracalReadinessFlapping
Section titled “CaracalReadinessFlapping”A Caracal pod is repeatedly toggling Ready or NotReady. Check dependency health, CPU throttling, OOM events, and readiness probe timeouts.
Alert Routing
Section titled “Alert Routing”| Severity | Route |
|---|---|
| Critical access-safety or tamper alerts | Security incident response. |
| Critical data-flow alerts | On-call plus release owner; freeze rollouts. |
| Warning readiness/capacity alerts | Service owner or platform queue. |
First-Response Rule
Section titled “First-Response Rule”When an alert involves audit, revocation, or policy freshness, prefer fail-closed recovery over traffic continuation. Restore evidence and revocation guarantees before resuming high-risk changes.
Next Step
Section titled “Next Step”Use Troubleshoot by Symptom when an alert or user report starts from an SDK error, HTTP status, or request ID.

