Skip to content

Configure Alerts

The Helm chart can render PrometheusRule alerts for policy freshness, audit health, outbox health, Gateway exchange, revocation safety, database saturation, and readiness flapping.

AlertSeverityFirst response
CaracalSTSOPACompileErrorswarningInspect policy activation history and STS logs; affected policy loads fail closed.
CaracalSTSPolicyBundleStalewarningCheck Redis policy invalidation and STS policy age.
CaracalSTSProviderRefreshErrorswarningCheck Redis coordination and upstream provider availability.
CaracalSTSProviderCircuitOpenwarningRestore provider or credential health; provider-backed grants fail closed.
CaracalAuditDLQNonEmptywarningInspect Audit DLQ records and producer HMAC/signature failures.
CaracalAuditDLQGrowthcriticalStop risky rollouts and recover Audit/Redis/Postgres.
CaracalAuditConsumerLagHighwarningScale Audit or restore Postgres/Redis performance.
CaracalGatewayAuditReplayBacklogOldwarningRecover Redis/Audit and confirm Gateway replay drains.
CaracalSTSAuditReplayBacklogOldwarningRecover Redis/Audit and confirm STS replay drains.
CaracalAuditTamperDetectedcriticalTreat as a security incident.
CaracalAPIOutboxDeadMessagescriticalReconcile outbox rows and Redis consumers before continuing rollouts.
CaracalAPIOutboxPendingOldestwarningCheck Redis health, API outbox workers, DB pool, and stream memory.
CaracalGatewaySTSExchangeErrorswarningCheck Gateway readiness, STS readiness, bindings, and service exchange key.
CaracalGatewaySTSCircuitOpencriticalRestore STS or service exchange before routing protected traffic.
CaracalGatewayRevocationSnapshotStalecriticalTreat as access-safety incident until snapshot freshness returns.
CaracalGatewayRevocationPropagationLagcriticalRestore revocation stream consumers and pending-entry reclamation.
CaracalGatewayRevocationReloadErrorscriticalFix Postgres/Redis snapshot load before relying on protected routes.
CaracalPostgresPoolSaturationwarningInspect long queries, pool sizing, and connection leaks.
CaracalReadinessFlappingwarningCheck dependencies, throttling, OOM, and probe timeouts.

STS is reporting OPA policy compilation failures. Inspect policy activation history and STS logs for the failing bundle; affected policy loads fail closed until the bundle compiles successfully.

STS policy age is above the configured freshness threshold. Check Redis policy invalidation, PostgreSQL polling, and STS readiness before relying on newly activated policies.

STS provider refresh coordination is failing. Restore Redis coordination and upstream provider availability; provider-backed grants may fail closed while refresh results cannot be coordinated.

STS has opened a provider refresh circuit after repeated provider failures. Restore the provider or credential health before retrying provider-backed grants.

The Audit dead-letter queue contains records. Inspect Audit logs, DLQ records, producer HMAC/signature failures, and Redis stream health before treating evidence capture as healthy.

The Audit dead-letter queue is growing. Stop risky rollouts and recover Audit, Redis, or Postgres before continuing changes that depend on complete audit evidence.

Audit ingestion is behind the Redis stream. Scale Audit or restore Postgres and Redis performance until consumer lag returns below the configured threshold.

Gateway has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm Gateway replay drains before continuing risky rollouts.

STS has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm STS replay drains before continuing risky rollouts.

Audit chain verification detected a mismatch, ordering break, or stored HMAC failure. Treat this as a security incident and preserve evidence before attempting recovery.

The API transactional outbox has abandoned messages. Recover Redis or stream consumers, inspect API logs, and reconcile affected outbox rows before continuing rollouts.

The API transactional outbox is not draining promptly. Check Redis health, API outbox workers, database pool saturation, and stream memory pressure.

Gateway cannot reliably exchange mandates with STS. Check Gateway readiness, STS readiness, route bindings, and the service exchange key before routing protected traffic.

Gateway is fast-failing STS exchanges after repeated STS-unavailable failures. Restore STS or service exchange health before routing protected traffic that requires exchanged authority.

Gateway cannot prove it has a fresh revocation baseline. Treat this as an access-safety incident until Postgres and Redis revocation snapshot loading are healthy.

Revocation stream messages are reaching Gateway outside the configured safety window. Restore Redis consumer health and confirm pending entries are reclaimed.

Gateway cannot refresh revocation state from Postgres. Treat this as an access-safety incident until reloads succeed and snapshot freshness returns.

A service is holding most of its Postgres connection pool. Inspect long-running queries, statement timeouts, pool sizing, and connection leaks before the pool exhausts.

A Caracal pod is repeatedly toggling Ready or NotReady. Check dependency health, CPU throttling, OOM events, and readiness probe timeouts.

SeverityRoute
Critical access-safety or tamper alertsSecurity incident response.
Critical data-flow alertsOn-call plus release owner; freeze rollouts.
Warning readiness/capacity alertsService owner or platform queue.

When an alert involves audit, revocation, or policy freshness, prefer fail-closed recovery over traffic continuation. Restore evidence and revocation guarantees before resuming high-risk changes.

Use Troubleshoot by Symptom when an alert or user report starts from an SDK error, HTTP status, or request ID.