---
title: "Configure Alerts"
url: "https://docs.caracal.run/operations/alerts/"
markdown_url: "https://docs.caracal.run/markdown/operations/alerts.md"
description: "Alert names, meanings, and first-response actions for Caracal operations."
page_type: "reference"
concepts: []
requires: []
---

# Configure Alerts

Canonical URL: https://docs.caracal.run/operations/alerts/
Markdown URL: https://docs.caracal.run/markdown/operations/alerts.md
Description: Alert names, meanings, and first-response actions for Caracal operations.
Page type: reference
Concepts: none
Requires: none

---

The Helm chart can render PrometheusRule alerts for policy freshness, audit health, outbox health, Gateway exchange, revocation safety, database saturation, and readiness flapping.

## Built-In Alerts

| Alert | Severity | First response |
| --- | --- | --- |
| `CaracalSTSOPACompileErrors` | warning | Inspect policy activation history and STS logs; affected policy loads fail closed. |
| `CaracalSTSPolicyBundleStale` | warning | Check Redis policy invalidation and STS policy age. |
| `CaracalSTSProviderRefreshErrors` | warning | Check Redis coordination and upstream provider availability. |
| `CaracalSTSProviderCircuitOpen` | warning | Restore provider or credential health; provider-backed grants fail closed. |
| `CaracalAuditDLQNonEmpty` | warning | Inspect Audit DLQ records and producer HMAC/signature failures. |
| `CaracalAuditDLQGrowth` | critical | Stop risky rollouts and recover Audit/Redis/Postgres. |
| `CaracalAuditConsumerLagHigh` | warning | Scale Audit or restore Postgres/Redis performance. |
| `CaracalGatewayAuditReplayBacklogOld` | warning | Recover Redis/Audit and confirm Gateway replay drains. |
| `CaracalSTSAuditReplayBacklogOld` | warning | Recover Redis/Audit and confirm STS replay drains. |
| `CaracalAuditTamperDetected` | critical | Treat as a security incident. |
| `CaracalAPIOutboxDeadMessages` | critical | Reconcile outbox rows and Redis consumers before continuing rollouts. |
| `CaracalAPIOutboxPendingOldest` | warning | Check Redis health, API outbox workers, DB pool, and stream memory. |
| `CaracalGatewaySTSExchangeErrors` | warning | Check Gateway readiness, STS readiness, bindings, and service exchange key. |
| `CaracalGatewaySTSCircuitOpen` | critical | Restore STS or service exchange before routing protected traffic. |
| `CaracalGatewayRevocationSnapshotStale` | critical | Treat as access-safety incident until snapshot freshness returns. |
| `CaracalGatewayRevocationPropagationLag` | critical | Restore revocation stream consumers and pending-entry reclamation. |
| `CaracalGatewayRevocationReloadErrors` | critical | Fix Postgres/Redis snapshot load before relying on protected routes. |
| `CaracalPostgresPoolSaturation` | warning | Inspect long queries, pool sizing, and connection leaks. |
| `CaracalReadinessFlapping` | warning | Check dependencies, throttling, OOM, and probe timeouts. |

## CaracalSTSOPACompileErrors

STS is reporting OPA policy compilation failures. Inspect policy activation history and STS logs for the failing bundle; affected policy loads fail closed until the bundle compiles successfully.

## CaracalSTSPolicyBundleStale

STS policy age is above the configured freshness threshold. Check Redis policy invalidation, PostgreSQL polling, and STS readiness before relying on newly activated policies.

## CaracalSTSProviderRefreshErrors

STS provider refresh coordination is failing. Restore Redis coordination and upstream provider availability; provider-backed grants may fail closed while refresh results cannot be coordinated.

## CaracalSTSProviderCircuitOpen

STS has opened a provider refresh circuit after repeated provider failures. Restore the provider or credential health before retrying provider-backed grants.

## CaracalAuditDLQNonEmpty

The Audit dead-letter queue contains records. Inspect Audit logs, DLQ records, producer HMAC/signature failures, and Redis stream health before treating evidence capture as healthy.

## CaracalAuditDLQGrowth

The Audit dead-letter queue is growing. Stop risky rollouts and recover Audit, Redis, or Postgres before continuing changes that depend on complete audit evidence.

## CaracalAuditConsumerLagHigh

Audit ingestion is behind the Redis stream. Scale Audit or restore Postgres and Redis performance until consumer lag returns below the configured threshold.

## CaracalGatewayAuditReplayBacklogOld

Gateway has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm Gateway replay drains before continuing risky rollouts.

## CaracalSTSAuditReplayBacklogOld

STS has audit replay files waiting on disk beyond the configured threshold. Recover Redis or Audit and confirm STS replay drains before continuing risky rollouts.

## CaracalAuditTamperDetected

Audit chain verification detected a mismatch, ordering break, or stored HMAC failure. Treat this as a security incident and preserve evidence before attempting recovery.

## CaracalAPIOutboxDeadMessages

The API transactional outbox has abandoned messages. Recover Redis or stream consumers, inspect API logs, and reconcile affected outbox rows before continuing rollouts.

## CaracalAPIOutboxPendingOldest

The API transactional outbox is not draining promptly. Check Redis health, API outbox workers, database pool saturation, and stream memory pressure.

## CaracalGatewaySTSExchangeErrors

Gateway cannot reliably exchange mandates with STS. Check Gateway readiness, STS readiness, route bindings, and the service exchange key before routing protected traffic.

## CaracalGatewaySTSCircuitOpen

Gateway is fast-failing STS exchanges after repeated STS-unavailable failures. Restore STS or service exchange health before routing protected traffic that requires exchanged authority.

## CaracalGatewayRevocationSnapshotStale

Gateway cannot prove it has a fresh revocation baseline. Treat this as an access-safety incident until Postgres and Redis revocation snapshot loading are healthy.

## CaracalGatewayRevocationPropagationLag

Revocation stream messages are reaching Gateway outside the configured safety window. Restore Redis consumer health and confirm pending entries are reclaimed.

## CaracalGatewayRevocationReloadErrors

Gateway cannot refresh revocation state from Postgres. Treat this as an access-safety incident until reloads succeed and snapshot freshness returns.

## CaracalPostgresPoolSaturation

A service is holding most of its Postgres connection pool. Inspect long-running queries, statement timeouts, pool sizing, and connection leaks before the pool exhausts.

## CaracalReadinessFlapping

A Caracal pod is repeatedly toggling Ready or NotReady. Check dependency health, CPU throttling, OOM events, and readiness probe timeouts.

## Alert Routing

| Severity | Route |
| --- | --- |
| Critical access-safety or tamper alerts | Security incident response. |
| Critical data-flow alerts | On-call plus release owner; freeze rollouts. |
| Warning readiness/capacity alerts | Service owner or platform queue. |

## First-Response Rule

When an alert involves audit, revocation, or policy freshness, prefer fail-closed recovery over traffic continuation. Restore evidence and revocation guarantees before resuming high-risk changes.

## Next Step

Use [Troubleshoot by Symptom](/operations/troubleshooting/) when an alert or user report starts from an SDK error, HTTP status, or request ID.
