Debug Infrastructure Issues
Debug from the boundary inward: runtime lifecycle, readiness, storage, streams, service config, then request-specific audit evidence.
For an application-level failure where you have an error or request ID but not the failing surface, start from the symptom-first Troubleshoot by Symptom instead.
Triage Flow
Section titled “Triage Flow”flowchart TD Symptom[Symptom] --> Ready[caracal status --ready or Kubernetes readiness] Ready --> Storage[Postgres and Redis] Storage --> Streams[Redis streams and outboxes] Streams --> Service[Service logs and metrics] Service --> Audit[Audit/explain request evidence] Audit --> Fix[Apply focused remediation]
Commands
Section titled “Commands”| Environment | Commands |
|---|---|
| Local | caracal status --ready, docker compose ps, docker compose logs <service> |
| Helm | kubectl -n caracal get pods,svc,jobs, kubectl -n caracal describe pod <pod>, kubectl -n caracal logs <pod> |
| Storage | infra/postgres/scripts/validateMigrations.sh, infra/redis/scripts/verify.sh |
| App-level | Console diagnostics, audit, and request trace |
Common Cases
Section titled “Common Cases”| Symptom | Likely area |
|---|---|
401 or 403 from API | Admin token, scope, Control token, or workload credential source. |
| STS exchange fails | Zone ID, application ID, grant, policy, client secret, step-up, or STS readiness. |
| Gateway fails before upstream | Mandate verification, STS exchange, binding, revocation snapshot, or upstream allowlist. |
| Agent views fail | Coordinator URL/token, selected zone, or Coordinator readiness. |
| Audit event missing | Redis stream health, Audit readiness, DLQ, replay backlog, or request never reaching protected boundary. |
Request Investigation
Section titled “Request Investigation”- Capture request ID, zone ID, subject, resource, and timestamp.
- Open Console
auditorrequest trace. - Confirm policy decision, scopes, target resource, session, agent session, and delegation edge.
- Compare token claims with resource-server verifier settings.
- Check revocation and step-up state when authority appears valid but access fails.
Escalation Bundle
Section titled “Escalation Bundle”Include readiness output, relevant logs, service versions, Helm values diff, Redis stream status, Postgres migration status, request ID, audit explanation, and any recent secret or policy changes.
Next Step
Section titled “Next Step”Use Recover from Failures when infrastructure debugging identifies a degraded dependency, service, stream, or safety guarantee.

