Debug Infrastructure Issues

Debug from the boundary inward: runtime lifecycle, readiness, storage, streams, service config, then request-specific audit evidence.

For an application-level failure where you have an error or request ID but not the failing surface, start from the symptom-first Troubleshoot by Symptom instead.

Triage Flow

flowchart TD
  Symptom[Symptom] --> Ready[caracal status --ready or Kubernetes readiness]
  Ready --> Storage[Postgres and Redis]
  Storage --> Streams[Redis streams and outboxes]
  Streams --> Service[Service logs and metrics]
  Service --> Audit[Audit/explain request evidence]
  Audit --> Fix[Apply focused remediation]

Commands

Environment	Commands
Local	`caracal status --ready`, `docker compose ps`, `docker compose logs <service>`
Helm	`kubectl -n caracal get pods,svc,jobs`, `kubectl -n caracal describe pod <pod>`, `kubectl -n caracal logs <pod>`
Storage	`infra/postgres/scripts/validateMigrations.sh`, `infra/redis/scripts/verify.sh`
App-level	Console `diagnostics`, `audit`, and `request trace`

Common Cases

Symptom	Likely area
`401` or `403` from API	Admin token, scope, Control token, or workload credential source.
STS exchange fails	Zone ID, application ID, grant, policy, client secret, step-up, or STS readiness.
Gateway fails before upstream	Mandate verification, STS exchange, binding, revocation snapshot, or upstream allowlist.
Agent views fail	Coordinator URL/token, selected zone, or Coordinator readiness.
Audit event missing	Redis stream health, Audit readiness, DLQ, replay backlog, or request never reaching protected boundary.

Request Investigation

Capture request ID, zone ID, subject, resource, and timestamp.
Open Console audit or request trace.
Confirm policy decision, scopes, target resource, session, agent session, and delegation edge.
Compare token claims with resource-server verifier settings.
Check revocation and step-up state when authority appears valid but access fails.

Escalation Bundle

Include readiness output, relevant logs, service versions, Helm values diff, Redis stream status, Postgres migration status, request ID, audit explanation, and any recent secret or policy changes.

Next Step

Use Recover from Failures when infrastructure debugging identifies a degraded dependency, service, stream, or safety guarantee.