Skip to content

Debug Infrastructure Issues

Debug from the boundary inward: runtime lifecycle, readiness, storage, streams, service config, then request-specific audit evidence.

For an application-level failure where you have an error or request ID but not the failing surface, start from the symptom-first Troubleshoot by Symptom instead.

flowchart TD
  Symptom[Symptom] --> Ready[caracal status --ready or Kubernetes readiness]
  Ready --> Storage[Postgres and Redis]
  Storage --> Streams[Redis streams and outboxes]
  Streams --> Service[Service logs and metrics]
  Service --> Audit[Audit/explain request evidence]
  Audit --> Fix[Apply focused remediation]
EnvironmentCommands
Localcaracal status --ready, docker compose ps, docker compose logs <service>
Helmkubectl -n caracal get pods,svc,jobs, kubectl -n caracal describe pod <pod>, kubectl -n caracal logs <pod>
Storageinfra/postgres/scripts/validateMigrations.sh, infra/redis/scripts/verify.sh
App-levelConsole diagnostics, audit, and request trace
SymptomLikely area
401 or 403 from APIAdmin token, scope, Control token, or workload credential source.
STS exchange failsZone ID, application ID, grant, policy, client secret, step-up, or STS readiness.
Gateway fails before upstreamMandate verification, STS exchange, binding, revocation snapshot, or upstream allowlist.
Agent views failCoordinator URL/token, selected zone, or Coordinator readiness.
Audit event missingRedis stream health, Audit readiness, DLQ, replay backlog, or request never reaching protected boundary.
  1. Capture request ID, zone ID, subject, resource, and timestamp.
  2. Open Console audit or request trace.
  3. Confirm policy decision, scopes, target resource, session, agent session, and delegation edge.
  4. Compare token claims with resource-server verifier settings.
  5. Check revocation and step-up state when authority appears valid but access fails.

Include readiness output, relevant logs, service versions, Helm values diff, Redis stream status, Postgres migration status, request ID, audit explanation, and any recent secret or policy changes.

Use Recover from Failures when infrastructure debugging identifies a degraded dependency, service, stream, or safety guarantee.