Scale Capacity

Scale Caracal around three bottlenecks: token exchange and policy evaluation in STS, upstream proxying in Gateway, and durable writes in Postgres/Audit.

Scaling Levers

Component	Primary levers
API	Replicas, `DB_POOL_MAX`, outbox batch/interval settings, request rate limits.
STS	Replicas, OPA policy age, Redis invalidation health, `MAX_GRANT_TTL_SECONDS`, replay persistence.
Gateway	Replicas, `MAX_REQUEST_BYTES`, `STS_TIMEOUT`, `UPSTREAM_TIMEOUT`, upstream allowlist, revocation snapshot health.
Audit	Replicas, Postgres write capacity, DLQ thresholds, retention, S3 export settings.
Coordinator	Replicas, DB pool, sweeper intervals, outbox batch, service-agent leases.
Postgres	Connection limits, indexes, partitions, storage IOPS, backup windows.
Redis	Stream memory, pending entries, consumer lag, persistence, network latency.

Helm Defaults

The chart defaults to two replicas for API, STS, Gateway, Audit, and Coordinator. Gateway has a higher maximum HPA ceiling because protected traffic fans through it.

Service	Default port	Default max HPA replicas
API	`3000`	`8`
STS	`8080`	`8`
Gateway	`8081`	`16`
Audit	`9090`	`8`
Coordinator	`4000`	`8`

Capacity Signals

Signal	Meaning
Postgres pool ratio near `0.9`	Service pool saturation; inspect queries and pool size.
Audit consumer lag	Audit ingestion cannot keep up with Redis stream input.
Audit replay backlog age	STS/Gateway cannot emit audit events to Redis/Audit promptly.
Gateway STS circuit open	Gateway is fast-failing because STS exchange is unhealthy.
Revocation propagation lag	Access-safety state is not reaching Gateway within the expected window.
Readiness flapping	Pods or dependencies are unstable under current load.

Scale Procedure

Identify the bottleneck from metrics and logs.
Scale stateless service replicas first when storage is healthy.
Increase Postgres and Redis capacity before raising service pools.
Verify readiness, lag, replay backlog, and DLQ after each change.
Document the new limit and alert threshold.

Worked Escalation Patterns

Signal	First action	Expected outcome
Postgres pool ratio stays near `0.9`	Inspect slow queries, then raise storage capacity or service pool size one step.	`/ready` stabilizes and pool saturation falls before replicas increase further.
Gateway STS circuit opens	Check STS readiness and exchange latency, then scale STS or reduce Gateway fan-in.	Gateway stops fast-failing protected requests and audit evidence resumes.
Audit consumer lag grows	Check Postgres write IOPS, partition health, and Audit replicas before extending retention or exports.	Redis pending entries and DLQ growth stop increasing.
Revocation propagation lags	Check Redis latency, stream consumers, and Gateway snapshot freshness.	Gateway denial decisions reflect current revocation state.

Troubleshooting

Symptom	First check
Higher replicas make failures worse	Postgres connection pressure or Redis latency.
Gateway latency spikes	STS exchange latency, upstream timeout, JWKS cache, and private upstream egress.
Audit cannot catch up	Postgres write IOPS, partition health, Audit replicas, and Redis pending entries.

Next Step

Use Monitor Health and Metrics to turn capacity signals into readiness gates and operator dashboards.