Skip to content

Scale Capacity

Scale Caracal around three bottlenecks: token exchange and policy evaluation in STS, upstream proxying in Gateway, and durable writes in Postgres/Audit.

ComponentPrimary levers
APIReplicas, DB_POOL_MAX, outbox batch/interval settings, request rate limits.
STSReplicas, OPA policy age, Redis invalidation health, MAX_GRANT_TTL_SECONDS, replay persistence.
GatewayReplicas, MAX_REQUEST_BYTES, STS_TIMEOUT, UPSTREAM_TIMEOUT, upstream allowlist, revocation snapshot health.
AuditReplicas, Postgres write capacity, DLQ thresholds, retention, S3 export settings.
CoordinatorReplicas, DB pool, sweeper intervals, outbox batch, service-agent leases.
PostgresConnection limits, indexes, partitions, storage IOPS, backup windows.
RedisStream memory, pending entries, consumer lag, persistence, network latency.

The chart defaults to two replicas for API, STS, Gateway, Audit, and Coordinator. Gateway has a higher maximum HPA ceiling because protected traffic fans through it.

ServiceDefault portDefault max HPA replicas
API30008
STS80808
Gateway808116
Audit90908
Coordinator40008
SignalMeaning
Postgres pool ratio near 0.9Service pool saturation; inspect queries and pool size.
Audit consumer lagAudit ingestion cannot keep up with Redis stream input.
Audit replay backlog ageSTS/Gateway cannot emit audit events to Redis/Audit promptly.
Gateway STS circuit openGateway is fast-failing because STS exchange is unhealthy.
Revocation propagation lagAccess-safety state is not reaching Gateway within the expected window.
Readiness flappingPods or dependencies are unstable under current load.
  1. Identify the bottleneck from metrics and logs.
  2. Scale stateless service replicas first when storage is healthy.
  3. Increase Postgres and Redis capacity before raising service pools.
  4. Verify readiness, lag, replay backlog, and DLQ after each change.
  5. Document the new limit and alert threshold.
SignalFirst actionExpected outcome
Postgres pool ratio stays near 0.9Inspect slow queries, then raise storage capacity or service pool size one step./ready stabilizes and pool saturation falls before replicas increase further.
Gateway STS circuit opensCheck STS readiness and exchange latency, then scale STS or reduce Gateway fan-in.Gateway stops fast-failing protected requests and audit evidence resumes.
Audit consumer lag growsCheck Postgres write IOPS, partition health, and Audit replicas before extending retention or exports.Redis pending entries and DLQ growth stop increasing.
Revocation propagation lagsCheck Redis latency, stream consumers, and Gateway snapshot freshness.Gateway denial decisions reflect current revocation state.
SymptomFirst check
Higher replicas make failures worsePostgres connection pressure or Redis latency.
Gateway latency spikesSTS exchange latency, upstream timeout, JWKS cache, and private upstream egress.
Audit cannot catch upPostgres write IOPS, partition health, Audit replicas, and Redis pending entries.

Use Monitor Health and Metrics to turn capacity signals into readiness gates and operator dashboards.