Skip to content

Key Management and Rotation

Caracal uses two categories of key material: the Zone Key Encryption Key (ZONE_KEK), which is an operator-supplied secret, and per-zone ES256 signing keys, which are generated by the STS and stored encrypted in the database.

ZONE_KEK is a 32-byte hex-encoded key encryption key supplied via environment variable to the STS and API. It is the root secret that protects all zone signing key material at rest.

Constraints enforced at startup:

  • Must be exactly 32 bytes when decoded from hex (64 hex characters).
  • Must not be all zeros. The all-zeros check prevents use of an accidentally unset or placeholder value that could make database snapshots trivially decryptable.
  • Must be identical across all STS and API replicas.

Where it is used:

  • STS: Decrypts zone signing keys from the secrets table on demand.
  • API: Encrypts new signing keys when a zone is created or rotated.

ZONE_KEK_PROVIDER defaults to local, meaning the key is read directly from the environment variable. This is the only supported provider in the OSS release.


Each zone has its own ECDSA P-256 signing key used to sign ES256 JWTs (mandates). The key is stored in the secrets table as a PEM-encoded private key encrypted with ChaCha20-Poly1305 using ZONE_KEK as the encryption key and a randomly generated nonce stored alongside the ciphertext.

Encryption at rest:

PEM private key bytes
→ ChaCha20-Poly1305 encrypt (key=ZONE_KEK, nonce=random 12 bytes)
→ ciphertext + nonce stored in secrets table

Decryption path (per-request in STS):

  1. STS checks its in-memory KeyCache for the zone’s key (15-minute TTL).
  2. On cache miss, queries GetZoneSigningKeySecret(zone_id) from Postgres.
  3. Decrypts ciphertext with ChaCha20-Poly1305.Open(ZONE_KEK, nonce, ciphertext).
  4. Parses the resulting PEM bytes as an ECDSA private key.
  5. Stores the key in the cache for 15 minutes.
  6. Uses the key to sign the JWT.

The cache TTL is 15 minutes. Under normal operation, each STS replica loads a zone’s key at most 4 times per hour from Postgres.

Key invalidation:

When a zone’s signing key is rotated, the API publishes an event to the caracal.keys.invalidate stream. The STS consumer group (sts-keys) processes this event and calls KeyCache.Invalidate(zoneID), forcing a fresh load from Postgres on the next request. This ensures all replicas pick up rotated keys within seconds.


The STS serves the zone’s public signing keys at:

GET {ISSUER_URL}/.well-known/jwks.json?zone_id={zone_id}

zone_id is mandatory. The STS never serves all zones’ keys in a single document.

Response:

{
"keys": [
{
"kty": "EC",
"crv": "P-256",
"x": "<base64url>",
"y": "<base64url>",
"kid": "<secret-id>",
"alg": "ES256",
"use": "sig"
}
]
}

The STS returns up to two keys: the current active key and the previous key (if one exists). During a key rotation, both keys appear in the JWKS document for a 24-hour grace period. This allows clients that have cached the old kid to verify tokens issued before the rotation without fetching the JWKS again immediately.

Cache-Control header: public, max-age=300, must-revalidate (5-minute client-side cache).

Consuming packages (@caracalai/identity, caracalai_identity, github.com/garudex-labs/caracal/identity) cache JWKS responses per issuer for 5 minutes with stale-while-revalidate on fetch errors.


1. Generate a new signing key for the zone:

Terminal window
caracal zone rotate-key --zone <zone-id>

This creates a new ES256 key pair, stores it encrypted in the secrets table (as version N+1), and publishes an invalidation event to caracal.keys.invalidate.

2. Verify the new key appears in JWKS:

Terminal window
curl "http://localhost:8080/.well-known/jwks.json?zone_id=<zone-id>" | jq '.keys | length'

During the grace period, this returns 2. After 24 hours, the old key ages out and the response returns 1.

3. Monitor for validation errors:

Watch STS metrics for jwks_invalid_keys — a non-zero value indicates a key is present in the JWKS but fails validation, which should not happen during normal rotation.

4. After the grace period:

No action required. The STS automatically stops serving the old key after 24 hours (the two most recent key versions are always returned; older versions are excluded).


Rotating ZONE_KEK requires re-encrypting all zone signing keys in the database with the new key before the old key is removed. This is a sensitive operation with a risk window.

Procedure:

  1. Generate a new 32-byte hex secret: openssl rand -hex 32.
  2. Run the re-encryption migration (a custom script or admin operation that reads each row from secrets, decrypts with the old KEK, and re-encrypts with the new KEK within a single transaction).
  3. Update ZONE_KEK in the environment on all STS and API replicas in a rolling restart.
  4. Verify services are healthy (GET /ready returns 200 on all replicas).
  5. Revoke and delete the old ZONE_KEK from your secrets manager.

Do not update the env var on some replicas while others still use the old key — a mixed-KEK state causes decryption failures on replicas with the wrong key.


AUDIT_HMAC_KEY is a 32-byte hex-encoded HMAC key used to sign audit event chain entries. It is required by the STS (for publishing events) and the Audit service (for verifying chain integrity).

Unlike ZONE_KEK, there is no in-flight re-keying path for AUDIT_HMAC_KEY. Rotating it requires accepting that historical chain verification will fail for events signed with the old key. Treat it as a long-lived secret and rotate it only when compromise is suspected.


STREAMS_HMAC_KEY is a 32-byte hex-encoded HMAC key used to sign Redis stream messages (the _sig field). All services that publish or consume streams should share the same key. It is optional but strongly recommended in production.

Rotating STREAMS_HMAC_KEY:

  1. Update the key in the environment for all services simultaneously (rolling restart).
  2. During the brief window when different replicas use different keys, some messages will fail signature verification and be skipped. This is handled gracefully — messages are acknowledged without being processed.
  3. After all replicas have restarted with the new key, normal operation resumes.