docs/adr/0006-postgres-plus-redis-data-split.md

# 0006 — PostgreSQL for durable state, Redis for ephemeral state

- **Status:** Accepted
- **Date:** 2026-05-06

## Context and problem statement

The system has two distinct data access patterns:

- **Durable state** — users, survivors, mission history, mission logs. Must survive crashes, restarts, and deploys. Read patterns are infrequent (mostly on session start) but writes need full ACID guarantees.
- **Ephemeral state** — active mission state, mission lobbies, tick locks, `nextTickAt` timestamps, rate limit counters. Read and written every poll cycle (~5 seconds), can be reconstructed from durable state if lost, must be very fast.

What datastore strategy supports both?

## Decision drivers

- **Performance for hot path.** Tick processing reads and writes mission state every 5 seconds per mission. A traditional SQL roundtrip per access is wasteful.
- **Durability for cold path.** Mission history, logs, and user records must survive any failure mode.
- **Operational complexity.** Each datastore added is another system to monitor, back up, and reason about during incidents.
- **Crash recovery.** Ephemeral state should be recoverable from durable state, so loss of the ephemeral store is degraded-but-not-broken.
- **Atomic operations.** Tick scheduling needs `SET NX` semantics for distributed locks, which Postgres can simulate but Redis does natively.

## Considered options

1. **Postgres only, with `pg_notify` and advisory locks.** Single datastore, all state durable, ephemeral access via in-memory cache.
2. **Redis only, with periodic snapshot to disk.** Single datastore, ephemeral by nature, durability via Redis persistence (RDB/AOF).
3. **Postgres + Redis split.** Each datastore plays to its strengths.
4. **Postgres + in-memory state in the API process.** No second datastore, but loses state on restart and doesn't support multi-instance.

## Decision outcome

**Chosen: PostgreSQL for durable state, Redis for ephemeral state.**

Postgres tables (durable):

- `users` — internal ID, Twitch opaque user ID, created at.
- `survivors` — FK to user, stats, perk slots, current lifecycle state.
- `missions` — FK to survivor or group, difficulty, status, timestamps.
- `mission_logs` — FK to mission, tick index, encounter ID, rendered text, seed, modifiers applied.

Redis keys (ephemeral):

- `active_mission:{missionId}` — JSON snapshot of in-progress mission state.
- `mission_lobby:{lobbyId}` — lobby member list and ready flags.
- `tick_lock:{missionId}` — distributed lock (see ADR-0005).
- `rate_limit:{userId}:{endpoint}` — rate limiting counters.

Rule: anything in Redis must be reconstructable from Postgres. Loss of Redis means active missions resume from their last persisted tick on next worker poll, after a brief delay.

## Consequences

### Positive

- **Hot-path performance.** Tick processing operates against Redis with sub-millisecond latency, only writing to Postgres at end-of-tick (the durable log entry).
- **Native primitives where useful.** `SET NX PX` for locks, `SETEX` for TTLs, sorted sets for "missions due" queries — all clean in Redis, awkward in SQL.
- **Failure isolation.** A Postgres slowdown doesn't immediately stop tick processing (Redis state continues); a Redis outage doesn't lose mission history (Postgres persists).
- **Familiar operational tooling.** Both Postgres and Redis have decades of operational maturity.

### Negative

- **Two datastores to operate.** Backups, monitoring, capacity planning, security hardening multiplied by two.
- **Consistency boundary.** Redis can drift from Postgres if the API crashes between Redis write and Postgres write. Mitigated by treating Postgres as authoritative on every cold-start reconciliation.
- **Schema discipline.** The "what lives where" rule must be documented and respected — accidentally putting durable data only in Redis is a class of bug that's invisible until something restarts.

### Neutral

- This split is a common pattern in real-time systems and is well-understood. Hiring or onboarding contributors with experience in either or both is straightforward.
- We deliberately avoid more exotic stores (event sourcing, time-series databases, document stores) until the data model demonstrably needs them.

## Implementation notes

- `mission_logs` rows are append-only; never updated after creation. This makes them trivially safe under concurrent writes and supports replay/debug.
- Plan retention/archival from day one — `mission_logs` will grow fast. Default: partition by month, archive partitions older than N months to cold storage.
- Consider Redis ACL setup before production deploy. Local dev runs without auth; production must not.
Add Zod dependency and update API interfaces - Added Zod as a dependency in package.json. - Updated pnpm-lock.yaml to include Zod. - Refactored API interfaces: exported new modules for perk, survivor, mission, and encounter. - Removed obsolete api-interfaces.ts file. - Enhanced tests for new schemas in api-interfaces.spec.ts, covering various validation scenarios. 2026-05-07 00:46:03 +00:00			`# 0006 — PostgreSQL for durable state, Redis for ephemeral state`

			`- Status: Accepted`
			`- Date: 2026-05-06`

			`## Context and problem statement`

			`The system has two distinct data access patterns:`

			`- Durable state — users, survivors, mission history, mission logs. Must survive crashes, restarts, and deploys. Read patterns are infrequent (mostly on session start) but writes need full ACID guarantees.`
			- Ephemeral state — active mission state, mission lobbies, tick locks, `nextTickAt` timestamps, rate limit counters. Read and written every poll cycle (~5 seconds), can be reconstructed from durable state if lost, must be very fast.

			`What datastore strategy supports both?`

			`## Decision drivers`

			`- Performance for hot path. Tick processing reads and writes mission state every 5 seconds per mission. A traditional SQL roundtrip per access is wasteful.`
			`- Durability for cold path. Mission history, logs, and user records must survive any failure mode.`
			`- Operational complexity. Each datastore added is another system to monitor, back up, and reason about during incidents.`
			`- Crash recovery. Ephemeral state should be recoverable from durable state, so loss of the ephemeral store is degraded-but-not-broken.`
			- Atomic operations. Tick scheduling needs `SET NX` semantics for distributed locks, which Postgres can simulate but Redis does natively.

			`## Considered options`

			1. Postgres only, with `pg_notify` and advisory locks. Single datastore, all state durable, ephemeral access via in-memory cache.
			`2. Redis only, with periodic snapshot to disk. Single datastore, ephemeral by nature, durability via Redis persistence (RDB/AOF).`
			`3. Postgres + Redis split. Each datastore plays to its strengths.`
			`4. Postgres + in-memory state in the API process. No second datastore, but loses state on restart and doesn't support multi-instance.`

			`## Decision outcome`

			`Chosen: PostgreSQL for durable state, Redis for ephemeral state.`

			`Postgres tables (durable):`

			- `users` — internal ID, Twitch opaque user ID, created at.
			- `survivors` — FK to user, stats, perk slots, current lifecycle state.
			- `missions` — FK to survivor or group, difficulty, status, timestamps.
			- `mission_logs` — FK to mission, tick index, encounter ID, rendered text, seed, modifiers applied.

			`Redis keys (ephemeral):`

			- `active_mission:{missionId}` — JSON snapshot of in-progress mission state.
			- `mission_lobby:{lobbyId}` — lobby member list and ready flags.
			- `tick_lock:{missionId}` — distributed lock (see ADR-0005).
			- `rate_limit:{userId}:{endpoint}` — rate limiting counters.

			`Rule: anything in Redis must be reconstructable from Postgres. Loss of Redis means active missions resume from their last persisted tick on next worker poll, after a brief delay.`

			`## Consequences`

			`### Positive`

			`- Hot-path performance. Tick processing operates against Redis with sub-millisecond latency, only writing to Postgres at end-of-tick (the durable log entry).`
			- Native primitives where useful. `SET NX PX` for locks, `SETEX` for TTLs, sorted sets for "missions due" queries — all clean in Redis, awkward in SQL.
			`- Failure isolation. A Postgres slowdown doesn't immediately stop tick processing (Redis state continues); a Redis outage doesn't lose mission history (Postgres persists).`
			`- Familiar operational tooling. Both Postgres and Redis have decades of operational maturity.`

			`### Negative`

			`- Two datastores to operate. Backups, monitoring, capacity planning, security hardening multiplied by two.`
			`- Consistency boundary. Redis can drift from Postgres if the API crashes between Redis write and Postgres write. Mitigated by treating Postgres as authoritative on every cold-start reconciliation.`
			`- Schema discipline. The "what lives where" rule must be documented and respected — accidentally putting durable data only in Redis is a class of bug that's invisible until something restarts.`

			`### Neutral`

			`- This split is a common pattern in real-time systems and is well-understood. Hiring or onboarding contributors with experience in either or both is straightforward.`
			`- We deliberately avoid more exotic stores (event sourcing, time-series databases, document stores) until the data model demonstrably needs them.`

			`## Implementation notes`

			- `mission_logs` rows are append-only; never updated after creation. This makes them trivially safe under concurrent writes and supports replay/debug.
			- Plan retention/archival from day one — `mission_logs` will grow fast. Default: partition by month, archive partitions older than N months to cold storage.
			`- Consider Redis ACL setup before production deploy. Local dev runs without auth; production must not.`