Files
fog/docs/adr/0006-postgres-plus-redis-data-split.md
Maurycy 65af268b86 Add Zod dependency and update API interfaces
- Added Zod as a dependency in package.json.
- Updated pnpm-lock.yaml to include Zod.
- Refactored API interfaces: exported new modules for perk, survivor, mission, and encounter.
- Removed obsolete api-interfaces.ts file.
- Enhanced tests for new schemas in api-interfaces.spec.ts, covering various validation scenarios.
2026-05-07 00:46:03 +00:00

4.6 KiB
Executable File

0006 — PostgreSQL for durable state, Redis for ephemeral state

  • Status: Accepted
  • Date: 2026-05-06

Context and problem statement

The system has two distinct data access patterns:

  • Durable state — users, survivors, mission history, mission logs. Must survive crashes, restarts, and deploys. Read patterns are infrequent (mostly on session start) but writes need full ACID guarantees.
  • Ephemeral state — active mission state, mission lobbies, tick locks, nextTickAt timestamps, rate limit counters. Read and written every poll cycle (~5 seconds), can be reconstructed from durable state if lost, must be very fast.

What datastore strategy supports both?

Decision drivers

  • Performance for hot path. Tick processing reads and writes mission state every 5 seconds per mission. A traditional SQL roundtrip per access is wasteful.
  • Durability for cold path. Mission history, logs, and user records must survive any failure mode.
  • Operational complexity. Each datastore added is another system to monitor, back up, and reason about during incidents.
  • Crash recovery. Ephemeral state should be recoverable from durable state, so loss of the ephemeral store is degraded-but-not-broken.
  • Atomic operations. Tick scheduling needs SET NX semantics for distributed locks, which Postgres can simulate but Redis does natively.

Considered options

  1. Postgres only, with pg_notify and advisory locks. Single datastore, all state durable, ephemeral access via in-memory cache.
  2. Redis only, with periodic snapshot to disk. Single datastore, ephemeral by nature, durability via Redis persistence (RDB/AOF).
  3. Postgres + Redis split. Each datastore plays to its strengths.
  4. Postgres + in-memory state in the API process. No second datastore, but loses state on restart and doesn't support multi-instance.

Decision outcome

Chosen: PostgreSQL for durable state, Redis for ephemeral state.

Postgres tables (durable):

  • users — internal ID, Twitch opaque user ID, created at.
  • survivors — FK to user, stats, perk slots, current lifecycle state.
  • missions — FK to survivor or group, difficulty, status, timestamps.
  • mission_logs — FK to mission, tick index, encounter ID, rendered text, seed, modifiers applied.

Redis keys (ephemeral):

  • active_mission:{missionId} — JSON snapshot of in-progress mission state.
  • mission_lobby:{lobbyId} — lobby member list and ready flags.
  • tick_lock:{missionId} — distributed lock (see ADR-0005).
  • rate_limit:{userId}:{endpoint} — rate limiting counters.

Rule: anything in Redis must be reconstructable from Postgres. Loss of Redis means active missions resume from their last persisted tick on next worker poll, after a brief delay.

Consequences

Positive

  • Hot-path performance. Tick processing operates against Redis with sub-millisecond latency, only writing to Postgres at end-of-tick (the durable log entry).
  • Native primitives where useful. SET NX PX for locks, SETEX for TTLs, sorted sets for "missions due" queries — all clean in Redis, awkward in SQL.
  • Failure isolation. A Postgres slowdown doesn't immediately stop tick processing (Redis state continues); a Redis outage doesn't lose mission history (Postgres persists).
  • Familiar operational tooling. Both Postgres and Redis have decades of operational maturity.

Negative

  • Two datastores to operate. Backups, monitoring, capacity planning, security hardening multiplied by two.
  • Consistency boundary. Redis can drift from Postgres if the API crashes between Redis write and Postgres write. Mitigated by treating Postgres as authoritative on every cold-start reconciliation.
  • Schema discipline. The "what lives where" rule must be documented and respected — accidentally putting durable data only in Redis is a class of bug that's invisible until something restarts.

Neutral

  • This split is a common pattern in real-time systems and is well-understood. Hiring or onboarding contributors with experience in either or both is straightforward.
  • We deliberately avoid more exotic stores (event sourcing, time-series databases, document stores) until the data model demonstrably needs them.

Implementation notes

  • mission_logs rows are append-only; never updated after creation. This makes them trivially safe under concurrent writes and supports replay/debug.
  • Plan retention/archival from day one — mission_logs will grow fast. Default: partition by month, archive partitions older than N months to cold storage.
  • Consider Redis ACL setup before production deploy. Local dev runs without auth; production must not.