- Added Zod as a dependency in package.json. - Updated pnpm-lock.yaml to include Zod. - Refactored API interfaces: exported new modules for perk, survivor, mission, and encounter. - Removed obsolete api-interfaces.ts file. - Enhanced tests for new schemas in api-interfaces.spec.ts, covering various validation scenarios.
4.6 KiB
Executable File
4.6 KiB
Executable File
0006 — PostgreSQL for durable state, Redis for ephemeral state
- Status: Accepted
- Date: 2026-05-06
Context and problem statement
The system has two distinct data access patterns:
- Durable state — users, survivors, mission history, mission logs. Must survive crashes, restarts, and deploys. Read patterns are infrequent (mostly on session start) but writes need full ACID guarantees.
- Ephemeral state — active mission state, mission lobbies, tick locks,
nextTickAttimestamps, rate limit counters. Read and written every poll cycle (~5 seconds), can be reconstructed from durable state if lost, must be very fast.
What datastore strategy supports both?
Decision drivers
- Performance for hot path. Tick processing reads and writes mission state every 5 seconds per mission. A traditional SQL roundtrip per access is wasteful.
- Durability for cold path. Mission history, logs, and user records must survive any failure mode.
- Operational complexity. Each datastore added is another system to monitor, back up, and reason about during incidents.
- Crash recovery. Ephemeral state should be recoverable from durable state, so loss of the ephemeral store is degraded-but-not-broken.
- Atomic operations. Tick scheduling needs
SET NXsemantics for distributed locks, which Postgres can simulate but Redis does natively.
Considered options
- Postgres only, with
pg_notifyand advisory locks. Single datastore, all state durable, ephemeral access via in-memory cache. - Redis only, with periodic snapshot to disk. Single datastore, ephemeral by nature, durability via Redis persistence (RDB/AOF).
- Postgres + Redis split. Each datastore plays to its strengths.
- Postgres + in-memory state in the API process. No second datastore, but loses state on restart and doesn't support multi-instance.
Decision outcome
Chosen: PostgreSQL for durable state, Redis for ephemeral state.
Postgres tables (durable):
users— internal ID, Twitch opaque user ID, created at.survivors— FK to user, stats, perk slots, current lifecycle state.missions— FK to survivor or group, difficulty, status, timestamps.mission_logs— FK to mission, tick index, encounter ID, rendered text, seed, modifiers applied.
Redis keys (ephemeral):
active_mission:{missionId}— JSON snapshot of in-progress mission state.mission_lobby:{lobbyId}— lobby member list and ready flags.tick_lock:{missionId}— distributed lock (see ADR-0005).rate_limit:{userId}:{endpoint}— rate limiting counters.
Rule: anything in Redis must be reconstructable from Postgres. Loss of Redis means active missions resume from their last persisted tick on next worker poll, after a brief delay.
Consequences
Positive
- Hot-path performance. Tick processing operates against Redis with sub-millisecond latency, only writing to Postgres at end-of-tick (the durable log entry).
- Native primitives where useful.
SET NX PXfor locks,SETEXfor TTLs, sorted sets for "missions due" queries — all clean in Redis, awkward in SQL. - Failure isolation. A Postgres slowdown doesn't immediately stop tick processing (Redis state continues); a Redis outage doesn't lose mission history (Postgres persists).
- Familiar operational tooling. Both Postgres and Redis have decades of operational maturity.
Negative
- Two datastores to operate. Backups, monitoring, capacity planning, security hardening multiplied by two.
- Consistency boundary. Redis can drift from Postgres if the API crashes between Redis write and Postgres write. Mitigated by treating Postgres as authoritative on every cold-start reconciliation.
- Schema discipline. The "what lives where" rule must be documented and respected — accidentally putting durable data only in Redis is a class of bug that's invisible until something restarts.
Neutral
- This split is a common pattern in real-time systems and is well-understood. Hiring or onboarding contributors with experience in either or both is straightforward.
- We deliberately avoid more exotic stores (event sourcing, time-series databases, document stores) until the data model demonstrably needs them.
Implementation notes
mission_logsrows are append-only; never updated after creation. This makes them trivially safe under concurrent writes and supports replay/debug.- Plan retention/archival from day one —
mission_logswill grow fast. Default: partition by month, archive partitions older than N months to cold storage. - Consider Redis ACL setup before production deploy. Local dev runs without auth; production must not.