Files
fog/docs/adr/0006-postgres-plus-redis-data-split.md

75 lines
4.6 KiB
Markdown
Raw Permalink Normal View History

# 0006 — PostgreSQL for durable state, Redis for ephemeral state
- **Status:** Accepted
- **Date:** 2026-05-06
## Context and problem statement
The system has two distinct data access patterns:
- **Durable state** — users, survivors, mission history, mission logs. Must survive crashes, restarts, and deploys. Read patterns are infrequent (mostly on session start) but writes need full ACID guarantees.
- **Ephemeral state** — active mission state, mission lobbies, tick locks, `nextTickAt` timestamps, rate limit counters. Read and written every poll cycle (~5 seconds), can be reconstructed from durable state if lost, must be very fast.
What datastore strategy supports both?
## Decision drivers
- **Performance for hot path.** Tick processing reads and writes mission state every 5 seconds per mission. A traditional SQL roundtrip per access is wasteful.
- **Durability for cold path.** Mission history, logs, and user records must survive any failure mode.
- **Operational complexity.** Each datastore added is another system to monitor, back up, and reason about during incidents.
- **Crash recovery.** Ephemeral state should be recoverable from durable state, so loss of the ephemeral store is degraded-but-not-broken.
- **Atomic operations.** Tick scheduling needs `SET NX` semantics for distributed locks, which Postgres can simulate but Redis does natively.
## Considered options
1. **Postgres only, with `pg_notify` and advisory locks.** Single datastore, all state durable, ephemeral access via in-memory cache.
2. **Redis only, with periodic snapshot to disk.** Single datastore, ephemeral by nature, durability via Redis persistence (RDB/AOF).
3. **Postgres + Redis split.** Each datastore plays to its strengths.
4. **Postgres + in-memory state in the API process.** No second datastore, but loses state on restart and doesn't support multi-instance.
## Decision outcome
**Chosen: PostgreSQL for durable state, Redis for ephemeral state.**
Postgres tables (durable):
- `users` — internal ID, Twitch opaque user ID, created at.
- `survivors` — FK to user, stats, perk slots, current lifecycle state.
- `missions` — FK to survivor or group, difficulty, status, timestamps.
- `mission_logs` — FK to mission, tick index, encounter ID, rendered text, seed, modifiers applied.
Redis keys (ephemeral):
- `active_mission:{missionId}` — JSON snapshot of in-progress mission state.
- `mission_lobby:{lobbyId}` — lobby member list and ready flags.
- `tick_lock:{missionId}` — distributed lock (see ADR-0005).
- `rate_limit:{userId}:{endpoint}` — rate limiting counters.
Rule: anything in Redis must be reconstructable from Postgres. Loss of Redis means active missions resume from their last persisted tick on next worker poll, after a brief delay.
## Consequences
### Positive
- **Hot-path performance.** Tick processing operates against Redis with sub-millisecond latency, only writing to Postgres at end-of-tick (the durable log entry).
- **Native primitives where useful.** `SET NX PX` for locks, `SETEX` for TTLs, sorted sets for "missions due" queries — all clean in Redis, awkward in SQL.
- **Failure isolation.** A Postgres slowdown doesn't immediately stop tick processing (Redis state continues); a Redis outage doesn't lose mission history (Postgres persists).
- **Familiar operational tooling.** Both Postgres and Redis have decades of operational maturity.
### Negative
- **Two datastores to operate.** Backups, monitoring, capacity planning, security hardening multiplied by two.
- **Consistency boundary.** Redis can drift from Postgres if the API crashes between Redis write and Postgres write. Mitigated by treating Postgres as authoritative on every cold-start reconciliation.
- **Schema discipline.** The "what lives where" rule must be documented and respected — accidentally putting durable data only in Redis is a class of bug that's invisible until something restarts.
### Neutral
- This split is a common pattern in real-time systems and is well-understood. Hiring or onboarding contributors with experience in either or both is straightforward.
- We deliberately avoid more exotic stores (event sourcing, time-series databases, document stores) until the data model demonstrably needs them.
## Implementation notes
- `mission_logs` rows are append-only; never updated after creation. This makes them trivially safe under concurrent writes and supports replay/debug.
- Plan retention/archival from day one — `mission_logs` will grow fast. Default: partition by month, archive partitions older than N months to cold storage.
- Consider Redis ACL setup before production deploy. Local dev runs without auth; production must not.