- Added Zod as a dependency in package.json. - Updated pnpm-lock.yaml to include Zod. - Refactored API interfaces: exported new modules for perk, survivor, mission, and encounter. - Removed obsolete api-interfaces.ts file. - Enhanced tests for new schemas in api-interfaces.spec.ts, covering various validation scenarios.
60 lines
4.4 KiB
Markdown
Executable File
60 lines
4.4 KiB
Markdown
Executable File
# 0005 — Per-mission jittered tick scheduling
|
|
|
|
- **Status:** Accepted
|
|
- **Date:** 2026-05-06
|
|
|
|
## Context and problem statement
|
|
|
|
The mission engine processes one encounter per mission every 60 seconds. The original plan called for a global heartbeat — a single timer that fires every 60 seconds and processes all active missions in lockstep. This design has a simple failure mode at scale: if many missions are active, every 60 seconds the system simultaneously executes hundreds or thousands of database writes, PubSub publishes, and Redis state updates.
|
|
|
|
How should ticks be scheduled to avoid this thundering herd?
|
|
|
|
## Decision drivers
|
|
|
|
- **Database load smoothness.** Synchronised tick processing creates a sawtooth load pattern on Postgres. A flat baseline is easier to provision for and easier to monitor.
|
|
- **PubSub rate limits.** Twitch's PubSub allows 1 message/sec per channel. A synchronised tick that touches 500 channels at once creates a queue spike.
|
|
- **Operational simplicity.** A single global timer is conceptually clean; per-mission timers add bookkeeping.
|
|
- **Crash recovery.** Whichever scheduling model is chosen, a worker crash mid-tick must not leave missions stuck or double-processed.
|
|
- **Distributed processing.** The system may eventually run on multiple worker instances. The scheduling model must support distributed locking.
|
|
|
|
## Considered options
|
|
|
|
1. **Global synchronised tick.** Single cron-style scheduler fires every 60s, processes all due missions in batch.
|
|
2. **Per-mission timers in memory.** Each mission has a `setTimeout` for its next tick.
|
|
3. **Per-mission `nextTickAt` timestamps in Redis, with a worker that polls for due missions and jitters new ticks across the next 60s window.**
|
|
4. **Message-queue-driven scheduling** (e.g., RabbitMQ, BullMQ) with delayed jobs.
|
|
|
|
## Decision outcome
|
|
|
|
**Chosen: Per-mission `nextTickAt` timestamps in Redis with jitter, processed by a polling worker.**
|
|
|
|
Each mission stores `nextTickAt` in Redis. When a mission starts (or after each tick processes), `nextTickAt` is set to `now() + 60s + random(±15s)`. A worker polls Redis every ~5 seconds for missions where `nextTickAt <= now()`, processes them, and reschedules.
|
|
|
|
Distributed safety is provided by `SET key value NX PX <ttl>` with a unique token (verified before release) on a per-mission lock. This pattern handles concurrent worker instances and crash recovery — a stuck lock expires after its TTL and another worker picks up the mission.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
- **Smooth Postgres write load.** Across many active missions, ticks are spread evenly through every 60-second window rather than clustered.
|
|
- **Headroom for PubSub batching.** Per-channel rate limits are easier to respect when channels' missions don't all tick at the exact same moment.
|
|
- **Crash-safe.** A worker that dies mid-tick releases its lock via TTL; another worker picks up the mission on the next poll.
|
|
- **Easy to reason about per-mission.** Each mission's lifecycle is independent in Redis, with no shared mutable state.
|
|
|
|
### Negative
|
|
|
|
- **More moving parts than a global cron.** Polling logic, lock semantics, and TTL tuning are all distinct things to get right.
|
|
- **Polling latency.** Worst-case, a tick fires up to ~5 seconds late (the polling interval). Acceptable for a 60-second cadence, would not be for sub-second.
|
|
- **Redis becomes operationally critical.** A Redis outage stops mission progression. Mitigate with monitoring and a clearly-documented recovery path.
|
|
|
|
### Neutral
|
|
|
|
- The polling interval is a tuning knob: shorter intervals reduce tick latency but increase Redis load. Default to 5 seconds; revisit if mission counts grow into the tens of thousands.
|
|
- This pattern generalises to other periodic work (PubSub batched flushes, mission timeout enforcement) without architectural change.
|
|
|
|
## Implementation notes
|
|
|
|
- Lock key shape: `tick_lock:{missionId}`, value is a UUID generated by the worker, TTL is 30 seconds (longer than worst-case tick duration, shorter than the 60s cadence).
|
|
- Release pattern: Lua script that checks the lock value matches the worker's UUID before deleting. Prevents accidentally releasing another worker's lock if your tick took too long.
|
|
- `nextTickAt` is set by the lock-holding worker after the encounter resolves. If the worker crashes between resolving and setting, the lock expires and the next worker reprocesses — safe because tick processing is idempotent on `(missionId, tickIndex)` thanks to the seeded resolver (see ADR-0004).
|