Files
fog/docs/adr/0005-per-mission-jittered-tick-scheduling.md

60 lines
4.4 KiB
Markdown
Raw Permalink Normal View History

# 0005 — Per-mission jittered tick scheduling
- **Status:** Accepted
- **Date:** 2026-05-06
## Context and problem statement
The mission engine processes one encounter per mission every 60 seconds. The original plan called for a global heartbeat — a single timer that fires every 60 seconds and processes all active missions in lockstep. This design has a simple failure mode at scale: if many missions are active, every 60 seconds the system simultaneously executes hundreds or thousands of database writes, PubSub publishes, and Redis state updates.
How should ticks be scheduled to avoid this thundering herd?
## Decision drivers
- **Database load smoothness.** Synchronised tick processing creates a sawtooth load pattern on Postgres. A flat baseline is easier to provision for and easier to monitor.
- **PubSub rate limits.** Twitch's PubSub allows 1 message/sec per channel. A synchronised tick that touches 500 channels at once creates a queue spike.
- **Operational simplicity.** A single global timer is conceptually clean; per-mission timers add bookkeeping.
- **Crash recovery.** Whichever scheduling model is chosen, a worker crash mid-tick must not leave missions stuck or double-processed.
- **Distributed processing.** The system may eventually run on multiple worker instances. The scheduling model must support distributed locking.
## Considered options
1. **Global synchronised tick.** Single cron-style scheduler fires every 60s, processes all due missions in batch.
2. **Per-mission timers in memory.** Each mission has a `setTimeout` for its next tick.
3. **Per-mission `nextTickAt` timestamps in Redis, with a worker that polls for due missions and jitters new ticks across the next 60s window.**
4. **Message-queue-driven scheduling** (e.g., RabbitMQ, BullMQ) with delayed jobs.
## Decision outcome
**Chosen: Per-mission `nextTickAt` timestamps in Redis with jitter, processed by a polling worker.**
Each mission stores `nextTickAt` in Redis. When a mission starts (or after each tick processes), `nextTickAt` is set to `now() + 60s + random(±15s)`. A worker polls Redis every ~5 seconds for missions where `nextTickAt <= now()`, processes them, and reschedules.
Distributed safety is provided by `SET key value NX PX <ttl>` with a unique token (verified before release) on a per-mission lock. This pattern handles concurrent worker instances and crash recovery — a stuck lock expires after its TTL and another worker picks up the mission.
## Consequences
### Positive
- **Smooth Postgres write load.** Across many active missions, ticks are spread evenly through every 60-second window rather than clustered.
- **Headroom for PubSub batching.** Per-channel rate limits are easier to respect when channels' missions don't all tick at the exact same moment.
- **Crash-safe.** A worker that dies mid-tick releases its lock via TTL; another worker picks up the mission on the next poll.
- **Easy to reason about per-mission.** Each mission's lifecycle is independent in Redis, with no shared mutable state.
### Negative
- **More moving parts than a global cron.** Polling logic, lock semantics, and TTL tuning are all distinct things to get right.
- **Polling latency.** Worst-case, a tick fires up to ~5 seconds late (the polling interval). Acceptable for a 60-second cadence, would not be for sub-second.
- **Redis becomes operationally critical.** A Redis outage stops mission progression. Mitigate with monitoring and a clearly-documented recovery path.
### Neutral
- The polling interval is a tuning knob: shorter intervals reduce tick latency but increase Redis load. Default to 5 seconds; revisit if mission counts grow into the tens of thousands.
- This pattern generalises to other periodic work (PubSub batched flushes, mission timeout enforcement) without architectural change.
## Implementation notes
- Lock key shape: `tick_lock:{missionId}`, value is a UUID generated by the worker, TTL is 30 seconds (longer than worst-case tick duration, shorter than the 60s cadence).
- Release pattern: Lua script that checks the lock value matches the worker's UUID before deleting. Prevents accidentally releasing another worker's lock if your tick took too long.
- `nextTickAt` is set by the lock-holding worker after the encounter resolves. If the worker crashes between resolving and setting, the lock expires and the next worker reprocesses — safe because tick processing is idempotent on `(missionId, tickIndex)` thanks to the seeded resolver (see ADR-0004).