Rotation service
A stateless fleet health + load-balancing service for the proxy platform — weighted random selection with latency-based weights, hysteresis-protected health states, CAS-safe updates against Couchbase.
- status
- active
- started
- 2026-04
- updated
- 2026-04
- tags
- Backend · Systems
What
A small stateless service that does two things for the proxy fleet it sits in front of: decides which nodes are currently healthy, and picks one per incoming request. Given a request, it returns a single host:port string — the proxy fleet uses the answer immediately and discards it.
Why
Running ~50 proxy nodes is the point where a static config stops working. Nodes go degraded, latency distributions vary across the fleet, and naïve round-robin happily sends traffic to the box that just started flapping. I needed something that combined a liveness signal and a performance signal, rebalanced continuously, and was cheap enough per request that it could sit on the hot path.
How
- Stack: TypeScript + Express, Couchbase (per-node docs keyed
node::host:port), Zod for input validation, Pino for structured logging. - Two-phase load balancing behind an
LB_PHASEflag — useful for rolling out the smart path against the dumb one:- Phase 1 — uniform random. Baseline, deterministic-ish, easy to reason about and compare against.
- Phase 2 — Weighted Random Selection. Weight per node is
base_weight * (1000 / latency_ms), so lower-latency nodes pick up proportionally more traffic. The selection itself is stateless: sum weights, draw a random point in[0, total), walk the list decrementing. No per-request writes, no DB round-trips on the hot path.
- Health state machine:
healthy → degraded (1 failure) → down (3 consecutive failures). Exit requires 2 consecutive successes — the hysteresis on the way out is what prevents a flaky node from cycling the cache every 30 seconds. - CAS-protected mutations on health updates, with a bounded 3-retry loop. Concurrent health reports on the same node used to lose updates under load; the retry closes the window without letting a hotkey collision stall ingestion forever.
- Regional cache of healthy-node sets, 30 s TTL, invalidated immediately on any health-state change. Most request traffic hits the cache; Couchbase sees rebuilds, not per-request reads.
- Input validation via Zod — schemas reject
javascript:anddata:URL schemes, currency codes are uppercased and validated as 3-char ISO before they touch downstream logic.
What was hard (and what I learned)
- Exit hysteresis matters more than entry hysteresis. Easy to write “three fails = down”; harder to resist the urge to mark a node healthy the moment it answers one probe. Letting nodes come back too fast is how you get a 30-second flap loop.
- Latency-weighted selection has a degenerate case I didn’t initially handle: if every reporting node has sub-millisecond latency (cold start, just-booted reporter, test fixture), the weights all collapse to the same large number and the weighted selection degenerates into uniform. Fallback is explicit — if all computed weights are zero or equal, drop to uniform random. The lesson was writing the fallback first, not after.
- Cache invalidation design: the aggressive “clear the whole regional cache on any health change” strategy is simple and correct, but it sends a thundering herd at Couchbase the moment a busy node flips. Replacing it with invalidate-then-background-recache is on the list (see limitations).
Scale / constraints
- Target fleet size: ~50 nodes.
- Stateless service. Scales horizontally behind a load balancer.
- Storage: Couchbase Capella,
smart_routingbucket,core.nodescollection. Small documents, K/V access patterns — no N1QL on the hot path. - Still under active development. This is the current iteration; the shape of the API and the health-check contract are stabilising but not yet final.
Honest limitations
Called out because they matter, and pretending they’re done would be worse than naming them:
- HMAC signature validation on
/node-healthis designed but not yet implemented. The architecture proposal mandates it; the code does not yet enforce it. Any caller with network reachability can currently push health updates. This is the next thing I’m shipping. - Health-checking is reactive only. The service waits to be told a node is unhealthy; there’s no internal probe loop yet. A single bad reporter can cascade to 503 faster than it should.
- No circuit-breaker for in-flight traffic when a node transitions to
down. Existing requests to that node will fail rather than being retried against a healthy peer. - Single-flight cache protection is designed but not built. Cold-cache moments can stampede Couchbase. Low impact at 50 nodes, higher impact at 500.
- Observability is thin. Pino HTTP logs are in; Prometheus metrics and structured audit logs are on the list but not merged.
- Test coverage is uneven. The geo / currency mapping and the weighted-selection logic have unit tests. There are no integration tests for Couchbase CAS contention yet.
What’s next
In rough order:
- HMAC validation on
/node-health. - Proactive probe loop alongside the reactive reports.
- Single-flight cache protection.
- Prometheus metrics + audit logs.
- Integration test harness for the Couchbase CAS paths.