Rotation service

A stateless fleet health + load-balancing service for the proxy platform — weighted random selection with latency-based weights, hysteresis-protected health states, CAS-safe updates against Couchbase.

status: active
started: 2026-04
updated: 2026-04
tags: Backend · Systems

What

A small stateless service that does two things for the proxy fleet it sits in front of: decides which nodes are currently healthy, and picks one per incoming request. Given a request, it returns a single host:port string — the proxy fleet uses the answer immediately and discards it.

Why

Running ~50 proxy nodes is the point where a static config stops working. Nodes go degraded, latency distributions vary across the fleet, and naïve round-robin happily sends traffic to the box that just started flapping. I needed something that combined a liveness signal and a performance signal, rebalanced continuously, and was cheap enough per request that it could sit on the hot path.

How

Stack: TypeScript + Express, Couchbase (per-node docs keyed node::host:port), Zod for input validation, Pino for structured logging.
Two-phase load balancing behind an LB_PHASE flag — useful for rolling out the smart path against the dumb one:
- Phase 1 — uniform random. Baseline, deterministic-ish, easy to reason about and compare against.
- Phase 2 — Weighted Random Selection. Weight per node is base_weight * (1000 / latency_ms), so lower-latency nodes pick up proportionally more traffic. The selection itself is stateless: sum weights, draw a random point in [0, total), walk the list decrementing. No per-request writes, no DB round-trips on the hot path.
Health state machine: healthy → degraded (1 failure) → down (3 consecutive failures). Exit requires 2 consecutive successes — the hysteresis on the way out is what prevents a flaky node from cycling the cache every 30 seconds.
CAS-protected mutations on health updates, with a bounded 3-retry loop. Concurrent health reports on the same node used to lose updates under load; the retry closes the window without letting a hotkey collision stall ingestion forever.
Regional cache of healthy-node sets, 30 s TTL, invalidated immediately on any health-state change. Most request traffic hits the cache; Couchbase sees rebuilds, not per-request reads.
Input validation via Zod — schemas reject javascript: and data: URL schemes, currency codes are uppercased and validated as 3-char ISO before they touch downstream logic.

What was hard (and what I learned)

Exit hysteresis matters more than entry hysteresis. Easy to write “three fails = down”; harder to resist the urge to mark a node healthy the moment it answers one probe. Letting nodes come back too fast is how you get a 30-second flap loop.
Latency-weighted selection has a degenerate case I didn’t initially handle: if every reporting node has sub-millisecond latency (cold start, just-booted reporter, test fixture), the weights all collapse to the same large number and the weighted selection degenerates into uniform. Fallback is explicit — if all computed weights are zero or equal, drop to uniform random. The lesson was writing the fallback first, not after.
Cache invalidation design: the aggressive “clear the whole regional cache on any health change” strategy is simple and correct, but it sends a thundering herd at Couchbase the moment a busy node flips. Replacing it with invalidate-then-background-recache is on the list (see limitations).

Scale / constraints

Target fleet size: ~50 nodes.
Stateless service. Scales horizontally behind a load balancer.
Storage: Couchbase Capella, smart_routing bucket, core.nodes collection. Small documents, K/V access patterns — no N1QL on the hot path.
Still under active development. This is the current iteration; the shape of the API and the health-check contract are stabilising but not yet final.

Honest limitations

Called out because they matter, and pretending they’re done would be worse than naming them:

HMAC signature validation on /node-health is designed but not yet implemented. The architecture proposal mandates it; the code does not yet enforce it. Any caller with network reachability can currently push health updates. This is the next thing I’m shipping.
Health-checking is reactive only. The service waits to be told a node is unhealthy; there’s no internal probe loop yet. A single bad reporter can cascade to 503 faster than it should.
No circuit-breaker for in-flight traffic when a node transitions to down. Existing requests to that node will fail rather than being retried against a healthy peer.
Single-flight cache protection is designed but not built. Cold-cache moments can stampede Couchbase. Low impact at 50 nodes, higher impact at 500.
Observability is thin. Pino HTTP logs are in; Prometheus metrics and structured audit logs are on the list but not merged.
Test coverage is uneven. The geo / currency mapping and the weighted-selection logic have unit tests. There are no integration tests for Couchbase CAS contention yet.

What’s next

In rough order:

HMAC validation on /node-health.
Proactive probe loop alongside the reactive reports.
Single-flight cache protection.
Prometheus metrics + audit logs.
Integration test harness for the Couchbase CAS paths.