Routing Algorithms: Requirements and Why Simple Solutions Fail

Five properties a routing algorithm must satisfy - and why round-robin, bucketing, and mapping tables each fail on at least one that can't be compromised.

April 26, 20265 min read2 / 5

The previous post established that routing and sharding are the same decision. That raises the stakes considerably - every algorithm we choose for routing also controls how data distributes across the cluster. Before comparing options, we need a clear benchmark for what a good algorithm actually looks like.

Five Requirements

1. Fast

The routing algorithm runs on every single request - not on writes only, not periodically. On every request. If computing the route adds even a few milliseconds, that latency becomes the dominant cost at scale. The algorithm must be cheap enough to treat as overhead-free.

2. Equal Distribution

With four servers, you do not want 90% of traffic hitting one of them while the others sit idle. Unequal distribution creates hot spots - one overloaded server becomes the bottleneck for the whole system. The algorithm should spread load roughly evenly.

3. Freely Add or Remove Servers

Servers crash. New capacity gets provisioned. The algorithm cannot assume a fixed cluster size. When a server goes down, requests that were going there must be rerouted automatically. When a new server is added, it should start absorbing load without manual reconfiguration.

4. Minimal Data Movement

When cluster topology changes, some data has to move. But the algorithm should keep that movement to the absolute minimum necessary - only move data that genuinely needs to move. A global reshuffle every time one server is added or removed is expensive and a window for data loss.

5. Deterministic Without Information Exchange

Given the same cluster state, the algorithm must always produce the same routing decision for a given user. And critically: if you have multiple Load Balancers, each must arrive at the same decision independently - without coordinating with the others.

This last requirement is the quiet killer for some otherwise-appealing approaches. Keeping distributed state in perfect sync is not just hard - as we will see when we get to the CAP theorem, it is mathematically impossible at low latency.

Round-Robin (Modulo Hashing)

The simplest approach: server = userId % n, where n is the number of active servers.

Plain text
userId=0 → 0 % 4 = 0 → Server A userId=1 → 1 % 4 = 1 → Server B userId=4 → 4 % 4 = 0 → Server A (wraps around)

Round-Robin Routing: Modulo-based assignment ExpandRound-Robin Routing: Modulo-based assignment

The pros are real: fast, equal distribution, and every Load Balancer independently arrives at the same answer. Three of the five requirements satisfied cleanly.

The fatal flaw: when n changes, everything reshuffles. If Server B crashes, n drops from 4 to 3. User 1 now maps to 1 % 3 = 1, which is Server C - but their data is still on Server B. User 4 maps to 4 % 3 = 1, also Server C, but its data is on Server A. Nearly every user gets reassigned, causing massive unnecessary data movement.

Round-robin only works when server count is fixed and never changes. Elasticsearch enforces a static shard count for exactly this reason. In a general-purpose system, server count is never fixed.

Bucketing

A variation: divide the user ID space into fixed-size ranges. With 400 users and 4 servers, Server A gets users 0-99, Server B gets 100-199, and so on.

Same problem. Remove a server, the bucket boundaries shift, and almost everyone gets reassigned. The root cause is identical - any algorithm that uses n in the denominator will reshuffle the entire dataset when n changes.

Mapping Table

A hash map: { userId -> serverId }. When a server crashes, iterate the map, find all users assigned to that server, and reassign them. When a new server is added, pick a proportional set of users and migrate them over.

This is genuinely better. Only the affected users move - data movement is minimal.

Two problems remain:

Problem 1 - Table size: With 5 billion users at roughly 12 bytes per entry (8 bytes for user ID, 4 for server ID), the table is about 60 GB. Large, but manageable on modern hardware.

Problem 2 - Cross-LB synchronization: This is the killer. Multiple Load Balancers must all have identical versions of this table at all times. If two LBs diverge even briefly, they route the same user to different servers - and data scatters. The CAP theorem proves keeping distributed state perfectly in sync at low latency is mathematically impossible.

The mapping table satisfies four of the five requirements and fails catastrophically on the one that cannot be compromised.

Routing Algorithms Compared: Round-robin vs Mapping Table vs Consistent Hashing ExpandRouting Algorithms Compared: Round-robin vs Mapping Table vs Consistent Hashing

None of the three satisfy all five requirements. This is exactly where consistent hashing enters the picture.

The Essentials

  1. A routing algorithm must be fast, evenly distributing, flexible, minimal on data movement, and deterministic without shared state - these five requirements are not negotiable, and every naive approach fails at least one of them.
  2. Round-robin breaks when n changes - modulo arithmetic causes nearly every user to be reassigned when a server is added or removed, making it unsuitable for dynamic clusters.
  3. The mapping table fails on synchronization - it handles data movement well but requires all Load Balancers to share identical state, which the CAP theorem proves impossible at low latency.

Further Reading and Watching

None of the three approaches satisfy all five requirements. The algorithm that does is built on a different geometric insight - treating the output space of a hash function as a ring rather than a line.

Practice what you just read.

Quiz: Routing Algorithms
1 exercise