Consistent Hashing: Operations and Internals

No-coordination property, adding and removing servers, cascading failure prevention, virtual node distribution, weighted nodes, and hash collisions.

April 4, 20266 min read6 / 7

No Coordination Between Load Balancers

Round-robin needed no coordination. The mapping table needed constant sync. Consistent hashing needs none.

Every load balancer runs the same code, uses the same hash functions, and uses the same value of k. Each one independently monitors server health via health checks and maintains its own copy of the server list. When all load balancers have the same server list, they all compute the exact same ring independently.

If Server B goes down, every load balancer detects this via its own health checks. Every load balancer removes Server B's virtual nodes from its ring independently. No message needs to pass between load balancers. They converge to the same ring on their own, because the ring is fully deterministic from the server list.

This is the property that makes consistent hashing work at scale: each load balancer can independently compute the correct routing for any request, without asking anyone else.

Adding and Removing Servers

Adding a Server

When a new server C is added, it gets k virtual nodes placed on the ring. Some of those positions fall between existing server nodes. Any user whose clockwise walk now hits a C node before it would have hit A or B will be routed to C.

Critically, C takes load from multiple servers -- a few users from A and a few from B. It does not empty one server and leave another untouched. The load redistributes roughly evenly across all servers.

Only the users now assigned to C need their data moved. Users still assigned to A or B are unaffected. Their routing did not change and their data does not move.

Removing (or Crashing) a Server

When server C crashes, all its virtual nodes are removed from the ring. Users that were assigned to C's nodes now walk clockwise to the next server node, which might be A or B depending on the ring position.

Because the virtual nodes of C were spread around the ring, the users of C spread to multiple different servers -- not all to one. The load is absorbed roughly evenly.

Users assigned to A or B were not affected. Their routing stays the same. Their data stays in place.

Adding and Removing Servers in Consistent Hashing ExpandAdding and Removing Servers in Consistent Hashing

What Actually Moves the Data?

When a server crashes or is replaced, the data cannot be recovered from the dead machine. The data migration relies on replication -- backups that were created while the server was alive. Consistent hashing tells you which users need their data moved and where. Replication provides the source data for the move.

Cascading Failure Prevention

Consider what happens with a naive routing scheme (like round-robin with one server per position) when a server crashes:

All users of the crashed server get routed to a single neighbouring server. That server now handles double its previous load. It crashes. Its users get dumped onto the next server. That crashes too. The failure propagates through the entire cluster.

This is called a cascading failure.

Consistent hashing prevents it. Because each server has k virtual nodes spread across the ring, a crashed server's users scatter to k different neighbours. No single surviving server absorbs more than a small fraction of the extra load. The cluster absorbs the failure gracefully without any server tipping over.

Why More Virtual Nodes Produce a More Equal Distribution

Hash functions are deterministic but unpredictable. Given a new input, you have no idea what output you will get. The output appears random even though it is calculated.

Consider three servers A, B, and C with k = 1 (one virtual node each). The hash function assigns each server one position on the ring. Because the positions are effectively random, they are unlikely to divide the ring into three equal arcs. You might end up with A occupying 67% of the ring, B occupying 7%, and C occupying 25%. A receives the majority of all requests.

Now increase to k = 64. Each server gets 64 positions distributed across the ring. Each individual position is still random, but 64 independent random positions are much less likely to all cluster in one region. The law of large numbers applies: with enough independent random samples from a uniform distribution, the aggregate coverage approaches uniformity.

Intuitively: if you throw one dart at a circle, it might land anywhere. If you throw 64 darts, they will spread across the circle much more evenly. The worst-case imbalance shrinks dramatically as the number of throws increases.

This is the reason k is chosen to be a moderately large number (typically 64 to 200 in production systems). The trade-off is memory: the sorted ring array grows with k × numServers. In practice, even k = 200 with 10,000 servers produces an array of 2,000,000 entries, which is well within the capacity of a modern load balancer's RAM.

Creating k Virtual Nodes Without k Separate Hash Functions

If you need 64 virtual nodes per server, do you need 64 different hash functions?

No. You parameterise the same hash function with a key.

Plain text
SHA256("ServerA" + "1") → position p1 SHA256("ServerA" + "2") → position p2 SHA256("ServerA" + "3") → position p3 ... SHA256("ServerA" + "64") → position p64

Passing the same server name with a different numeric suffix produces a different output each time. These behave as independent hash functions -- they produce different positions on the ring -- even though they all call the same underlying function.

This is the same idea as salting passwords: you add a varying value to the input to produce different outputs from a single function. You can generate any number of virtual node positions for any server with a single hash function and a loop from 1 to k.

Hash Collisions on the Ring

A collision occurs when two virtual node hashes produce the same ring position. This is possible but extremely unlikely in practice.

Consider a 64-bit hash function like MurmurHash3. The output space is 2⁶⁴ -- approximately 16 billion billion distinct values. The probability that any two specific hashes collide is 1 in 16 billion billion.

When a collision does happen between two virtual nodes, tie-breaking resolves which server "wins" the position -- for example, by always preferring the server with the lexicographically smaller name. The losing server effectively has one fewer virtual node: 63 instead of 64. Its share of the ring decreases by roughly 1/64, which is about 1.5%. The distribution remains almost equal. No special-case handling is required; the algorithm continues to work.

Weighted Virtual Nodes: Giving Powerful Servers More Load

The vanilla consistent hashing algorithm assigns the same number of virtual nodes (k) to every server regardless of its hardware. This means a 64-core machine with 512 GB of RAM gets the same share of traffic as a 4-core machine with 16 GB.

If your cluster has heterogeneous hardware, you can extend the algorithm by assigning different values of k per server based on its capacity. A server with twice the CPU and RAM gets twice as many virtual nodes on the ring, which means it receives roughly twice the traffic.

One practical way to implement this: when the load balancer sends a health check ping to a server, it also fetches the server's hardware configuration -- number of cores, available RAM, disk type. The load balancer uses that configuration to decide how many virtual nodes to assign that server.

This is a modification on top of the base algorithm. The standard consistent hashing treats all servers equally. The weighted variant is an extension you would build when operating a fleet with meaningfully different machine sizes.

Practice what you just read.

Quiz: Consistent Hashing Operations
1 exercise

Enjoyed this? Get more like it.

Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.