What is High-Level Design (HLD)?

Hands-on practice for this lecture. Work through the exercises and quizzes to reinforce what you've learned.

1

Exercise 1 of 2

Sort 50 Petabytes — Why a Single Machine Fails

sorted() loads everything into RAM. 50 PB is 3,125,000× larger than a 16 GB machine. Find the number of servers that splits the data into chunks that actually fit.

Sort 50 PB on a single 16 GB machine

File size vs available RAM

File to sort50 PB
Available RAM16 GB
← invisible at this scale
3,125,000×

more data than available RAM

50,000,000 GB cannot fit in 16 GB — not even close

sort.py
with open("data.txt") as f:
lines = f.readlines() # loads ALL 50 PB into RAM
print(sorted(lines)) # never reached

❌ readlines() requests all 50,000,000 GB at once. The OS refuses immediately. The program exits before it reads a single line.

2

Exercise 2 of 2

MapReduce: How 50 PB Gets Sorted Across Thousands of Servers

Step through the four phases — raw data, local sort, shuffle by key range, and k-way merge — with 3 servers and 9 words standing in for 50,000 servers and 50 PB.

Merge data from 3 servers — naive approach

Three servers each hold a slice of the dataset. The naive approach: concatenate all three lists and call sort() on the combined result.

Server 1
mangoapplezebra
Server 2
grapebananalemon
Server 3
peachcherrymelon

Result: concatenate all three lists

mangoapplezebragrapebananalemonpeachcherrymelon

❌ The output is unsorted — it's just the three lists stuck together. You'd need a full sort() pass over all 9 items on a single machine. At petabyte scale, that single machine doesn't exist.

Practice: What is High-Level Design (HLD)? — Interactive Exercises | Durgesh Rai