Hands-on practice for this lecture. Work through the exercises and quizzes to reinforce what you've learned.
Exercise 1 of 2
sorted() loads everything into RAM. 50 PB is 3,125,000× larger than a 16 GB machine. Find the number of servers that splits the data into chunks that actually fit.
File size vs available RAM
more data than available RAM
50,000,000 GB cannot fit in 16 GB — not even close
with open("data.txt") as f:lines = f.readlines() # loads ALL 50 PB into RAMprint(sorted(lines)) # never reached
⌠readlines() requests all 50,000,000 GB at once. The OS refuses immediately. The program exits before it reads a single line.
Exercise 2 of 2
Step through the four phases — raw data, local sort, shuffle by key range, and k-way merge — with 3 servers and 9 words standing in for 50,000 servers and 50 PB.
Three servers each hold a slice of the dataset. The naive approach: concatenate all three lists and call sort() on the combined result.
Result: concatenate all three lists
⌠The output is unsorted — it's just the three lists stuck together. You'd need a full sort() pass over all 9 items on a single machine. At petabyte scale, that single machine doesn't exist.