World’s Largest Data Storage System - S3

Sep 29, 2024

Source: How AWS Simple Storage Service (S3) Works

Amazon S3 is a massive object storage service with an HTTP REST API. Under the hood, it has several key parts:

A fleet of web servers handling incoming client requests
A global namespace service (tracking object locations)
A huge storage fleet (HDDs and some SSDs)
A background operations fleet (managing maintenance tasks)

The Magic of HDDs

S3 still uses many mechanical hard drives because they’re good for sequential I/O and cheaper than SSDs for bulk storage. Here’s a wild fact about modern HDD precision:

Imagine a Boeing 747 flying 75 mph over a large field of grass. The gap between the plane and the grass would be about two sheets of paper, and the “track width” on the spinning disk is only about 4.6 blades of grass wide. These drives miss reading a blade of grass about once every 25,000 times that plane circles the earth!

That kind of precision lets HDDs reach a data density of about 15 GB/cm². But they have a big limitation: random access is slow because the drive head has to physically move. An HDD might only do ~120 random I/O operations per second, which severely limits random access at large scale.

Scale and Heat Management

S3 uses the largest HDDs available (26 TB or more) and deals with heat management, i.e., avoiding hotspots. Hotspots happen when a few drives get overloaded, causing slow performance. The challenge is we don’t know future access patterns at the time data is written, yet we have to decide how to place that data.

A key solution is that:

Aggregate demand (from millions of customers) becomes more predictable overall.
S3 spreads and replicates data widely so individual disks only see a small fraction of a single customer’s workload.

Redundancy at Scale

Replication – Storing multiple copies (mirroring) of data across disks for both durability and performance.
Erasure Coding – Achieves better storage efficiency than full replication by splitting data into chunks and storing them with parity blocks. This protects against data loss without paying for 100% or 200% overhead.

By spreading objects across many disks, each customer can burst to huge I/O rates. Even with each disk limited to ~120 IOPS, having thousands of disks in parallel can handle massive workloads that would be impossible on a single, smaller system.