1
Finding a needle in Haystack:
Facebook’s photo storage
IBM Haifa Research
Storage Systems
Some Numbers (2010)
Over 260 Billion images (20 PB)
– 65 Billion X 4 different sizes for each image.
1 Billion (60 TB) are uploaded each week
3
Motivation
Started with traditional NFS system shouldered by a CDN.
The long tail pattern of access to photos, leaves much of the traffic to the
NFS system.
Key observation: The NFS system does not withstand the amount of
requests due to excessive amount of metadata disk operations
The NFS Based Design
• 80% CDN Hit Rate.
• Rest of 20% are on a long tail distribution, which is not cacheable.
Picture taken from “
Finding a needle in Haystack: Facebook’s photo
storage”,
OSDI'10 Proceedings
5
NFS and The Metadata Bottleneck
1. Starting point: More than 10 disk operations to retrieve a single image (thousands of images per directory)
2. Reducing directory size to hundreds images led to 3 disk operations
• Read directory metadata
• Load inode
• Read file content
3. Caching inodes
• Caching all inodes is an expensive requirement for current filesystems
The New Approach
Reduce the amount of filesystem per image metadata so it can all fit into
main memory.
Aggregate a 100GB worth of images into one single “file” or volume.
7
The Usual Design Goals 1/2
A storage system for write once, read often, never modified, and rarely deleted data.
High Throughput and Low Latency: – Need to facilitate good user experience
– Measurements show up to 12 ms (measured on the storage machine)
– Achieved by:
- keeping all metadata in main memory (ala GFS)
- Log structured multi writes/append only operations
Fault-tolerance:
– Replication in geographically distinct locations
– When a replica is lost, a new one is created
The Usual Design Goals 2/2
Cost Effective - Comparing their previous NFS Solution:
– Cost per usable terabyte of storage is 28% less
– X4 application layer read rate per terabyte of usable storage
9
1. 10TB of Server Capacity is organized as 100 physical volumes of 100 GB of storage each.
2. Physical volumes are grouped into logical volumes.
3. When a photo is stored in a logical volume, it is written to all corresponding physical volumes Maintains logical to physical
mapping Additional CDN Layer
Overview of the Haystack Architecture
Picture taken from “
Finding a needle in Haystack: Facebook’s photo
storage”,
OSDI'10 Proceedings
Serving A Photo
http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>
Picture taken from “
Finding a needle in Haystack: Facebook’s photo
storage”,
OSDI'10 Proceedings
11
Uploading A Photo
Picture taken from “
Finding a needle in Haystack: Facebook’s photo
storage”,
OSDI'10 Proceedings
The Haystack Directory
Mapping from logical volumes to physical volumes. (Placement table)
– What about photo id to logical volumes mapping?
Identify read-only logical volumes
– Reached their storage capacity
– Due to operational reasons
Load balance writes across write-enabled logical storage volumes
13
The Haystack Store
Each store machine manages multiple physical volumes
Each physical volume can be thought of as large file (~100GB) saved as /hay/haystack_<logical volume id>
Keeps open file descriptor for each managed physical volume (xfs)
Keeps in memory mapping: <photo ID> ---> <file, offset, size>
The Physical Volume Structure
File,offset, size Photo ID
On Disk
In Memory
Picture taken from “
Finding a needle in Haystack: Facebook’s photo
storage”,
OSDI'10 Proceedings
15
Store Basic Operations
Read
– Get <logical volume id, key, alternate key, cookie> from Cache
– Lookup in memory metadata, if the photo exists/not marked as deleted
– seek
– read the entire needle (data + metadata)
– Verify cookie, integrity
– Return data to cache machine
Write
– Get <logical volume id, key, alternate key, cookie, data> from web server
– Synchronously append needle images to the appropriate physical volume
– Update in memory structure
Modify (e.g. when a photo is rotated)
– The new version is either written to a new logical volume, requiring a metadata update by the directory
– Or the new version is written to the same physical volume in a higher offset
Delete and Compact
– Set the delete flag both in memory and on disk synchronously
– Write the whole logical file into a new one skipping deleted photos. 25% of the photos gets deleted.
Recovery from Failures
Arsenal
– Replication
– RAID-6
– “pitchfork” background process that:
- Tests connections to store machines
- Checks availability of volume files
- Attempts to read data from store machines
17