• No results found

Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems

N/A
N/A
Protected

Academic year: 2021

Share "Finding a needle in Haystack: Facebook s photo storage IBM Haifa Research Storage Systems"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Finding a needle in Haystack:

Facebook’s photo storage

IBM Haifa Research

Storage Systems

(2)

Some Numbers (2010)



Over 260 Billion images (20 PB)

– 65 Billion X 4 different sizes for each image.



1 Billion (60 TB) are uploaded each week

(3)

3

Motivation



Started with traditional NFS system shouldered by a CDN.



The long tail pattern of access to photos, leaves much of the traffic to the

NFS system.



Key observation: The NFS system does not withstand the amount of

requests due to excessive amount of metadata disk operations

(4)

The NFS Based Design

• 80% CDN Hit Rate.

• Rest of 20% are on a long tail distribution, which is not cacheable.

Picture taken from “

Finding a needle in Haystack: Facebook’s photo

storage”,

OSDI'10 Proceedings

(5)

5

NFS and The Metadata Bottleneck

1. Starting point: More than 10 disk operations to retrieve a single image (thousands of images per directory)

2. Reducing directory size to hundreds images led to 3 disk operations

• Read directory metadata

• Load inode

• Read file content

3. Caching inodes

• Caching all inodes is an expensive requirement for current filesystems

(6)

The New Approach



Reduce the amount of filesystem per image metadata so it can all fit into

main memory.



Aggregate a 100GB worth of images into one single “file” or volume.

(7)

7

The Usual Design Goals 1/2

 A storage system for write once, read often, never modified, and rarely deleted data.

 High Throughput and Low Latency: – Need to facilitate good user experience

– Measurements show up to 12 ms (measured on the storage machine)

– Achieved by:

- keeping all metadata in main memory (ala GFS)

- Log structured multi writes/append only operations

 Fault-tolerance:

– Replication in geographically distinct locations

– When a replica is lost, a new one is created

(8)

The Usual Design Goals 2/2



Cost Effective - Comparing their previous NFS Solution:

– Cost per usable terabyte of storage is 28% less

– X4 application layer read rate per terabyte of usable storage

(9)

9

1. 10TB of Server Capacity is organized as 100 physical volumes of 100 GB of storage each.

2. Physical volumes are grouped into logical volumes.

3. When a photo is stored in a logical volume, it is written to all corresponding physical volumes Maintains logical to physical

mapping Additional CDN Layer

Overview of the Haystack Architecture

Picture taken from “

Finding a needle in Haystack: Facebook’s photo

storage”,

OSDI'10 Proceedings

(10)

Serving A Photo

http://<CDN>/<Cache>/<Machine id>/<Logical volume, Photo>

Picture taken from “

Finding a needle in Haystack: Facebook’s photo

storage”,

OSDI'10 Proceedings

(11)

11

Uploading A Photo

Picture taken from “

Finding a needle in Haystack: Facebook’s photo

storage”,

OSDI'10 Proceedings

(12)

The Haystack Directory

 Mapping from logical volumes to physical volumes. (Placement table)

– What about photo id to logical volumes mapping?

 Identify read-only logical volumes

– Reached their storage capacity

– Due to operational reasons

 Load balance writes across write-enabled logical storage volumes

(13)

13

The Haystack Store

 Each store machine manages multiple physical volumes

 Each physical volume can be thought of as large file (~100GB) saved as /hay/haystack_<logical volume id>

 Keeps open file descriptor for each managed physical volume (xfs)

 Keeps in memory mapping: <photo ID> ---> <file, offset, size>

(14)

The Physical Volume Structure

File,offset, size Photo ID

On Disk

In Memory

Picture taken from “

Finding a needle in Haystack: Facebook’s photo

storage”,

OSDI'10 Proceedings

(15)

15

Store Basic Operations

 Read

– Get <logical volume id, key, alternate key, cookie> from Cache

– Lookup in memory metadata, if the photo exists/not marked as deleted

– seek

– read the entire needle (data + metadata)

– Verify cookie, integrity

– Return data to cache machine

 Write

– Get <logical volume id, key, alternate key, cookie, data> from web server

– Synchronously append needle images to the appropriate physical volume

– Update in memory structure

 Modify (e.g. when a photo is rotated)

– The new version is either written to a new logical volume, requiring a metadata update by the directory

– Or the new version is written to the same physical volume in a higher offset

 Delete and Compact

– Set the delete flag both in memory and on disk synchronously

– Write the whole logical file into a new one skipping deleted photos. 25% of the photos gets deleted.

(16)

Recovery from Failures



Arsenal

– Replication

– RAID-6

– “pitchfork” background process that:

- Tests connections to store machines

- Checks availability of volume files

- Attempts to read data from store machines

(17)

17

Reference



Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010.

Finding a needle in Haystack: facebook's photo storage. In Proceedings of the 9th

USENIX conference on Operating systems design and implementation (OSDI'10).

USENIX Association, Berkeley, CA, USA, 1-8.

References

Related documents

s-process p-process Mass known Half-life known

Both the Cyber Security Research and Development Act of 2002 (PL 107-305) and the Homeland Security Act established new programs and authorized new funds for cybersecurity

Figure 7.22 Tensile fractured surface of non-modified OPA-filled natural rubber vulcanizates and CTAB-modified OPA- filled natural rubber vulcanizates at varying OPA loading

In memory of Harold Taub, beloved husband of Paula Taub by: Karen &amp; Charles Rosen.. Honouring Maria Belenkova-Buford on her marriage by: Karen &amp;

Presence of such trends in field of international students mobility within Ukrainian universities indicates at two major problems: (1) existence of information gap among

“ the seven words of our LORD on

To address these questions, the following goals were set: (a) to reproduce field explosions pertaining to primary blast injury as accurate as possible in a controlled