Haoyuan Li, Tachyon Nexus
September 30, 2015 @ Strata and Hadoop World NYC 2015
An Open Source Memory-Centric
Distributed Storage System
Outline
• Open Source
• Introduction to Tachyon
• New Features
Outline
• Open Source• Introduction to Tachyon
• New Features
History
• Started at UC Berkeley AMPLab
– From summer 2012
– Same lab produced Apache Spark and Apache Mesos
• Open sourced
– April 2013
– Apache License 2.0
– Latest Release: Version 0.7.1 (August 2015)
Contributors Growth
v0.4! Feb ‘14 v0.3! Oct ‘13 v0.2 Apr ‘13 v0.1Dec ‘12 Jul ‘14v0.5! Mar ‘15v0.6! Jul ‘15v0.7!
1 3
15
30
46
70
Contributors Growth
>
150
Contributors
(
3x
increment over the last Strata NYC)
Contributors Growth
One of the
Fastest
Growing Big Data
Open Source
Thanks to Contributors and Users!
One Tachyon Production
Deployment Example
• Baidu (Dominant Search Engine in China,
~ 50 Billion USD Market Cap)
• Framework: SparkSQL
• Under Storage: Baidu’s File System
• Storage Media: MEM + HDD
• 100+ nodes deployment
• 1PB+ managed space
Outline
• Open Source
• Introduction to Tachyon
• New Features
Tachyon
is an
Open Source
Memory-centric
Distributed
Performance Trend:
Memory is
Fast
• RAM throughput
increasing exponentially • Disk throughput increasing slowly
Price Trend: Memory is
Cheaper
Is the
Missing a Solution
A Use Case Example with -
• Fast, in-memory data processing framework
– Keep one in-memory copy inside JVM
– Track lineage of operations used to derive data
– Upon failure, use lineage to recompute data
map
filter map
join reduce Lineage Tracking
Issue 1
Data Sharing is the bottleneck in
analytics pipeline:
Slow writes to disk
Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1
HDFS / Amazon S3
block 1
block 3
block 2
block 4
storage engine &
execution engine
same process
Issue 1
Spark Job Spark mem block manager block 1 block 3
Hadoop MR Job
YARN
HDFS / Amazon S3
block 1
block 3
block 2
block 4
Data Sharing is the bottleneck in
analytics pipeline:
Slow writes to disk
storage engine &
execution engine
same process
Issue 1 resolved with Tachyon
Memory-speed data sharing
among jobs in di
ff
erent
frameworks
execution engine &
storage engine
same process
(fast writes)
Spark Job Spark mem
Hadoop MR Job YARN
HDFS / Amazon S3
block 1 block 2
HDFS disk
block 1 block 3
block 2 block 4
Tachyon!
in-memory block 1
Issue 2
Spark Task Spark memory block manager block 1 block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
Cache loss when process
Issue 2
crash Spark memory block manager block 1 block 3
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
Cache loss when process
HDFS / Amazon S3
Issue 2
block 1 block 3 block 2 block 4
execution engine &
storage engine
same process crash
Cache loss when process
HDFS / Amazon S3 block 1 block 3 block 2 block 4 Tachyon! in-memory block 1
block 3 block 4
Issue 2 resolved with Tachyon
Spark Task Spark memory
block manager
execution engine &
storage engine
same process
Keep in-memory data safe,
Issue 2 resolved with Tachyon
HDFS disk
block 1 block 3
block 2 block 4
execution engine &
storage engine
same process
Tachyon!
in-memory block 1
block 3 block 4
crash
HDFS / Amazon S3
block 1 block 2
Keep in-memory data safe,
HDFS / Amazon S3
Issue 3
In-memory Data Duplication &
Java Garbage Collection
Spark Job1 Spark mem block manager block 1 block 3 Spark Job2 Spark mem block manager block 3 block 1 block 1 block 3 block 2 block 4
execution engine &
storage engine
same process
Issue 3 resolved with Tachyon
No in-memory data duplication,
much less GC
Spark Job1 Spark mem
Spark Job2 Spark mem
HDFS / Amazon S3
block 1
block 3
block 2
block 4
execution engine &
storage engine
same process
(no duplication & GC)
HDFS disk
block 1 block 3
block 2 block 4
Tachyon!
in-memory block 1
Previously Mentioned
• A memory-centric storage architecture
Outline
• Open Source
• Introduction to Tachyon
• New Features
1) Eco-system:
Enable new workload in any storage;
2) Tachyon running in
production environment
,
both
Use Case: Baidu
• Framework: SparkSQL
• Under Storage: Baidu’s File System
• Storage Media: MEM + HDD
• 100+ nodes deployment
• 1PB+ managed space
Use Case: a SAAS Company
• Framework: Impala
• Under Storage: S3
• Storage Media: MEM + SSD
Use Case: an Oil Company
• Framework: Spark
• Under Storage: GlusterFS
• Storage Media: MEM only
Use Case: a SAAS Company
• Framework: Spark
• Under Storage: S3
• Storage Media: SSD only
What if
data size exceeds
3) Tiered Storage:
Tachyon Manages More Than DRAM
MEM
SSD
HDD
Faster
Higher
Configurable Storage Tiers
MEM only
MEM + HHD
4) Pluggable Data Management Policy
Evict stale data to lower tier
Promote hot data to upper tier
More Features
• 7) Remote Write Support
• 8) Easy deployment with Mesos and Yarn
• 9) Initial Security Support
• 10) One Command Cluster Deployment
• 11) Metrics Reporting for Clients, Workers,
Outline
• Open Source
• Introduction to Tachyon
• New Features
Memory-Centric Distributed Storage
• Team consists of Tachyon creators, top contributors
• Series A ($7.5 million) from Andreessen Horowitz
• Committed to Tachyon Open Source
Strata NYC 2015
• Welcome to visit us at our booth #P18.
• Check out other Tachyon related talks.
– First-ever scalable, distributed deep learning architecture
using Spark and Tachyon
• Christopher Nguyen (Adatao, Inc.), Vu Pham (Adatao, Inc)
• 2:05pm–2:45pm Thursday, 10/01/2015
– Faster time to insight using Spark, Tachyon, and Zeppelin
• Nirmal Ranganathan (Rackspace Hosting)
• Try Tachyon: http://tachyon-project.org
• Develop Tachyon: https://github.com/amplab/tachyon
• Meet Friends: http://www.meetup.com/Tachyon
• Get News: http://goo.gl/mwB2sX
• Tachyon Nexus: http://www.tachyonnexus.com