Neptune
H.J. Kim 2009.07
http://dev.naver.com/projects/neptune http://www.openneptune.com
Distributed Data Storage
“Data Tsunami”
• 1 billions book
• 40 billions Web page
• 55 trillions Web link
• 281 exa-bytes
• 45 GB/person
• 2X growth in 4 years
Think Various Storage!
Data Volume Data
Complexity
Cassandra
NAS
Dynamo
Neptune
• Distributed Data Storage
– semi-structured data store(not file system) – Use Distributed File System for Data file – Supports real time and batch processing
• Google Bigtable clone
– Data Model, Architecture, Features
• Goal
– 1,000 nodes
– 100 ~ 200 GB per node, Peta bytes
Features
• Schema Management
– Create, drop, modify table schema
• Real-time Transaction
– Single row operation(no join, group by, order by) – Multi row operation: like, between
• Batch Transaction
– Scanner, uploader, map&reduce adapter
• Scalability
– Automatic table split & re-assignment
• Reliability
– Data file stored in Distributed File System(HDFS, Others) – Commit log stored in ChangeLog Cluster
• Failover
– Tablet takeover time: max 2 min.
• Utility
– Web Console, Shell(simple query), Data Verifier
Architecture
분산파일시스템(Hadoop or other)
TabletServer #1 TabletServer #2 TabletServer #n
Neptune Master
분산/병렬컴퓨팅 플랫폼(MapReduce) 사용자 애플리케이션
Neptune
(대용량분산 데이터 저장소)
논리적 Table 물리적 저장소
System Components
DFS #1
(DataNode) Computing #1 (Map&Reduce)
TabletServer #1 (Neptune)
Local disk
DFS #2
(DataNode) Computing #2 (Map&Reduce)
TabletServer #2 (Neptune)
Local disk
DFS #n
(DataNode) Computing #n (Map&Reduce)
TabletServer #n (Neptune)
Local disk Master
Neptune Master
Neptune Client NTable Scanner Lock Server
NChubby
failover
/ event
Pleidasfailover / event
ZooKeeper Neptune Master
Data/Control Control
Shell
LogServer
#1 LogServer
#2 LogServer
#n
Data Model
row #n row #m+1
row #m row #k+1
row #k row #1
Rowkey Column#1
TabletA-1 rk-1 ck-1 v1, t1
v2, t2 ck-2 v3, t2
v4, t3 v5, t4 ck-n vn, tn - Sorted by rowKey - Sorted by columnKey
Column#n
TabletA-2
TabletA-n
Table
Row#1
Cell1 Column1
Cell2 Cell3 Cell-n
Cell1 Cell2 Cell-k Column2
Cell1 Cell2 Cell-m Column-n Row.Key
… … …
Cell
Cell.Key
Cell.Value(t1) Cell.Value(t2) Cell.Value(tn)
…
Data Model Examples
Hbase Schema Design Case Studies (http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies)
• 1:N relation
- 1 user has 1+friends
- will lookup all friends of a user
T_FRIEND user_id friend_id type
T_USER_FRIEND
row info friend
<user_id> name sexage
<user_id>=type
RDBMS Neptune
select *
from T_USER, T_FRIEND where T_USER_ID.id = ?
and T_USER_ID.id = T_FRIEND.user_id T_USER
id(pk) name sex age
Data Model Examples
• access log
- each log line contains time, ip, domain, url…
- will be analyzed every 5 minutes, every hour, daily, weekly…
T_ACCESS_LOG
row http user
<time><INC_COUNTER> ip
domain urlreferer
login_id
RDBMS Neptune
T_ACCESS_LOG time
ip
domain url referer login_id
Data Model Examples
• N:M 관계
- 1 student – many courses - 1 course – many students
T_Student id(pk) name sex age
T_S_C s_id c_id type
T_Course id(pk) title
teacher_id
T_Student
row info course
s_id name sexage
c_id:<type>
T_Course
row info course
c_id title
teacher_id s_id:<type>
RDBMS
Neptune
Client TabletServer MemoryTable ChangeLogServer
Data Operation
MapFile#2 (HDFS)
put(key, value)
ChangeLog
MapFile#1
(HDFS) MapFile #n
(HDFS)
Minor
Compaction
Merged MapFile (HDFS)
Major Compaction
Searcher
get(key)
Failover
NeptuneMaster #1
NeptuneMaster #2
NeptuneMaster #3
ZooKeeper Cluster (5 nodes)
/neptune_master 1. Try lock
1. Try lock
1. Try lock 2. Get lock
3. Active elected 4. Master fail
5. Get lock 6. Active elected
NeptuneClient Where master?
TabletServer /tserver_host01 Get lock
Network fail
If can’t lock -> self kill Send event
• No shared data in Master
• Manage Tablet assignment
Failover
• Master 장애
– Table Schema Management, Tablet Split 기능만 장애 – Active-standby로 장애 대처
• TabletServer 장애
– Master에 의해 Tablet re-assign – 수초 ~ 수십 초 이내 복구
• ZooKeeper 장애
– 5개 node로 클러스터 구성 – 절대 장애 발생하지 않음
• Hadoop NameNode 장애 – 별도의 이중화 방안 필요
• Hadoop 전체 장애
– Neptune 클러스터도 장애
MapReduce
Tablet A-3
Tablet A-N
…
Tablet A-2 TabletA-1
TableA
META Table
Map Task TaskTracker
Map TaskMap
Task
Map Task TaskTracker
Map TaskMap
Task
Map Task TaskTracker
Map TaskMap
Task
TaskTracker Reduce Task
TaskTracker Reduce Task
TableB
Tablet B-2 Tablet B-1
Partition using key
DBMS or HDFS
TabletIn putF ormat
Client
• Client API
– Single row operation: put/get
– Multi row operation: like, between – Batch operation: scanner/uploader – MapReduce: TabletInputFormat
• Command line Shell
– NQL(Neptune Query Language)
• JDBC support
• Web Console
Client API Example
TableShema tableSchema = new TableSchema(“T_TEST”, new String[]{“col1”, “col2”});
NTable.createTable(tableSchema);
NTable ntable = Ntable.openTable(“T_TEST”);
Row row = new Row(new Row.Key(“RK1”));
Row.addCell(“col1”, new Cell(new Cell.Key(“CK1”), “test_value”.getBytes()));
ntable.put(row);
Row selectedRow = ntable.get(new Row.Key(“RK1”));
System.out.println(selectedRow.getCellList(“col1”).get(0));
TableScanner scanner = ScannerFactory.openScanner(ntable, new String[]{“col1”});
Row scanRow = null;
while( (scanRow = scanner.next()) = null) {
System.out.println(selectedRow.getCellList(“col1”).get(0));
}
scanner.close();
Neptune Shell
• Data Definition
– CREATE TABLE – DROP TABLE – SHOW TABLES – DESC
• Data Manipulation
– SELECT – DELETE – INSERT
– TRUNCATE COLUMN – TRUNCATE TABLE – SET CHARSET
• Cluster Monitoring
– PING TABLETSERVER – REPORT TABLE
– SHOW USERS – STOP ACTION
Web Console
Performance
Experiment Neptune HBase HBase(Cache)
Random read 495 578 1,623
Random write 1,223 2,864 8,300
Sequential read 498 600 2,109
Sequential write 1,327 2,635 6,553
Scan 40,329 22,795 30,840
Number of 1000-byte values read/written per second
HBase, Bigtable
Neptune Bigtable HBase
File System Hadoop DFS or other DFS
GFS Hadoop DFS
Computing Hadoop or others MapReduce Hadoop
Master failover Yes(ZooKeeper) Yes(Chubby) 0.20(ZooKeeper) Script Language No(NQL) Sawzall No
Change log 별도 구성 GFS HDFS + Memory
API Java, Thrift, REST C++ Java, Thrift, REST
ACL Yes Yes No
Memory Table No Yes No
Scanner Yes Yes Yes
Uploader Yes Unknown No
Storage
구분 데이터
(확장성)용량
실시간데이터 처리
데이터복잡성 안정성 분석작업
연계 비용
Local Disk X X X X X Low
NAS O X X O X Middle
RDBMS X O O △
(Difficult) X High
Distributed
RDBMS O O O △
(Difficult) △ Very High
Hadoop O X X O O Low
Bigtable 계열
(Neptune 등) O O △ O O Low
Dynamo 계열
(Dynomite 등) O O X O △ Low
Google Infra Usage
Application Type:
대용량 데이터 실시간 조회
Application Type:
대용량 데이터
분석 + 실시간 조회
Application Type:
대용량 데이터 저장
Application Type:
대용량 데이터 저장 + 분석
Neptune Usage
WebServer
RDBMS (Master)
RDBMS (Slave)
NeptuneNeptuneNeptune
NeptuneNeptuneHadoop
Batch Processing
실시간처리 (META 데이터) 실시간처리 (대량데이터)
첨부파일 분석용 In/Out
분석용 In/Out 분석용 Out 분석결과 조회
실시간조회 (입력데이터/
분석데이터)
Stress Test
• Cluster: 43 nodes
– 1 hadoop NameNode, 42 DataNode – 1 Job Tracker, 20 TaskTracker
– 7 TabletServer node: 2GB Heap – 15 ChangeLogServer node
– Disk: Hadoop and ChangeLogServer use different disk
• Hadoop: 0.19.0
• Map Task:
– 1024 map task, 1GB/Map
– 2 Maps/TaskTracker, 40 Map Task Concurrently – Map only
• Data:
– 1 row: 10,000 bytes
– Total 1TB, 110 million rows
- 20 40 60 80 100 120 140 160
- 100 200 300 400 500 600 700 800 900
2 20 38 56 74 92 110 128 146 164 182 200 218 236 254 272 290 308 326 344 362 380 398 416 434 452 470 488 506 524 542 560 578 596 614 632 650 668
TPS/TabletServer
TPS/TabletServer Data(GB)/TabletServer Time(min)
TPS GB
0 500 1000 1500 2000 2500 3000
1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 985
Map Task Elapse Time(sec)
Map ID Time(sec)