Neptune Distributed Data Storage. H.J. Kim

(1)

Neptune

H.J. Kim 2009.07

http://dev.naver.com/projects/neptune http://www.openneptune.com

Distributed Data Storage

(2)

“Data Tsunami”

• 1 billions book

• 40 billions Web page

• 55 trillions Web link

• 281 exa-bytes

• 45 GB/person

• 2X growth in 4 years

(3)

Think Various Storage!

Data Volume Data

Complexity

Cassandra

NAS

Dynamo

(4)

Neptune

• Distributed Data Storage

– semi-structured data store(not file system) – Use Distributed File System for Data file – Supports real time and batch processing

• Google Bigtable clone

– Data Model, Architecture, Features

• Goal

– 1,000 nodes

– 100 ~ 200 GB per node, Peta bytes

(5)

Features

• Schema Management

– Create, drop, modify table schema

• Real-time Transaction

– Single row operation(no join, group by, order by) – Multi row operation: like, between

• Batch Transaction

– Scanner, uploader, map&reduce adapter

• Scalability

– Automatic table split & re-assignment

• Reliability

– Data file stored in Distributed File System(HDFS, Others) – Commit log stored in ChangeLog Cluster

• Failover

– Tablet takeover time: max 2 min.

• Utility

– Web Console, Shell(simple query), Data Verifier

(6)

Architecture

분산파일시스템(Hadoop or other)

TabletServer #1 TabletServer #2 TabletServer #n

Neptune Master

분산/병렬컴퓨팅 플랫폼(MapReduce) 사용자 애플리케이션

Neptune

(대용량분산 데이터 저장소)

논리적 Table 물리적 저장소

(7)

System Components

DFS #1

(DataNode) Computing #1 (Map&Reduce)

TabletServer #1 (Neptune)

Local disk

DFS #2

(DataNode) Computing #2 (Map&Reduce)

TabletServer #2 (Neptune)

Local disk

DFS #n

(DataNode) Computing #n (Map&Reduce)

TabletServer #n (Neptune)

Local disk Master

Neptune Master

Neptune Client NTable Scanner Lock Server

NChubby

failover

/ event

^Pleidas

failover / event

ZooKeeper Neptune Master

Data/Control Control

Shell

LogServer

#1 LogServer

#2 LogServer

#n

(8)

Data Model

row #n row #m+1

row #m row #k+1

row #k row #1

Rowkey Column#1

TabletA-1 rk-1 ck-1 v1, t1

v2, t2 ck-2 v3, t2

v4, t3 v5, t4 ck-n vn, tn - Sorted by rowKey - Sorted by columnKey

Column#n

TabletA-2

TabletA-n

Table

Row#1

Cell1 Column1

Cell2 Cell3 Cell-n

Cell1 Cell2 Cell-k Column2

Cell1 Cell2 Cell-m Column-n Row.Key

… … …

Cell

Cell.Key

Cell.Value(t1) Cell.Value(t2) Cell.Value(tn)

…

(9)

Data Model Examples

Hbase Schema Design Case Studies (http://www.slideshare.net/hmisty/20090713-hbase-schema-design-case-studies)

• 1:N relation

- 1 user has 1+friends

- will lookup all friends of a user

T_FRIEND user_id friend_id type

T_USER_FRIEND

row info friend

<user_id> name sexage

<user_id>=type

RDBMS Neptune

select *

from T_USER, T_FRIEND where T_USER_ID.id = ?

and T_USER_ID.id = T_FRIEND.user_id T_USER

id(pk) name sex age

(10)

Data Model Examples

• access log

- each log line contains time, ip, domain, url…

- will be analyzed every 5 minutes, every hour, daily, weekly…

T_ACCESS_LOG

row http user

<time><INC_COUNTER> ip

domain urlreferer

login_id

RDBMS Neptune

T_ACCESS_LOG time

ip

domain url referer login_id

(11)

Data Model Examples

• N:M 관계

- 1 student – many courses - 1 course – many students

T_Student id(pk) name sex age

T_S_C s_id c_id type

T_Course id(pk) title

teacher_id

T_Student

row info course

s_id name sexage

c_id:<type>

T_Course

row info course

c_id title

teacher_id s_id:<type>

RDBMS

Neptune

(12)

Client TabletServer MemoryTable ChangeLogServer

Data Operation

MapFile#2 (HDFS)

put(key, value)

ChangeLog

MapFile#1

(HDFS) MapFile #n

(HDFS)

Minor

Compaction

Merged MapFile (HDFS)

Major Compaction

Searcher

get(key)

(13)

Failover

NeptuneMaster #1

NeptuneMaster #2

NeptuneMaster #3

ZooKeeper Cluster (5 nodes)

/neptune_master 1. Try lock

1. Try lock

1. Try lock 2. Get lock

3. Active elected 4. Master fail

5. Get lock 6. Active elected

NeptuneClient Where master?

TabletServer /tserver_host01 Get lock

Network fail

If can’t lock -> self kill Send event

• No shared data in Master

• Manage Tablet assignment

(14)

Failover

• Master 장애

– Table Schema Management, Tablet Split 기능만 장애 – Active-standby로 장애 대처

• TabletServer 장애

– Master에 의해 Tablet re-assign – 수초 ~ 수십 초 이내 복구

• ZooKeeper 장애

– 5개 node로 클러스터 구성 – 절대 장애 발생하지 않음

• Hadoop NameNode 장애 – 별도의 이중화 방안 필요

• Hadoop 전체 장애

– Neptune 클러스터도 장애

(15)

MapReduce

Tablet A-3

Tablet A-N

…

Tablet A-2 TabletA-1

TableA

META Table

Map Task TaskTracker

Map TaskMap

Task

Map TaskMap

Task

Map TaskMap

Task

TaskTracker Reduce Task

TableB

Tablet B-2 Tablet B-1

Partition using key

DBMS or HDFS

TabletIn putF ormat

(16)

Client

• Client API

– Single row operation: put/get

– Multi row operation: like, between – Batch operation: scanner/uploader – MapReduce: TabletInputFormat

• Command line Shell

– NQL(Neptune Query Language)

• JDBC support

• Web Console

(17)

Client API Example

TableShema tableSchema = new TableSchema(“T_TEST”, new String[]{“col1”, “col2”});

NTable.createTable(tableSchema);

NTable ntable = Ntable.openTable(“T_TEST”);

Row row = new Row(new Row.Key(“RK1”));

Row.addCell(“col1”, new Cell(new Cell.Key(“CK1”), “test_value”.getBytes()));

ntable.put(row);

Row selectedRow = ntable.get(new Row.Key(“RK1”));

System.out.println(selectedRow.getCellList(“col1”).get(0));

TableScanner scanner = ScannerFactory.openScanner(ntable, new String[]{“col1”});

Row scanRow = null;

while( (scanRow = scanner.next()) = null) {

System.out.println(selectedRow.getCellList(“col1”).get(0));

}

scanner.close();

(18)

Neptune Shell

• Data Definition

– CREATE TABLE – DROP TABLE – SHOW TABLES – DESC

• Data Manipulation

– SELECT – DELETE – INSERT

– TRUNCATE COLUMN – TRUNCATE TABLE – SET CHARSET

• Cluster Monitoring

– PING TABLETSERVER – REPORT TABLE

– SHOW USERS – STOP ACTION

(19)

Web Console

(20)

Performance

Experiment Neptune HBase HBase(Cache)

Random read 495 578 1,623

Random write 1,223 2,864 8,300

Sequential read 498 600 2,109

Sequential write 1,327 2,635 6,553

Scan 40,329 22,795 30,840

Number of 1000-byte values read/written per second

(21)

HBase, Bigtable

Neptune Bigtable HBase

File System Hadoop DFS or other DFS

GFS Hadoop DFS

Computing Hadoop or others MapReduce Hadoop

Master failover Yes(ZooKeeper) Yes(Chubby) 0.20(ZooKeeper) Script Language No(NQL) Sawzall No

Change log 별도 구성 GFS HDFS + Memory

API Java, Thrift, REST C++ Java, Thrift, REST

ACL Yes Yes No

Memory Table No Yes No

Scanner Yes Yes Yes

Uploader Yes Unknown No

(22)

Storage

구분 데이터

(확장성)용량

실시간데이터 처리

데이터복잡성 안정성 분석작업

연계 비용

Local Disk X X X X X Low

NAS O X X O X Middle

RDBMS X O O △

(Difficult) X High

Distributed

RDBMS O O O △

(Difficult) △ Very High

Hadoop O X X O O Low

Bigtable 계열

(Neptune 등) O O △ O O Low

Dynamo 계열

(Dynomite 등) O O X O △ Low

(23)

Google Infra Usage

Application Type:

대용량 데이터 실시간 조회

Application Type:

대용량 데이터

분석 + 실시간 조회

Application Type:

대용량 데이터 저장

Application Type:

대용량 데이터 저장 + 분석

(24)

Neptune Usage

WebServer

RDBMS (Master)

RDBMS (Slave)

NeptuneNeptuneNeptune

NeptuneNeptuneHadoop

Batch Processing

실시간처리 (META 데이터) 실시간처리 (대량데이터)

첨부파일 분석용 In/Out

분석용 In/Out 분석용 Out 분석결과 조회

실시간조회 (입력데이터/

분석데이터)

(25)

Stress Test

• Cluster: 43 nodes

– 1 hadoop NameNode, 42 DataNode – 1 Job Tracker, 20 TaskTracker

– 7 TabletServer node: 2GB Heap – 15 ChangeLogServer node

– Disk: Hadoop and ChangeLogServer use different disk

• Hadoop: 0.19.0

• Map Task:

– 1024 map task, 1GB/Map

– 2 Maps/TaskTracker, 40 Map Task Concurrently – Map only

• Data:

– 1 row: 10,000 bytes

– Total 1TB, 110 million rows

(26)

- 20 40 60 80 100 120 140 160

- 100 200 300 400 500 600 700 800 900

2 20 38 56 74 92 110 128 146 164 182 200 218 236 254 272 290 308 326 344 362 380 398 416 434 452 470 488 506 524 542 560 578 596 614 632 650 668

TPS/TabletServer

TPS/TabletServer Data(GB)/TabletServer Time(min)

TPS GB

(27)

0 500 1000 1500 2000 2500 3000

1 25 49 73 97 121 145 169 193 217 241 265 289 313 337 361 385 409 433 457 481 505 529 553 577 601 625 649 673 697 721 745 769 793 817 841 865 889 913 937 961 985

Map Task Elapse Time(sec)

Map ID Time(sec)

(28)

Test Result

• Elapse time: 11 hour 40 min

• Average TPS/TabletServer: 394

• Average TPS/Cluster: 2,758

• Average put latency: 9 ms

• Total # Tablets: 8,133

– Average Tablet size: 130MB

• Each TabletServer

– # Tablets: 1,162

– Service data:143GB – Heap Usage:

• Free: 935,746 KB

• Total: 2,080,128 KB

(29)

Powered by Neptune

• GAIA(http://www.gaiaville.com)

– Cloud Searchable Storage Service

(30)

Milestone

• Neptune-1.4 release(2009.07)

– Split시 lock time 최소화 – Supports ganglia metrics – Tablet Balancer

– Add start key in META record

• Neptune-1.5(2009.10)

– get 성능향상: DFS Block Cache, Bloom Filter – Hive query 연동

– Tablet 할당 정책

(31)

Join Neptune project

http://www.openneptune.com

(32)