• No results found

Apache HBase 0.96 and What s Next Headline Goes Here Jonathan

N/A
N/A
Protected

Academic year: 2021

Share "Apache HBase 0.96 and What s Next Headline Goes Here Jonathan"

Copied!
42
0
0

Loading.... (view fulltext now)

Full text

(1)

Headline Goes Here

Speaker Name or Subhead Goes Here

DO NOT USE PUBLICLY

PRIOR TO 10/23/12

Apache HBase 0.96 and What’s Next

Jonathan Hsieh

|

@jmhsieh

Software Engineer at Cloudera

|

HBase PMC Member

(2)

Who Am I?

Cloudera:

Software Engineer

Tech Lead HBase Team

Apache HBase

committer / PMC

Apache Flume

founder / PMC

U of Washington:

Research in Distributed Systems

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(3)

What is Apache HBase?

Apache HBase

is an open

source, distributed,

consistent, non-relational

database that provides

low-latency, random

read/write operations.

ZK

HDFS

App

MR

(4)

Developer Community

Vibrant, Highly

Active community!

We’re Growing!

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(5)

Today: Apache HBase 0.96.0

5 10/29/13 Strata + Hadoop World /

HBase is

fault tolerant

HDFS for

durability

via replication

HBase has good

availability

Uses ZK for

coordination

via quorums

New features for fast RS recovery in 0.96

HBase strives for

predictable

performance

Some experimental features in 0.96, and

more improvements in development.

Reliable

Highly Available

(6)

Summary by Version

0.90 (CDH3)

0.92 /0.94 (CDH4)

0.96 (CDH5)

Next

New

Features

Stability

Reliability

Continuity

Multitenancy

MTTR

Recovery in

Hours

Recovery in Minutes

Recovery of writes in

seconds, reads in 10’s

of seconds

Recovery in Seconds

(reads+writes)

Perf

Baseline

Better Throughput

Optimizing

Performance

Predictable

Performance

Usability

HBase

Developer

Expertise

HBase Operational

Experience

Distributed Systems

Admin Experience

Application

Developers

Experience

6 10/29/13 Strata + Hadoop World /

(7)

MTTR Recap

Mean time to recovery

(8)

Mean Time to Recovery (MTTR)

Machine failures happen in distributed systems

Average unavailability when automatically recovering from a

failure.

Recovery time for a unclean data center power cycle

10/29/13 Strata + Hadoop World / Jonathan Hsieh 8

recovered

notify

repair

detect

Region

unavailable

Region available

client aware

Region available

client unaware

(9)

Distributed log splitting (0.92)

10/29/13 Strata + Hadoop World / 9

recovered

replay

assign

split

Region

unavailable

Region available

for RW

hdfs

detect

Repair == split, assign, replay

When many servers fail, many logs to recover

Instead of just at the master, RS’s split logs in parallel

Huge win in recovery time

hdfs hdfs

(10)

Fast notification and detection (0.96)

Proactive notification of HMaster failure (0.96)

Proactive notification of RS failure (0.96)

Notify client on recovery (0.96)

Fast server failover (Hardware)

10/29/13 Strata + Hadoop World / Jonathan Hsieh 10

recovered

replay

assign

split

Region

unavailable

Region available

for RW

hdfs hdfs

detect

hdfs

(11)

Previously had two IO intensive passes:

Log splitting to intermediate files

Assign and log replay

Now just one IO heavy pass: Assign first, then split+replay.

Improves read and write recovery times.

Off by default currently*.

Distributed log replay (0.96)

10/29/13 Strata + Hadoop World / 11

recovered

split + replay

assign

Region

unavailable

Region available

for RW

Region available

for replay writes

hdfs

detect

*Caveat: If you override time stamps you could have

READ REPEATED isolation violations (use tags to fix this)

(12)

recovered

split + replay

Distributed log replay with fast write recovery

10/29/13 Strata + Hadoop World / Jonathan Hsieh 12

assign

Region

unavailable

Region available

for RW

Region available

for all writes

hdfs

detect

Writes in HBase do not incur reads.

With distributed log replay, we’ve already have regions open

for write.

Allow fresh writes while replaying old logs*.

*Caveat: If you override time stamps you could have

READ REPEATED isolation violations (use tags to fix this)

(13)

Fast Read Recovery

Idea: Pristine Region fast read recovery

If region not edited it is consistent and can recover RW immediately

Idea: Shadow Regions for fast read recovery

Shadow region tails the WAL of the primary region

Shadow memstore is one HDFS block behind, catch up recover RW

Currently some progress for trunk

10/29/13 Strata + Hadoop World / 13

recovered

assign

Region

unavailable

Can guarantee no new edits?

Region available for all RW

detect

Can guarantee we have all edits?

Region available for all RW

(14)

Many apps and users in a single cluster

Multi-tenancy

14 10/29/13 Strata + Hadoop World /

(15)

Growing HBase

Pre 0.96.0: scaling up HBase for

single HBase applications

Essentially a single user for single

app.

Ex: Facebook messages, one

application, many hbase clusters

Shard users to different pods

Focused on continuity and disaster

recovery features

Cross-cluster Replication

Table Snapshots

Rolling Upgrades

10/29/13 Strata + Hadoop World / 15

# of isolated applications

#

of

c

lu

st

er

s

Sc

alab

ility

One giant application,

Multiple clusters

(16)

Growing HBase

In 0.96 we introduce primitives

for supporting

Multitenancy

Many users, many applications,

one HBase cluster

Need to have some control of

the interactions different users

cause.

Ex: Manage for MR analytics and

low-latency serving in one

cluster.

10/29/13 Strata + Hadoop World / Jonathan Hsieh 16

# of isolated applications

#

of

c

lu

st

er

s

multitenancy

Sc

alab

ility

One giant application,

Multiple clusters

Many applications

In one shared cluster

(17)

Namespaces (0.96)

Namespaces provide an abstraction for multiple tenants to

create and manage their own tables within a large HBase

instance.

10/29/13 Strata + Hadoop World / 17

(18)

Multitenancy goals

Security (0.96)

A separate admin ACLs for different sets of tables

Quotas (in progress)

Max tables, max regions.

Performance Isolation (in progress)

Limit performance impact load on one table has on others.

Priority (future)

Handle some tables before others

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(19)

Isolation with Region Server Groups

10/29/13 Strata + Hadoop World / 19

Region assignment distribution (no region server groups)

(20)

Isolation with Region Server Groups

10/29/13 Strata + Hadoop World / Jonathan Hsieh

20

RSG blue

RSG green orange

Namespace blue

Namespace green

Namespace orange

(21)

Cell Tags

Mechanism for attaching arbitrary metadata to

Cells.

Motivation: Finer-grained isolation

Use for Accumulo-style cell-level visibility

Main feature for 0.98 (in development).

Other uses:

Add sequence numbers to enable correct fast

read/write recovery

Potential for schema tags

10/29/13 Strata + Hadoop World / 21

(22)

Improving the 99%tile

Improving Predictability

22 10/29/13 Strata + Hadoop World /

(23)

Common causes of performance variability

Locality Loss

Favored Nodes, HDFS block affinity

Compaction*

Exploring compactor

GC

Off-heap Cache

Hardware hiccups

Multi WAL, HDFS speculative read

10/29/13 Strata + Hadoop World / 23

(24)

Performance degraded after recovery

After recovery, reads suffer a performance hit.

Regions have lost locality

To maintain performance after failover, we need to regain locality.

Compact Region to regain locality

We can do better by using HDFS features

10/29/13 Strata + Hadoop World / Jonathan Hsieh 24

performance recovered

recovered

Service recovered;

degraded performance

recovery

Performance recovered because

compaction restores locality

(25)

Control and track where block replicas are

All files for a region created such that blocks go to the same set of

favored nodes

When failing over, assign the region to one of those favored nodes.

Currently a preview feature in 0.96

Disabled by default because it doesn’t work well with the latest balancer or splits.

Will likely use upcoming HDFS block affinity for better operability

Originally on Facebook’s 0.89, ported to 0.96

performance recovered

Read Throughput: Favored Nodes (0.96)

10/29/13 Strata + Hadoop World / 25

Service recovered; performance sustained because

region assigned to favored node.

(26)

Read latency: HDFS speculative read

HBase’s HDFS client reads 1

of 3 replicas

If you chose the slow node,

your reads are slow.

Idea: If a read is taking too

long, speculatively go to

another that may be faster.

10/29/13 Strata + Hadoop World / Jonathan Hsieh 26

RS

1

2

3

Slow read!

Hdf

s

replic

as

RS

1

2

3

Hdf

s

rep

lic

as

(27)

Write latency: Multiple WALs

HBase’s HDFS client writes 3

replicas

Min write latency is bounded

by the slowest of the 3

replicas

Idea: If a write is taking too

long let’s duplicate it on

another set that may be

faster.

10/29/13 Strata + Hadoop World / 27

RS

1

2

3

Slow Write

Hdf

s

rep

lic

as

RS

1

2

3

Hdf

s

rep

lic

as

1

2

3

Hdf

s

rep

lic

as

Too slow, write

to other replica

(28)

An Ecosystem of projects built on HBase

HBase Extensions

28 10/29/13 Strata + Hadoop World /

(29)

Making HBase easier to use and tune.

With great power comes great responsibility.

Difficult to see what is happening in HBase

Easy to make poor design decisions early without realizing

New Developments

HTrace + Zipkin

Frameworks for Schema design

10/29/13 Strata + Hadoop World / 29

(30)

HTrace: Distributed Tracing in HBase and HDFS

Framework Inspired by

Google Dapper

Tracks time spent in calls

in RPCs across different

machines.

Threaded through HBase

(0.96) and future HDFS.

10/29/13 Strata + Hadoop World / Jonathan Hsieh 30

HB

ase

RS

1

HDF

S

DN

ZK

HB

ase

C

lien

t

HB

ase

m

et

a

HDF

S

NN

A span RPC calls

(31)

Zipkin – Visualizing Spans

UI + Visualization System

Written by Twitter

Zipkin HBase Storage

Zipkin HTrace integration

View where time from a

specific call is spent in

HBase, HDFS, and ZK.

10/29/13 Strata + Hadoop World / 31

(32)

HBase Schemas

HBase Application developers must iterate to find a suitable

HBase schema

Schema critical for Performance at Scale

How can we make this easier?

How can we reduce the expertise required to do this?

Old option: Learn the architecture. Use the API. Experiment

and learn.

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(33)

How should I arrange my data?

Isomorphic data representations!

10/29/13 Strata + Hadoop World / 33

rowkey

d:

bob-col1 aaaa

bob-col2 bbbb

bob-col3 cccc

bob-col4 dddd

jon-col1

eeee

jon-col2

ffff

jon-col3

gggg

jon-col4

hhhh

Rowkey

d:col1

d:col2

d:col3

d:col4

bob

aaaa

bbbb

cccc

dddd

jon

eeee

ffff

gggg

hhhhh

Rowkey

col1:

col2:

col3:

col4:

bob

aaaa

bbbb

cccc

dddd

jon

eeee

ffff

gggg

hhhhh

Short Fat Table using column qualifiers

Short Fat Table using column families

Tall skinny with

compound rowkey

(34)

Row key design techniques

Numeric Keys and lexicographic sort

Store numbers big-endian.

Pad ASCII numbers with 0’s.

Use

reversal

to have most significant traits first.

Reverse URL.

Reverse timestamp to get most recent first.

(MAX_LONG - ts) so “time” gets monotonically smaller.

Use

composite keys

to make key distribute nicely and work

well with sub-scans

Ex: User-ReverseTimeStamp

Do not use current timestamp as first part of row key!

34

Row100

Row3

Row 31

Row003

Row031

Row100

vs.

blog.cloudera.com hbase.apache.org strataconf.com

vs.

com.cloudera.blog com.strataconf org.apache.hbase

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(35)

Row key design techniques

Numeric Keys and lexicographic sort

Store numbers big-endian.

Pad ASCII numbers with 0’s.

Use

reversal

to have most significant traits first.

Reverse URL.

Reverse timestamp to get most recent first.

(MAX_LONG - ts) so “time” gets monotonically smaller.

Use

composite keys

to make key distribute nicely and work

well with sub-scans

Ex: User-ReverseTimeStamp

Do not use current timestamp as first part of row key!

35

Row100

Row3

Row 31

Row003

Row031

Row100

vs.

blog.cloudera.com hbase.apache.org strataconf.com

vs.

com.cloudera.blog com.strataconf org.apache.hbase

(36)

Phoenix

A SQL skin over HBase targeting low-latency queries.

JDBC SQL interface

Highlights

Adds Types

Handles Compound Row key encoding

Secondary indices in development

Provides some pushdown aggregations (coprocessor).

Open sourced by Salesforce.com

Work from James Taylor, Jesse Yates, et al

https://github.com/forcedotcom/phoenix

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(37)

Impala

Scalable Low-latency SQL querying for HDFS (and HBase!)

ODBC/JDBC driver interface

Highlights

Use’s Hive metastore and its hbase-hbase connector

configuration conventions.

Native code implementation, uses JIT for query

execution optimization.

Authorization via Kerberos support

Open sourced by Cloudera

https://github.com/cloudera/impala

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(38)

Kiji

APIs for building big data applications on HBase

based on Google’s Bigtable usage experience

Highlights

Provides types via Avro serialization and Schema encoding

Provides locality group that logically maps to HBase’s physical column

families.

Manages schema evolution

Provides framework for applying machine learning to data

Open sourced by WibiData

http://www.kiji.org/

10/29/13 Strata + Hadoop World / Jonathan Hsieh

(39)

Cloudera Development Kit (CDK)

APIs that provides a

Dataset abstraction

Provides get/put/delete API in avro objects

HBase Support in progress

Highlights

Supports multiple components of the hadoop distros

(flume, morphlines, hive, crunch, hcat)

Provides types using Avro and parquet formats for

encoding entities

Manages schema evolution

Open source by Cloudera

http://cloudera.github.io/cdk/docs/current

10/29/13 Strata + Hadoop World / 39

(40)

Conclusions

40 10/29/13 Strata + Hadoop World /

(41)

Summary by Version

0.90 (CDH3) 0.92 /0.94 (CDH4) 0.96 (CDH5) Next Major Features Stability •True Durability • Replication • Import/Export/Copy Reliability •True Consistency •Master-Master Replication •Coprocs + Security Continuity • Protobufs • Snapshots • Namespace Security Multitenancy • Cell-level Tags • Namespaces Isolation • Namespace Quotas MTTR Recovery in Hours •Distributed log splitting* Recovery in Minutes

•Distributed log splitting

Recovery of writes in seconds, reads in 10’s of Seconds

• Distributed log replay†

• Fast Failure Notification

Recovery in Seconds (reads+writes)

•Pristine Region read recover •Shadow Regions Perf Baseline •Metrics Better Throughput • CF+Region Metrics • HFile Checksums

• Short Circuit HDFS Read

• Blooms and Big Hfiles

Optimizing Performance

• HTrace

• Prefix Tree Encoding†

• Exploring compaction

• Stochastic load balancer

Predictable Performance

• Multi WAL

• Speculative Reads

• Favored nodes

Usability HBase Developer Expertise

HBase Operational Experience

Distributed Systems Admin Experience

Application Developers Experience

† experimental in progress *backported

(42)

42

More questions?

Come to the Cloudera

booth for an ask the expert

session with Jon and Lars George!

Questions?

References

Related documents

The process of selecting the workers in a crew and assigning crews to different tasks is crucial for ensuring the success of a construction project and improved labor

The result of this research study clearly indicate that there is weak relationship between the Fundamental Analysis and technical analysis with buy or sell trade but

It is striking, however, that despite compelling evidence from normative studies showing that typically developing males have an increased propensity towards risk taking compared

(iii) The information is provided by an employee through his or her employer’s internal reporting procedures before or at the same time the employee submits the information to

This tradition has been especially strong in the United States, where a major driver of continuing anti-Catholic attitudes was the perception that the political, social

Since the overall posterior probability of a cleave is simply the product of the individual posteriors of a cleave given a single feature, it is possible to

● [1] What is Hadoop?, The Apache Software Foundation, referenced on 2009-10-29, available at

Da li je to samo uzbuđenje što će smak sveta koji.?.