Applying filters - Getting started

Getting started

2.6.4 Applying filters

It’s not always possible to design a rowkey to perfectly match your access patterns. Sometimes you’ll have use cases where you need to scan through a set of data in HBase but return only a subset of it to the client. This is where filters come in. Add a filter to your Scan object like this:

Filter f = ... Scan s = new Scan(); s.setFilter(f);

A filter is a predicate that executes in HBase instead of on the client. When you specify a Filter in your Scan, HBase uses it to determine whether a record should be returned. This can avoid a lot of unnecessary data transfer. It also keeps the filtering on the server instead of placing that burden on the client.

The filter applied is anything implementing the org.apache.hadoop.hbase.filter.Filter interface. HBase provides a number of filters, but it’s easy to implement your own.

To filter all twits that mention TwitBase, you can use a ValueFilter in combina- tion with a RegexStringComparator:

Scan s = new Scan();

s.addColumn(Bytes.toBytes("twits"), Bytes.toByes("twit")); Filter f = new ValueFilter(

CompareOp.EQUAL,

new RegexStringComparator(".*TwitBase.*")); s.setFilter(f);

HBase also provides a class for filter construction. The ParseFilter object imple- ments a kind of query language used to construct a Filter instance for you. The same TwitBase filter can be constructed from an expression:

Scan s = new Scan();

s.addColumn(TWITS_FAM, TWIT_COL);

String expression = "ValueFilter(=,'regexString:.*TwitBase.*')"; ParseFilter p = new ParseFilter();

Filter f = p.parseSimpleFilterExpression(Bytes.toBytes(expression)); s.setFilter(f);

Atomic operations

In either case, your regular expression is compiled and applied in the region before data ever reaches the client.

This is a simple example of using a filter in your applications. Filters in HBase can be applied to rowkeys, column qualifiers, or data values. You can also compose multiple filters together using the FilterList and WhileMatchFilter objects. Filters also allow you to page over data, limiting the number of rows returned by the scanner. We cover the bundled filters in more depth in chapter 4.

2.7 Atomic operations

The last command in the HBase arsenal is the Increment Column Value (ICV). It’s

exposed as both the Increment command object like the others but also as a method on the HTableInterface. Let’s use the HTableInterface version because it offers slightly more intuitive semantics. Using it to keep count of the number of twits per user looks like this:

long ret = usersTable.incrementColumnValue( Bytes.toBytes("TheRealMT"),

Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), 1L);

This command allows you to change an integral value stored in an HBase cell without reading it back first. The data manipulation happens in HBase, not in your client application, which makes it fast. It also avoids a possible race condition where some other client is interacting with the same cell. You can think of the ICV as identical to Java’s AtomicLong.addAndGet() method. The increment value can be any Java Long value, positive or negative. We’ll cover atomic operations in more detail in the next section.

Notice also that you’re not storing this data in the twits table but instead in the users table. You store it there because you don’t want this information as part of a scan. Keeping it in the twits table would upset the common access pattern of that table.

Like Java’s Atomic family of classes, the HTableInterface also provides checkAnd- Put() and checkAndDelete() methods. They allow for more fine-grained control while maintaining the atomic semantics. You could implement the incrementColumn- Value() method using checkAndPut():

Get g = new Get(Bytes.toBytes("TheRealMT")); Result r = usersTable.get(g);

long curVal = Bytes.toLong( r.getColumnLatest( Bytes.toBytes("info"),

Bytes.toBytes("tweet_count")).getValue()); long incVal = curVal + 1;

Put p = new Put(Bytes.toBytes("TheRealMT")); p.add( Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), Bytes.toBytes(incVal)); usersTable.checkAndPut( Bytes.toBytes("TheRealMT"),

Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), Bytes.toBytes(curVal), p);

This implementation is quite a bit longer, but you can do it. Using checkAndDelete() looks much the same.

Following the same patterns as before, you can now easily build a TwitsTool. The model, DAO, and command-line implementations look similar to what you’ve seen for the users table. An implementation is provided in the source code accompanying this book.

2.8 ACID semantics

If you’ve worked with database systems, you’ve heard about the ACID semantics that various systems provide. ACID is a set of properties that are important to be aware of when building applications that use database systems for storage. Using these properties, you can reason the behavior of your application when it comes to interacting with the underlying store. For the sake of simplicity, let’s again define ACID. Keep in mind that ACID is different from CAP, which we briefly touched on earlier:

■ _{Atomicity—Atomicity is the property of an operation being atomic, or, in other}

words, all or nothing. If the operation succeeds, the entire operation succeeds. If the operation fails, it fails in its entirety and the system is left in exactly the same state as it was in before the operation started.

■ _{Consistency—Consistency is the property of an operation taking the system from}

one valid state to another. If the operation makes the system inconsistent, it won’t be performed or it will be rolled back.

■ Isolation—Isolation means that no operation interferes with any other opera-

tion in the system. For instance, no two writes to a single object will happen at the same time. The writes will happen one after the other, but not at the exact same moment.

■ Durability—Durability is something we talked about earlier. It means that once

data is written, it’s guaranteed to be read back and not lost in due course of nor- mal operation of the system.

2.9 Summary

In case you missed something along the way, here is a quick overview of the material covered in this chapter.

HBase is a database designed for semistructured data and horizontal scalability. It stores data in tables. Within a table, data is organized over a four-dimensional coordi- nate system: rowkey, column family, column qualifier, and version. HBase is schema-less, requiring only that column families be defined ahead of time. It’s also type-less, storing all data as uninterpreted arrays of bytes. There are five basic commands for interacting with data in HBase: Get, Put, Delete, Scan, and Increment. The only way to query

Summary

The data model is logically organized as either a key-value store or as a sorted map of maps. The physical data model is column-oriented along column families and individual records are stored in a key-value style. HBase persists data records into HFiles, an immutable file

format. Because records can’t be modified once written, new values are persisted to new HFiles. Data view is reconciled on the fly at read time and during compactions.

The HBase Java client API exposes tables via the HTableInterface. Table connec- tions can be established by constructing an HTable instance directly. Instantiating an HTable instance is expensive, so the preferred method is via the HTablePool because it manages connection reuse. Tables are created and manipulated via instances of the HBaseAdmin, HTableDescriptor, and HColumnDescriptor classes. All five commands are exposed via their respective command objects: Get, Put, Delete, Scan, and Incre- ment. Commands are sent to the HTableInterface instance for execution. A variant of Increment is also available using the HTableInterface.incrementColumnValue() method. The results of executing Get, Scan, and Increment commands are returned in instances of Result and ResultScanner objects. Each record returned is represented

6 _{HBase’s ACID semantics are described in the HBase manual:}_{http://hbase.apache.org/acid-semantics.html}_. HBase is not an ACID-compliant database6

HBase isn’t an ACID-compliant database. But HBase provides some guarantees that you can use to reason about the behavior of your application’s interaction with the system. These guarantees are as follows:

1 Operations are row-level atomic. In other words, any Put() on a given row

either succeeds in its entirety or fails and leaves the row the way it was before the operation started. There will never be a case where part of the row is written and some part is left out. This property is regardless of the number of column families across which the operation is being performed.

2 Interrow operations are not atomic. There are no guarantees that all opera- tions will complete or fail together in their entirety. All the individual operations are atomic as listed in the previous point.

3 checkAnd* and increment* operations are atomic.

4 Multiple write operations to a given row are always independent of each other in their entirety. This is an extension of the first point.

5 Any Get() operation on a given row returns the complete row as it exists at

that point in time in the system.

6 A scan across a table is not a scan over a snapshot of the table at any point. If a row R is mutated after the scan has started but before R is read by the scanner object, the updated version of R is read by the scanner. But the data read by the scanner is consistent and contains the complete row at the time it’s read.

From the context of building applications with HBase, these are the important points you need to be aware of.

by a KeyValue instance. All of these operations are also available on the command line via the HBase shell.

Schema designs in HBase are heavily influenced by anticipated data-access patterns. Ideally, the tables in your schema are organized according to these patterns. The rowkey is the only fully indexed coordinate in HBase, so queries are often imple- mented as rowkey scans. Compound rowkeys are a common practice in support of these scans. An even distribution of rowkey values is often desirable. Hashing algo- rithms such as MD5 or SHA1 are commonly used to achieve even distribution.

Distributed HBase,

In document Hbase in Action (Page 73-78)