HBase table design
4.7.7 Cell versioning
HBase by default maintains three versions of each cell. This property is configurable. If you care about only one version, it’s recommended that you configure your tables to maintain only one. This way, it doesn’t hold multiple versions for cells you may update. Versions are also configurable at a column family level and can be specified at the time of table instantiation:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 1}
You can specify multiple properties for column families in the same create statement, like this:
hbase(main):002:0> create 'mytable',
{NAME => 'colfam1', VERSIONS => 1, TTL => '18000'}
6 Lempel-Ziv-Oberhumer compression algorithm: www.oberhumer.com/opensource/lzo/. 7 Snappy compression library, by Google: http://code.google.com/p/snappy/.
117
Filtering data
You can also specify the minimum versions that should be stored for a column family like this:
hbase(main):002:0> create 'mytable', {NAME => 'colfam1', VERSIONS => 5,
MIN_VERSIONS => '1'}
This comes in handy in conjunction with setting the TTLs on the family. When all ver- sions currently stored are older than the TTL, at least the MIN_VERSION number of last versions is kept around. This ensures that you don’t get empty results if you query and the data is older than the TTL.
4.8
Filtering data
You’ve learned so far that HBase has a flexible schema and a simple disk layout, which allows applications to work closer to the disk and network and optimize at that level. Designing effective schemas is one aspect of it, and by now you have a bunch of concepts that you can apply to do that. You can design your keys such that data you access together is placed close on the disk so you can save on disk seeks while reading or writing. Often you have certain criteria, based on which you’re reading data that you can use to further optimize access. Filters are a powerful feature that can come in handy in such cases.
We haven’t come across many real-life use cases that use filters much; generally the access pattern can be optimized with tweaks to the table designs. But sometimes you’ve tweaked your table design as much as you can, and optimized it for as many different access patterns as possible. When you still need to reduce the data returned to clients, that’s when you reach for a filter. Filters are sometimes called push-down predicates, allowing you to push data-filtering criteria down to the server (see figure 4.16). That logic is applied during reads and affects the data that is returned to the client. This saves network I/O by limiting the data transferred over the network. But data is still
Region Server Region Server Region Server Region Server Client Filtering logic Region Server Region Server Region Server Region Server Client Answer to your query Filters Answer to your query
Client-side filtering Server-side filtering
Sending all data all the time
Send only data that passes filter criteria
Get or Scan Get or Scan
Figure 4.16 Filtering data can be done at the client side by reading data into the client application from the RegionServers and applying the filter logic there; or it can be done at the server side by pushing the filtering logic down to the RegionServers, thereby reducing the amount of data transferred over the network to the clients. Filters can essentially save on network I/O costs, and sometimes even on disk I/O.
read off the disk into the RegionServers, and filters are applied at the RegionServers. The network I/O savings can be significant because you’re probably storing a large amount of data in HBase tables, and reading it all into the client application to filter out useful bits is expensive.
HBase provides an API you can use to implement custom filters. A bunch of filters come bundled as well. Filtering can be done based on the rowkeys, which happens the earliest in the read process. Thereafter, filtering can happen based on the KeyValues read off the HFiles. A filter has to implement the Filter interface that’s part of the
HBase JAR or extend one of the abstract classes that implement it. We recommend extending the FilterBase abstract class so you can avoid having to write boilerplate code. Extending other classes such as CompareFilter is also an option and works equally well. The interface has the following methods that are called at various points while reading a row (see figure 4.17 for the order). They’re always executed in the sequence described next:
1 This method is the first one to be called and performs filtering based on the
rowkey:
boolean filterRowKey(byte[] buffer, int offset, int length)
Based on the logic in there, it returns true if the row is to be filtered out (not included in the result set returned) or false if it’s to be sent to the client.
2 If the row isn’t filtered in the previous step, this method is invoked next for
every KeyValue object that’s part of the current row:
ReturnCode filterKeyValue(KeyValue v)
This method returns a ReturnCode, which is an enum defined as a part of the Filter interface. The ReturnCode returned determines what happens to that particular KeyValue object.
3 This method comes after the KeyValues are filtered based on step 2:
void filterRow(List<KeyValue> kvs)
This method is passed the list of KeyValue objects that made it after filtering. Given that this method has access to that list, you can perform any transforma- tions or operations you want to on the elements in the list at this point.
4 At this point, the framework provides another chance to filter out the row if you
choose to do so:
boolean filterRow()
Returning true filters out the row under consideration.
5 You can build logic in your filter to stop a scan early. This is the method into
which you put that logic:
boolean filterAllRemaining()
This is handy in cases where you’re scanning a bunch of rows, looking for some- thing specific in the rowkey, column qualifier, or cell value, and you don’t care about the remaining rows once you’ve found it. This is the last method to be
119
Filtering data
Another useful method is reset(). It resets the filter and is called by the server after it has been applied to the entire row.
NOTE This API is powerful, but we aren’t aware of it being used much in applications. In many cases, the requirement for using filters changes if the schema design is changed.