• No results found

Configuring the HBase BlockCache

In document Cloudera Administration (Page 87-91)

In the default configuration, HBase uses a single on-heap cache. If you configure the off-heap BucketCache, the on-heap cache is used for Bloom filters and indexes,and the off-heap BucketCache is used to cache data blocks. This is referred to as the Combined Blockcache configuration. The Combined BlockCache allows you to use a larger in-memory cache while reducing the negative impact of garbage collection in the heap, because HBase manages the BucketCache, rather than relying on the garbage collector.

Contents of the BlockCache

In order to size the BlockCache correctly, you need to understand what HBase places into it.

• Your data: Each time a Get or Scan operation occurs, the result is added to the BlockCache if it was not already cached there. If you use the BucketCache, data blocks are always cached in the BucketCache.

• Row keys: When a value is loaded into the cache, its row key is also cached. This is one reason that it is important to make your row keys as small as possible. A larger row key takes up more space in the cache.

• hbase:meta: The hbase:meta catalog table keeps track of which RegionServer is serving which regions. It can consume several megabytes of cache if you have a large number of regions. It is given in-memory access priority, which means HBase attempts to keep it in the cache as long as possible.

• Indexes of HFiles: HBase stores its data in HDFS in a format called HFile. These HFiles contain indexes which allow HBase to seek for data within them without needing to open the entire HFile. The size of an index is a factor of the block size, the size of your row keys, and the amount of data you are storing. For big data sets, the size can exceed 1 GB per RegionServer, although it is unlikely that the entire index will be in the cache at the same time. If you use the BucketCache, indexes are always cached on-heap.

• Bloom filters: If you use Bloom filters, they are stored in the BlockCache. If you use the BucketCache, Bloom filters are always cached on-heap.

The sum of the sizes of these objects is highly dependent on your usage patterns and the characteristics of your data. For this reason, the HBase Web UI and Cloudera Manager each expose several metrics to help you size and tune the BlockCache.

Deciding Whether To Use the BucketCache

The HBase team has published the results of exhaustive BlockCache testing, which revealed the following guidelines.

• If the result of a Get or Scan typically fits completely in the heap, the default configuration, which uses the on-heap LruBlockCache, is the best choice, as the L2 cache will not provide much benefit. If the eviction rate is low, garbage collection can be 50% less than that of the BucketCache, and throughput can be at least 20% higher.

• Otherwise, if your cache is experiencing a consistently high eviction rate, use the BucketCache, which causes 30-50% of the garbage collection of LruBlockCache when the eviction rate is high.

BucketCache using file mode on solid-state disks has a better garbage-collection profile but lower throughput than BucketCache using off-heap memory.

Bypassing the BlockCache

If the data needed for a specific but atypical operation does not all fit in memory, using the BlockCache can be counter-productive, because data that you are still using may be evicted, or even if other data is not evicted, excess garbage collection can adversely effect performance. For this type of operation, you may decide to bypass the BlockCache. To bypass the BlockCache for a given Scan or Get, use the setCacheBlocks(false) method.

In addition, you can prevent a specific column family's contents from being cached, by setting its BLOCKCACHE configuration to false. Use the following syntax in HBase Shell:

hbase> alter 'myTable', 'myCF', CONFIGURATION => {BLOCKCACHE => 'false'}

Cache Eviction Priorities

Both the on-heap cache and the off-heap BucketCache use the same cache priority mechanism to decide which cache objects to evict in order to make room for new objects. Three levels of block priority allow for scan-resistance and in-memory column families. Objects evicted from the cache are subject to garbage collection.

• Single access priority: The first time a block is loaded from HDFS, it is given single access priority, which means that it will be part of the first group to be considered during evictions. Scanned blocks are more likely to be evicted than blocks that are used more frequently.

• Multi access priority: If a block in the single access priority group is accessed again, it is assigned multi access priority, which means that it is part of the second group considered during evictions, and is therefore less likely to be evicted.

• In-memory access priority: If the block belongs to a column family which is configured with the in-memory configuration option, it is assigned the in memory access priority, regardless of its access pattern. This group is the last group considered during evictions, but is not guaranteed not to be evicted. Catalog tables are configured with in-memory access priority.

To configure a column family for in-memory access, use the following syntax in HBase Shell:

hbase> alter 'myTable', 'myCF', CONFIGURATION => {IN_MEMORY => 'true'}

To use the Java API to configure a column family for in-memory access, use the HColumnDescriptor.setInMemory(true) method.

Sizing the BlockCache

When you use the LruBlockCache, the blocks needed to satisfy each read are cached, evicting older cached objects if the LruBlockCache is full. The size cached objects for a given read may be significantly larger than the actual result of the read. For instance, if HBase needs to scan through 20 HFile blocks to return a 100 byte result, and the HFile blocksize is 100 KB, the read will add 20 * 100 KB to the LruBlockCache.

Because the LruBlockCache resides entirely within the Java heap, it is important to understand how much heap is available to HBase and what percentage of the heap is available to the LruBlockCache. By default, the amount of HBase's heap reserved for the LruBlockCache (hfile.block.cache.size) is .25, or 25%. To determine the amount of heap available for the LruBlockCache, use the following formula. The 0.99 factor allows 1% of heap to be available as a "working area" for evicting items from the cache. If you use the BucketCache, the on-heap LruBlockCache only stores indexes and Bloom filters, and data blocks are cached in the off-heap BucketCache.

number of RegionServers * heap size * hfile.block.cache.size * 0.99

To tune the size of the LruBlockCache, you can add RegionServers or increase the total Java heap on a given RegionServer to increase it, or you can tune hfile.block.cache.size to reduce it. Reducing it will cause cache

evictions to happen more often, but will reduce the time it takes to perform a cycle of garbage collection. Increasing the heap will cause garbage collection to take longer but happen less frequently.

About the off-heap BucketCache

If the BucketCache is enabled, it stores data blocks, leaving the on-heap cache free for storing indexes and Bloom filters. The physical location of the BucketCache storage can be either in memory (off-heap) or in a file stored in a fast disk.

• Off-heap: This is the default configuration.

• File-based: You can use the file-based storage mode to store the BucketCache on an SSD or FusionIO device, Starting in CDH 5.4 (HBase 1.0), it is possible to configure a column family to keep its data blocks in the L1 cache instead of the BucketCache, using the HColumnDescriptor.cacheDataInL1(true) method or by using the following syntax in HBase Shell:

hbase> alter 'myTable', CONFIGURATION => {CACHE_DATA_IN_L1 => 'true'}}

Configuring the off-heap BucketCache

This table summaries the important configuration properties for the BucketCache. To configure the BucketCache, see Configuring the off-heap BucketCache Using Cloudera Manager on page 90 or Configuring the off-heap BucketCache Using the Command Line on page 90

Table 1: BucketCache Configuration Properties

Description Default

Property

When BucketCache is enabled, use it as a L2 cache for LruBlockCache. true

hbase.bucketcache.combinedcache.enabled

If set to true, indexes and Bloom filters are kept in the

LruBlockCache and the data blocks are kept in the BucketCache Where to store the contents of the BucketCache. Either onheap or file:/path/to/file.

none (BucketCache is disabled by default)

hbase.bucketcache.ioengine

A float between 0.0 and 1.0. This factor multiplied by the Java heap 0.4

hfile.block.cache.size

size is the size of the L1 cache. In other words, the percentage of the Java heap to use for the L1 cache.

The total size of the BucketCache, in megabytes. The size you configure 64 MB, which is the default blocksize

hbase.bucketcache.size

depends on the amount of memory available to HBase, or the size of your FusionIO device or SSD.

A comma-separated list of sizes for buckets for the BucketCache if you not set

hbase.bucketcache.sizes

prefer to use multiple sizes. The sizes should be multiples of the default blocksize, ordered from smallest to largest. The sizes you use will depend on your data patterns. This parameter is experimental.

Description Default

Property

A JVM option to configure the maximum amount of direct memory not set

-XX:MaxDirectMemorySize

available to the JVM. This value should be larger than the combination of the heap and off-heap BucketCache.

Configuring the off-heap BucketCache Using Cloudera Manager 1. Go to the HBase service.

2. Click the Configuration tab.

3. Edit the parameter HBASE_OFFHEAPSIZE in the HBase Service Advanced Configuration Snippet for hbase-env.shand set it to a value (such as 5G) which will accommodate your desired L2 cache size, in addition to space reserved for cache management.

4. Edit the parameter HBASE_OPTS in the HBase Service Advanced Configuration Snippet for hbase-env.sh and add the JVM option -XX:MaxDirectMemorySize=<size>G, replacing <size> with a value large enough to contain your heap and off-heap BucketCache, expressed as a number of gigabytes.

5. Add the following settings to the HBase Service Advanced Configuration Snippet for hbase-site.xml, using values appropriate to your situation. See Table 1: BucketCache Configuration Properties on page 89.

<property>

<name>hbase.bucketcache.ioengine</name>

<value>offheap</value>

</property>

<property>

<name>hfile.block.cache.size</name>

<value>0.2</value>

</property>

<property>

<name>hbase.bucketcache.size</name>

<value>4096</value>

</property>

6. Click Save Changes to commit the changes.

7. Restart or rolling restart your RegionServers for the changes to take effect.

Configuring the off-heap BucketCache Using the Command Line Important:

• If you use Cloudera Manager, do not use these command-line instructions.

• This information applies specifically to CDH 5.5.x. If you use an earlier version of CDH, see the documentation for that version located at Cloudera Documentation.

1. First, verify the RegionServer's off-heap size, and if necessary, tune it by editing the hbase-env.sh file and adding a line like the following:

HBASE_OFFHEAPSIZE=5G

Set it to a value which will accommodate your desired L2 cache size, in addition to space reserved for cache management.

2. Edit the parameter HBASE_OPTS in the hbase-env.sh file and add the JVM option

-XX:MaxDirectMemorySize=<size>G, replacing <size> with a value large enough to contain your heap and off-heap BucketCache, expressed as a number of gigabytes.

3. Next, configure the properties in Table 1: BucketCache Configuration Properties on page 89 as appropriate, using the example below as a model.

<property>

<name>hbase.bucketcache.ioengine</name>

<value>offheap</value>

</property>

<property>

<name>hfile.block.cache.size</name>

<value>0.2</value>

</property>

<property>

<name>hbase.bucketcache.size</name>

<value>4194304</value>

</property>

4. Restart each RegionServer for the changes to take effect.

Monitoring the BlockCache

Cloudera Manager provides metrics to monitor the performance of the BlockCache, to assist you in tuning your configuration.

You can view further detail and graphs using the RegionServer UI. To access the RegionServer UI in Cloudera Manager, go to the Cloudera Manager page for the host, click the RegionServer process, and click HBase RegionServer Web UI.

If you do not use Cloudera Manager, access the BlockCache reports at

http://regionServer_host:22102/rs-status#memoryStats, replacing regionServer_host with the hostname or IP address of your RegionServer.

In document Cloudera Administration (Page 87-91)