HBase table design
4.5.1 Optimized for writes
When you’re writing lots of data into HBase tables, you want to optimize by distribut- ing the load across RegionServers. This isn’t all that hard to do, but you may have to make trade-offs in optimizing your read patterns: for instance, the time-series data example. If your data is such that you use the timestamp as the rowkey, you’ll hot-spot on a single region during write time.
In many use cases, you don’t need to access the data based on a single timestamp. You’ll probably want to run a job that computes aggregates over a time range, and if that’s not latency sensitive, you can afford to do a parallel scan across multiple regions
"TheRealMT", "info", "password", 1329088321289, "Langhorne" "TheRealMT", "info", "password", 1329088818321, "abc123"
"TheRealMT", "info", "email", 1329088321289, "[email protected]" "TheRealMT", "info", "name", 1329088321289, "Mark Twain"
HFile for the info column family in the users table
105
I/O considerations
to do that for you. The question is, how do you distribute that data across multiple regions? There are a few options to consider, and the answer depends on what kind of information you want your rowkeys to contain.
HASHING
If you’re willing to lose the timestamp information from your rowkey (which may be okay in cases where you need to scan the entire table every time you want to do some- thing, or you know the exact key every time you want to read data), making your rowkey a hash of the original data is a possible solution:
hash("TheRealMT") -> random byte[]
You need to know "TheRealMT" every time you want to access the row that is keyed by the hashed value of this function.
With time-series data, that generally isn’t the case. You most likely don’t know the specific timestamp when you access data; you probably have a time range in mind. But there are cases like the twits table or the relationship tables you created earlier, where you know the user ID and can calculate the hash to find the correct row. To achieve a good distribution across all regions, you can hash using MD5, SHA-1, or any other hash function of your choice that gives you random distribution.
The way you use your hash function is also important. The relationship tables you built earlier in this chapter use MD5 hashes of the user IDs, but you can easily regener- ate those when you’re looking for a particular user’s information. But note that you’re concatenating the MD5 hashes of two user IDs (MD5(user1) + MD5(user2)) rather than concatenating the user IDs and then hashing the result (MD5(user1 + user2)). The reason is that when you want to scan all the relationships for a given user, you pass start and stop rowkeys to your scanner object. Doing that when the key is a hash of the combination of the two user IDs isn’t possible because you lose the information for the given user ID from that rowkey.
SALTING
Salting is another trick you can have in your tool belt when thinking about rowkeys. Let’s consider the same time-series example discussed earlier. Suppose you know the time range at read time and don’t want to do full table scans. Hashing the time- stamp and making the hash value the rowkey requires full table scans, which is highly inefficient, especially if you have the ability to limit the scan. Making the hash
Collisions
Hashing algorithms have a non-zero probability of collision. Some algorithms have more than others. When working with large datasets, be careful to use a hashing algorithm that has lower probability of collision. For instance, SHA-1 is better than MD5 in that regard and may be a better option in some cases even though it’s slightly slower in performance.
value the rowkey isn’t your solution here. You can instead prefix the timestamp with a random number.
For example, you can generate a random salt number by taking the hash code of the timestamp and taking its modulus with some multiple of the number of Region- Servers:
int salt = new Integer(new Long(timestamp).hashCode()).shortValue() % <number
of region servers>
This involves taking the salt number and putting it in front of the timestamp to gener- ate your timestamp:
byte[] rowkey = Bytes.add(Bytes.toBytes(salt) \ + Bytes.toBytes("|") + Bytes.toBytes(timestamp));
Now your rowkeys are something like the following:
0|timestamp1 0|timestamp5 0|timestamp6 1|timestamp2 1|timestamp9 2|timestamp4 2|timestamp8
These, as you can imagine, distribute across regions based on the first part of the key, which is the random salt number.
0|timestamp1, 0|timestamp5, and 0|timestamp6 go to one region unless the region splits, in which case it’s distributed to two regions. 1|timestamp2 and 1|timestamp9 go to a different region, and 2|timestamp4 and 2|timestamp8 go to the third. Data for con- secutive timestamps is distributed across multiple regions.
But not everything is hunky-dory. Reads now involve distributing the scans to all the regions and finding the relevant rows. Because they’re no longer stored together, a short scan won’t cut it. It’s about trade-offs and choosing the ones you need to make in order to have a successful application.