HBase and Cassandra are two main stream open source key-value pair NoSQL databases that can scale out and are fault tolerant. We looked at the two databases because they both work well with the challenge of high velocity incoming data compared to relational database.
B+ tree vs Log Structured Merge
Relational database use B+ tree to store data, it’s a tree structure that supports efficient insertion, lookup and deletion. The problem it has with high velocity data is: Each node in B+ tree represents a “block” or “page”, a computer writes or read in units of blocks, if the data is smaller than one block, it’ll take this whole block, if it’s bigger than one block but smaller than 2, it’ll just take blocks. Database systems manage data by blocks to make efficient use of disk space. If one tree “node” exceeds the block capacity, the node will be split into two nodes, and it might also affect its parent node. Now if data comes very fast, the split and re-organization of blocks will happen frequently. Eventually, when the frequency of re-organization happens outweighs the advantage re-organization brings, MySQL won’t be able to stay healthy any more, that’s when you will see database response time spiking, causing timeout error on your servers.
HBase and Cassandra takes a different approach called Log Structured Merge. As I mentioned in the last article, data is written to an in memory structure, it’s called memstore in HBase, memtable in Cassandra. When this memory structure reaches a certain limit or the user manually flushes memory into disk, it will be written into disk as a file. As this kind of file accumulates, these two databases start compaction process to combine smaller files into bigger files. Each file also contains a tree structure data sorted.
Fundamentally, B+ tree re-organization requires disk seek to find the node and do stuff, LSM requires disk transfer to merge files. According to Moore’s Law, CPU, RAM and disk size always double every 18-24 months, however, disk seek time only has 5% improvement every year. As a result, at scale, disk seek is less efficient than disk transfer. LSM gains advantage of hardware to outperform B+ tree in this perspective. (However, LSM also brings its own problems due to this design with UPDATE and DELETE operation, I’ll talk about them later)
Except for high velocity, high volume is also a problem that MySQL has problem to deal with. Relational database is designed to work on one machine, depending your application’s data retention time, if it gets bigger than one machine’s capacity, you would either shard MySQL or look for other databases that can scale out. HBase and Cassandra employs different approach on this feature.
HBase arrange its data strictly ordered by its key byte order. For example, following is a sample sequence of keys in disk:
If a user wants to do range scan query, he would have to pad some numbers to his key design like the following:
You can keep inserting data into one region until it reaches a configurable maximum size, then it’ll be cut in half. Each region in HBase represents a key range, different key ranges are distributed in differet nodes. ZooKeeper helps HBase coordinate among different nodes and store metadata for clients to find the right node to write/read data. There is one master node and other normal nodes.
Cassandra actually provides a similar option called ByteOrderedPartitioner, which would order data like HBase does, however, it’s not recommended to be used in production since it’s hard to manage and will create hot spot issue. I tried it once and Cassandra would cut off a fixed length of my key and converts into some bytes, it turns out the cutting off part only occupied one small area of Cassandra supported key range, all my data end up in certain hot spot of the cluster. I didn’t dig further in this path. But HBase would do an auto-sharding for me, while in Cassandra you are on your own with this task.
The recommended way in Cassandra is called Murmur3 hash, it will hash your key evenly into its key range and help you evenly distribute your load across the cluster. Consistent hash and small virtual nodes are also techniques used to help ease the process of adding, removing node in cluster.
Cassandra doesn’t need ZooKeeper to help coordinate, it’s truely a peer to peer system, that the client could just talk to any node in cluster to write/read data. While in HBase, the client would talk to ZooKeeper first and remember which node to communicate, cache it memory for next time query, update when there’s a sharding going on.
As a result, both HBase and Cassandra can scale out with big volume data in their own approach, depending on your use case, they have strength and weakness for different query pattern, I’ll talk about details in next article.