ACID is good-Complexity in Cassandra consistency

If you are used to relational database and new to NoSQL database, you will soon start to miss ACID support as your friend in development. I’ll talk about some very subtle but important details about Cassandra that I wish I had known better when I started my relationship with Cassandra.

Cassandra data modeling is very query oriented. Each query, even for exactly the same data, often requires a new table, with same data, but different partition key.

For example, a user has several different accounts, each account has time series payment data for different purposes. Now if I need the following queries:

  1. Get first 100 payment data for this user.
  2. Get first 100 payment data for one of this user’s accounts.
  3. Get first 100 payment data for a specific purpose.
  4. Get first 100 payment data for a specific purpose for one of the user’s accounts.

Unlike in MySQL, where you can use index to get them easily with a WHERE condition. Just for the record, Cassandra also provides a secondary index, but it has a lot of problems that prevent people from using it. What people end up doing is to create one table for each query with the same data. In this case, it would be like the following:

  1. Table one: Partition by user
  2. Table two: Partition by account
  3. Table three: Partition by payment purpose
  4. Table four: Partition by combined key-account and payment purpose

Since it’s advertised that storage is cheap, Cassandra, like many other big data technologies, run on cheap commodity machines, it seems reasonable to just create these 4 tables, increase number of Cassandra nodes, query the table when you need to.

However, the story doesn’t end here. Now that you have 4 tables to populate when you insert one record. You would probably, as you always do in relational database, add a transaction to include all four inserts and just push them to database, so you can sleep well without worrying about data consistency across table. But it’s not so easy.

The first thing you look up would be Cassandra’s transaction. It really is not the transaction we are talking about. Cassandra’s transaction is called CAS(Compare and Set), and it only applies at individual statement level. Things like:

Insert x into xxx if x doesn’t exist;

Update xxx set a=’a’ if a=’b’;

It doesn’t support batching several statements into one transaction and roll back in case of partial failure. What Cassandra’s transaction really provides is Linearizable consistency, or, as in ACID term, Isolation. A good example is that two users trying to claim the same account id, only one will succeed with Cassandra transaction, or MySQL isolation. Since Cassandra is a distributed system, it’s hard to just create a lock like in MySQL, internally, Cassandra used Paxo’s algorithm to achieve isolation. Here is a good article explaining this.

Ok, transaction doesn’t work, what’s next?

BatchStatement comes to the rescue. A batch in Cassandra allows you to batch statements and send them together in one request, if you are using ‘Logged batch’, it will guarantee that either all succeed or none succeed, BUT, it only guarantees eventual consistency, and does NOT supports isolation. Let me explain this:

Suppose you are sending the payment data into 4 tables with 4 statements in a logged batch, firstly, the Cassandra node that receives this request write batch_log into itself and two other nodes in case this node dies, so other nodes can continue its task. Then the node starts to do its work, apply the four statements in random order unless you force order given some conditions. Now the work started, if, for whatever reason, one of the statements could not be completed, or takes too long to process, the user can see results from other statements, no isolation supported. The worst case would be like this:

Since one statement is not successful yet, Cassandra will retain the batch_log in system, throw a WriteTimeOut Exception to you, and EVENTUALLY completes that query, nobody knows when that statement will be written, and before that happens, you get a partial success. Only when that finally finishes, Cassandra will remove the  batch_log. This might be undesirable for your use case, so think about it.

Now this is still not the end of story. Another batch type-Counter batch, which is the only type of batch statement that allows you to inject counter update statement, and only allows you to inject counter update, does not even guarantee that all statements will eventually be successful. I wasn’t able to find documentation about it, but if you look into Cassandra source code, and follow the class BatchStatement execute method, you will find that at some point, the batch will make a decision to either use mutateAtomically or mutate method based on batch type, and for counter, it’s not atomic.

Counter is an interesting topic in Cassandra, it had a lot of accuracy issue before version 2.1, claimed to be fixed in 2.1 with a new implementation, which sacrificed some performance to get accuracy, however, when combining counter with batch, you might see some surprisingly bad performance and possibly data loss in pre 3.0 version.

Back to the topic. Now if you are ok with non isolation support with batch statement for transaction purpose, be prepared for the next surprise, batch statement, even if it requires just one network call from the client to Cassandra, it actually has worse performance than sending statements one by one. And the reasoning makes perfect sense:

When the node receives a batch request which contains several different statements, if each statement belongs to a different partition(in our case, different table), these partitions, based on the partitioner you use, might belong to different nodes, also, since Cassandra has replicas to store duplicate data for fault tolerance, each one statement must be sent to more than one node given your consistency level setting. Now all these network traffic states have to be managed by the coordinator node, causing worse performance as you increase the diversity of partitions in the batch. As a result, it is recommended to include a small number of different partitions in your request, or put only same partition statements into one batch, which is not really the most common data modeling use case in Cassandra.

Given so many constraints, we are cornered to the use case that: to gain atomic behavior across multiple tables, we are better off think really hard about how many tables are absolutely necessary to insert into without using any counter queries, and no isolation support. These might or might not work for your specific use case. And sadly, this is actually not the end of the story yet.

People usually keep their data only to a specific time, and purge the data when it expires. Except for all those common headaches of tombstone, compaction, repair that come along with Cassandra, you still need to worry about DELETE consistency across tables. Just like when you insert data, you want them to execute atomically, when you delete data, you want it too. And then you will encounter exactly the same batch statement, isolation issue, and remember that DELETE and compactions can cause high consumption of CPU, memory, even affecting your operational data, like timeouts with high resource usage.

Having said all those, I think the point is that one should be aware of so many details when conducting data modeling in Cassandra, and really understand what your application needs and what Cassandra can do, a lot of these problems I mentioned only come up when doing performance test, doing your due diligence to details can save you a lot of effort and time.

I remember a saying like this: Relational database was a one size fits all solution, now it doesn’t fit any problem unless your application only need low diversity, small data set, low velocity data, for all other problems, you can find a specific new technology that works better. And my experience tells me that, yes, you can find a better one for your application, but make sure you understand its use case really well to know that it will solve your specific problem and the effort required. In the case of Cassandra data modeling, I start to realize how convenient ACID is in relational database.

And for those of you who are interested, there is another kind of database, called NewSQL, that preserves old SQL ACID characteristics, and is memory based, thus providing faster process time, and reasonably good volume(at terabytes level), single threaded(but very fast process for each record process), I will write a blog to talk about it. If you define your data set to be kind of big but not so big, it’s worthwhile to take a look at it.




Leave a Reply

Your email address will not be published. Required fields are marked *