Time series data system with Cassandra(1)

Time series data is very common in science, engineering, financial domain, with the popularity of Internet of Things, it’s getting even more crucial for business. For a database, time series data presents a lot of challenges At Ayla Networks, we switched from MySQL to Cassandra to solve this issue, I’ll talk about comparison between MySQL and Cassandra regarding one of the challenges brought by time series data in this article, and will compare Cassandra with other NoSQL database later in a different article.

The problem is high velocity database WRITE operation. Think about an energy sensor, temperature sensor, or fluid pressure sensor in HVAC system, data might come into your db several times a minute for each sensor. With batch process, you could even have hundreds of data writes in one request from a talkative machine that has multiple sensors attached. This characteristics makes it hard for traditional relational database(like MySQL) to achieve stable performance. As we know, for MySQL to write one record, it requires disk seek operation to find the correct place to insert, while cassandra approach inserts in a different way: it has an in-memory structure called memtable andan append-only commitlog file. When a new record comes in, the data will be written to memtable, which is very fast since it’s in memory, when memtable is full, Cassandra will flush its content into disk. For every request, the data is also appended to commitlog file for recovery purpose. The whole insert process requires no disk seek operation. Depending on the type of hardware(SSD, electromagnetic disk) and operating system, disk seek time varies, but given the same hardware and OS, cassandra is clearly much faster with no disk seek at all.

Another factor that comes with Cassandra is its distributed system design. To achieve high availability, one insert will be sent to n different nodes, you have the option to configure n. This is the same principle as HDFS has, usually with a replication factor of 3 to guarantee not losing any data in case of some machines getting down. Cassandra introduces a configuration concept called “tunable consistency level”, one can define that as long as 2 of 3 required Cassandra nodes have got the data in memtable and commitlog, Cassandra can return success to its client. The other node might have got the data or not, but Cassandra will eventually reach consistent data across all 3 nodes through other mechanisms like gossip and hinted hand off, repair service, read repair, etc. Depending on your configuration, one need to make a trade-off between insert speed and consistency level choice, note that you do the same kind of consistency level for read operation in Cassandra. This is application specific decision, if you care about immediate consistency, you can choose higher consistency level for write, or you can balance consistency level of both write and read, say you config both write and read at 2 out 3 duplicates, it’s guaranteed that you will always read the correct data. As a result of duplicate writes, there will be extra network latency in Cassandra compared to MySQL, which is designed to work on one machine. MySQL is often deployed in Master-Slave pattern where data is duplicated to slave as a read only machine, in this case, the same read consistency problem would arise. Sometimes people also use sharding strategies in MySQL, which requires a lot of manual effort, and sharding strategy varies from case to case, adding lots of dev ops work. There have been a lot of effort to improve MySQL scalability, but if you think about each strategy, they often tend to trade off relational database advantages(ACID) and become more “NoSQL”, then the problem becomes more of a linear trade-off decision instead of binary decision. CAP theorem describes the trade off very well among many different databases.

iMkdg

Of course, the fast WRITE speed doesn’t come in no price. Cassandra read queries are much less flexible compared to MySQL, it has two protocols: thrift and CQL, thrift has been deprecated at this point and CQL syntax looks very similar to SQL, but that’s probably the only similarity between CQL and SQL, for example, CQL won’t support Order By with any non-primary key field, it doesn’t support inequality search != at all, it doesn’t support > or < except for primary key, its secondary index has a lot of performance issues(The latest Cassandra 3.4 introduced a new secondary index implementation though, which I haven’t tested yet). Due to these limitations, schema design for Cassandra presents a lot of interesting problems, especially for time series data. I’ll continue to talk about these topics in the upcoming series.

 

PS: This is the first tech blog I’m starting, I’m sure there are lots of improvement to make, but I’m being agile and hope to get suggestions, iterate faster,  get better sooner, any constructive comment is welcome.

Leave a Reply

Your email address will not be published. Required fields are marked *