- I have 10 million thermostats keeping sending data frequently(once per 10 seconds) to the cloud, at a certain time, I want to know how many and which thermostats are on and have temperature higher than 80.
- How can I find all thermostats that have ever had temperature higher than 90 and less than 100?
- How can I find anomaly thermostats based all time series data for each device in real time or near real time that would allow me to take actions before it affects customers?
In the effort to research solutions for the above 3 problems, I come across ElasticSearch and would like to share some findings here. It’s always a good practice to keep data in one system for consistency consideration, so I’m trying to see if it’s possible use ElasticSearch as OLTP engine.
ElasticSearch is a powerful search engine built on top of the revert index engine Apache Lucene, it provides a lot of complex aggregation and text analysis, including a Time Series data analysis tool TimeLion, and I also found a nice statistical anomaly detection implementation in ElasticSearch based on Ebay’s anomaly detection algorithm for their servers’ time series data health check.
Near real time
ES has a lot of similarities with Cassandra in writing, data is written into memory first, then flushed to disk later on in a batch fashion. However, unlike Cassandra where you can GET the latest data from memory before it’s flushed into disk, in ES, the data is not visible until it can be opened by filesystem cache, as a unit of segment(A term from Lucene). Every 1 second, data is synced from memory to searchable cache. As a result, you should think about your application tolerance to real time query.
Since data in ES is distributed in multiple nodes and shards and segments, a fully distributed and exhaustive search will be applied to all segments that partially has the data, transfer to one node, do a global sort, then return the requested page, and when you request another page next time, the same thing will be done again. Both the time complexity of sort and space requirement of putting all data in memory can be really bad for OLTP use case, especially when you query time series data in pages. This can be alleviated by better index design, for example, have a weekly/daily/hourly index, however, be aware of the worst case that each hour data might be still too bad for sorting in memory, think carefully about your data intensity.
There’s no transaction support for ES, even though there are two APIs that provide bulk operation: update_by_query, bulk CRUD, they are not atomic and are there only for better performance with no roll back support.
Consistency level of ONE, QUORUM, ALL are provided, a write to primary shard must be successful first and then data will be sent to replica shards in parallel. Note that when you query, you can also specify to query the primary shard only. In this case, if you have consistency level of ONE, which optimizes the insertion performance and query the primary shard, you can get better performance but might create a hot spot on primary shard nodes.
A Write ahead log called translog is available for recovery, just like Cassandra’s commitLog.
Let’s assume that writing traffic is not a problem.
For problem 1, if I want to search the latest value of all thermostats, I have to first find the latest of each device. Since for each time series, a sort based on time is done on the fly, it can take a huge amount of time to just find the last value of all devices, so this is not an option. Then maybe I can keep updating the value of each device in a separate index. However, like in Cassandra, an UPDATE in ES is a soft update, meaning a new record is created with a newer version number, both version of this record will stay in disk, later on compaction process will compact them, which is not a desirable cost of resource. What’s worse, since each field is indexed in ES, one update actually means not only updating the record itself, but also updating all relevant indices, causing even more resource than Cassandra. Perhaps an in memory datastore would work better with update intensive traffic.
For problem 2, ES might be a good solution since it’s a perfect reverse index use case.
For problem 3, the timelion plug-in can provide lots of good info for analytics use case. Further algorithm research needs to be done.