Choosing a Time Series Database

Feed: MemSQL Blog.
Author: Floyd Smith.

Time series data is as old as databases themselves – and also the hot new thing. Interest in the topic has more than doubled during this decade. MemSQL handles time series data effectively. MemSQL has robust support for rapid ingest, stellar analytical and transactional performance, easy manageability, SQL support for queries and analytics tools, and low total cost of ownership.

Time series data is at the core of the Internet of Things, with frequent sensor readings coming in from all sorts of devices. Once you decide what readings to track for the servers in your data center, for example, you need a place to store the data, and you need to be able to quickly analyze and respond to it. A malfunctioning server needs to trigger an alert; lower or higher throughput than usual needs to be tracked, and action taken to follow up. (See our blog post, What is Time Series Data.)

Similar concerns come up for other Internet of Things devices, such as cars and card readers; in financial services, for stock trading, portfolio management, and risk analysis; in online marketplaces for ads, business needs, and consumer goods; and, increasingly, in day to day business management. Companies need to rapidly ingest original transactional data, use it to instantly trigger responses, and make it available for both real-time and longer-term analytics. These are all use cases that MemSQL excels at.

Use MemSQL to combine transactions and analytics in a translytical database. — MemSQL is a high-performing, scalable SQL database
that suits many time series data use cases well.

MemSQL is a high-performing, scalable SQL database
that suits many time series data use cases well

How Time Series Data is Used

Time series data is used differently than a lot of other kinds of data:

Appends and upserts over updates. A time series data record is usually stored in addition to an existing database – an upsert (from “update by inserting a record”). More narrowly, a specific time series data point can be stored by appending it to an existing data table row or record, though data that arrives out of order may be upserted to preserve time order. Upserts and appends can often be carried out on data in memory. Corrections to existing records, if allowed by the database being used, are regular transactions.
In-memory for value alerting. Alerting on a single out-of-bounds reading is a common use case for time series data; “the well is about to blow!”, or a stock value is crashing. As time series data comes in, it needs to be inspected and reacted to very quickly to handle alerting for out-of-bounds values.
In-memory for trend alerting. Also common, but harder to manage, is alerting for trends; if a value doubles in less than a minute, for instance, send an alert. If the time span for the alert is long, it can be a challenge – or impossible – to have all the data needed for the alert in memory, hurting both responsiveness and overall performance.
In-memory (or some on-disk) for real-time analytics. Real-time analytics, such as a trend display on a management dashboard, may include both older data and the most recent data. Real-time analytics do not need the near-instantaneous responsiveness of alerting but benefit strongly from needed data being kept in memory.
In-memory and on-disk for after-the fact analytics. After-the-fact analytics may primarily cover stale data (outside the alerting window), but should include the latest data as well. So most analytics, machine learning, and AI applications need access to all the data, and good performance, though not the near-instantaneous responsiveness of alerting.

Earthquake fault creep is straightforward time series data that MemSQL can handle well. — Earthquake fault creep in the San Francisco Bay Area. Anomalies
greater than two standard deviations from the norm are marked in red.

Earthquake fault creep in the San Francisco Bay Area. Anomalies
greater than two standard deviations from the norm are marked in red.

In the past, the time series-specific requirements above – appends, upserts, and alerting – have gotten most of the attention in discussing and implementing time series databases. This makes sense, given that time series databases often interacted with expensive machinery or very high-value processes. It was worth sacrificing longer-term considerations, such as analytics, in order to get the performance needed for alerting and, perhaps, dashboarding and similar monitoring. The raw time series data, voluminous as it was, was often thrown away quite soon after it was generated.

However, as the potential value of all the use cases for time series data grows, with the increasing use of advanced analytics, machine learning, and AI, the other requirements – upserts and transactions, real-time analytics (for device management and more traditional business purposes), and after-the-fact analytics to feed new predictive models – are increasing in importance, to the extent that they rival the alerting-oriented requirements.

The earthquake data shown above is an example of both the short-term value, and the increasing long-term value, of time series data. In the past, you needed to know whether an earthquake was happening, and afterward, you might store the data around the time of a quake for further study.

Today, advanced analytics, machine learning, and AI may come together to help us predict earthquakes. There’s no earthquake data set so voluminous that someone won’t want to run machine learning algorithms against it – and possibly generate a breakthrough in prediction by doing so.

NoSQL for Time Series Data?

In the past, relational databases couldn’t do the job for time series data. The main issue was cost-effective scalability. Relational databases couldn’t be efficiently scaled horizontally, and so couldn’t scale to handle the speed, volume, and real-time alerting requirements of time series data. NoSQL databases, on the other hand, scale well on standard commodity hardware.

Most relational databases also couldn’t be taken and modified for time series-specific needs. NoSQL databases, being mostly open source, allowed for innovation on top of existing, tested database code.

This is somewhat ironic, because time series data is structured in a way that makes it a great candidate for analytics. However, NoSQL databases do not natively support SQL. This renders their query interfaces off-limits to ad hoc SQL queries, analytics programs, traditional BI tools, and much more.

Additionally, purpose built time series databases may not interface well to ingest technologies such as Hadoop/HDFS, Kafka, S3 and others. In these cases, time series databases may create data islands, hard to get data into and hard to get information out of.

There are also a few time series databases that are relational, but run as a single process. The currently existing options have the scalability restrictions of traditional relational databases, while also lacking SQL support.

The query languages that do exist for NoSQL databases are optimized for in-memory alerting use cases. They are not structured for, and perform very poorly for, ad hoc queries and other exploratory analytics performed by data analysts.

NoSQL databases lack a query optimizer for, well, SQL queries. This means that every application developer has to write code that contains their own view of how best to frame an inquiry. This makes the application code larger and slower, while making the overall system inflexible.

Fortunately, on the relational database side, the picture is much different today. NewSQL databases have taken away much of the reason for existence of NoSQL databases – as described in our very popular blog post on the topic. NewSQL databases, such as MemSQL, are fully scalable – and, unlike NoSQL databases, they support structure and unstructured data, transactions, and SQL for queries, analytics, and business intelligence (BI) tools.

NewSQL databases also, increasingly, support specialized functionality that allows them to work well with time series data out of the box, while offering scale-out, in-memory and disk-based processing, and the JSON support that has been added to MemSQL. (This blog post describes how to use MemSQL’s data support for a blob.)

Some NewSQL databases can also be further optimized much more easily than was the case for traditional relational databases. We’ll explain some of these features, and potential optimizations, for the specific case of MemSQL below.

Using MemSQL as a Time Series Database

NewSQL databases are a relatively new category, and not all of them are equally mature. Also, some NewSQL databases, such as Google Spanner, are restricted to a specific cloud platform and are focused on making transactions consistent on data distributed around the world, which can raise concerns about analytics flexibility and lock-in.

MemSQL is among the most mature of the MemSQL offerings, and it has several features that particularly recommend it for use with time series data. These features include:

Speed plus scalability

Memory-optimized with on-disk compression

Ecosystem of ingest tools

Pipelines

MemSQL Pipelines

Pipelines to stored procedures

stored procedures

Scalable transaction support

SQL support

You can use MemSQL Pipelines to simplify and speed your operation. — MemSQL Pipelines and stored procedures help you streamline
a great deal of functionality into a single operation.

Because MemSQL has specialized functionality for smoothly interacting with data across memory and disk, including large memory caches for columnstore tables and compression that’s high-performance for both reads and writes, the severe disparity between in-memory and on-disk performance is greatly reduced with MemSQL. This allows you to use timestore data much more flexibly across a range of use cases, from alerting to reporting and predictive analytics.

The table below summarizes the pluses and minuses of dedicated time series databases, largely based on NoSQL, vs. MemSQL for time series data.

	Dedicated time series database	MemSQL
Ingest connections (Hadoop, Kafka, S3 etc.)	Varies	Strong
In-memory support	Strong	Strong
Columnstore/compression	Varies	Strong
Ingest performance	Strong	Strong
Transformations	Strong	Strong
Transaction support	Varies	Strong
SQL support	Poor	Strong
SQL optimizations	Poor	Strong
Time series-specific optimizations	Strong	Pipelines transformations or stored procedures

You can use MemSQL in combination with a wide range of existing tools. MemSQL can handle all needed processing, or you can deploy it in more complex architectures. For instance, you may use Hadoop/HDFS for ingest and to store raw or lightly processed input data in a data lake; Kafka for messaging; and Oracle or another traditional relational database for billing. You can even continue to use a specialized time series database for direct machine interface and alerting and specialized reporting. In all cases, MemSQL provides the operational data store and support for responsive analytics.

Conclusion

MemSQL is being used for an ever-wider range of use cases. Many time series workloads will operate better on MemSQL – faster, with greater functionality, and more cost-effectively. MemSQL customers tend to be most impressed by its high performance across a wide range of use cases and by the support of the MemSQL team.

Adopting a new database is a big decision. We recommend that you review our case studies and then try MemSQL. It’s free to use for clusters with up to 128GB of RAM. With MemSQL’s flexible columnstore support for tables stored on disk, that allows you to support datasets with hundreds of gigabytes of data.

You can also find out more directly from MemSQL. Simply contact us for more information. To see MemSQL’s time series capabilities in action, please view our webinar on the topic.

How Time Series Data is Used

NoSQL for Time Series Data?

Using MemSQL as a Time Series Database

Conclusion

Latest Images

Trending Articles

Latest Images