Building an Analytics App with MemSQL – Webinar

January 17, 2019, 11:29 am

≫ Next: monday.com Repost: How our engineers supercharged our BI capabilities: A trilogy

≪ Previous: How MemSQL Works – At a Glance

In this webinar, MemSQL Product Manager Jacky Liang took a live audience through the process of building an analytics app in just a few minutes. You can view the webinar and download the slides here.

In just a few minutes in the middle of the webinar, Jacky created a free MemSQL account; installed memSQL on AWS; loaded in some data; and created a dashboard using Looker, a popular analytics tool.

Jacky also explained some MemSQL basics and gave a quick overview of MemSQL’s architecture. We encourage you to read about the details here, then view the webinar to see MemSQL in action.

MemSQL Now Free to Use

It’s already widely known that MemSQL is a highly capable database. What’s not so widely known is that MemSQL is now free to use, up to 128GB of RAM usage. Because MemSQL is memory-led, not memory-only, the actual database size supported may be as much as several terabytes. A memory footprint of 128GB is more than enough storage to get a project started, through the proof of concept phase, and into deployment and initial scaling.

One of the points that Jacky makes during the webinar is that there are many projects that can be handled within the 128GB RAM capacity in which MemSQL is free to use, even when the project goes into production. At the point where you need MemSQL’s legendary support, and/or more capacity, a simple license switch and a reasonable monthly payment get you both.

MemSQL Basics

We published a somewhat detailed description of how MemSQL works just a few weeks ago. To sum it up in a list:

MemSQL’s core is tiered storage, memory to disk. In this memory-led architecture, you keep rowstore tables in memory and columnstore tables on disk, with a relatively large memory cache. You get the best of both worlds, with high performance and a manageable memory footprint.
Query optimizer across tables. MemSQL distributes storage across multiple clusters, with aggregator nodes holding schema and leaf nodes storing data. The MemSQL query optimizer compiles and speeds queries across table types and across nodes.
Memory and CPU management. The MemSQL memory and CPU manager uses available memory effectively and optimizes performance.
Lock-free distributed ingest and analytics. Write to and simultaneously read from an ACID-compliant SQL database that can ingest data in streaming or batch modes.
MemSQL Studio and command-line tools. Using the Studio interface and our command-line tools, you can easily manage multiple clusters and use what you learn to optimize both data storage and queries.
Enterprise security. MemSQL operates to very strict security standards and supports role-based access control, encryption, auditing support, and more.
SQL API and SQL queries. The secret of MemSQL’s success is our thoroughgoing support for ANSI SQL (like traditional transactional and analytics databases), combined with our support for distributed storage (like NoSQL databases) – the best of both worlds.

MemSQL fits in very well into existing, often complex data processing architectures. The database works with a variety of other tools for data ingest and analytics. MemSQL runs on premises or in a public cloud. In containers and on virtual machines – anywhere you can run Linux.

The Demo

The webinar features a straightforward demo of implementing an analytics app with MemSQL.

There are just four steps:

Create a free MemSQL account. This is quick and easy.
Install MemSQL. For the demo, this is done on AWS.
Load data. For the demo, we used the 100GB TPC-H data set.
Run a dashboard. For the demo, Looker is used to create the dashboard.

It’s worth taking a quick, well, look at our use of Looker to create the demo dashboard. Due to its thoroughgoing ANSI SQL support, MemSQL works well with the wide range of business intelligence (BI) tools that use SQL. A MemSQL database is also easily queried by anyone who knows standard SQL – a number that is certainly in the hundreds of thousands, and has been estimated to be more than a million people.

We used Looker for the demo because it gives users straightforward and direct access to the underlying database(s) being queried. This is especially advantageous for us at MemSQL, as our architecture is actually rather elegant, garnering a lot of praise – especially from users who come to MemSQL with a lot of previous database experience and some feel for what an “ideal”, distributed SQL database might look like. (You don’t have to take our word for it – find a few such people and ask them directly.)

Looker is easiest to use with standard SQL databases like MemSQL. And like many BI tools, it is also highly optimized specifically for use with MemSQL, and vice versa.

MemSQL’s Architecture

To get the most out of MemSQL, it’s very helpful to understand a few things about our distributed architecture. The webinar concludes with a brief description of it.

A MemSQL database is implemented as a series of master nodes and leaf nodes. Each master node has a copy of the database schema and a list of the data elements it’s responsible for. Several leaf nodes are managed by each master node. Applications and other clients connect to one of the aggregators, the Master Aggregator, that fields queries.

Q&A

This webinar concluded with a lively Q&A period. The questions and answers include:

Can MemSQL connect to other BI tools? Yes, very much so. MemSQL is built around support for the SQL standard and, specifically, the MySQL wire protocol. This allows you to easily use MemSQL with the many hundreds of data integration and BI tools that support MySQL connectivity. A few of the many such tools include Looker (as used in our demo), Microstrategy, Tableau, and Zoomdata.
Does MemSQL run on Windows? MemSQL runs on Linux systems, so it runs on Windows in a virtual machine. (Microsoft has been working on improving their support for virtual machine environments on Windows in general, and Linux in particular.) See our System Requirements and Recommendations page for more.
What size restrictions do you have? You can use MemSQL for free, but without official MemSQL support, on a system with up to 128GB of RAM. For customers with an Enterprise subscription, there are no physical limitations on the amount of RAM allocated. Customers decide the balance of RAM to disk usage based on performance requirements. Many MemSQL customers use a few terabytes of RAM to support hundreds of terabytes on disk.

There were also comparative questions about other database solutions:

How is MemSQL different from Snowflake? Snowflake is a managed service offering for AWS and the Azure cloud platform only. MemSQL runs on any system that can run Linux, including in containers and virtual machines, on all cloud platforms, on-premises, and in mixed deployments. Both support high performance data warehouse workloads, but Snowflake is optimized for back office analytics and priced on usage. MemSQL is optimized for operational, “live” business requirements that need always-on, continuously updated data with fast ingest, as well as back office analytics. MemSQL also offers transactional support. The billing differences mean that Snowflake’s costs are variable, depending on the size of semi-regular data loading and the frequency and complexity of queries. MemSQL cost of ownership is known up front and fixed for a given deployment size.
How does MemSQL ingest performance compare to Azure SQL Server data warehouse? While any comparison can be subject to criticism, MemSQL utilizes a very fast, high-throughput Skip List index methodology. This is backed by a distributed, memory-optimized platform, enabling ingest rates in the millions of events per second. Azure SQL Server Data Warehouse, on the other hand, uses a traditional B-tree index and other older methodologies, backed by a single-node architecture that limits parallel throughput. This is likely to result in ingest speeds more on the order of tens of thousands of events per second.

Next Steps

If you watch the webinar video, and have questions or feedback, feel free to email the main presenter, Jacky Liang, directly. He would also like to hear how you’re using the free tier. You can reach him at jliang@memsql.com.

We urge you to go here to view the webinar video and download the slides. Then download MemSQL and give it a try yourself, either in a public cloud or using hardware you have on-premises.

↧

monday.com Repost: How our engineers supercharged our BI capabilities: A trilogy

January 19, 2019, 1:17 am

≫ Next: Partner Repost: Using Streaming Analytics to Identify and Visualise Fraudulent ATM Transactions in Real-Time

≪ Previous: Building an Analytics App with MemSQL – Webinar

Feed: MemSQL Blog.
Author: Daniel Mittelman.

Everything at monday.com starts and ends with data. We’re driven to succeed, and to do that, we must measure everything with precision and accuracy.

That’s why we built our own business intelligence (BI) solution from scratch: our beloved BigBrain. It tracks every single KPI we have here at monday.com and in the spirit of transparency, we make these numbers readily accessible to everyone on our team.

Our BigBrain team—currently 4 engineers and counting (we’re hiring!)—just wrapped up a big project to supercharge BigBrain as we embark on a new chapter of accelerated growth in our company. In this three-part series, BigBrain developer Daniel Mittelman will share how and why we moved to MemSQL and why it’s awesome.

Here at monday.com, we’re growing at an incredibly fast rate. More and more people are visiting our website, signing up for new accounts, and trying out the platform. In addition, an increasing number of happy customers are upgrading their plans to let more people on their team work with monday.com.

As we mentioned before, we collect and analyze data about our users’ behavior (with privacy concerns taken into account, of course). This helps us shape our business plan for years to come and make better decisions about how to improve our product.

But…there’s a cost. An increasing number of users, combined with an increasing number of available features in the product, means we’re accumulating data at an incredibly fast rate. Check it out in the chart below:

monday.com’s total number of engagement records generated by users over time.

In January 2016, we had around 50 million records in our engagement data. One year later in January 2017, this number had grown to 265 million records. We kickstarted 2018 with over 1 billion records.

This is not considered a very large amount in Big Data, but simple math dictates that by 2019, our data pool will surpass 4 billion records, and by 2020, we’ll face data storage of 15 billion records.

Armed with this knowledge, we realized that our existing Elasticsearch-based solution would reach its limits within a year. Don’t get me wrong—Elasticsearch is a great database for token-based searching. It also performs very well when given basic data analysis tasks (mainly counting and bucketing stuff.)

However, Elasticsearch lacks the ability to cope with greater challenges, such as performing aggregations over multiple indices and joining them together, or making complex funnel computations. Unsurprisingly, these more complex analyses are the ones we want most here at monday.com, as they reveal a more complete picture about user engagement.

We set out on the search for a new analytical database in mid-2017. We looked at (almost) every possible database that’s out there: from the cloud-based, ready-to-use data warehouses like Amazon’s Redshift, Google BigQuery and Microsoft’s Cosmos DB, to the expanding new field of GPU-based databases like MapD and Kinetica. We even checked out Amazon’s Athena backed by a JSON-formatted S3 bucket.

Each database had its advantages and downsides. Overwhelmed by all the options out there, we mapped our exact requirements for this new solution. Our checklist was, in descending order of importance:

Speed! Requests shouldn’t exceed 6 seconds in most cases, and 30 seconds at most
Versatile: Capable of joining different data types during a single computation
Distributed: Highly available with replication and fault tolerance
Secure: It must meet our needs for encryption, both at rest and in transit
Integrates with Ruby on Rails, the framework we use for backend development
Easily scalable: We’re growing and we don’t expect to stop anytime soon
SQL: Elasticsearch’s JSON DSL is a nightmare
Reasonably priced: We paid $3,600 a month for our 9-node Elasticsearch cluster and did not want to exceed this price point

Speed and versatility were our most important requirements, and are also the places where most of the cloud-based solutions fail. We tested some of our more complex queries, both in Redshift and BigQuery, and got disappointing execution times, ranging from 50 seconds to several minutes. Cosmos DB actually showed impressive benchmark results, but doesn’t have official Ruby support. Same for MapD and Kinetica.

Our last contender was a memory-based database solution called MemSQL. At first sight, its Enterprise edition seemed to match all of our requirements, including versatility and support for SQL. But the question of whether it would live up to our speed requirements still remained to be seen.

Their website boasts the database’s superior speed over many traditional database engines, but this did not sway us from putting the database to test. The results, as can be predicted by reading this blog post up to this point, were stellar, and we will dive more into the technical aspects in Part 2 (coming soon!)

Here are some real-life examples of how we use MemSQL within BigBrain for non-trivial analysis (everything beyond charts showing aggregations over time):

1. End-to-end marketing performance: Many marketing tools, like AdWords Account Manager or the Facebook Ad Manager, give detailed information about how campaigns perform by displaying the campaign’s cost and how many impressions/clicks it generated.

We took it a few steps further to include analysis of marketing sources, campaigns, and even single banners. We display the number of visitors each banner and associated campaign brought to our website, how many people signed up, and how many people eventually became paying customers.

This allows us to calculate KPIs for each segment, such as its customer acquisition cost (CAC) and the return on investment (ROI). This kind of computation requires us to join six tables, sometimes over tens of millions of records—something that our octa-core PostgreSQL database couldn’t accomplish in under a minute. MemSQL, however, completes the request in a matter of seconds.

2. Funnels: Our funnels tool allows us to construct an event-based process (out of our 1,800 different engagement types) and see how many visitors, users, and accounts pass through the process within a given time frame.

More importantly, it lets us identify weak points where the user experience is not optimal, and in turn fine-tune it in the platform. You can think of it as our UX debugging tool. We can also segment our funnels when running A/B tests, so that we can analyze the behavior of users that are served each of the test’s variants. We build funnels using an iterative process of “join and reduce,” such that each step in the funnel requires a different OUTER JOIN to the engagement storage, with special constraints in place. Here’s an example for our signup process funnel:

Our signup process statistics in H2 2017, segmented by browser, starting with 3.6 million unique visitors. Interestingly, people who visit our website on Chrome have a statistically significant higher chance of actually signing up to our trial than Firefox. This funnel was computed in 18 seconds.

In my next post, we will dive deeper into how we store our data in MemSQL, how different parts of MemSQL work, and how we adjusted them to run our queries as quickly as possible. We will also show how we built our funnels system so that it can compute any funnel, including non-trivial ones where events don’t necessarily occur in the order specified in the funnel. (Spoiler alert—it’s not as easy as shown in most of the tutorials you will find online). In the third part, we will review the process of launching a MemSQL cluster from scratch and how to set it up as a critical piece of infrastructure in production.

The original posting of this blog can be found here.

↧

Partner Repost: Using Streaming Analytics to Identify and Visualise Fraudulent ATM Transactions in Real-Time

January 24, 2019, 10:38 am

≫ Next: Augmenting Hadoop with MemSQL for Faster Analytics at a Fortune 50 Company

≪ Previous: monday.com Repost: How our engineers supercharged our BI capabilities: A trilogy

Feed: MemSQL Blog.
Author: Saeed Barghi.

Storing and analysing larger and larger amounts of data is no longer what drives the decision-making process at successful companies. In today’s world, the key is how fast decision makers are provided with the right information, enabling them to make the right decision before it’s too late.

Streaming analytics help you identify perishable insights, which are the subject of their very own Forrester Report. Perishable insights are results “which must be acted on within a given timeframe or else the chance to influence business outcomes will pass,” according to the IBM Cloud Blog.

The trouble is, many companies see implementing a streaming analytics platform as a very challenging and costly project. This is because, when talking to internal experts or outside consultants, they are usually presented with a very complex architecture – one that needs a lot of effort and money to set up and get going.
That is not the reality of streaming analytics. When the proper set of technologies and tools is used, such a platform can be set up very quickly and effectively.
In facing challenges of this type, I’ve identified a platform that works perfectly. The high-level architecture of this platform is shown below.

Looks beautiful and simple, doesn’t it? We’re going to use this architecture and build a solution that identifies fraudulent ATM transactions in real time. This platform can solve many other problems as well.) I’ll describe the solution in three steps.

Step 1: Build and Analyze Streams of Data

Everyone in the field of Big Data has heard of Kafka. It’s the backbone of most streaming applications.

Kafka was invented because, for most streaming data, the speed at which the data is generated at the source is a lot faster than the speed at which it can be consumed at the destination. Therefore, a Kafka connection, called a topic, is required to act as a buffer. It receives data from a source, called a publisher, and holds onto it until it’s read by the destination, called a consumer.

Like other streaming solutions, we’re going to use Kafka here as well. But not just any edition of it; we’ll be using Confluent Kafka. Confluent has not only put great add-ons around Kafka, and made it a lot easier to deploy and manage; they are the pioneer in stream processing. Furthermore, Confluent Kafka scales very efficiently, and is able to work very fast, and with very big data, without any hiccups.

The most interesting component in Confluent platform for me is KSQL. It provides a SQL-like querying capability on top of streams of data. And that sounds like heaven for someone like me, who has spent most of his professional life writing and optimizing SQL queries. ]

For the first part of this solution, I followed this blog post and created the streams and processed them with KSQL. The steps I took were:

Download and set up Confluent platform: https://www.confluent.io/download-cp/
Start your Confluent platform:
./bin/confluent start
Download and set up gess: https://github.com/rmoff/gess
Create the topic, using Confluent’s “kafka-topics” command:
./bin/kafka-topics –topic atm_txns –zookeeper localhost:2181 —create –partitions 1 –replication-factor 1
Follow the blog post and create and process the stream.

Just as a reference, your final Stream should look like this:

CREATE STREAM ATM_POSSIBLE_FRAUD
WITH (PARTITIONS=1) AS
SELECT TIMESTAMPTOSTRING(T1.ROWTIME, ‘yyyy-MM-dd HH:mm:ss’) AS T1_TIMESTAMP, TIMESTAMPTOSTRING(T2.ROWTIME, ‘yyyy-MM-dd HH:mm:ss’) AS T2_TIMESTAMP,
GEO_DISTANCE(T1.location->lat, T1.location->lon, T2.location->lat, T2.location->lon, ‘KM’) AS DISTANCE_BETWEEN_TXN_KM,
(T2.ROWTIME – T1.ROWTIME) AS MILLISECONDS_DIFFERENCE,
(CAST(T2.ROWTIME AS DOUBLE) – CAST(T1.ROWTIME AS DOUBLE)) / 1000 / 60 AS MINUTES_DIFFERENCE,
GEO_DISTANCE(T1.location->lat, T1.location->lon, T2.location->lat, T2.location->lon, ‘KM’) / ((CAST(T2.ROWTIME AS DOUBLE) – CAST(T1.ROWTIME AS DOUBLE)) / 1000 / 60 / 60) AS KMH_REQUIRED,
T1.ACCOUNT_ID AS ACCOUNT_ID,
T1.TRANSACTION_ID, T2.TRANSACTION_ID,
T1.AMOUNT, T2.AMOUNT,
T1.ATM, T2.ATM,
T1.location->lat AS T1_LAT,
T1.location->lon AS T1_LON,
T2.location->lat AS T2_LAT,
T2.location->lon AS T2_LON
FROM ATM_TXNS T1
INNER JOIN ATM_TXNS_02 T2
WITHIN (0 MINUTES, 10 MINUTES)
ON T1.ACCOUNT_ID = T2.ACCOUNT_ID
WHERE T1.TRANSACTION_ID != T2.TRANSACTION_ID
AND (T1.location->lat != T2.location->lat OR
T1.location->lon != T2.location->lon)
AND T2.ROWTIME != T1.ROWTIME;

Step 2: Ingest Streams of Data into Data Store in Real-Time

The next layer we need to implement in this architecture is data ingestion and storage. There are different tools in the market that are able to ingest data in close to real time, such as Nifi, StreamSets, and maybe Talend. And then for storage, depending on your preference as to on-premise or cloud, HDFS or Object Storage are the options.

The number one factor that I always consider when suggesting a solution to my clients is integrity and homogeneity in all layers of the purpose-built solutions. And when it comes to streaming, where performance is the number one factor, I can’t think of a solution more reliable and faster than MemSQL. If you’re curious to know how fast the database is, watch this video. And be prepared for your mind to be blown!

Another reason I love MemSQL for streaming use cases is how well it integrates with Kafka through MemSQL Pipelines. Take the following steps to set up MemSQL and integrate it with your Confluent platform:

Install MemSQL on the environment of your choice:
https://docs.memsql.com/guides/latest/install-memsql/
Fire up the MemSQL command-line interface and create a new database:

create database streaming_demo_database;
Create a new table for the records you receive from Confluent:
CREATE TABLE atm_possible_fraud (INGESTION_TIME TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
, MESSAGE_FROM_KAFKA JSON
, T1_TIMESTAMP AS MESSAGE_FROM_KAFKA::$T1_TIMESTAMP PERSISTED DATETIME
, T2_TIMESTAMP AS MESSAGE_FROM_KAFKA::$T2_TIMESTAMP PERSISTED DATETIME
, DISTANCE_BETWEEN_TXN_KM AS MESSAGE_FROM_KAFKA::$DISTANCE_BETWEEN_TXN_KM PERSISTED DOUBLE
,MILLISECONDS_DIFFERENCE AS MESSAGE_FROM_KAFKA::$MILLISECONDS_DIFFERENCE PERSISTED DOUBLE
,MINUTES_DIFFERENCE AS MESSAGE_FROM_KAFKA::$MINUTES_DIFFERENCE PERSISTED DOUBLE
,KMH_REQUIRED AS MESSAGE_FROM_KAFKA::$KMH_REQUIRED PERSISTED DOUBLE
,ACCOUNT_ID AS MESSAGE_FROM_KAFKA::$ACCOUNT_ID PERSISTED CHAR(100)
,T1_TRANSACTION_ID AS MESSAGE_FROM_KAFKA::$T1_TRANSACTION_ID PERSISTED CHAR(100)
,T2_TRANSACTION_ID AS MESSAGE_FROM_KAFKA::$T2_TRANSACTION_ID PERSISTED CHAR(100)
,T1_AMOUNT AS MESSAGE_FROM_KAFKA::$T1_AMOUNT PERSISTED DOUBLE
,T2_AMOUNT AS MESSAGE_FROM_KAFKA::$T2_AMOUNT PERSISTED DOUBLE
,T1_ATM AS MESSAGE_FROM_KAFKA::$T1_ATM PERSISTED CHAR(100)
,T2_ATM AS MESSAGE_FROM_KAFKA::$T2_ATM PERSISTED CHAR(100)
,T1_LAT AS MESSAGE_FROM_KAFKA::$T1_LAT PERSISTED DOUBLE
,T1_LON AS MESSAGE_FROM_KAFKA::$T1_LON PERSISTED DOUBLE
,T2_LAT AS MESSAGE_FROM_KAFKA::$T2_LAT PERSISTED DOUBLE
,T2_LON AS MESSAGE_FROM_KAFKA::$T2_LON PERSISTED DOUBLE
);

Note: A few points about this create table script:

The first column, INGESTION_TIME, is populated automatically when every record is ingested
The second column, MESSAGE_FROM_KAFKA, holds the records received from Confluent topics in JSON format
The rest are Persistent Computed columns in the table that are computed and populated when each JSON record lands in the table. This is another cool feature of MemSQL, makes it incredible easy to parse JSON data without the need to call any additional script or coding.

Create an index on INGESTION_TIME column. This is needed when we get to build our visualisation work in real-time with Zoomdata:

CREATE INDEX inserttime_index ON atm_possible_fraud (Ingestion_Time);

Create a pipeline that reads data from Confluent topics and inserts into MemSQL in real-time:

CREATE PIPELINE atm_possible_fraud AS LOAD DATA KAFKA ‘[IP_ADDRESS]:9092/ATM_POSSIBLE_FRAUD’ INTO TABLE atm_possible_fraud (MESSAGE_FROM_KAFKA);

Test and start the pipeline:

TEST PIPELINE atm_possible_fraud;
START PIPELINE atm_possible_fraud;

And we’re done. Now you can start ingesting data into Confluent Kafka topics and they will be replicated in your MemSQL table in real time.

On your Confluent server, make sure you have started your Confluent platform by running:

./bin/confluent status
Then go to the folder where you have downloaded gess and run following commands:

./gess.sh start

nc -v -u -l 6900 | [CONFLUENT_DIRECTORY]/bin/kafka-console-producer –broker-list localhost:9092 –topic atm_txns

…and start querying your MemSQL table. You should see records ingested in there as they are generated from the source and delivered into the appropriate Kafka topic in Congruent.

Step 3: Visualise Your Data in Real Time

There are so many visualisation tools out there in the market, some of which claim they can visualise your data in real-time. But none of them can truly achieve that. Why? Because they all need to take custody of data to be able to visualise it: each and every record needs to be moved to the server where the visualisation engine runs, processed there, and then visualised to users in the form of graphs and dashboards.

There is one visualisation tool that is different from every other tool in the market, in that it pushes down the query processing to the source where data is stored. That tool is Zoomdata.

Zoomdata doesn’t move data. It doesn’t take custody of the data. How does it work? I’m glad you asked.

Zoomdata’s smart query engine takes the query that is meant to grab the data for the dashboard, applies its knowledge of the underlying data store and metadata about the tables involved in the visualisation, and breaks the query into many smaller queries called micro-queries.

The idea is that, instead of sending a big query down to the source and waiting for the result to come back, it makes more sense to send those smaller queries down, then progressively sharpen the visualisation as the results are returned from each micro-query.

Another very important point about Zoomdata is that it is truly self-service, unlike some other tools, which require a certification to master.

To create a dashboard pointing to the data on our MemSQL database, follow these steps.

Open Zoomdata UI and login as admin.
From the top menu, click on Setting, then select Sources. From installed connectors, click on the MemSQL logo and start configuring the data source.

Give your data source and click Next.
This is the page where you give the details and credentials to connect to your MemSQL cluster. Fill it up and then click Next.
You’ll see the list of tables that exist in the data source you created, in our case the MemSQL database. Select “atm_possible_fraud.”
The list of columns and sample the top 10 rows of the table will be loaded. Remember to toggle CACHING off, since we are building a real time dashboard. Then click Next.

The next page has the list of columns and their data types. Zoomdata infers the data types for each column from the data source. Change them if they don’t match the type of analysis you want to do, or if they are not correct. Read more about this tab here. After you review the data types for each column, click Next.

The next tab is where you define how often you would like to refresh the cached data. We don’t need to make any changes here since we’re not caching any data. Click Save & Next.
Now we’re on the Charts tab. This tab is used to define the default behaviour of each visualisation type. This means you can define which columns to used to render each visualisation when the dashboard loads. (Zoomdata is 100% self service, meaning that users can change the dashboards at runtime without the need for IT or highly technical resources).

Another very important and interesting feature of your visualisation dashboard will be defined in this tab as well: the Time Bar. Zoomdata follows a streaming architecture, which enables it to connect in “Live Mode” to any data source capable of real-time data. The technology behind this feature is called Data DVR.

In Live Mode, Zoomdata visualisations immediately reflect changes in the source data as data streams in. The live stream can be paused, rewound and replayed, essentially treating it as a digital video recorder (DVR) for your data. Follow these steps to set it up:

Select Ingestion Time from the Time Attribute drop-down. (That’s the column we had our MemSQL table partitioned by, remember?) This is the column driving our time bar, and it makes sense to choose it: our real-time dashboard needs to be able to get the values from our real-time data source, based on this column, very fast.
Click the Enable Live Mode checkbox.
Select 1 for Refresh Rate and 1 second for Delay By. The idea is that Zoomdata will look for records added to the data source with 1 second delay. (In a future version, Zoomdata will be able to support one millisecond delays.)
Click Finish.

You will be redirected to the source page. You’ll see on the top that your new data source has been created. Click on New Chart and Dashboard and start building live visualisations.

Finish visualisations as needed.

Here is a short video showing the incredible performance this streaming solution can provide. On the left side of the video I kick off my script that publishes records to Confluent Kafka topics, and it takes less than three seconds from that point until the visualisation is updated.

Our solution was so simple and easy to implement that I’ve summarized it here in a single blog post. Yet, at the same time, it’s capable of providing incredible performance running on just three servers – one Confluent, one MemSQL, and one Zoomdata.

This post is from Saeed Barghi’s blog, The Business Intelligence Palace. The original version of this post appears on his blog.

Want to know more about any of these products? Reach out to the author at his blog or contact MemSQL.

↧

Augmenting Hadoop with MemSQL for Faster Analytics at a Fortune 50 Company

January 29, 2019, 2:22 pm

≫ Next: What is Time Series Data?

≪ Previous: Partner Repost: Using Streaming Analytics to Identify and Visualise Fraudulent ATM Transactions in Real-Time

Feed: MemSQL Blog.
Author: Floyd Smith.

A Fortune 50 company tried traditional SQL databases, then Hadoop, to manage financial updates company-wide. They finally found a successful solution by augmenting Hadoop with MemSQL.

Many companies today are straining to meet the needs for faster ad hoc analytics, dashboards, reporting, and support for modern applications that depend on fast access to the latest data – in particular, financial data.

Facing competitive and customer pressures, IT organizations must respond to a rapid increase in the volume of incoming data and the needs of the business to use this data for rapid decision-making, improving operations, and providing better customer experiences. These problems are especially severe in larger traditional companies that need to innovate.

Traditional relational databases, lacking scalability, have run out of gas for solving these problems. Many companies have turned to Hadoop, a NoSQL solution, to rapidly ingest data and move it to a data lake. However, the lack of SQL support and performance issues for queries – which affect ad hoc queries, reporting, dashboards, and apps – have made this approach unproductive for many.

A diversified Fortune 50 company approached MemSQL to help solve exactly this problem. The company first attempted to meet its analytics needs with an existing, legacy relational database. They then introduced a NoSQL solution based on Hadoop. The company found each of these solutions to be inadequate.

Ultimately, this company augmented Hadoop with MemSQL. They now use MemSQL to support an operational data store for company-wide data analytics, including access by executives. Hadoop still powers a data lake, which is used largely to support data scientists.

Detailed below is our customer’s journey and the benefits they saw from moving their core analytics processing to MemSQL. Included is a reference architecture for a combined implementation of MemSQL alongside Hadoop.

Displacing Hadoop for Analytics and Business Intelligence

The MemSQL customer in this case study needed fast analytics at the divisional level, as well as the ability to continually roll up data across the company for quarter-end reporting and communications to investors.

Originally, the company used a legacy relational database to drive analytics. Business analysts used business intelligence (BI) tools such as Business Objects and Tableau to derive insights. Analysts, knowledgeable in the needs of the business, were able to deliver real value to the company.

However, data volumes continued to increase, while more users wanted analytics access. Analytics users, from business analysts to top management, demanded greater responsiveness. And the analytics system needed to be ready to handle demands from machine learning and artificial intelligence software on both operational and legacy data.

The system that drove analytics was unable to keep up with the transaction speeds, query responsiveness, and concurrency levels required by the business. As a first attempt to solve the problem, the company augmented the legacy system with a Hadoop solution, with the final goal being to fully replace the legacy system with a Hadoop/Big Data solution.

The first try at change was a Hadoop/HDFS data lake — The company’s first try at a solution, a data lake supported by Hadoop
and HDFS, has become a common analytics architecture

Unfortunately, the move to Hadoop was unable to deliver for this customer – as it has for many others. Many data lake projects fail to make it into production, and others experience serious problems after implementation.

Hadoop is designed for unstructured data, and this causes many serious problems:

Batch processing. With Hadoop, input data is processed in batches, which is time-consuming. The data available to analytics users is not current.
Large appetite for hardware. Both batch processing and analytics responses take a lot of processing power, so companies spend a great deal to power Hadoop solutions.
Governance and security challenges. Because Hadoop data is less organized than data in a relational database, it’s hard to identify and protect the most sensitive items.
Complex system. Big Data solutions involve an entire ecosystem with many specialized components, which make for a difficult system to manage.
Complex queries. Queries have to be written against complex data, with careful study of each data source needed.
Slow queries. Queries run slowly, and users must either spend time up front optimizing them, or spend extra time waiting for results.
No SQL. NoSQL eliminates the ease, speed, and business intelligence (BI) tool support found with SQL, preventing companies from getting maximum value from their data.

Queries, Sorted: Augmenting Hadoop with MemSQL

The company had two competing sets of needs. Data scientists wanted access to raw data, and were willing to do the work to transform, catalog, and analyze data as needed. Business analytics users (including the executive team), BI tools, and existing applications and dashboards needed structured data and SQL support.

By augmenting Hadoop with MemSQL, the company can meet all these needs at once. MemSQL is a massively-scalable, high-performance relational database. MemSQL is a leader in a new category of databases, commonly described as NewSQL.

As a leading NewSQL database, MemSQL offers a cloud-native, distributed architecture (details here). It combines the ability to handle streaming ingest, transactions, and analytics, including SQL support.

MemSQL is a fast, scalable SQL database — MemSQL adds scalability to transactional databases, with
SQL support, offering a wide range of benefits

With MemSQL holding the operational data store, the company is once again able to unleash its business analysts on its most important data. The same operational data store is available for executive queries and financial reporting.

MemSQL drives analytics while HDFS is reserved for data science use — The client now runs an operational data store on MemSQL
for most needs, plus a data lake on HDFS for data science

Hadoop and HDFS are still used, but as a data archive, to support audits, and as a data lake for use by data scientists. (Who, of course, use the operational data store as well.) Work in the data lake yields valuable results, while the operational data store supports the ease of access, efficiency, and rapid response needed by the rest of the business.

Choosing an Ingest Strategy

With MemSQL used as an operational data store for real-time analytics, and Hadoop as a data lake, the customer had three choices as to how to ingest data:

MemSQL-first ingest. The customer could replicate the data, using change data capture (CDC) from some sources and MemSQL Pipelines from others. Some or all of the data could then be extracted and loaded to Hadoop/Hive.
Simultaneous ingest. The customer could replicate the data to both the targets in parallel. In either case, some data could be discarded on ingest, but the default choice for any incoming data would be to go to both databases at once.
Hadoop-first ingest. The customer could bring all data into Hadoop in a batch operation, then transfer all of the data, or a subset, to MemSQL as needed.

Each approach has either advantages or disadvantages.

	Advantages	Disadvantages	Use Case
MemSQL-first	ETL, CDC and MemSQL Pipelines stream data; analytics run against live data		Fast analytics critical; MemSQL needs most or all of the data
Simultaneous		Added complexity, as both streams need to be managed as primary	Fast analytics critical; both MemSQL and Hadoop need most or all of the data
Hadoop-first		Major delay for MemSQL ingest & analytics, happening after Hadoop batch ingest	Many users try this, only to realize they need an operational data store w/SQL

The MemSQL-first approach gets you analytics against live data,
while other approaches add complexity and delays

The customer chose direct ingest to MemSQL. Data is also sent to an operational data store, then batched to Hadoop Their whole reason for using MemSQL was to speed up analytics, so putting Hadoop first – leading to delays in data getting into MemSQL – was a poor option. Because both MemSQL and Hadoop would be getting most of the data, and to simplify troubleshooting if a problem occurred, the customer decided to keep MemSQL and Hadoop ingest separate.

With MemSQL, the customer has the same advantages that customers have long sought from a traditional data warehouse: direct access to data for faster analytics performance. A traditional data warehouse requires pre-aggregation of data, where this is needed less, or not at all for MemSQL. And MemSQL adds advantages of its own: better performance than traditional data warehousing products and the ability to query live data.

MemSQL compares even more positively to the customer’s former Hadoop-based data warehouse-type implementation. Hadoop is not designed to support ad hoc queries and dashboards. It lacks SQL support, meaning that it can’t be used easily with most business intelligence (BI) tools, and it can’t be queried conveniently by business analysts and others who want to use SQL for ad hoc queries, not write a computer program.

	Hadoop	MemSQL
Performance	Poor	Excellent
Concurrency	Poor	Excellent
Queries	Custom	SQL ad hoc queries & BI tools
Accessible by	Data scientists	Business analysts, executives, data scientists
Governance support	Poor	Excellent

Hadoop lacks key analytics features supported
by MemSQL, especially SQL support

Analytics with No Limits

The company’s use of data is now on much firmer footing. Thousands of BI tool users and more than five hundred direct users, spread across roughly a dozen divisions, use MemSQL, BI tools, and SQL queries for all their analytics needs. Many terabytes of data are accessible to users across the globe. Analysts and other users work with up-to-date data, at high concurrency, with fast query performance.

Analytics work has dramatically improved:

From batch mode to near-real-time: Analytics now run against near-real-time data. Information that used to take days to arrive now takes minutes.
From spreadsheets to BI: This Fortune 50 company (like many) was doing much of its core analytics work by exporting data from Hadoop into Excel spreadsheets. With MemSQL, the full range of BI tools is now available as needed.
Instant access for live queries: Using SQL, any analyst can make any query, anytime, and get results quickly. In combination with BI tools, this supports the kind of interactive probing that leads to real, timely, actionable insights.
Executive accessibility: There’s now a single source of truth for investor reporting. The CEO and senior executives have direct access to insight into operations.

	Hadoop + Hive	MemSQL
Processing mode	Batch	Real-time
Analytics tools used	Excel spreadsheets	Wide range of BI tools
Access mode	Custom queries	SQL queries
Executive accessibility	Inaccessible	Direct

MemSQL has improved analytics
performance and accessibility

In addition to the outcomes, the deployment process and operations for MemSQL are also considered a success for the IT team:

Time to go live was less than three quarters.
Deployment across the company was managed by a single team of just two people.
MemSQL runs faster on a six-node cluster than Hadoop + Hive did on fifty nodes.
MemSQL offered superior total cost of ownership, with operational savings more than offsetting licensing costs for MemSQL.
As impressive as it is, this use case only shows part of what MemSQL can do. The same company is now implementing MemSQL in a different use case that takes advantage of a wider range of MemSQL’s capabilities.

Spotlight on Pipelines

MemSQL’s Pipelines capability makes it uniquely easy to ingest data. With Pipelines, MemSQL streams data in from any source and transforms it “in flight”, then loads it rapidly into MemSQL. The traditional extract, transform, and load (ETL) process is performed on streaming data in minutes, rather than on batched data in a period of hours.

MemSQL pipelines replace ETL with data streaming — *MemSQL Pipelines turn a slow ETL process into a rapid streaming operation.*

To extend the Pipelines capability, MemSQL has introduced Pipelines to stored procedures. Stored procedures add flexibility and transactional guarantees to the streaming capabilities of Pipelines. Hadoop and HDFS, by contrast, ingest data in a slow batch process, with little transformation capability, and no access to stored procedures.

Conclusion

The Fortune 50 company described here is facing the same challenges with its use of data as most other enterprises worldwide. MemSQL has allowed them to provide better access to data, to more people across their company, at a lower cost than other solutions. The results that they have achieved are encouraging them to increase their use of MemSQL for analytical, and also for transactional use cases. The same may be true for your organization.

For more discussion about the use of MemSQL with Hadoop, view our recent webinar and access the slides. And consider trying MemSQL for free today.

↧

What is Time Series Data?

January 30, 2019, 10:40 pm

≫ Next: Choosing a Time Series Database

≪ Previous: Augmenting Hadoop with MemSQL for Faster Analytics at a Fortune 50 Company

Feed: MemSQL Blog.
Author: Floyd Smith.

Time series data is as old as databases themselves – and also the hot new thing. Interest in the topic has more than doubled during this decade. In this blog post, we’ll explain what time series data is, why there’s an increasing focus on it, and how MemSQL handles it. In a companion blog post, we explain the considerations that go into choosing a time series database.

Time series data is at the core of the Internet of Things, with frequent sensor readings coming in from all sorts of devices. But what is time series data?

We provide a brief answer below, for quick reference. For more detail, please see Chapter 1 of our free excerpt from the O’Reilly book, Time Series Databases.

For related information, see the companion post, Choosing a Time Series Database.

Time series data is inherently well-structured… except when it isn’t. For a simple time series data record in IoT, for example, you might have the sensor number or other identifier, the date, the time, and a reading – that’s it.

Sensor ID	Date	Time (sec)	Value
4257	01012019	011304	233518

Notice that, in this simple example, the data is inherently well-structured – and therefore suitable for processing by both traditional transactional databases and analytical data warehouses with SQL support, or more effectively through a “translytical” NewSQL database such as MemSQL.

Also notice the timestamp field. More precise – that is, longer – timestamps make time series data more useful. However, they also make the data more voluminous, both in flight and in storage, as well as slower to process for comparison and queries. Given the frequent use of time series data for alerting, as discussed in our next article on this topic, the ability to do quick comparisons on time series data is an especially important consideration.

When you begin to collect and work with time series data at scale, the data can overwhelm the capacity of traditional databases at every stage of the data lifecycle: ingest, storage, transactions (if any), and queries. For example, a modern airliner generates half a terabyte of data per flight. A connected car can generate 300TB of data a year. And data that used to be considered disposable, such as routine transaction data, is now being seen as worth capturing and keeping online.

Now, how much of this kind of data can you afford to throw away? Using our example above, cars are subject to recalls, safety investigations, lawsuits, and much more. Their manufacturing and performance can be optimized to a significant extent – if you have the information needed to perform the analysis necessary to do it.

Cases such as these, where companies collect, analyze, and act on huge amounts of time series data, are expected to grow exponentially in the next 10 years.

The sheer volume of time series data, paired with its increasing value, is where the trouble starts; trouble which is (partly) handled by creating more complex data structures (see below).

Transaction Records as Time Series Data

Time series data has traditionally been associated with simple processes that produce lots of data, such as sensor measurements within a machine. But those of us who focus on transactions have been using – and, in some cases, ignoring – time series data for years.

Think of a customer database that includes the customer’s address. Every time the customer moves, that’s treated as a database update. The transaction record comes in, is held, and might append or overwrite the previous customer record. And the transaction record is thrown away – or, at best, is held in a transaction log somewhere in cold storage.

In a strict transactional scenario, you now no longer know all sorts of things you could have known. How often do your customers move? Is a given customer moving from lower- to higher-income zipcodes, or heading in the other direction? Does this correlate with – or even predict the direction of – their credit score and creditworthiness?

The answers to these questions, and many more, in a strict transactional scenario, are all the same: I don’t know, I don’t know, and I don’t know.

Furthermore, if the transactions aren’t online, you have no way of ever knowing these potentially important facts again – at least not without mounting a bunch of tapes currently gathering dust in a storage facility.

Part of the current mania for storing every transaction record you get, in raw or lightly processed form, in storage that is at least warm, comes from management’s expecting to be able to answer simple questions like those mentioned above. To be able to answer these questions, the organization must collect, store, and be able to quickly analyze data that was once thought irrelevant.

Per-Minute Time Series Data

Some of the complexity around processing time series data comes from a clever method used to battle its voluminousness. Sensors are likely to report (or, to be polled) at varying intervals. Instead of creating a separate record for each new reading, you can create one record, for example, per minute (or other time period).

Within the record of that minute, you can store as many values as come in. (Or, you can only record changed values, or values that change more than a prescribed amount, storing the time at which the change occurred and the new value.)

Because this data is not prescriptively structured – the number of values is not known in advance, so the size of the data in the field can vary – the field that holds the data qualifies as a blob.

Sensor ID	Date	Time (min)	Values
4257	01012019	0113	[04,233518],[14,233649]

The use of blobs, such as this one, in time series data has long been used as a rationale for using NoSQL databases to store it. For a long time, if you had anything but structured data – if you had unstructured data, or even semi-structured data, such as this – a traditional transactional database, the kind that had SQL support, couldn’t efficiently store it.

However, the emergence of JSON as a standard enables a new approach that makes it possible to answer this question differently. You can use key-value pairs in JSON format to store this variable data, and this approach is getting increasingly common. Over the last few years, MemSQL has steadily increased both its ability to store JSON data and the performance of processing and queries against data stored in JSON format.

MemSQL is now able to provide excellent performance for JSON processing and queries against JSON data. This capability makes MemSQL fully competitive with bespoke time series databases for many use cases for the things they’re optimized for. At the same time, MemSQL provides capabilities, such as fast transaction processing, support for many simultaneous users (concurrency), and query performance at volume, that most time series databases lack.

Time Series Data, Machine Learning, and AI

One of the big reasons for the newfound importance of time series data – and the increasing drive to keep as much data as possible, of all types, online – is the increasing use of machine learning and AI.

At the most basic level, executives are going to use machine learning and AI to ask the same questions they might have asked before – about such things as a customer’s house moves and their likely income. But now they might also ask far more detailed questions – about customers’ movements within a retail store, for instance, or across a website.

But machine learning and AI can also be more or less self-powered. Machine learning algorithms can run against a database and find out interesting things for themselves – things that no one could ever have predicted, such as a tendency among customers signed up at different times of year to be more or less valuable. (Companies have even gotten in hot water for sending baby product discounts to people whose families didn’t know they were expecting.)

Machine learning algorithms identifying hot spots — Machine learning algorithms can identify valuable “hot spots”
among seemingly random correlations. (Source: MemSQL)

Machine learning algorithms can identify valuable “hot spots”
among seemingly random correlations. (Source: MemSQL)

The algorithms can only do their work, though, if the data is there to support this kind of investigation. Companies with more data will have a competitive advantage against their smaller competitors, as well as against those who ran their data storage policies in a more “lean and mean” fashion.

Don’t Isolate Your Time Series Data

Many organizations have only partially learned their lesson about the value of time series data.

There is an increasing drive to retain data and to keep it readily accessible for analytics and transactions, machine learning, and AI. However, the data is often kept in NoSQL databases, such as a Hadoop/HDFS data lake, where it’s harder to analyze.

Querying capability slows greatly when each query that you process has to do the work that a database with the right kind of structure – including, where needed, the ability to support semi-structured data in JSON format – has already done for you.

MemSQL gives you the best of both worlds. You can keep massive volumes of time series data in MemSQL, using it as an ultra-fast operational data store that also has excellent analytics support (something that NoSQL databases are inherently unsuited for). That way nothing is out of reach of your business.

For much more about choosing the right database for your time series data, see our blog post on choosing a time series database. For more about using MemSQL for time series applications, please watch our webinar on the topic.

↧

Choosing a Time Series Database

January 30, 2019, 11:17 pm

≫ Next: Customer Repost: How GoGuardian stores and queries high throughput data with MemSQL

≪ Previous: What is Time Series Data?

Feed: MemSQL Blog.
Author: Floyd Smith.

Time series data is as old as databases themselves – and also the hot new thing. Interest in the topic has more than doubled during this decade. MemSQL handles time series data effectively. MemSQL has robust support for rapid ingest, stellar analytical and transactional performance, easy manageability, SQL support for queries and analytics tools, and low total cost of ownership.

Time series data is at the core of the Internet of Things, with frequent sensor readings coming in from all sorts of devices. Once you decide what readings to track for the servers in your data center, for example, you need a place to store the data, and you need to be able to quickly analyze and respond to it. A malfunctioning server needs to trigger an alert; lower or higher throughput than usual needs to be tracked, and action taken to follow up. (See our blog post, What is Time Series Data.)

Similar concerns come up for other Internet of Things devices, such as cars and card readers; in financial services, for stock trading, portfolio management, and risk analysis; in online marketplaces for ads, business needs, and consumer goods; and, increasingly, in day to day business management. Companies need to rapidly ingest original transactional data, use it to instantly trigger responses, and make it available for both real-time and longer-term analytics. These are all use cases that MemSQL excels at.

Use MemSQL to combine transactions and analytics in a translytical database. — MemSQL is a high-performing, scalable SQL database
that suits many time series data use cases well.

MemSQL is a high-performing, scalable SQL database
that suits many time series data use cases well

How Time Series Data is Used

Time series data is used differently than a lot of other kinds of data:

Appends and upserts over updates. A time series data record is usually stored in addition to an existing database – an upsert (from “update by inserting a record”). More narrowly, a specific time series data point can be stored by appending it to an existing data table row or record, though data that arrives out of order may be upserted to preserve time order. Upserts and appends can often be carried out on data in memory. Corrections to existing records, if allowed by the database being used, are regular transactions.
In-memory for value alerting. Alerting on a single out-of-bounds reading is a common use case for time series data; “the well is about to blow!”, or a stock value is crashing. As time series data comes in, it needs to be inspected and reacted to very quickly to handle alerting for out-of-bounds values.
In-memory for trend alerting. Also common, but harder to manage, is alerting for trends; if a value doubles in less than a minute, for instance, send an alert. If the time span for the alert is long, it can be a challenge – or impossible – to have all the data needed for the alert in memory, hurting both responsiveness and overall performance.
In-memory (or some on-disk) for real-time analytics. Real-time analytics, such as a trend display on a management dashboard, may include both older data and the most recent data. Real-time analytics do not need the near-instantaneous responsiveness of alerting but benefit strongly from needed data being kept in memory.
In-memory and on-disk for after-the fact analytics. After-the-fact analytics may primarily cover stale data (outside the alerting window), but should include the latest data as well. So most analytics, machine learning, and AI applications need access to all the data, and good performance, though not the near-instantaneous responsiveness of alerting.

Earthquake fault creep is straightforward time series data that MemSQL can handle well. — Earthquake fault creep in the San Francisco Bay Area. Anomalies
greater than two standard deviations from the norm are marked in red.

Earthquake fault creep in the San Francisco Bay Area. Anomalies
greater than two standard deviations from the norm are marked in red.

In the past, the time series-specific requirements above – appends, upserts, and alerting – have gotten most of the attention in discussing and implementing time series databases. This makes sense, given that time series databases often interacted with expensive machinery or very high-value processes. It was worth sacrificing longer-term considerations, such as analytics, in order to get the performance needed for alerting and, perhaps, dashboarding and similar monitoring. The raw time series data, voluminous as it was, was often thrown away quite soon after it was generated.

However, as the potential value of all the use cases for time series data grows, with the increasing use of advanced analytics, machine learning, and AI, the other requirements – upserts and transactions, real-time analytics (for device management and more traditional business purposes), and after-the-fact analytics to feed new predictive models – are increasing in importance, to the extent that they rival the alerting-oriented requirements.

The earthquake data shown above is an example of both the short-term value, and the increasing long-term value, of time series data. In the past, you needed to know whether an earthquake was happening, and afterward, you might store the data around the time of a quake for further study.

Today, advanced analytics, machine learning, and AI may come together to help us predict earthquakes. There’s no earthquake data set so voluminous that someone won’t want to run machine learning algorithms against it – and possibly generate a breakthrough in prediction by doing so.

NoSQL for Time Series Data?

In the past, relational databases couldn’t do the job for time series data. The main issue was cost-effective scalability. Relational databases couldn’t be efficiently scaled horizontally, and so couldn’t scale to handle the speed, volume, and real-time alerting requirements of time series data. NoSQL databases, on the other hand, scale well on standard commodity hardware.

Most relational databases also couldn’t be taken and modified for time series-specific needs. NoSQL databases, being mostly open source, allowed for innovation on top of existing, tested database code.

This is somewhat ironic, because time series data is structured in a way that makes it a great candidate for analytics. However, NoSQL databases do not natively support SQL. This renders their query interfaces off-limits to ad hoc SQL queries, analytics programs, traditional BI tools, and much more.

Additionally, purpose built time series databases may not interface well to ingest technologies such as Hadoop/HDFS, Kafka, S3 and others. In these cases, time series databases may create data islands, hard to get data into and hard to get information out of.

There are also a few time series databases that are relational, but run as a single process. The currently existing options have the scalability restrictions of traditional relational databases, while also lacking SQL support.

The query languages that do exist for NoSQL databases are optimized for in-memory alerting use cases. They are not structured for, and perform very poorly for, ad hoc queries and other exploratory analytics performed by data analysts.

NoSQL databases lack a query optimizer for, well, SQL queries. This means that every application developer has to write code that contains their own view of how best to frame an inquiry. This makes the application code larger and slower, while making the overall system inflexible.

Fortunately, on the relational database side, the picture is much different today. NewSQL databases have taken away much of the reason for existence of NoSQL databases – as described in our very popular blog post on the topic. NewSQL databases, such as MemSQL, are fully scalable – and, unlike NoSQL databases, they support structure and unstructured data, transactions, and SQL for queries, analytics, and business intelligence (BI) tools.

NewSQL databases also, increasingly, support specialized functionality that allows them to work well with time series data out of the box, while offering scale-out, in-memory and disk-based processing, and the JSON support that has been added to MemSQL. (This blog post describes how to use MemSQL’s data support for a blob.)

Some NewSQL databases can also be further optimized much more easily than was the case for traditional relational databases. We’ll explain some of these features, and potential optimizations, for the specific case of MemSQL below.

Using MemSQL as a Time Series Database

NewSQL databases are a relatively new category, and not all of them are equally mature. Also, some NewSQL databases, such as Google Spanner, are restricted to a specific cloud platform and are focused on making transactions consistent on data distributed around the world, which can raise concerns about analytics flexibility and lock-in.

MemSQL is among the most mature of the MemSQL offerings, and it has several features that particularly recommend it for use with time series data. These features include:

Speed plus scalability

Memory-optimized with on-disk compression

Ecosystem of ingest tools

Pipelines

MemSQL Pipelines

Pipelines to stored procedures

stored procedures

Scalable transaction support

SQL support

You can use MemSQL Pipelines to simplify and speed your operation. — MemSQL Pipelines and stored procedures help you streamline
a great deal of functionality into a single operation.

Because MemSQL has specialized functionality for smoothly interacting with data across memory and disk, including large memory caches for columnstore tables and compression that’s high-performance for both reads and writes, the severe disparity between in-memory and on-disk performance is greatly reduced with MemSQL. This allows you to use timestore data much more flexibly across a range of use cases, from alerting to reporting and predictive analytics.

The table below summarizes the pluses and minuses of dedicated time series databases, largely based on NoSQL, vs. MemSQL for time series data.

	Dedicated time series database	MemSQL
Ingest connections (Hadoop, Kafka, S3 etc.)	Varies	Strong
In-memory support	Strong	Strong
Columnstore/compression	Varies	Strong
Ingest performance	Strong	Strong
Transformations	Strong	Strong
Transaction support	Varies	Strong
SQL support	Poor	Strong
SQL optimizations	Poor	Strong
Time series-specific optimizations	Strong	Pipelines transformations or stored procedures

You can use MemSQL in combination with a wide range of existing tools. MemSQL can handle all needed processing, or you can deploy it in more complex architectures. For instance, you may use Hadoop/HDFS for ingest and to store raw or lightly processed input data in a data lake; Kafka for messaging; and Oracle or another traditional relational database for billing. You can even continue to use a specialized time series database for direct machine interface and alerting and specialized reporting. In all cases, MemSQL provides the operational data store and support for responsive analytics.

Conclusion

MemSQL is being used for an ever-wider range of use cases. Many time series workloads will operate better on MemSQL – faster, with greater functionality, and more cost-effectively. MemSQL customers tend to be most impressed by its high performance across a wide range of use cases and by the support of the MemSQL team.

Adopting a new database is a big decision. We recommend that you review our case studies and then try MemSQL. It’s free to use for clusters with up to 128GB of RAM. With MemSQL’s flexible columnstore support for tables stored on disk, that allows you to support datasets with hundreds of gigabytes of data.

You can also find out more directly from MemSQL. Simply contact us for more information. To see MemSQL’s time series capabilities in action, please view our webinar on the topic.

↧

Customer Repost: How GoGuardian stores and queries high throughput data with MemSQL

February 6, 2019, 3:27 pm

≫ Next: Using MemSQL and Looker for Real-Time Data Analytics

≪ Previous: Choosing a Time Series Database

Feed: MemSQL Blog.
Author: JK Kim.

GoGuardian is an Education Technology company that specializes in moderating student web activities by using machine learning to facilitate a better learning environment. We combine dynamic data inputs to calibrate student engagement and help educators draw conclusions about their class sessions. This means that there are a large number of events happening from more than 5 million students every single day. As one can imagine, handling all of these events is quite a challenge.

This article will detail how we solved our data storage issues and querying challenges with our friends at MemSQL.

OUR ENGINEERING CHALLENGES

Here at GoGuardian, we understand the value and importance of our customers’ data. We consider and evaluate all of the following points, constantly, for all of our projects and products:

–Security: Ensuring the security of our users’ data is our primary concern.
–Data fidelity and retention: We want to reconstruct web activity and any gap in data is a functional failure for our products.
–Scalability and availability: Our systems must scale to meet the need of millions of concurrent users.
–Queryability: Data that is not accessible to our users is useless and should be avoided.

While data security is a priority of its own, and deserves its own write-up, I will not be spending much time discussing it here; t is beyond the scope and intent of this article. However, to address the requirements of data retention and queryability, we have some specific technical challenges:

1. Data generation is cyclical:
Most of our users are students in schools, and many schools choose to disable our products when school is not in session. This means the rate of data generation outside of school hours is drastically lower than when school is in session. This is not as difficult to solve as other challenges, but it does pose a headache for resource allocation because the difference between our peak and trough traffic is quite large.

2. High data throughput:
Our servers receive traffic that is generated by more than 5 million students in real time, and each event translates into multiple writes across different tables and databases. (An event roughly corresponds to a collection of web clicks or navigation events.)

3. Data duplication:
A piece of data we saw at time T0 may be updated and reappear at T0 + t aggregated. These two pieces of data are not identical, but consist of the same key with expanded or updated data. For example, an event may have an array of start and end time pairs of [[T0, T1]]. However, an event with the same key may appear later with start and end time pairs of [[T0, T1], [T2, T3]] if the event re-occurred within a certain time threshold. The updated event encapsulates both new and old data. By storing only the most up-to-date version of each event, we save considerably on row count and storage for many tables, thus reducing overall compute time.

This means that event data is mutable and in some cases we need to update rather than insert. This poses challenges for some databases that are not mutation-friendly. To get around this, we could have redesigned our data generation to support immutable inserts only. However, this would have meant retaining the entire payload of all the generated data, which would make write performance faster but cause the row count to increase, leading to more expensive reads.

We chose to optimize for read performance over write performance due to the dynamic nature of our reads, which is discussed more in the next point.

4. Read query pattern:
Our reads are quite dynamic over many dimensions. We group, aggregate and filter by time, classrooms, student, school, URL, and many other factors. Also, most of our queries are aggregate in nature: less than 6 percent are at the row level while over 94 percent require some kind of ranking or aggregation over a dimension.

We did discuss pre-calculating some of the requests, but in order to make it feasible we would have had to reduce the degree of dimensions and also reduce how dynamic our queries can be. Doing so would have resulted in removing features from our products, which is unacceptable for us and our customers’ experience.

One consolation for us is that our read throughput for this data is not nearly as high as the write throughput. Thanks to various caching strategies, there are about 400 reads per minute.

LEGACY SOLUTIONS

To address our challenges, we had previously implemented various solutions to meet our query needs. Each solution worked well originally. However, we quickly outgrew the legacy implementations that we once relied on.

Before we discuss these, it is important to understand that our intention is not to convey that these solutions are inherently inadequate or insufficient. Instead, what we are trying to say is that when we designed these systems, we had different product requirements than we do now. Eventually, these other solutions no longer fit our needs.

Sharded SQL DBs:
We started out with a single SQL database. At a certain scale, we could no longer rely on a single-instance database. We implemented a sharding solution to split writes across multiple databases based on a key, with each database holding a subset of the data.

Screen Shot 2019-02-01 at 2.25.27 PM

One key thing to note here is that these are sets of relational databases that handle high-throughput browser events. This results in a large quantity of rows per table on each shard. When a table is at such a high scale, queries without indexes will have unreasonable latency and thus the queries need to be carefully crafted, especially when joining with other tables. Otherwise, the databases would lock up and result in cascading effects all the way to our customers.

The relational, row-based SQL databases handled writes relatively well, as expected. However, reads were problematic, especially considering that most of our queries are aggregate in nature, with many dimensions. Adding more SQL DBs and resharding would obviously help, but we were quickly approaching a point where the cadence of potential resharding couldn’t keep up with our growth. When we talk about databases, one of the most often overlooked factors is maintainability. We are often too focused on latency, performance, and cost, but rarely ever talk about maintainability. Shards do not score very high on the maintainability metric for two reasons: the resharding process and the need for a shard router.

Resharding is a resource-straining task and isn’t as simple as adding a database cluster. It needs to be registered with the shard router, load the data for the keys it is now serving, ensure even distribution of the load, etc. These are all all possible tasks, but the dull, mundane, and time-consuming nature of that particular work was something we were not thrilled about having to do.

The shard router itself was another problem we faced. As you can see in the architecture diagram above, the operation of these shards is dependent on the shard router service that knows which shard is responsible for each key. The reason why we used this stateful mapping is because not all keys are equal in traffic load, and the degree of variance is quite high. To handle such a variance in workload, we decided to allocate keys to shards based on the expected traffic, which resulted in the need for the shard router service. Our database uptime and performance dependency on this shard router service was an undesirable situation and became even more challenging when resharding was involved.

pros:
-Writes are fast
-Fast simple index based fetch (key based query without aggregation)

cons:
-Aggregate queries are slow (94 percent of our queries)
-Not easy to do cross-shard queries
-We need to maintain a shard router service
-Resharding is an expensive operation

Druid:
Druid is, unlike the sharded SQL DBs we were using, a columnar database that we had adopted to complement our shards. Our shards were great at inserts, but terrible at aggregate queries, so Druid was the response to supplement our aggregate query requirements.

Screen Shot 2019-02-01 at 2.26.05 PM

The first point to note about Druid is that it doesn’t do mutation at the row level; there is no row update or delete. The only option for data mutation is to run an index replacement job, which replaces an entire data block in a batch process. Because of the way our data is generated, it necessitates updates of individual rows. This was a major roadblock for us. We ended up having to create non-trivial logic to eliminate the duplicated data during the time threshold when the newer, more correct data could show up. Once the data was finalized, we would trigger a batch job to replace the duplicate data with the finalized, condensed data.

Although we no longer had to maintain a shard router like we did in the case of the SQL shards, we now had to maintain another outside component: the index replacement job. While this is largely facilitated by Druid using a Hadoop cluster, it is yet another dependency we didn’t want to manage.

Druid is not a relational database, so there are no joins. Though this was not a dealbreaker for us, it was definitely something that the engineering team had to adapt to. The way we design queries and tables, as well as how we think about our data, had to change. On top of that, at the time, Druid did not have basic support for SQL, so the query DSL required us to change a lot of the code that we used to query the data.

Druid is a great database that does aggregations across vast amounts of data and dimensions at terrifying speeds (with some query results being “approximate” by design if you read its docs: topn and approx-histograms). I don’t think there was ever a time where we had to worry about the read latency of Druid that wasn’t induced by process or infrastructure failure, which is quite impressive. However, as we continued to use Druid it became painfully obvious that it did not fit our “upsert” use case. Druid is meant for “insert only” where reads can be very dynamic yet still maintain fast latency through various caches and approximations. I’ll be the first to admit that we abused Druid because it wasn’t a perfect fit for the data we were putting into it.

pros:
-Fast aggregate queries
-Distributed by nature, so scaling is easier

cons:
-Had to maintain index replacement job
-Many moving parts (Hadoop, SQL DB, Zookeeper, and various node types)
-No joins and limited SQL support

REDESIGN FROM THE GROUND UP

When we sat down and looked at our incoming data and query patterns, we were at an all-too-familiar place for an engineering team: we needed the best of both worlds. We had fast writes but slow reads with the SQL shards. We also had fast reads but slow, duplicated writes with Druid. What we needed was the fast writes of row-based databases and the fast aggregate reads of columnar databases. Normally this is where we engineers begin to use our “trade-off” and “expectation management” skills.

Nonetheless, in hopes that there were better and cheaper solutions that existed, we began experimenting.

What we have tried:
Again, I cannot emphasize enough that all the databases below have their own strengths and use cases.

1. Phoenix:
Phoenix is an Apache project that adds a layer on top of HBase that allows SQL queries and joins.

Configurations, adaptations, and usage were rather straightforward and we were excited for the potential of Phoenix. However, during our testing we got into an odd state where the entire database was bugged out and no amount of restarts or configuration changes could bring the database back to a functional state. It’s very possible that something went wrong during configuration or usage. However, our production database should be resilient and versatile to the point where any operations should not be able to bring the entire database into an unintentional, unrecoverable and inoperable state.

2. Druid:
Another option was to redesign not only how the data was generated so that updates would no longer be necessary, but to also redesign our Druid schema and services to adapt to such data.

However, the transition and implementation for this is difficult. For zero downtime, we would have had to replicate the data ingestion and storage for an extended period of time. Time, cost, and engineering effort for this was significant. Furthermore, we weren’t completely convinced that insert-only data generation was a better choice over our current method of data generation.

3. BigQuery | Presto | Athena:
Although each of these products are different, the principal idea of the query engine decoupled from the storage is similar; they have similar characteristics of great parallel wide queries but not-so-ideal write throughput.

Of these solutions, BigQuery has the most optimal write throughput, when writing to native BigQuery storage rather than imposing a schema on top of files. However, we would still need to redesign our data generation to reduce write throughput because even BigQuery didn’t fully address our write needs.

Overall, despite us trying various partition strategies and schemas, we couldn’t come up with a confident solution for any of the above. We either ran into another transition difficulty, as we did with Druid, or we had to make compromises in business requirements. They are great for the kind of ad-hoc, non-latency-sensitive queries that are run by our analytics team, but not for our customer-facing products.

4. Spanner:
Spanner is Google’s proprietary, geographically-distributed database that was recently made available to the public via GCP. It is another relational database that shines on strong consistency across multiple regions. For this use case, we didn’t necessarily need the tight and strong consistency that Spanner is known for, but it was a very fast and exciting database to work with.

Spanner is a great product with in depth concepts and fascinating features (such as interleaved-tables) that I was really excited about, and it was one of the most impressive candidates during our testing phase.

The problem we ran into was that the cost projection for our usage was higher than that of our existing legacy systems.

MEMSQL

We first learned about MemSQL in the crowded vendor hall of AWS re:Invent 2017. Another engineer and I started sharing our problems with one of their representatives and ended up talking about data throughput, consistency, high availability, transaction isolation, and databases in general.

It was one of the most exciting and enlightening conversations I’ve had, and it changed how we served data at GoGuardian.

Why MemSQL:
MemSQL is a distributed, SQL-compliant database. There are multiple aggregator nodes that serve as the brains of the operation, and multiple leaf nodes that serve as data storage, and the database is coordinated by a single master aggregator.

Through simplicity of design, MemSQL was able to achieve complex operations at low latency.

1. Both row and columnar:
If someone were to ask “is MemSQL columnar or row-based?”, my answer would be “yes”. MemSQL supports both types of storage, defined at table creation time. Perhaps most importantly, it allows unions and joins across row-based and columnar tables.

I cannot stress enough how important this feature is to us, as it fundamentally changed how we served data by giving us the best of both worlds: the fast writes of a row-store and the fast aggregate reads of a column-store.

I don’t know of many database solutions that can support both row and columnar storage types. I certainly don’t know many database solutions that support seamless joins and unions across both types. These features allowed us a degree of flexibility we never had previously.

2. High availability
Machine failure is inevitable and it is something all engineers anticipate and prepare for. We create as many preventions as we can while also preparing for incident mitigation. MemSQL achieves this by making it so that every write is replicated into both a master and a secondary partition. If the original master fails over, the secondary partition becomes the master.

3. Speed
It’s fast. There are some databases that, by design, cannot get better than 500ms latency, regardless of how small or simple the data being queried is. With MemSQL, we are able to see some queries under 30ms when using proper indexes and partition keys.

4. Friendly support
I’ve worked with many big and small companies representing various databases and people. Sometimes, we as technologists run into a product that is new and difficult for us to understand and we need to ask new questions. Sometimes, companies or representative do not handle or communicate very well either by documentation or direct questioning. I’ve been reluctant to use some products based on unresponsiveness and the perceived difficulty of their documentation and community.

The folks at MemSQL were generally very helpful. Ever since our conversation on the crowded floor of AWS re:Invent, all the way through the time when we dropped the multiple legacy databases that were replaced by MemSQL, we have always enjoyed their assistance and friendliness; either in the form of 100+ long email threads or support tickets. It has definitely been a pleasant working experience.

Our architecture:
Let’s recap our challenges.

Data duplication: data is volatile for a period of time after creation.
High throughput: capturing all browsing events of five million students.
Aggregate queries: most queries are not simple row retrieval, but aggregates over several dynamic dimensions.

What if we write to the row-based table and pay a read latency penalty for the short period of time while the data is volatile, and union that table with a columnar table that holds the final data past that volatile time period?

MemSQL allowed us to do this by allowing unions and joins across row and columnar tables seamlessly.

Screen Shot 2019-02-01 at 2.26.56 PM

As described above, our stream processor continuously writes our data in real time into a row table. Periodically, our batch process dumps data that has become stable into a columnar table with the same schema. When we read the data, the queries are run against a view that is the union of the row and columnar tables.

Once we figured out the appropriate partition keys that would minimize our data skew, the speed and efficiency we were able to achieve from this architecture was quite stellar.

It is also important to mention that joins are now possible for us. Previously, we couldn’t do joins at all in our columnar storage, and heavily frowned upon it in our row storage due to inefficiency. However, because MemSQL allows us to do both (and it actually works pretty well) we can now have the properly organized and normalized data we data engineers dream about.

Tests:
Any engineering decision requires a sufficient amount of testing and questioning. Below are some of the results of the tests we ran during our experiments with MemSQL.

Here are some points to mention:

These tests were run during our proof-of-concept phase.
These test results are based on our specific use case.
Setup, configuration, design, and code were done by us.
Each event was inserted into multiple, semi-normalized tables.
Reads may be joined, aggregated, filtered, or ordered based on sample production load.
Ran against MemSQL 6.5.
Our goal was to prove that it could work. Our POC cluster definitely could not handle our production load and our finalized prod environment is bigger and more optimized than the POC.
You should do your own testing.

1. Isolated read throughput test

Test Setup:

Node	Type	Count
Master	m4.2xlarge	1
Aggregator	m4.2xlarge	3
Leaf	m4.2xlarge	4

Test result:
Screen Shot 2019-02-01 at 2.28.01 PM

QPS	Test Duration	Latency
1 qps	~20 mins	<200 ms
5 qps	12 mins	<200 ms
10 qps	48 mins	<200 ms
20 qps	96 mins	<300 ms
40 qps	32 mins	<600 ms
100 qps	8 mins	~15 sec

-Number of parallel read queries per second was continuously increased during the test.

-Read queries were our top most frequent queries with a distribution similar to our production load.

2. Isolated write to row table throughput test
Test Setup:

Node	Type	Count
Master	m4.2xlarge	1
Aggregator	m4.2xlarge	3
Leaf	m4.2xlarge	4

Test result:
Screen Shot 2019-02-01 at 2.28.53 PM

Stats	Number
Avg Latency per 100 events	27ms
Avg Throughput	2.49m events / min
MemSQL Row (RAM usage)	14.04 gb
AVG leaf CPU	62.34%
AVG aggregator CPU	36.33%
AVG master CPU	13.43%

-The reason why there was reduced activity around 17:40 to 17:55 was due to faulty code within our test writer that caused out of memory and the test server was restarted and terminated soon after.

3. Read while write
Test Setup:

Node	Type	Count
Master	m4.2xlarge	1
Aggregator	m4.2xlarge	3
Leaf	m4.2xlarge	8

Test result:
Screen Shot 2019-02-01 at 2.29.41 PM

-Read throughput was pegged at 40 qps.
-Write throughput we saw was around 750,000 per second.
-Read latency bump we saw was due to us running ANALYZE and OPTIMIZE query during the run to observe its effects.

Final thoughts

At the end of the day, all the databases that are listed and not listed in here, including MemSQL, Druid, MySQL, Spanner, BigQuery, Presto, Athena, Phoenix, and more, have their own place in this world. The question always comes down to what it takes to make it work for your company’s specific use cases.

For us at GoGuardian, we found MemSQL to be the path of least resistance.

The ability to perform joins and unions across row and columnar tables is definitely a game-changer that allows us to do so much more. MemSQL isn’t without its own problems; but no solution is perfect. There are some workarounds we had to do to make it work for us and there were a few times when we were disappointed. But they are listening to our feedback and improving the product accordingly. For example, when we told them that we really needed the ability to backup our data to S3, one of their engineers sent us an unreleased version with that feature to start testing with. From there we were able to establish a line of communication with their engineering team to round out the feature and even iron out a bug on their end. This line of communication we were able to establish increased our confidence in adopting MemSQL.

Now that everything is up and running smoothly, I’m both proud of what we have accomplished and thankful for the product and the support we have received from the MemSQL team. I have more confidence than ever before in our infrastructure and its ability to handle strenuous loads during peak hours.

Oh, and have I mentioned that we are saving about $30,000 a month over our previous solutions?

The original posting of this blog can be found here.

↧

Using MemSQL and Looker for Real-Time Data Analytics

February 23, 2019, 3:55 pm

≫ Next: Case Study: MemSQL Powering AI Breakthroughs at diwo

≪ Previous: Customer Repost: How GoGuardian stores and queries high throughput data with MemSQL

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL is a fast, scalable SQL database. Looker is a fast, scalable analytics platform. You can use MemSQL and Looker to create a fast, scalable – yes, those words again – analytics solution that works well across a wide range of data ingest, transaction processing, and analytics needs.

Both MemSQL and Looker are flexible and powerful tools. With the ability to provide full ANSI SQL support, MemSQL has the ability to works with a wide range of analytics tools. For Looker, its ability to connect to any SQL data source allows it to work well with a vast number of databases. Looker also optimizes its database interface to take advantage of specific database features, as you will see below.

When paired together, MemSQL and Looker combine these areas of strength to deliver consistent and concrete results. For instance, one of the most popular applications for real-time analytics is to create a real-time dashboard. There may not be an easier or more effective way to create such dashboards than to first implement MemSQL and Looker together atop your existing architecture. Use Looker to make creating your dashboard easy, and use MemSQL to make it fast.

Speeding Up Analytics with MemSQL and Looker

You can use the combination of Looker and MemSQL atop an existing data architecture to make data much easier to access and greatly speed performance. MemSQL is faster than competing solutions; often twice as fast, at half the cost. You can also use MemSQL to take over some or all of the work currently done by an existing SQL or NoSQL database, further improving performance.

A solid example of an organization using MemSQL to speed analytics performance is the online retail company Fanatics. Fanatics sources and sells branded merchandise for some of the world’s leading sports teams, including the NBA and the NFL, along with global brands such as Manchester United. Fanatics uses MemSQL to create a fast and reliable data architecture for all their analytics needs – including apps, business intelligence (BI) tools, and ad hoc SQL queries.

Data from all sources powers streaming analytics in Fanflow using MemSQL — *Fanatics’ Fanflow feeds into robust analytics capabilities powered by MemSQL*

Looker can also be used alongside existing BI and analytics tools. Where you use Looker, you’ll gain high-performance SQL query performance and ease of use, thanks to the LookML modeling layer. By implementing Looker, you can begin to create and foster a true data culture at your organization.

One company that has done this is Kollective, which has the demanding job of distributing video content for a wide range of customers, including Fortune 500 companies like ExxonMobil, HSBC, and T-Mobile. Kollective chose MemSQL and Looker and uses the tools together for real-time analytics. You can read the case study or watch a joint webinar presented by people from both companies.

You can also use MemSQL to replace both your existing database types – transactional and analytical – with a single, converged database, MemSQL. As you do so, you’re removing existing batch loading and extract, transform, and load (ETL) processes from your overall data flow. This change brings apps and analytics tools – including Looker – closer to your source data. With a properly architected solution, you can achieve near-real-time or real-time analytics.

Together, MemSQL and Looker support much broader access to your data. By making data much easier to access, Looker increases the number of people who want to dig into data and the frequency with which they access it. MemSQL contributes, with its outstanding performance and its high degree of concurrency. As a scalable SQL database, MemSQL lets you power your data with the amount of hardware that you need to get the performance that you want, for all the users who need it.

MemSQL’s unique ability to offer this kind of solution is mentioned by Looker in Looker’s Pocket Guide to Databases. Looker describes MemSQL as a database that is:

Powered by both rowstore functionality, traditionally used mostly for transactions, and columnstore functionality, traditionally used mostly for analytics.
A massively parallel processing (MPP) database, capable of smoothly scaling out across multiple nodes.
Both a self-managed (aka on-premises) MPP database and an on-demand (aka cloud) MPP database.

Looker's guide describes a wide range of databases, including MemSQL, and how best to use them with Looker — *Looker’s guide to databases features MemSQL*

Setting Up MemSQL to Work Well with Looker

A typical “small” MemSQL implementation has two aggregator nodes, four leaf nodes, 128GB of RAM, and – through the use of mixed rowstore and columnstore data – up to perhaps a terabyte of total data. You add nodes to support larger and larger amounts of data.

In its early implementations, MemSQL worked as a very fast, rowstore, in-memory database. Several years ago, MemSQL added columnstore functionality, which keeps data – including strongly compressed data – on disk, with a solid chunk of RAM dedicated for use as a cache over the columnstore.

Because MemSQL functions as both a rowstore and columnstore database in one, most operations proceed at or near in-memory speed. This allows data that’s presented in Looker to appear as near-real-time analytics, at a cost closer to that of a disk-based system.

More recently, MemSQL has added support for semi-structured data. Geospatial data, JSON data, and AVRO data (a specialized, compressed format based on JSON) are all supported, easy to manage, and with performance very close to fully structured data.

You don’t really need to do anything special to MemSQL to make it work well with Looker. In fact, Looker is designed to take full advantage of MemSQL’s capabilities.

Looker supports MemSQL’s semi-structured data formats. You don’t have to limit their use in order to keep your data available for analytics, and you don’t have to worry that Looker, as your analytics tool, will bog down on semi-structured data. (In order to take advantage of the data, you first need to schematize it into a relational format.) You can store and manage your data in the way that makes sense for the specific data you’re storing and the queries you’ll be making against it.

Second, Looker can flexibly use rowstore or columnstore data. This allows you to maximize your use of either, or both, without worrying about the needs of your analytics program. For example, if you really feel the need for speed, you can keep more of your MemSQL data in memory, assumedly in rowstore format. You can then let Looker do the work needed to efficiently run queries that would normally only work well against columnstore data.

Setting Up Looker to Work Well with MemSQL

One of the advantages of using Looker with MemSQL is that Looker “gets” MemSQL. Looker works smoothly and well across rowstore and columnstore tables, hiding the implementation details from the people and applications generating the queries, with excellent performance.

Looker has specific setup instructions for use with a set of databases that are deeply MySQL-compatible: MySQL, Clustrix, MariaDB, and MemSQL. For all of these databases, you can enable either persistent or regular derived tables. Derived tables are powerful tools that can give you more capability in LookML and more performance from your SQL queries.

Looker and MemSQL together also help users to resolve a challenge in using any database that supports SQL, including MemSQL. The challenge is to easily generate SQL that’s optimized for the database in question.

With Looker and MemSQL together, you have four options:

Write your own SQL. Many people are so SQL-conversant that this is an easy option for them, for simple queries.
Let LookML generate SQL for you. Looker generates highly-performant SQL queries that query your database, and are optimized for it, directly from LookML’s modeling layer.
Use Looker’s SQL Runner to optimize your query. SQL Runner has a wide range of capabilities, including the ability to test derived tables.
Use MemSQL Studio and MemSQL’s command-line tools. With these tools, you can profile and optimize your database for maximum performance against the queries generated as ad hoc SQL queries (#1 above), from Looker (#2 and #3 above), and from other sources, including machine learning and AI programs.

Note: Code generated from LookML by Looker (see #2 above) is likely to run faster than handwritten SQL. For instance, Looker takes advantage of the MemSQL Persistent Derived Tables capability to generate optimized tables – in MemSQL – for extremely fast performance of one-time or repeated queries.

You can further speed analytics by optimizing your data storage choices in many ways, taking advantage of MemSQL’s flexible use of rowstore and columnstore tables. For instance, you can construct a dashboard in Looker that’s backed entirely by rowstore tables in SQL for optimal performance. Or you can mix columnstore and rowstore data flexibly to target the price-performance that you need.

You can use Looker and MemSQL together to iteratively optimize your database structure and analytics needs. The two companies have been working together for years. For a quick demo of building an analytics app with MemSQL, using Looker as the analytics tool, please view our webinar on the topic.

Ready To Get Started?

Want to learn more and get started with MemSQL? You can get started for free. Or reach out to our team to learn more about how MemSQL can work for you.

And if you’re ready to find out how Looker can enable data-driven insights at your organization, contact the Looker team to request a demo and connect with their experts.

↧

Case Study: MemSQL Powering AI Breakthroughs at diwo

February 25, 2019, 1:15 am

≫ Next: Case Study Update: How Novus Partners Manages $2 Trillion with Help from MemSQL

≪ Previous: Using MemSQL and Looker for Real-Time Data Analytics

Feed: MemSQL Blog.
Author: Floyd Smith.

diwo® is a new, AI-powered platform that promises to help business users answer real-life challenges, acting as its “cognitive decision-making superpower.” diwo – which stands for “data in, wisdom out” – uses MemSQL to power a significant part of its functionality.

diwo was developed to reveal hidden business opportunities and empower users to act on them in a timely manner. The software can run in an AI-powered conversational mode, rather than through a programming or scripting language (though many layers of sophisticated coding underline the system). diwo’s conversational persona, ASK, is powered by a series of distributed microservices, and uses MemSQL for transactions and queries.

diwo has been built by a team with strong experience in decision science and data engineering, united by a strong desire to create useful, business-first solutions. “We came to see data science as always in research mode—geared toward exploration and experimentation, rather than value. Businesses were not getting what they should in terms of results,” says the project’s visionary leader, Krishna Kallakuri. “We want to bridge that gap, to build something designed primarily for use by the business community, something that actually demonstrates value on day one.”

diwo has many systems and subsystems, using MemSQL to power machine learning and AI — Going beyond “Self-Service BI,” diwo provides contextual information
and actionable insights about causes—not just “answering the question.”

diwo runs in the background of day-to-day operations, learning about your business and discovering opportunities to maximize revenues or minimize risks. The software then guides users through optimized, customizable strategies to address those opportunities in real time. While most such systems focus on generating insights—leaving the application of these suggestions up to the user—diwo goes further, achieving what early AI scientists envisioned: talking to a computer to walk you through your real-life business decisions.

MemSQL Powers Interactive Querying

With diwo, the platform is an active partner in making business decisions. The diwo platform features three personas, ASK, WATCH, and DECIDE. Each has a different role in the information-gathering and decision-making process.

Currently, MemSQL helps to power the ASK persona. diwo ASK goes beyond traditional Search BI—it’s designed as a system-led conversation that works to identify the user’s intent in order to solve underlying issues and reach the optimal decision. In responding to a query, diwo needs to assemble, generate, and plumb a dataset that can easily reach dozens of terabytes – quickly and easily. The process is often recursive, with one question/answer pair generating another.

Because the user asks queries in natural language, it’s up to diwo to convert that query into machine-actionable requests, steadily working through all the data available to diwo and converging on a useful answer. The distributed nature of diwo’s architecture, and its extreme demands on both data processing and query performance, make MemSQL a natural fit.

“Every piece of code we have is distributed, on several levels,” said Kallakuri. After founding one of the fastest growing analytics companies in the midwest 15 years ago, he is now leading the launch of Loven Systems, which owns the diwo project.

diwo uses MemSQL for real-time database interaction. — MemSQL is used in diwo’s ASK conversational persona,
which draws on a wide range of data sources in real time.

What Drove the Adoption of MemSQL?

The development team needed a fast, scalable database to underpin the diwo platform. The team was initially attracted by the speed and flexibility offered by Redis, an in-memory database that runs properly composed queries quite quickly. They also tried Cassandra, the open source database and data management system.

However, they found difficulty on two sides:

Composing useful queries
Getting acceptable performance from queries

“The more we dug into it, the more we found that the ability to query is a bottleneck,” said Kallakuri.

The problem that the diwo project encountered is the same problem that users so often find with NoSQL: you’re not made to put structure on your data, which initially seems to make things easier. But then you don’t get the benefit of decades of SQL query optimization that SQL databases inherit and can build on.

MemSQL has additional advantages, offering the best of both worlds along several different axes:

SQL vs. scalability. Traditionally, you could either have structure and SQL, in traditional transactional and analytics databases, or you could have scalability across multiple machine instances, in a NoSQL system. MemSQL is a NewSQL system—SQL that scales.
Rowstore vs. columnstore. Most databases offer rowstore, which is in-memory friendly for data ingest and transaction processing, or columnstore, which requires the larger capacity of disk to power analytics. MemSQL supports both, with a memory-led architecture that allows you to decide just how many machine instances and how much RAM, SSD, and disk to use for the performance you need.
In-memory vs. disk-based. Some databases are in-memory-only or in-memory-mostly, while others prioritize disk-based storage. MemSQL is truly open to both, separately or at the same time, as needed for your data structure and performance requirements.
Structured vs. semi-structured and unstructured. SQL databases long forced you to structure data, or pushed you to NoSQL if you had semi-structured data (including JSON, which is more and more popular) or unstructured data. But MemSQL has high-performance support for geophysical data types and JSON data, supporting hybrid data structures such as BLOBs. This allows you to use semi-structured data freely, with performance close to that of structured data, and with everything in one place.

These advantages are part of what attracted diwo to MemSQL. Another aspect is high performance across all of MemSQL’s features. Capabilities such as scanning a trillion rows per second are very useful indeed when you have to offer interactive, conversational-speed responses to business questions that demand complex processing to answer.

“We tried to push MemSQL to the worst possible extent to see if it would break,” said Kallakuri. It didn’t. diwo has had to add functionality on top of MemSQL to support its project needs, mostly around dynamic SQL and dynamic stored procedures. The performance and stability of MemSQL make it a solid base to build on.

What’s Next for diwo?

diwo is just getting started, showing its technology to interested customers—who are generally enthralled. “Nearly every time we do a demo, we get an order,” says Kallakuri, noting that diwo’s Cognitive Decision-Making architecture is industry-agnostic. “We’re in the early stages of working with retail, financial, and automotive organizations, and have taken initial steps into gaming as well.”

Now that MemSQL is part of its tech stack, diwo is likely to keep finding novel ways to use it.

↧

Case Study Update: How Novus Partners Manages $2 Trillion with Help from MemSQL

February 25, 2019, 4:18 pm

≫ Next: Webinar: Choosing the Right Database for Time Series Data

≪ Previous: Case Study: MemSQL Powering AI Breakthroughs at diwo

Feed: MemSQL Blog.
Author: Noah Zucker.

This case study was first published as a video with a brief description. Noah Zucker, Senior Vice President for Technology at Novus Partners, talks about how they use MemSQL for portfolio intelligence, applied to their $2 trillion – yes, two trillion dollars – in assets under management. The case study has proven popular, and the content is still highly relevant today, so we’re releasing a transcript of the content as a blog post.

[embedded content]

I’m here today to talk about Novus Partners and how we’re using MemSQL to change how the world invests.

What is Novus Partners? We are a portfolio intelligence company. We provide a platform used by over 100 investment managers – that’s hedge funds, pension funds, large allocators, home offices – to gain better insights on their investments and to better understand risk. Our platform currently encompasses over two trillion dollars in assets under management.

In addition, we have a research platform that our clients use to explore investments from publicly sourced data. Any hedge fund that’s large enough has to file 13F data about their investments. Our users can log in and explore and understand other hedge funds out there, where the risk is, and get trading ideas from that.

How Novus Partners Helps Investors

Essentially, our mission is to help investors discover their true investment acumen – where their true strengths are, and also understand their risk. Our users log in to the platform, the Alpha platform, and they are presented with a series of pages of interactive graphs, charts, and other tools they can use to explore their investments and get deeper insights than they previously had when they were just looking at things in spreadsheets, or just maybe looking at a graph with just simply returns from the last 10 years.

Novus Alpha is near-real-time analytics for portfolio management — *The Novus Alpha platform lets investors take a deep dive into investments*

You know, lots of these hedge funds have glossy brochures where they show how they beat the market from the last 10 years, but it doesn’t show the deeper picture of where they got those returns from. Do they actually have high returns from last quarter – but also they have a large risk, like an illiquid position, or their investments are in some area that they probably didn’t truly understand.

So our users are able to gain deeper insights and bring a kind of moneyball approach to investing, whereas the hedge fund industry traditionally has been more sort of gut instinct investing.

Here are some of our clients that use the platform today, who use Novus. Some of the top investment managers in the world.

Novus serves top clients needing real-time analytics with MemSQL — *Novus clients include top names in investing*

Before MemSQL: ETL Headaches

Let’s talk about the story before MemSQL, because Novus Partners didn’t always use MemSQL as its main database backing our investment analytics platform. Before MemSQL, we used MongoDB, and when I joined Novus in 2013 I immediately saw that we had some problems.

As a distributed SQL database, MemSQL meets Novus' needs well — *Novus faced big ETL and business headaches before MemSQL*

You know, we have a client data team. That’s our team that works with our investor customers. The members of our team are very skilled portfolio analysis analysts themselves. They understand investments. They understand the data.

But they were spending most of their time not actually doing that, but managing our ETL (extract, transform, and load – Ed.) pipeline. What that meant was they had a 24-7 operation. They had to babysit, handhold the ETL process loading the metrics into our platform.

If there was a job failure then they’d have to spring into action, shuffling around their data load schedule. In the worst case they’d have to load a large job during the day and that would mean an application slowdown for all of our users while the database was under strain.

So, being the new guy, I asked why we only have 12 of these compute nodes implemented in Scala? Why can’t we just put them out on the cloud, just scale out, and have one 100 of these just blasting through all that data?

The answer I got back was a little bit interesting. You know actually what I was told was they tried doing that, they tried scaling up, but they really couldn’t go much higher because the database that we’re using, MongoDB, just couldn’t keep up. So it got to a point where we had to actually investigate making a change.

We either had to learn to scale out our existing, do the work to scale our existing database, or we had to investigate using something else. And of course, that was an opportunity to investigate other technologies, and MemSQL was one of those.

You know, one reason why we decided to make a change was that, using Mongo, there are well understood ways to scale out, and it basically would be a full re-write of our application. You know, we’d have to revisit our data model, introduce sharding, and as an application developer now you’re having to think about scalability and that sort of stuff alongside your business logic. So that is a big undertaking.

MemSQL Cuts Load Times by 98%

So this is where MemSQL comes in. This is what our actual data pipeline looks like. So as I mentioned our clients, they provide us with data in all sorts of formats, flat text files, Excel spreadsheets, even PDF. We scrape data off PDF format and we load that through a pretty standard ETL process, just data clean-up, and it’s stored in a persistent store of record.

MemSQL scalability makes it a fast SQL database for Novus — *MemSQL cut load times by 98% and simplified the architecture*

Then our platform takes that data out of our store of record, sends it into our Scala-based distributed compute layer, and that does the computations of the portfolio analytics and the metrics that I referred to earlier. It caches that in MemSQL so that, when our customers log in, all that data is available to them at their fingertips.

From their perspective these high intensity computations are being done immediately, something that they’re not able to do without us. They might be waiting minutes or hours if they’re trying to crunch those numbers in a spreadsheet or on a traditional database. So that’s of great value to them. And the bottom line for our ETL team was that a typical hedge fund data load went down from 90 minutes to two minutes.

So, even if there’s a failure, we can just re-run it. It doesn’t cause a load on our system inter-day. And from a developer’s perspective – and actually, I’m a Scala developer myself, not a data engineer – so from my perspective, MemSQL brought a lot of value.

“The Learning Curve is Basically Non-Existent”

Now you hear that we’re moving to a new database and the first thing you want to know is like, what’s its interface? And the answer you get back, it just uses the MySQL interface. I think that’s something overlooked perhaps in the MemSQL buzz, but as a developer, that’s a huge win.

The learning curve is basically non-existent. You have the whole tool chain available to you. There’s a lot of documentation online. So that’s a great value as a developer.

Support for semi-structured data makes MemSQL a good JSON database and IoT database — *Novus is a heavy user of MemSQL’s first-class JSON support*

In addition, MemSQL has first class JSON support. Being a Mongo shop, that’s really important, because we did have a stable schema at the time that we were doing the migration. We were able to map a lot of our data to a relational schema, but there are parts that we had to leave in JSON – or maybe that we were iterating quickly over it.

(MemSQL’s JSON support is being used by more and more MemSQL customers as both IoT and the use of time series data for IoT and a range of other purposes, continue to increase. – Ed.)

We want to do more rapid development and it’s changing. So we want to leave it in JSON format and so that means that there’s a whole lot of code that we don’t have to re-write, that we can just leave as is.

Novus is also known for this open source library developed at Novus, the Scala case class serialization library, that works with Mongo, Salat, and we could just leave a lot of that code in place. Moving from Mongo to MemSQL didn’t mean we had to scrap a lot of code. We just left a lot of it as is. So that was a big win for us.

So the bottom line in terms of what the impact was of MemSQL for our business, that client data team that I mentioned earlier, they’re focusing more on delivering value to our customers, helping them understand, and their data, and their investments during the integration, the data integration process.

MemSQL replaced a slow database with a fast one — *MemSQL supports 10x the workers – and cuts ops overhead by more than half*

You know, when we have a failure inter-day we don’t have that data application slowdown. So that’s important. Our end users don’t have these bad days where things are going slow because we had to run a big job during the middle of the day. Our architecture is not limited by the database at this point.

10X the Workers – with No Code Changes

You know, we are able to scale up from just 12 workers to 126. So that’s over a 10X improvement. And, if we want to scale even further, with MemSQL we can just add more servers and we’re not limited by our database architecture.

As an application developer I don’t even have to think about changing my code for this increase in scale. I don’t have to revisit the data model. I just have a set of really well written SQL queries, well indexed.

I just ask the operations team, “We need to add more servers,” and they get on provisioning those and now we have more scale. So it’s really convenient.

Now with Half the Operations Workload

One more thing to mention, from a system administration perspective. Before, on Mongo, we had two full-time sysadmins and an architect putting in a significant amount of time during their week, just the care and feeding of that Mongo operation.

If we had scaled out Mongo, that was going to potentially be more of that type of work. But with MemSQL we now actually have just one DBA and an architect, maybe a few hours a week, if that, working on MemSQL.

For a small company like Novus, less than 100 employees, it really fits our operational model. We don’t have to devote a whole team just to the care and feeding of this data platform. It pretty much takes care of itself, once we have it configured and set up.

You can try MemSQL for free today.

↧

Webinar: Choosing the Right Database for Time Series Data

February 26, 2019, 11:55 am

≫ Next: DZone Webinar – MemSQL for Time Series, Real Time, and beyond

≪ Previous: Case Study Update: How Novus Partners Manages $2 Trillion with Help from MemSQL

Feed: MemSQL Blog.
Author: Floyd Smith.

In this webinar, MemSQL Product Marketing Manager Mike Boyarski describes the growth in popularity of time series data and talks about the best options for a time series database, including a live Q&A. You can view the webinar and download the slides here.

Here at MemSQL, we’ve had a lot of interest in our blog posts on time series data and choosing a time series database, as well as our O’Reilly time series ebook download. However, this webinar does a particularly good job of explaining what you would want in a time series database, and how that fits with MemSQL. We encourage you to read this blog post, then view the webinar.

Time series data is growing in use because it’s getting easier and cheaper to generate time series data (more and cheaper sensors), transmit it (faster online and wireless connections), store it (better databases), act on it (more responsive websites and other online systems), and report on it (better analytics tools).

Time series data has latterly attracted more interest than graph databases, key-value stores, document stores and other data types. — *In the last twelve months, interest in time series databases has risen sharply.*

Time series data is used for device monitoring, for energy systems such as oil wells, for manufacturing, for computer operations, in financial pricing and trading, and for marketing automation. You can use it for alerting, monitoring, and – a usage that’s getting more and more important – for real-time response to all sorts of signals.

Just for one example, an e-commerce site can monitor current sales of hot products. Combining sales trends with relevancy data, the site can offer each visitor the hottest product that they’re most likely to buy.

Time series data goes through a life cycle. It’s generated by software to reflect a real-world event, such as a pressure valve reading or a completed transaction. The data then often goes into a pipeline, to move it on from the issuer and provide functionality such as data recovery and the ability to move to multiple potential consumers.

From the pipeline, the data is then transformed by software – for instance, it can be normalized or reformatted. A series of readings that tend to be several seconds apart, for instance, can be consolidated into a single record per minute, using JSON to handle the resulting semi-structured data. (With MemSQL, this can happen during ingest, via the Pipelines capability, including the use of Pipelines to stored procedures.)

MemSQL handles transformation, real-time queries (via rowstore tables) and ad hoc queries (via columnstore tables) — *Time series data has its own life cycle.*

To effectively support time series, a database needs to meet specific requirements in terms of its ability to support transactions; its scalability; its effectiveness as an operational database; and its usefulness and responsiveness for analytics.

Mike delivers a summary assessment of different kinds of database on each of these axes – transaction support, scalability, operational capabilities, and analytics support. For instance, a NoSQL database is likely to be scalable – but unlikely to be all that useful for analytics, because of the very fact that it doesn’t support SQL.

Legacy databases lack scalability; distributed databases are missing SQL; NewSQL is distributed and has SQL — Different kinds of databases have different
strengths and weaknesses for time series data.

Fanatics, the leader in branded team merchandise from the NBA, NFL, Champions League football (soccer) teams, and many others, is a proud user of MemSQL. They are also a great example of time series data in use.

All the data that Fanatics takes in – from their website, from mobile users, and from point of sale (POS) systems – has time series aspects to it. In the NFL playoffs, for example, sales of team jerseys for the Super Bowl contenders and winners are going to spike.

Fanatics can use time series data in the run-up to the big game to predict jersey sales for the winning team and gear up production accordingly. Heck, maybe they even have advance insight into who’s going to win each Super Bowl – but if so, they haven’t shared it with us.

Time series data comes from multiple sources into the MemSQL-based event-driven architecture — Fanatics’ FanFlow analytics architecture, which ingests
time series data of several kinds, is driven by MemSQL.

The webinar finished with a brief Q&A, including these questions and answers on time series, implementing MemSQL, and MemSQL vs. other databases.

Does MemSQL perform integrity constraints while streaming?
Yes, of course. It depends on how fast the data is coming in, whereas full checks.

What do you do if you get bursts of old data? For example, some of our devices don’t have an Internet connection, so the data comes in in bursts.
With MemSQL, you can use the transaction capability to integrate the out-of-sequence data. And you can use tools that come with MemSQL to help you write queries that give you good answers to data series that have gaps.

How does MemSQL perform against Redis?
Redis has somewhat limited analytics, it’s not really suited to exploratory analytics. To support that, you then have to copy the data into something else. MemSQL avoids that by

Can you say that, using MemSQL, we can skip using data lakes?
Yes, we have a customer that is doing this, using MemSQL instead. However, we also have an HDFS connector, so you can also use Hadoop as a data lake, then move appropriate data – or all the data – into MemSQL.

How is MemSQL compared to Snowflake? 
Snowflake is a one-workload environment that’s very good at data warehousing, as is MemSQL. Where we differ is that MemSQL was additionally designed for fast data ingestion, and will give you much better performance. MemSQL also runs many more places – basically everywhere, vs. just two public clouds for Snowflake. Snowflake can also become quite expensive if you run it continually.

Intrigued? It’s easy to learn more. View the webinar, including Q&A, and download the slides here.

↧

DZone Webinar – MemSQL for Time Series, Real Time, and beyond

February 28, 2019, 3:37 pm

≫ Next: Case Study: Wealth Management Dashboards Powered by MemSQL

≪ Previous: Webinar: Choosing the Right Database for Time Series Data

Feed: MemSQL Blog.
Author: Eric Hanson.

Eric Hanson, Principal Product Manager at MemSQL, is an accomplished data professional with decades of relevant experience. This is an edited transcript of a webinar on time series data that he recently delivered for developer website DZone. Eric provided an architect’s view on how the legacy database limits of the past can be solved with scalable SQL. He shows how challenging workloads like time series and big data analytics are addressed by MemSQL, without sacrificing the familiarity of ANSI SQL. You can view the webinar on DZone.

Time series data is getting more and more interest as companies seek to get more value out of the data they have – and the data they can get in the future. MemSQL is the world’s fastest database – typically 10 times faster, and three times more cost effective, than competing databases. MemSQL is a fully scalable relational database that supports structured and semi-structured data, with schema and ANSI SQL compatibility.

MemSQL has features that support time series use cases. For time series data, key strengths of MemSQL include a very high rate of data ingest, with processing on ingest as needed; very fast queries; and high concurrency on queries.

Key industries with intensive time series requirements that are using MemSQL today include energy and utilities; financial services; media and telecommunications; and high technology. These are not all the industries that are using MemSQL, but these four in particular, we have a lot of customers in these industries and these industries use time series data.

Introduction to MemSQL

MemSQL has a very wide range of attributes that make it a strong candidate for time series workloads. You can see from the chart that MemSQL connects to a very wide range of other data technologies; supports applications, business intelligence (BI) tools, and ad hoc queries; and runs everywhere – on bare metal or in the cloud, in virtual machines or containers, or as a service. No matter where you run it, MemSQL is highly scalable. It has a scale-out, shared-nothing architecture.

The MemSQL ecosystem works with Kafka, Spark, Hadoop/HDFS, AWS S3, and SQL outputs

Ingest

The MemSQL database has a very broad ecosystem of tools that work well with it. On the left of the diagram are major data sources that work with MemSQL.
MemSQL has a very high-speed ability to ingest data. That includes fast bulk loading and streaming data with our Pipelines feature.

MemSQL is partnering with a company to build connectors to relational data stores. We support loading and conductivity with data lakes including Hadoop, HDFS, and Amazon S3.

For example, you can stream data in real time into MemSQL using Kafka streaming into pipelines in MemSQL. It’s a very convenient way to load data into MemSQL. Just dump data into a Kafka queue and MemSQL will subscribe to that queue, and be able to load data in real time into tables without writing any code to do that. Also, we support transformations with tools such as Spark and Informatica.

Rowstore and Columnstore

MemSQL stores data in two types of tables: memory-optimized rowstore tables and columnstore tables that combine the speed of memory and the capacity of disk storage.

In-memory rowstore-oriented tables are used for extremely low latency transactions, as well as very analytics on smaller quantities of data.

Disk-optimized columnstore tables support tremendous scale – petabyte scale. They include built-in compression on write, analytics and queries against compressed data, and high performance.

Querying and Reporting

Lots of application software already can connect to MemSQL because we support ANSI SQL and use standard relational database connectivity capabilities like ODBC. Pretty much anything that connect to a MySQL system can connect to MemSQL.
You can write custom apps to connect to MemSQL easily with APIs that you’re already accustomed to.

In addition, a broad range of BI tools and Dashboarding systems can connect to MemSQL, such as Tableau, Looker, and Microstrategy.

MemSQL for Time Series

With these broad capabilities, and some specific features we’ll describe here, MemSQL is a fantastic time series database.

MemSQL has specific features that provide strong time series support

Where MemSQL really shines, and does a fantastic job for time series data, is in those time series use cases where there are one, two, or three of the following requirements:

High ingest rate. You’ve got a lot of events per second coming in, or you need to load data extremely fast, perhaps periodically. So high ingest rate.
Low-latency queries. You need a very fast, interactive response time for queries. These can come from apps, from BI tools, from ad hoc queries, or a combination.
Concurrency needs. MemSQL shines when you have a strong need for concurrency. You simply scale the number of machine instances supporting MemSQL to support as many simultaneous users as you need.

When you have one, two, or three of these requirements, MemSQL is a really good choice for time series data.

ANSI SQL

MemSQL’s built-in support for schema and ANSI SQL is first and foremost in the features that make us a good fit for time series data. SQL supports a lot of powerful capability for filtering, joining, aggregating, and you need that kind of capability when processing time series data. SQL support is a general purpose capability, and it’s needed and applicable for time series as well.

Unlike many databases that more or less specialize in time series data, MemSQL supports transactions. So if you’re doing time series, as with any application, you want your data to be permanent and secure and consistent. Transaction support in MemSQL makes that possible.

Ingest

One of the thing that really people love about processing time series data with MemSQL is fast and easy ingest. You can ingest data in multiple ways, including inserting the data with regular insert statements, a bulk loader to load data, as well as MemSQL Pipelines. You can use whichever technique is most convenient for you and performance will be very fast.

I’d like to give some more detail on MemSQL Pipelines. You can create a pipeline that references a Kafka queue, or a file folder in your Linux file system, or AWS S3, or an Azure Blob store. Then you start the pipeline and we directly load the messages from a Kafka queue, or files that land in the file folder that you pointed us at. You don’t need a fetch-execute loop in your application; we handle that for you with our pipelines approach.

You can also transform the data with pipelines using external scripts written in Python, or any language you want, using a standard interface. It’s convenient to load with pipelines, and you can transform the data if you need to.

We support transactional consistency in our loader. If you load a file, it’s either all going to load, or none of it’s going to load. Queries will see all of the file or none of it. We support exactly-once semantics with Kafka pipelines.
You can ingest data into MemSQL phenomenally fast. I can’t emphasize this enough.
I got the opportunity to work with our partner Intel using a Xeon Platinum server – a single server with two Xeon Platinum chips, with 28 cores each, and high performance solid state disks (SSDs) attached.

I configured it as a cluster – some people call it a cluster in a box. It had two leaf nodes and one aggregator, but all installed on this same machine.

Then I just loaded data with a driver that was highly concurrent – using lots and lots of concurrent connections that were running inserts and updates, or upserts, simultaneously, and was able to drive 2.85 million rows per second insert and update on a single server.

That is a phenomenal rate, I mean very few applications need much more than that. And, if you need to scale to ingest data faster than that, we can do it. You just have to add more nodes and scale out.

Queries

MemSQL also supports fast query processing with vectorization and compilation. We compile queries to machine code. I described earlier about how MemSQL is typically an order of magnitude or more faster than legacy, rowstore-oriented databases. One key reason for that is that we compile our queries to machine code. When a query accesses a rowstore table, it’s executing code directly, rather than being interpreted as legacy databases do. We also support window functions, which are very helpful for time series data, and I’ll get into that in detail later on.

We support extensibility extensively: stored procedures, user-defined functions, and user-defined aggregate functions. All three of these are useful for managing time series.

MemSQL also has excellent data compression. Time series event data can be very voluminous, and so it’s important to be able to compress data to save storage. MemSQL’s columnar data compression does an excellent job of compressing time series events.

So let’s talk about how to query data effectively with time, when it’s time series data. Window functions allow you to aggregate data over a window of rows. You have to specify your input set of rows by a partition key. Then within each partition, you will have an ordering or sort order.

MemSQL supports SQL window functions that are highly useful for time series

The window functions compute a result for each row within a partition, so the window function result depends on the order of the rows within that partition. So, as an example, as illustrated on the right above, you may have an input set of rows like that entire larger block. You see the purple color between start and end is your window, and there’s a concept of the beginning of the window, the end of the window and the current row within the window that you can specify with SQL using this new extension in SQL for Window functions.

MemSQL supports a bunch of different Window functions. We support ranking functions, as well as value functions like LAG and LEAD, and aggregate functions like SUM, MEAN, MAX, AVERAGE, COUNT, et cetera. Then special percentile functions to give you a percentile value.

Specific window funcions include rank, value, aggregation, and percentile functions

MemSQL for Time Series Examples

I’m going to give a simple example of a time series application, just based on a Tick stream. Imagine that you have very simple financial Tick stream which has a table called Tick, which has a high resolution daytime, daytime six.

MemSQL supports a daytime or timestamp type, which has six digits after the decimal place. So it supports resolution down to the microsecond, which is a high resolution timestamp you may need for some of your time series of applications.

In this example table, I’ve got a high resolution timestamp, then a stock symbol, and then the trade price. This is oversimplified, but I just want to use this simple example for some future queries I’m going to show you.

Data for the MemSQL tick stream table example

One important thing that you may need to do with time series data is to smooth the data. So you might have Ticks coming in, or you might have events coming into a time series. Perhaps there’s a lot of noise in those events, and the curve that looks jagged, you want to smooth it out for easier display, easier understanding for your users.

You can do that with Window functions and MemSQL. So this is an example of a query that computes the moving average over the last four entries in a time series. In this case, the example is just pulling out the stock ABC, and we want to know what the stock symbol is, what the timestamp is, the original price and the smooth price. So you can see there in the tabular output below how price moves up and down, a little bit chunky. Then the smooth price averages over a sliding window of between three rows preceding and the current row.

Using averaging to smooth time series data in MemSQL

So that’s how we get the four entries that we’re averaging over for the smooth price. You can define that window any way you want. You could average over the entire window, or rows preceding and the current row, or rows from preceding the current row to after the current row. You can define the window whoever you like.

Another important operation that people want to do on time series data is to aggregate over well-defined time buckets. This is called time bucketing. Maybe you want to convert an irregular time series to a regular time series. So an irregular time series has entries at irregular intervals. They may arrive at random intervals, so you might not have say one entry every second. You might have on arrival time on average, one entry every half a second, but maybe like a statistical process, like a Poisson process of arrival time.

You can group by time to create time bucketed records

You might have several seconds between arrivals of Ticks, and you might want to convert that so you’ve got one entry every second. That’s something you can do with time bucketing.

Another application of time bucketing is if you may want to convert a high resolution time series to a lower resolution. For example, you might have one entry per second, and you might want to convert it to have one entry per minute. That’s reducing the resolution of your time series.

There are a couple of different ways you can do time bucketing in MemSQL. One is to group by a time expression. So, for example, that query that says select ts:> daytime and so on. That first one, that expression ts :> daytime, that’s a typecast in MemSQL. It converts that timestamp to the daytime, not a daytime six. So that is going to have resolution of a single second. We won’t have fractional seconds.

You can convert to daytime, take an aggregate, then group by the first expression, and order by the first expression. That’ll convert a high-resolution time series to a one-second granularity time series.

You can also use user-defined functions to do time bucketing. I’ve written a time bucket function which takes two arguments. The first argument is a time pattern that says something like one second, one minute, three seconds, et cetera. So you can define with a phrase what your time bucket is. The second argument is the timestamp.

You can do this with an expression in the query, but just as a simplification, I’ve written this time bucket user-defined function and there’s a blog post that’ll be coming out soon. (For more detail, and for an additional example with candlestick charts, view the webinar. We will link to the more detailed blog post when it’s available – Ed.)

More on MemSQL for Time Series

MemSQL can solve time series development challenges for you in a number of different ways, because we have incredibly high performance through our scale-out, our query compilation, and our vectorization. Your time series queries will run fast and we can scale to handle large workloads and large data volumes.

Also, we support very fast ingest of time series events through just regular insert statements, upserts, load, or pipelines. Also, we support powerful SQL Window function extensions that are terrific for time series. They’re fully built into MemSQL, as native implementations. If you want, you can make your time series processing a little bit easier using user-defined functions, and user-defined aggregate functions, and stored procedures, just to add more power and flexibility to your time series processing.

Finally, MemSQL provides all the capabilities of a distributed SQL DBMS. So if you’re going to build a time series application, you may have a choice between using a purpose-built time series database that’s specific to time series, or a general purpose database like MemSQL.

If you choose a general purpose database, you’re going to have a lot of extra benefits that come with that, including SQL transactions, backup and restore, full cluster management, rowstore, indexes on the rowstore, columnstore, concurrency, high availability. You can handle general purpose applications that are transactional. You can handle analytical applications like data warehouses, data marts, operational data stores, or analytical extensions to your operational apps.
You have full SQL capability to do outer joins and other kinds of things that may be difficult to do in a time series specific database. So the generality of MemSQL can handle a lot of different application needs. Moreover, MemSQL can still handle time series really effectively with powerful Window functions and user-defined extensions.

We invite you to try MemSQL today, if you haven’t tried it already. You can download MemSQL today. MemSQL can be used free in production for up to 128GB of RAM capacity; if some of your data is in columnstore format, that can represent hundreds of gigabytes of total database size. Or, you can use our Enterprise trial, which has unlimited scale, for up to 30 days for development.

Q&A

Q: There are a few technologies out there that build time series functions on top of PostgreSQL. So the general question is, what would make the MemSQL offering different vs. something that is implemented on top of a PostgreSQL-based solution?

A: The main difference is the performance that you’re going to get with MemSQL for ingest and for large-scale query. So a PostgreSQL-oriented system, depending on which one you’re looking at, it may not provide scale-out. Also, MemSQL has an in-memory rowstore and a disk-based columnstore. We compile queries to machine code and use vectorization for columnar queries. We have a higher-performance query processing engine than you might get for standard query processing operations on PostgreSQL.

Q. What would be some differences or advantages between a NoSQL-based time series implementation, versus something that MemSQL offers?

A. Standard NoSQL systems don’t necessarily have special time series support, and they often don’t support full SQL. So one thing that you find different about MemSQL is that we do support full SQL and Window functions, which are good for processing time series style queries. So that’s the main difference that I see.

Q. What are techniques that can allow MemSQL to extend the throughput from an ingest perspective?

A. You saw, earlier in the talk, I showed ingesting 2.85 million rows per second into a single server transactionally. So that’s a pretty phenomenal rate. As I said, if you need to scale more than that because we support scale-out, you can add more servers and we can load data – basically limited only by how much hardware you’re providing. You can add extra aggregators and extra leaves. If you’re inserting data directly by adding aggregators, you can increase the insert rate that you can handle. Also, our pipelines capability is fully parallel, at the leaf level. So if you want to do ingest through pipelines, you can scale that out by adding more leaves to your cluster.

In addition, our loader is pretty high performance and we’ve done some work in that area recently. In the last release, the 6.7 release, we introduced dynamic data compression into our loader – because when you’re loading data, you may need to shuffle the data across your nodes as part of that operation.

We have an adaptive data compression strategy that compresses data as it’s flowing between the leaves over our internal network, if we’re bandwidth limited on the network. So there’s a bunch of techniques you can use to increase the performance of MemSQL for loading by scaling. Then just the implementation of our load is pretty high performance, through techniques like compilation to machine code, and also dynamic compression.

Q. Are there limitations to pipelines in terms of the number of concurrent, or parallel pipelines used. Is there any sort of physical limitation there?

A. I’m not aware of any limitations other than the capacity of your cluster, the hardware capacity of the cluster. I don’t think there are any fixed limits.

↧

Case Study: Wealth Management Dashboards Powered by MemSQL

February 28, 2019, 9:38 pm

≫ Next: Case Study: Improving Risk Management Performance with MemSQL and Kafka

≪ Previous: DZone Webinar – MemSQL for Time Series, Real Time, and beyond

Feed: MemSQL Blog.
Author: Floyd Smith.

In this case study, we describe how MemSQL powers wealth management dashboards – one of the most demanding financial services applications. MemSQL’s scalability and support for fast, SQL-based analytics, with high concurrency, mean that it’s well-suited to serve as the database behind these highly interactive tools.

Dashboards have become a hugely popular technique for monitoring and interacting with a range of disparate data. Like the dashboard in a car or an airplane, an effective dashboard consolidates data from many inputs into a consolidated, easy to understand display that responds instantly to both external conditions and user actions. MemSQL is widely used to power dashboards of many different kinds, including one of the most demanding: wealth management dashboards for families, individuals, and institutions.

Banks and other financial services companies work hard to meet the needs of these highly valuable customers. These users are highly desired as customers, and as such have high expectations and hold those who provide them services to a very high standard.

Data is the lifeblood of financial services companies. More than one bank has described themselves as “a technology company that happens to deal with money,” and with many now employing more technology professionals than some large software companies These financial institutions differentiate themselves on the basis of the breadth, depth, and speed of their information and trading support. So wealth management dashboards offer an important opportunity for these companies to provide the highest possible level of service and stand out from the competition.

A wealth management dashboard from a specialist provider — *A wealth management example from Private Wealth Systems.*

The Wealth Management Opportunity

Technology allows wealth management divisions of financial services companies to offer a previously unheard-of level of high-touch, custom services to customers. Little more than a decade ago, much of the banking business was transacted by telephone and paper mail. Financial reporting, even for quite wealthy clients, was in the form of a quarterly statement, compiled and mailed several weeks after the close of a quarter.

The proliferation of digitized financial services has made the old way of doing things insufficient – and also offered an alternative. The financial options available to high net worth clients is almost unlimited, and the complexity of one well-off person’s or family’s portfolio may rival what was once considered normal for a medium-sized company.

Reporting and regulatory requirements have also exploded, especially since the financial crash of 2007-2009 – when the S&P 500, a broad and “safe” index of stocks, lost more than 50% of its value. Banks are responsible for everything they recommend to clients, as well as being held liable for some recommendations or warnings that they don’t offer. Some clients have successfully sued to recover a large portion of serious financial losses, if the financial institution(s) involved have made actionable mistakes or failures to disclose relevant information.

The size of both the opportunity, in terms of varied investment options, and the risk, from market losses, loss of customers, and legal and regulatory exposure, makes customers and those serving them both eager and anxious. Timely, accurate, and voluminous information helps both sides of this charged investment environment: providing accurate information to clients increases their opportunities for productive action and reduces the risk of misunderstandings and mistakes that can prove very expensive for all concerned.

Financial service providers combine “make” and “buy” approaches in delivering wealth management systems. At the “buy” end, companies like Private Wealth Systems deliver purpose-built wealth management systems. These can be provided to end clients directly or through white labeling under a financial institutions brand. The financial institution can endeavour to provide more or less added value on top of the third-party service.

At the “make” end, a financial services institution can create a service all its own – designing, architecting, building, and running it in-house. Such a service will have privileged access to a financial services company’s proprietary offerings, but must also provide information – if not trading capability – for externally controlled assets.

The Requirements and Technical Architecture of Portfolio Dashboard Platforms

Providing portfolio management opportunities to wealthy individuals occurs in two steps: the creation of a portfolio dashboard platform, and extending that platform to the individuals and families being served.

A portfolio dashboard solution can be thought of as a kind of command center for all the financial information flows available to anyone, anywhere in the world. A financial services company’s clients are likely, as a group, to be invested in nearly every possible kind of investment, in every country in the world. And their asset allocations are likely to shift continuously as they enact complex trading strategies to reduce risk and maximize reward.

So the platform itself must receive all relevant financial data, with no lags or delays, and process it as needed. For instance, a customer may have purchased a financial services company’s index fund, made up of many individual investments. Calculations of the index’s current value and its risk exposures must be made accurately and instantly.

The platform must also serve its clients in multi-tenant fashion – the same software platform must support many customers at once, smoothly and efficiently, with no lags, delays, or barriers.

Traditional software architectures face many barriers in meeting demands of this kind. It’s become the norm, for example, for one system to bring data in – often in batch mode, imposing delays – and process it. An extract, transform, and load (ETL) process then “lifts and shifts” the data to a data warehouse system, which supports apps, business intelligence (BI) tools, and ad hoc queries. Note that the data warehouse is now separated from incoming data by a batch process and an ETL process – totalling many minutes, or even several hours, in a financial environment where competitive advantage is measured in fractions of a second.

Traditional dashboards depend on multiple databases and a complex data transformation process — *Traditional dashboards are slowed by a complex ETL process.*

The Challenges of Operating Portfolio Dashboards Today

The goal of a system serving portfolio dashboards is to provide end-to-end real-time decisioning. Streaming data powers this capability, requiring data streams from a very wide range of reporting sources. The data is processed as it arrives and is immediately made available to users. Traditional systems are inherently unable to meet these requirements.

The technical requirements for a portfolio dashboard include:

Fast, lock-free transactions. There can be no delays in, or barriers to, bringing incoming data into the system, while also ensuring precision and accuracy of the data.
Fast, scalable data processing. Data may be normalized, calculated, merged, or combined to create output data streams.
Fast query response. Specific queries – whether generated by an app, a BI tool, or an ad hoc user query – must be answered in tens of milliseconds. The number of queries that can be answered per second gates the performance of apps, BI tools, and analysts.
Concurrency. High concurrency is a key requirement for any multi-tenant system. Given the market demands placed on portfolio dashboards, when markets shift, dramatically increased utilization of the dashboard occurs. Given the importance of the insights provided by the system during fast-changing market events, as many customers as reasonably possible must be supported, with no delays.
SQL-compliant. Custom apps, BI tools, and ad hoc queries from savvy individuals nearly all use SQL as a lingua franca for communication across the boundaries of messaging systems and data stores. Non-SQL systems impose delays and heavy development burdens on app developers, BI tool providers, and end users.

Underlying all of these changes is a step change in the characteristics of the user base. First, the number of people who are wealthy has increased sharply. In addition, more and more of them are “digital natives” – people who either grew up in an always-online environment, or who have adapted rapidly to the new digital reality. These users expect constant access and constant interaction, whenever they feel the need for it. As a result, even the wealth management divisions of banks, where the number of clients numbers in the thousands, rather than the millions, are facing what IT providers used to refer to as “webscale” problems.

In addition, all of these requirements are only growing more stringent as new query profiles arrive, driven by predictive analytics, machine learning, and AI. These algorithms differentiate themselves, and their providers, by running more times per second against more streams of data, and providing results faster – further straining the platform.

The additional workload can push response times up for all users of a system, whether their own requirements are simple or demanding. And slow ingest or slow data processing can mean the difference between an actionable insight and watching an asset’s value crash, long seconds behind other market actors.

The Case of the Single Slow Query

In one telling incident, a high net worth individual became accustomed to the fast data updates and sub-second response times that were reliably provided by their wealth management dashboard. But then one query went awry. It took six seconds for the customer to get a response. Six… long… seconds.

The customer complained. His bank scrambled to find out what happened. They pinned down the problem, fixed it, and showed the customer how such a delay would never darken their day again.

But it wasn’t enough. A rival promised the customer uninterrupted access to their portfolio, with sub-second response times – 24 hours a day, seven days a week, forever. The customer changed banks.

The key to solving these problems is in a reliable high performance data architecture, followed by relentless tuning and quality control.

How To Improve Performance, Reduce Latency, and Eliminate Complexity with NewSQL

Wealth management systems are only one example – if a somewhat extreme one – of increasing demands on messaging, data processing, and query responsiveness. For wealth management systems, the challenges involved are highlighted by the demands of the clients that consume these services, But versions of these challenges have arisen – and, in many cases, not been met – for years, across a wide variety of use cases.

One crucial underlying problem is that traditional relational database systems are mostly single-core. That is, their core process can only run on one machine at a time. So they can scale up – that is, they get faster if the single machine they run on is replaced by a more powerful one. But they can’t efficiently scale out; that is, they can’t quickly and cost-effectively use the power of multiple servers, yoked together, to deliver faster performance or support more concurrent users.

NewSQL is a new class of databases that combines the scalability of NoSQL with the schema and SQL support of traditional databases. This kind of software is hard to create, and the category is still maturing. Some of the leading NewSQL offerings are limited to a specific cloud provider, for instance. MemSQL is the leading database of its kind: a platform-independent, fully scalable, relational solution that fully supports schema ANSI SQL.

NewSQL databases such as MemSQL provide the best of both worlds - legacy relational SQL databases and NoSQL — *Some institutions have tried NoSQL as a step to move on from legacy relational databases.*

This slide summarizes the difference among different classes of databases. NoSQL, a new category, suffers on transactions, operational support, and SQL capabilities. Legacy relational databases have just about everything wanted, except the scalability needed for so many systems today. NewSQL combines the best of both.

Caching as a Limited Solution

Wealth management dashboards are one of the most intense examples of a problem that has plagued the database world ever since relational databases standardized around SQL in the 1970s and 1980s. These databases were relational, and fast for small and medium-sized data loads. But they were not horizontally scalable; at the core, performance was restricted by the capabilities of the most robust single server that can be brought to bear on the problem.

The companies that offered these database systems were largely unable to work past them. (Oracle’s RAC offering is a brave, but expensive and fragile, attempt to do so.) What the industry tried to offer, instead of a scalable solution, is in-memory caches. RAC is an example of a database provider offering a caching-based solution. Numerous third parties also offered caches to bolt onto existing, single-core database solutions.

Unfortunately, such caches bring with them several problems:

Unexpectedly slow performance. The increasing demands of users lead to increasing numbers of cache misses. A cache miss is more expensive than a direct read from disk, with no cache at all. It doesn’t take too many cache misses to render a cache counterproductive.
Response delays. Even the suspicion of a cache becoming stale leads to the cache being dumped and reloaded. This process causes a delay for all processing, again in excess of the time required for a direct read.
SQL breakage. SQL queries that produce an optimized response from a disk-based system, or an even faster response from an entirely in-memory system, produce long waits or even fail when some of the answer is in the cache and some isn’t.
Incorrect results. Caching creators face a tough trade-off between allowing a relatively small number of potentially incorrect results and frequently jettisoning cache contents in favor of a cache reload. It’s all too easy for these design choices, or unanticipated conditions, to lead to some number of incorrect results.

The answer to these problems is a system – and, in particular, a query processor – built from the ground up to make smart decisions about in-memory, on-disk, and cached data. Legacy database providers have not made the investments required for this kind of holistic solution. It’s been left to new relational database providers, described by the label NewSQL, to find new ways to offer a relational database that supports SQL and works flexibly with disk and memory.

Modernizing Wealth Management Dashboards with Kafka and MemSQL

A wealth management dashboard must incorporate many streams of data – structured data such as account records, time series data such as stock market updates, and unstructured data such as video feeds. And new feeds may need to be added at any time.

These requirements militate for a standardized messaging interface within the data architecture, and Kafka can provide this.

Kafka is now widely used as a messaging queue within data architectures, and it integrates well with MemSQL. MemSQL then interacts with a standardized input source, simplifying system design and reducing operational burdens.

Among the desirable attributes of Kafka that work well with MemSQL:

Streaming data support. Kafka can be used in either asynchronous (batch) or synchronous (streaming) mode to accommodate stock position data, news, and valuable research data. MemSQL’s high performance supports both options well.
Distributed. Like MemSQL, Kafka is distributed, so it’s scalable. As a result, Kafka can handle scale and bursts of data without incurring costly offline re-sharding or shuffling.
Persistent. Kafka is resilient against data loss, with the ability to copy data into one or more stores before successful receipt of the data is acknowledged. MemSQL is also fully persistent, even for rowstore tables, making the “chain of custody” for data much easier to manage.
Publish-subscribe model. Kafka can accommodate a wide range of data inputs (publishers) and data consumers (subscribers). MemSQL then sees a simplified range of inputs, as they’re mostly or entirely coming in through Kafka, and has a robust ability to support analytics outputs, due to its native support for SQL.
Exactly-once semantics. Kafka can be used to guarantee that data is accepted into a Kafka pipeline once and only once, eliminating duplicates or incomplete data. MemSQL works with Kafka to help provide end to end exactly-once guarantees.
“Source of truth.” Kafka’s attributes make it a good candidate as a source of truth for incoming data that may then be divided among other processes and data stores. MemSQL can ingest most, or all, of the data streams distributed in Kafka.

The wealth management use case is a good fit for MemSQL’s Pipelines feature. Data feeds from Kafka – or AWS S3, Hadoop/HDFS, and other sources – can be ingested with a simple Create Pipeline command. Ingest is scalable, distributed, and can be fed directly into a memory optimized rowstore, a disk optimized columnstore, or both at the same time, to deliver dramatic results in the millions of events per second.

The code to create a pipeline is quite simple. In this example, from a recent Kafka and MemSQL webinar, a pipeline is created to load tweets from Twitter into a table:

CREATE PIPELINE twitter_pipeline AS
LOAD DATA KAFKA “public-kafka.memcompute.com:9092/tweets-json”
INTO TABLE tweets
WITH TRANSFORM (‘/path/to/executable’, ‘arg1’, ‘arg2’)
(id, tweet);

START PIPELINE twitter_pipeline;

The combination of Kafka data pipelines and MemSQL’s Pipelines for data ingest make the architecture supporting the dashboard display very simple. (Even though the overall architecture of the system may be complex, with different data sources, each of which requires more or less processing.)

Kafka and MemSQL combine to simplify core architecture. — *The core wealth management dashboard architecture is simple and easy to manage.*

MemSQL, The Secret Ingredient for Reliably Fast Wealth Management Dashboards

MemSQL, along with Kafka pipelines, has been selected and implemented to power wealth management dashboards at a Top 10 US financial institution – one of the five such institutions that have already adopted MemSQL.

As in other implementations, MemSQL is paired with Kafka as a kind of messaging bus. Because MemSQL can handle all the inputs, process them as needed, and support a very wide range of user demands – with high concurrency – the MemSQL customer is able to cost-effectively provide an outstanding level of service.

Requirements are strict – stricter in the private wealth management area that supports wealth management dashboards than in the general banking part of the business. In the private wealth management area only, the customer is willing to overprovision systems in order to give MemSQL all the resources it needs for optimal performance.

In this environment, MemSQL must reliably meet query service level agreements (SLAs) of less than a quarter of a second, and these must be met while simultaneously ingesting batch loads of new data while under heavy and variable load. The customer also demands low variance – no single query can take much longer than the SLA, even if the average query time stays low.

MemSQL has not only met these strict requirements for several years running, the company is expanding its footprint within this Top 10 institution. At the same time, MemSQL adoption is growing right across the financial services industry.

If you are providing wealth management dashboards, similar financial services, or implementing other kinds of dashboards, this case study may demonstrate that MemSQL can serve as an important part of your solution.

↧

Case Study: Improving Risk Management Performance with MemSQL and Kafka

March 5, 2019, 1:33 pm

≫ Next: Webinar: Data Innovation in Financial Services

≪ Previous: Case Study: Wealth Management Dashboards Powered by MemSQL

Feed: MemSQL Blog.
Author: Floyd Smith.

Risk management is a critical task throughout the world of finance (and increasingly in other disciplines as well). It is a significant area of investment for IT teams across banks, investors, insurers, and other financial institutions. MemSQL has proven to be very well suited to support risk management and decisioning applications and analytics, as well as related areas such as fraud detection and wealth management.

In this case study we’ll show how one major financial services provider improved the performance and ease of development of their risk management decisioning by replacing Oracle with MemSQL and Kafka. We’ll also include some lessons learned from other, similar MemSQL implementations.

Starting with an Oracle-based Data Warehouse

At many of the financial services institutions we work with, Oracle is used as a database for transaction processing and, separately, as a data warehouse. In this architecture, an extract, transform, and load (ETL) process moves data between the operational database and the analytics data warehouse. Other ETL processes are also typically used to load additional data sources into the data warehouse.

The customer started with a complicated and slow architecture based on Oracle — The original architecture was slowed by ETL processes that ran at
irregular intervals and required disparate operations skills

This architecture, while functional and scalable, is not ideal to meet the growing concurrency and performance expectations that risk management systems at financial institutions need to meet. MemSQL customers have seen a number of problems with these existing approaches:

Stale data. Fresh transaction data that analytics users want is always a batch load (into the transaction database), a transaction processing cycle, and an ETL process away from showing up in the OLAP database.
Variably aged data. Because there are different data sources with different processing schedules, comprehensive reporting and wide-ranging queries might have to wait until the slowest process has had a chance to come up to date.
Operational complexity. Each ETL process is its own hassle, taking up operators’ time, and confusingly different from the others.
Fragility. With multiple processes to juggle, a problem in one area causes problems for all the analytics users.
Expense. The company has too many expensive contracts for databases and related technology and needs too many people with varied, specialized skills in operations.

What’s Needed in a Database Used for Risk Management

The requirements for a database used to support risk management are an intensification of the requirements for other data-related projects. A database used for risk management must power a data architecture that is:

Fast. Under intense regulatory pressure, financial services companies are responsible for using all the data they have in their possession, now. Slow answers to questions are not acceptable.
Up-to-date. The common cycle of running data through an OLTP database, an ETL process, and into an OLAP database / data warehouse results in stale data for analytics. This is increasingly unacceptable for risk management.
Streaming-ready. There is increasing pressure for financial services institutions to stream incoming data into and through a database for immediate analytics availability. Today, Kafka provides the fast connections; databases must do their part to process data and move it along smartly.
High concurrency. Top management wants analytics visibility across the entire company, while more and more people throughout the company see analytics as necessary for their daily work. This means that the database powering analytics must support large numbers of simultaneous users, with good responsiveness for all.
Flexible. A risk management database may need to be hosted near to where its incoming data is, near to where its users are, or a combination. So it should be able to run in any public cloud, on premises, in a container or virtual machine, or in a blended environment to mix and match strengths, as needed to meet these requirements.
Scalable. Ingest and processing requirements can grow rapidly in any part of the data transmission chain. A database must be scalable so as to provide arbitrarily large capacity wherever needed.
SQL-enabled. Scores of popular business intelligence tools use SQL, and many users know how to compose ad hoc queries in SQL. Also, SQL operations have been optimized over a period of decades, meaning a SQL-capable database is more likely to meet performance requirements.

Two important capabilities for a risk management system highlight the importance of these valuable characteristics in the database driving the risk management system.

The first area is the need for pre-trade analysis. Traders want active feedback to their queries about the risk profile of a trade. They – and the organization – also need background analysis and alerting for trades that are unusually risky, or beyond a pre-set risk threshold.

Pre-trade analysis is computationally intense, but must not slow other work. (See “fast” and “high concurrency” above.) This analysis can be run as a trade is executed, or can be run as a precondition to executing the trade – and the trade can be flagged, or even held up, if the analysis is outside the organization’s guidelines.

What-if analysis – or its logical complement, exposure analysis – is a second area that is highly important for risk management. An exposure analysis answers questions such as, “What is our direct exposure to the Japanese yen?” That is, what part of our assets are denominated in yen?

It’s equally important to ask questions about indirect exposure – all the assets that are affected if the yen’s value moves strongly up or down. With this kind of analysis, an organization can avoid serious problems that might arise if its portfolios, as a group, drift too strongly into a given country, currency, commodity, and so on.

A what-if analysis addresses these same questions, but makes them more specific. “What if the yen goes up by 5% and the Chinese renmibi drops by 2%?” This is the kind of related set of currency movements that might occur if one country’s economy heats up and the other’s slows down.

These questions are computationally intense, require wide swaths of all the available data to answer – and must be able to run without slowing down other work, such as executing trades or powering real-time analytics dashboards. MemSQL characteristics such as speed, scalability, and support for a high degree of concurrency allow these risk management-specific needs to be addressed smoothly.

Improving Performance, Scale, and Ease of Development with MemSQL

Oracle, and other legacy relational databases, are relatively slow. They can only serve as an OLTP or OLAP database (not both in one); they do not support high concurrency or scale without significant added cost and complexity; and they require specialized hardware for acceleration. These legacy relational databases are also very expensive to license and operate compared to modern databases.

Oracle has worked to address many of these problems as their customers’ needs have changed. The scalability requirement for its single node architecture foundation can be partly met by scaling up — albeit to a massively expensive and hard to manage system, Exadata. Oracle also meets the SQL requirement, which gives it an advantage over NoSQL systems – but not over modern “NewSQL” databases like MemSQL.

After due consideration, the customer chose to move their analytics support from an Oracle data warehouse to an operational database running MemSQL.

The Solution: A Database for Operational Analytics

To address the challenges with an Oracle-centric legacy architecture, one company we work with decided to move to operational analytics. An operational approach to analytics puts all the data that’s needed by the company on an ongoing basis into a single data store and makes it available for rapid, ongoing decision-making.

This approach also seeks to reduce the lag time from the original creation of a data item to its reflection in the operational data store. As part of this effort, all messaging between data sources and data stores is moved to a single messaging system, such as Apache Kafka. ETL processes are eliminated where possible, and standardized as loads into the messaging system where not.

The operational data store does a lot – but not everything. It very much supports ad hoc analytics queries, reporting, business intelligence tools, and operational uses of machine learning and AI.

What it doesn’t do is store all of the data for all of the time. There are cost, logistical, and speed advantages to not have all potentially relevant company data kept in the operational data store.

Non-operational data is either deleted or – an increasingly common alternative – batch loaded into a data lake, often powered by Hadoop/HDFS, where it can be stored long-term, and also plumbed as needed by data scientists.

The new architecture replaces Oracle with MemSQL, speeding analytics — *The new architecture has fast and robust support for analytics and data science users*

The data lake also serves a valuable governance function by allowing the organization to keep large amounts of raw or lightly processed data, enabling audits and far-reaching analytical efforts to access the widest possible range of data, without interfering with operational requirements.

MemSQL is well suited for operational analytics. MemSQL features fast ingest via its Pipeline features. It can also handle transactions on data coming in via the Pipeline – either directly, for lighter processing, or through the use of Pipelines to stored procedures for more complex work. Stored procedures add capability to the ingest and transformation process.

MemSQL can support data ingest, transactions, and queries against the operational data store, all running at the same time. Because it’s a distributed system, MemSQL can scale out to handle as much data ingest, transformational processing, and query traffic as needed.

A separate instance of MemSQL can also be used for the data lake, but that function is more often handled by Hadoop/HDFS or another system explicitly designed as a data lake.

Implementing MemSQL as an Operational Data Warehouse

The financial services company described above wanted to significantly improve their portfolio risk management capabilities, as well as other analytics capabilities. They also wanted to support both real-time operational use and research use of machine learning and AI.

In support of these goals, the company implemented an increasingly common architecture based on three modern data tools:

Messaging with Apache Kafka. The company standardized on Kafka for messaging, speeding data flows and simplifying operations.
Analytics database consolidation to MemSQL. A single data store running on MemSQL was chosen as the engine and source of truth for operational analytics.
Standalone data lake with Apache Hadoop. The data lake was taken out of the operational analytics flow and used to store a superset of the operational data.

As you can see, the core of the architecture became much simpler after the move to MemSQL as an operational data warehouse. The architecture is made up of four silos.

Inputs

Each operational system, every external data source, and each internal source of behavioral data outputs to the same destination – a data streaming cluster running a Kafka-based streaming platform from Confluent.

Streaming Data Ingestion

The data streaming cluster receives all inputs and data to two different destinations:

Operational data warehouse. Most of the data goes to the operational data warehouse.
Data science sandbox. Some structured and semi-structured data goes to the data science sandbox.
Hadoop/HDFS. All of the data is sent to Hadoop/HDFS for long-term storage.

Data Stores

MemSQL stores the operational data warehouse and the data science sandbox. Hadoop/HDFS holds the data lake.

Queries

Queries come from several sources: ad hoc SQL queries; business apps; Tableau, the company’s main business intelligence tool; Microsoft Excel; SAS, the statistics tool; and data science tools.

Benefits of the Updated Data Platform

The customer who implemented risk management and other analytics, moving from ETL into Oracle to Kafka, MemSQL, and Hadoop, achieved a wide range of benefits.

They had begun with nightly batch loads for data, but needed to move to more frequent, intraday updates – without causing long waits or delays in analytics performance. For analytics, they needed sub-second response times for dozens of queries per second.

With MemSQL, the customer was able to load data in as soon as it became available. This led to better query performance, with query results that include the latest data. The customer has achieved greater performance, more uptime, and simpler application development. Risk managers have access to much more recent data.

Risk management users, analytics users overall, and data scientists share in a wide range of overall benefits, including:

Reduction from Oracle licensing costs
Reduced costs due to less need for servers, compute cores, and RAM
Fresher data – new data available much faster
Less coding for new apps
Lower TCO
Cloud connectivity and flexibility
Reduction in operations costs
Elimination of maintenance costs for outmoded batch apps
More analytics users supported
Faster analytics results
Faster data science results
New business opportunities

Why MemSQL for Risk Management?

MemSQL is fast – with the ability to scan up to one trillion rows per second. It’s a distributed SQL database, fully scalable. MemSQL supports streaming, in combination with messaging platforms such as Apache Kafka, and supports exactly-once guarantees. MemSQL supports high levels of concurrency and runs everywhere – on premises or in the cloud, in containers or virtual machines.

MemSQL customers often begin by moving some or all of their analytics to MemSQL for better responsiveness, greater concurrency, and reduced costs for the platform – including software licensing, hardware requirements, and operations expenses.

Customers then tend to find that MemSQL can take over more and more of the data pipeline. The combination of Kafka for messaging, MemSQL for data processing, Hadoop/HDFS as a data lake, and BYOBI (bring your own business intelligence, or BI, tools), can serve as a core architecture for a wide range of data analytics needs.

You can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

↧

Webinar: Data Innovation in Financial Services

March 7, 2019, 1:15 pm

≫ Next: How We Use Exactly-Once Semantics with Apache Kafka

≪ Previous: Case Study: Improving Risk Management Performance with MemSQL and Kafka

Feed: MemSQL Blog.
Author: Floyd Smith.

In this webinar, MemSQL Product Marketing Manager Mike Boyarski describes trends in data initiatives for banks and other financial services companies. Data is the lifeblood of modern financial services companies, and MemSQL, as the world’s fastest database, is rapidly growing its footprint in financial services.

Financial services companies are leaders in digital transformation initiatives, working diligently to wring business value from the latest initiatives. According to Gartner’s 2019 CIO Agenda survey, Digital transformation is the top priority for banks, followed by growth in revenue; operational excellence; customer experience; cost optimization/reduction; and data and analytics initiatives (which are also likely to show up in the “digital transformation” bucket). As Gartner puts it, “… the digital transformation of banks creates new sources of revenue, supports new enterprise operating models and delivers digital products and services.”

Digital transformation is the leading priority for banks and other financial services companies

We encourage you to review these highlights, then view the webinar recording for Mike’s in-depth treatment of this important topic.

Banks Bring Digital Transformation to Life

The key data initiatives that banks are pursuing as part of digital transformation include:

Premium customer experiences. Areas such as wealth management, always-on advisory services, and personalization are receiving focused attention to improving the customer experience.
Smart risk management. Areas of interest include automating compliance reporting, risk management analysis carried out before a trade is made, and governance for big data, as well as greater responsiveness for risk management analysis.
Cloud computing. Cloud initiatives include agile machine learning (ML) and AI services, elastic computing and storage, and hybrid architectures across cloud providers and on-premises data processing.
AI/ML. Machine learning and AI are being used for fraud detection (often, in real time rather than in daily batch analysis), “robo advisors” that use computer-based intelligence to provide trading and investment advice, and portfolio analysis, including for risk management.

The cloud, machine learning, and AI are key financial services data initiatives

How do data innovations support improvements to banking applications? Mike cites the elimination of latency between real-world events, such as a stock trade, and actionable insights, such as offering relevant services; faster data architectures for greater agility; and reducing operational burdens, even while improving responsiveness and concurrency.

How Does MemSQL Help?

MemSQL helps to meet requirements that are commonly found in banking:

Single-node database augmentation. Banks are often looking to improve performance for legacy, single-node relational databases such as Oracle, Sybase, and Netezza.
Hadoop acceleration. Hadoop/HDFS implementations are often slow and hard to query, even when additional layers such as Hive and Spark are used to try to improve performance. MemSQL provides scalable SQL for performance and usability.
Path to cloud. As a cloud-native database that also excels in on-premises deployments, MemSQL provides a path for banks to innovate while keeping the door open between cloud and on-premises use.
Real-time AI/ML. As machine learning and AI move from research to deployment, scalable MemSQL has the capacity, concurrency support, and performance to deliver business value.

In particular, banks have strong needs in both of the data table types that mark different use cases in data management. MemSQL is unique in performing strongly on both while offering flexibility between the two:

Row-oriented tables (often memory-based). Frequently updated data tables and transactions commonly run in row-oriented tables, as do data with critical timing requirements for accessibility. (Think device control, as in IoT.)
Column-oriented tables (usually disk-based). Less frequently updated data tables, often featuring different data aggregations and sort orders for stellar query performance. These tables tend to have high concurrency requirements. (Think business intelligence, operational databases for app support, and operational AI/ML.)

MemSQL has strong support for both rowstore and columnstore tables

Q&A

Q. How does MemSQL work with Apache Spark and what use cases does it support?

A. Spark is great at doing acceleration for your Hadoop architecture, but in terms of how it works with MemSQL, we see folks using it in two ways: they use it for transforming data in the stream and for using the ML library for processing the data before it lands in MemSQL. We see Spark used more as a transformation layer, especially for the ML logic.

Q. How is MemSQL used for geospatial support?

A. This is crucial functionality for several users, and Uber has really led the way in taking advantage of our geospatial support (see the Uber engineering blog, view a live video presentation, and visit our customers page – Ed.). But for details, let me refer you to the MemSQL documentation.

Q. Can MemSQL work as a dedicated data warehouse, like Netezza?

A. MemSQL definitely can work that way. But MemSQL is a unique technology. We provide an ingest engine that’s largely rowstore based, and in memory based. That’s the capability that traditional data warehouses just don’t have. We absolutely can also function as a dedicated data warehouse, and we have customers using us for that. We look forward to publishing some benchmarks to show just how fast we are at these functions.

Following Up

Intrigued? It’s easy to learn more.

Read the case studies referred to in the webinar – wealth management dashboards and portfolio risk management.

View the webinar, including Q&A, and download the slides here.

↧

How We Use Exactly-Once Semantics with Apache Kafka

March 17, 2019, 7:07 pm

≫ Next: What MemSQL Can Do for Time-Series Applications

≪ Previous: Webinar: Data Innovation in Financial Services

Feed: MemSQL Blog.
Author: Floyd Smith.

A version of this blog post first appeared in the developer-oriented website, The New Stack. It describes how MemSQL works with Apache Kafka to guarantee exactly-once semantics within a data stream.

Apache Kafka usage is becoming more and more widespread. As the amount of data that companies deal with explodes, and as demands on data continue to grow, Kafka serves a valuable purpose. This includes its use as a standardized messaging bus due to several key attributes.

One of the most important attributes of Kafka is its ability to support exactly-once semantics. With exactly-once semantics, you avoid losing data in transit, but you also avoid receiving the same data multiple times. This avoids problems such as a resend of an old database update overwriting a newer update that was processed successfully the first time.

However, because Kafka is used for messaging, it can’t keep the exactly-once promise on its own. Other components in the data stream have to cooperate – if a data store, for example, were to make the same update multiple times, it would violate the exactly-once promise of the Kafka stream as a whole.

Kafka and MemSQL are a very powerful combination. Our resources on the topic include instructions on quickly creating an IoT Kafka pipeline; how to do real-time analytics with Kafka and MemSQL; a webinar on using Kafka with MemSQL; and an overview of using MemSQL pipelines with Kafka in MemSQL’s documentation.

How MemSQL Works with Kafka

MemSQL is fast, scalable, relational database software, with SQL support. MemSQL works in containers, virtual machines, and in multiple clouds – anywhere you can run Linux.

This is a novel combination of attributes: the scalability formerly available only with NoSQL, along with the power, compatibility, and usability of a relational, SQL database. This makes MemSQL a leading light in the NewSQL movement – along with Amazon Aurora, Google Spanner, and others.

The ability to combine scalable performance, ACID guarantees, and SQL access to data is relevant anywhere that people want to store, update, and analyze data, from a venerable on-premise transactional database to ephemeral workloads running in a microservices architecture.

NewSQL allows database users to gain both the main benefit of NoSQL – scalability across industry-standard servers – and the many benefits of traditional relational databases, which can be summarized as schema (structure) and SQL support.

In our role as NewSQL stalwarts, Apache Kafka is one of our favorite things. One of the main reasons is that Kafka, like MemSQL, supports exactly-once semantics. In fact, Kafka is somewhat famous for this, as shown in my favorite headline from The New Stack: Apache Kafka 1.0 Released Exactly Once.

What Is Exactly-Once?

To briefly describe exactly-once, it’s one of three alternatives for processing a stream event – or a database update:

At-most-once. This is the “fire and forget” of event processing. The initiator puts an event on the wire, or sends an update to a database, and doesn’t check whether it’s received or not. Some lower-value Internet of Things streams work this way, because updates are so voluminous, or may be of a type that won’t be missed much. (Though you’ll want an alert if updates stop completely.)
At-least-once. This is checking whether an event landed, but not making sure that it hasn’t landed multiple times. The initiator sends an event, waits for an acknowledgement, and resends if none is received. Sending is repeated until the sender gets an acknowledgement. However, the initiator doesn’t bother to check whether one or more of the non-acknowledged event(s) got processed, along with the final, acknowledged one that terminated the send attempts. (Think of adding the same record to a database multiple times; in some cases, this will cause problems, and in others, it won’t.)
Exactly-once. This is checking whether an event landed, and freezing and rolling back the system if it doesn’t. Then, the sender will resend and repeat until the event is accepted and acknowledged. When an event doesn’t make it (doesn’t get acknowledged), all the operators on the stream stop and roll back to a “known good” state. Then, processing is restarted. This cycle is repeated until the errant event is processed successfully.

Exactly-once semantics are more exacting than other kinds. — MemSQL Pipelines provide exactly-once semantics
when connected to the right message broker

How MemSQL Joins In with Pipelines

The availability of exactly-once semantics in Kafka gives an opportunity to other participants in the processing of streaming data, such as database makers, to support that capability in their software. MemSQL saw this early. The MemSQL Pipelines capability was first launched in the fall of 2016, as part of MemSQL 5.5; you can see a video here. There’s also more about the Pipelines feature in our documentation – original and updated. We also have specific documentation on connecting a Pipeline to Kafka.

The Pipelines feature basically hotwires the data transfer process, replacing the well known ETL (Extract, Transform, and Load) process by a direction connection between the database and a data source. Some limited changes are available to the data as it streams in, and it’s then loaded into the MemSQL database.

From the beginning, Pipelines have supported exactly-once semantics. When you connect a message broker with exactly-once semantics, such as Kafka, to MemSQL Pipelines, we support exactly-once semantics on database operations.

The key feature of a Pipeline is that it’s fast. That’s vital to exactly-once semantics, which represent a promise to back up and try again whenever an operation fails.

Like most things worth having in life, exactly-once semantics places certain demands on those who wish to benefit from them. Making the exactly-once promise make sense requires two things:

Having few operations fail.
Running each operation so fast that retries, when needed, are not too extensive or time-consuming.

If these two conditions are both met, you get the benefits of exactly-once semantics without a lot of performance overhead, even when a certain number of crashes occur. If either of these conditions is not met, the costs can start to outweigh the benefits.

MemSQL 5.5 met these challenges, and the Pipelines capability is popular with our customers. But to help people get the most out of it, we needed to widen the pipe. So, in the recent MemSQL 6.5 release, we announced Pipelines to stored procedures. This feature does what it says on the tin: you can write SQL code and attach it to a MemSQL Pipeline. Adding custom code greatly extends the transformation capability of Pipelines.

Stored procedures can both query MemSQL tables and insert into them, which means the feature is quite powerful. However, in order to meet the desiderata for exactly-once semantics, there are limitations on it. Stored procedures are MemSQL-specific; third-party libraries are not supported; and developers have to be thoughtful as to overall system throughput when using stored procedures.

Because MemSQL is SQL-compliant, stored procedures are written in standard ANSI SQL. And because MemSQL is very fast, developers can fit a lot of functionality into them, without disrupting exactly-once semantics.

Pipelines are Fast and Flexible

The Pipelines capability is not only fast – it’s also flexible, both on its own, and when used with other tools. That’s because more and more data processing components can support exactly-once semantics.

For instance, here are two ways to enrich a stream with outside data. The first is to create a stored procedure to do the work in MemSQL.

The following stored procedure uses an existing MemSQL table to join an incoming IP address batch with existing geospatial data about its location:

CREATE PROCEDURE proc(batch query(ip varchar, ...)) AS BEGIN INSERT INTO t SELECT batch.*, ip_to_point_table.geopoint FROM batch JOIN ip_to_point_table ON ip_prefix(ip) = ip_to_point_table.ip; END

(For a lot more on what you can do with stored procedures, see our documentation, which also describes how to add SSL and Kerberos to a Kafka pipeline.)

You can also handle the transformation with Apache Spark, and you can do it in such a way as to support exactly-once semantics, as described in this article. (As the article’s author, Ji Zhang, puts it: “But surely knowing how to achieve exactly-once is a good chance of learning, and it’s a great fun.”)

Once Apache Spark has done its work, stream the results right on into MemSQL via Pipelines. (Which were not available when we first described using Kafka, Spark, and MemSQL to power a model city.)

MemSQL Pipelines support data transformations from Kafka, including machine learning with Apache Spark. — Use Kafka, Spark, MemSQL Pipelines, and stored procedures
for operational flexibility with exactly-once semantics

Try it Yourself

You can try all of this yourself, quickly and easily. MemSQL software is now available for free, with community support, up to a fairly powerful cluster. This allows you to develop, experiment, test, and even deploy for free. If you want to discuss a specific use case with us, contact MemSQL.

↧

What MemSQL Can Do for Time-Series Applications

March 27, 2019, 9:35 am

≫ Next: Webinar: Data Trends for Predictive Analytics and ML

≪ Previous: How We Use Exactly-Once Semantics with Apache Kafka

Feed: MemSQL Blog.
Author: Eric Hanson.

In earlier blog posts we described what time series data is and key characteristics of a time series database. In this blog post, which originally appeared in The New Stack, Eric Hanson, principal product manager at MemSQL, shows you how to use MemSQL for time series applications.

At MemSQL we’ve seen strong interest in using our database for time series data. This is especially the case when an organization needs to accommodate the following: (1) a high rate of event ingestion, (2) low-latency queries, and (3) a high rate of concurrent queries.

In what follows, I show how MemSQL can be used as a powerful time-series database and illustrate this with simple queries and user-defined functions (UDFs) that show how to do time series-frequency conversion, smoothing, and more. I also cover how to load time series-data points fast, with no scale limits.

Manipulating Time Series with SQL

Unlike most time series-specific databases, MemSQL supports standard SQL, including inner and outer joins, subqueries, common table expressions (CTEs), views, rich scalar functions for date and time manipulation, grouping, aggregation, and window functions. We support all the common SQL data types, including a datetime(6) type with microsecond accuracy that’s perfect as a time series timestamp.

A common type of time-series analysis in financial trading systems is to manipulate stock ticks. Here’s a simple example of using standard SQL to do this kind of calculation. We use a table with a time series of ticks for multiple stocks, and produce high, low, open, and close for each stock:

CREATE TABLE tick(ts datetime(6), symbol varchar(5),
   price numeric(18,4));
INSERT INTO tick VALUES
  ('2019-02-18 10:55:36.179760', 'ABC', 100.00),
  ('2019-02-18 10:57:26.179761', 'ABC', 101.00),
  ('2019-02-18 10:59:16.178763', 'ABC', 102.50),
  ('2019-02-18 11:00:56.179769', 'ABC', 102.00),
  ('2019-02-18 11:01:37.179769', 'ABC', 103.00),
  ('2019-02-18 11:02:46.179769', 'ABC', 103.00),
  ('2019-02-18 11:02:59.179769', 'ABC', 102.60),
  ('2019-02-18 11:02:46.179769', 'XYZ', 103.00),
  ('2019-02-18 11:02:59.179769', 'XYZ', 102.60),
  ('2019-02-18 11:03:59.179769', 'XYZ', 102.50);

This query uses standard SQL window functions to produce high, low, open and close values for each symbol in the table, assuming that “ticks” contains data for the most recent trading day.

WITH ranked AS
(SELECT symbol,
    RANK() OVER w as r,
    MIN(price) OVER w as min_pr,
    MAX(price) OVER w as max_pr,
    FIRST_VALUE(price) OVER w as first,
    LAST_VALUE(price) OVER w as last
    FROM tick
    WINDOW w AS (PARTITION BY symbol
    ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING
        AND UNBOUNDED FOLLOWING))
 
SELECT symbol, min_pr, max_pr, first, last
FROM ranked
WHERE r = 1;

Results:

+--------+----------+----------+----------+----------+
| symbol | min_pr   | max_pr   | first    | last     |   
+--------+----------+----------+----------+----------+
| XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
| ABC    | 100.0000 | 103.0000 | 100.0000 | 102.6000 |
+--------+----------+----------+----------+----------+

Similar queries can be used to create “candlestick charts,” a popular report style for financial time series that looks like the image below. A candlestick chart shows open, high, low, and close prices for a security over successive time intervals:

You can use standard SQL queries to create

For example, this query generates a table that can be directly converted to a candlestick chart over three-minute intervals:

WITH ranked AS
   (SELECT symbol, ts,
    RANK() OVER w as r,
    MIN(price) OVER w as min_pr,
    MAX(price) OVER w as max_pr,
    FIRST_VALUE(price) OVER w as first,
    LAST_VALUE(price) OVER w as last
 
   FROM tick
   WINDOW w AS (PARTITION BY symbol, time_bucket('3 minute', ts)
        ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING
                AND UNBOUNDED FOLLOWING))
 
SELECT symbol, time_bucket('3 minute', ts), min_pr, max_pr,
first, last
FROM ranked
WHERE r = 1
ORDER BY 1, 2;

Results:

+--------+-----------------------------+----------+----------+----------+----------+
| symbol | time_bucket('3 minute', ts) | min_pr   | max_pr   | first    | last     |
+--------+-----------------------------+----------+----------+----------+----------+
| ABC    | 2019-02-18 10:54:00.000000  | 100.0000 | 100.0000 | 100.0000 | 100.0000 |
| ABC    | 2019-02-18 10:57:00.000000  | 101.0000 | 102.5000 | 101.0000 | 102.5000 |
| ABC    | 2019-02-18 11:00:00.000000  | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| XYZ    | 2019-02-18 11:00:00.000000  | 102.6000 | 103.0000 | 103.0000 | 102.6000 |
| XYZ    | 2019-02-18 11:03:00.000000  | 102.5000 | 102.5000 | 102.5000 | 102.5000 |
+--------+-----------------------------+----------+----------+----------+----------+

Smoothing is another common need in managing time series data. This query produces a smoothed sequence of prices for stock “ABC,” averaging the price over the last three ticks:

SELECT symbol, ts, price,
AVG(price) OVER (ORDER BY ts ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS smoothed_price
FROM tick
WHERE symbol = 'ABC';

Results:

+--------+----------------------------+----------+----------------+
| symbol | ts                         | price    | smoothed_price |
+--------+----------------------------+----------+----------------+
| ABC    | 2019-02-18 10:55:36.179760 | 100.0000 |   100.00000000 |
| ABC    | 2019-02-18 10:57:26.179761 | 101.0000 |   100.50000000 |
| ABC    | 2019-02-18 10:59:16.178763 | 102.5000 |   101.16666667 |
| ABC    | 2019-02-18 11:00:56.179769 | 102.0000 |   101.37500000 |
| ABC    | 2019-02-18 11:01:37.179769 | 103.0000 |   102.12500000 |
| ABC    | 2019-02-18 11:02:46.179769 | 103.0000 |   102.62500000 |
| ABC    | 2019-02-18 11:02:59.179769 | 102.6000 |   102.65000000 |
+--------+----------------------------+----------+----------------+

Using Extensibility to Increase the Power of MemSQL for Time Series

MemSQL supports extensibility with user-defined functions and stored procedures. MemSQL compiles UDFs and stored procedures to machine code for high performance.

I actually used MemSQL’s extensibility to create the time_bucket() function, shown in the Supplemental Material section below, which appeared in the previous section as a UDF. This function provides equivalent capability to similar functions in time-series-specific products. You can easily create a function or expression to bucket by time intervals, such as second, minute, hour, or day.

A common need with time-series data is to perform interpolation. For example, suppose you have a time series with points at random intervals that are 30 seconds apart on average. There may be some minutes with no data point. So, if you convert the raw (irregular) time-series data to a regular time series with a point a minute, there may be gaps.

If you want to provide output for plotting with no gaps, you need to interpolate the values for the gaps from the values before and after the gaps. It’s straightforward to implement a stored procedure in MemSQL by taking a query result and outputting a row set, with the gaps interpolated, into a temporary table.

This can then be sent back to the client application using the ECHO command. In addition, MemSQL supports user-defined aggregate functions. These functions can be used to implement useful time series operations, such as shorthand for getting the first and last values in a sequence without the need for specific window functions.

Consider this query to get the first value for stock ABC in each three minutes of trading, based on a user-defined aggregate function (UDAF) called FIRST():

SELECT time_bucket('3 minute', ts), first(price, ts)
FROM tick
WHERE symbol = "ABC"
GROUP BY 1
ORDER BY 1;

Results:

+-----------------------------+------------------+
| time_bucket('3 minute', ts) | first(price, ts) |
+-----------------------------+------------------+
| 2019-02-18 10:54:00.000000  | 100.0000         |
| 2019-02-18 10:57:00.000000  | 101.0000         |
| 2019-02-18 11:00:00.000000  | 102.0000         |
+-----------------------------+------------------+

The implementations of the FIRST() UDAF, and the analogous LAST() UDAF, are shown in the Supplemental Material section below.

Time Series Compression and Life Cycle Management

MemSQL is adept at handling both bursty insert traffic for time series events and historical time series information where space savings are important. For bursty insert traffic, you can use a MemSQL rowstore table to hold time series events.

For larger and longer-lived sets of time series events, or older time series data sets that have aged and are unlikely to be updated anymore, the MemSQL columnstore is a great format. It compresses time-series data very effectively, with MemSQL supporting fast operations on compressed columnstore data. Moreover, columnstore data resides on disk, so main memory size is not a limit on how much data you can store.

Scalable Time Series Ingestion

When building a time series application, data can come at high rates from many sources. Sources include applications, file systems, AWS S3, Hadoop HDFS, Azure Blob stores, and Kafka queues. MemSQL can ingest data incredibly fast from all these sources.

MemSQL Pipelines are purpose-built for fast and easy loading of data streams from these sources, requiring no procedural coding to establish a fast flow of events into MemSQL.

MemSQL can ingest data at phenomenal data rates. In a recent test, I inserted 2,850,500 events per second directly from an application, with full transactional integrity and persistence, using a two-leaf MemSQL cluster. Each leaf ran on an Intel Xeon Platinum 28-core system.

Comparable or even better rates can be had using direct loading or Kafka pipelines. If you have to scale higher, just add more nodes — there’s no practical limit.

When General-Purpose MemSQL Is Right for Time Series

We’ve seen the market for time-series data management bifurcate into special-purpose products for time series, with their own special-purpose languages, and extended SQL systems that can interoperate with standard reporting and business intelligence tools that use SQL. MemSQL is in this second category.

MemSQL is right for time series applications that need rapid ingest, low-latency query, and high concurrency, without scale limits, and which benefit from SQL language features and SQL tool connectivity.

Many time-series-specific products have shortcomings when it comes to data management. Some lack scale-out, capping the size of problems they can tackle, or forcing application developers to build tortuous sharding logic into their code to split data across multiple instances, which costs precious dollars for labor that could better be invested into application business logic.

Other systems have interpreted query processors that can’t keep up with the latest query execution implementations as ours can. Some lack transaction processing integrity features common to SQL databases.

MemSQL lets time series application developers move forward confidently, knowing they won’t hit a scale wall, and they can use all their familiar tools — anything that can connect to a SQL database.

Summary

MemSQL is a strong platform for managing time series data. It supports the ability to load streams of events fast and conveniently, with unlimited scale. It supports full SQL that enables sophisticated querying using all the standard capabilities of SQL 92, plus the more recently added window function extensions.

MemSQL supports transactions, high rates of concurrent update and query, and high availability technologies that many developers need for all kinds of applications, including time series. And your favorite SQL-compatible tools, such as business intelligence (BI) tools, can connect to MemSQL. Users and developers – in areas such as real-time analytics, predictive analytics, machine learning, and AI – can use the SQL interfaces they’re familiar with, as described above. All of this and more makes MemSQL a strong platform for time series.

Download and use MemSQL for free today and try it on your time series data!

Supplemental Material: Full Text of time_bucket() Function

-- Usage: first(value, timestamp_expr)
-- Example:
--   Get first value of x for each day from a time series in table
--  t(x, ts)
--   with timestamp ts.
--
--   SELECT ts :> date, first(x, ts) FROM t GROUP BY 1 ORDER BY 1;
 
DELIMITER //
CREATE OR REPLACE FUNCTION first_init() RETURNS RECORD(v TEXT, d datetime(6)) AS
  BEGIN
    RETURN ROW("_empty_set_", '9999-12-31 23:59:59.999999');
  END //
DELIMITER ;
 
DELIMITER //
CREATE OR REPLACE FUNCTION first_iter(state RECORD(v TEXT, d DATETIME(6)),
   v TEXT, d DATETIME(6))
  RETURNS RECORD(v TEXT, d DATETIME(6)) AS
  DECLARE
    nv TEXT;
    nd DATETIME(6);
    nr RECORD(v TEXT, d DATETIME(6));
  BEGIN
    -- if new timestamp is less than lowest before, update state
    IF state.d > d THEN
      nr.v = v;
      nr.d = d;
      RETURN nr;
    END IF;
    RETURN state;
  END //
DELIMITER ;
 
DELIMITER //
CREATE OR REPLACE FUNCTION first_merge(state1 RECORD(v TEXT, d DATETIME(6)),
   state2 RECORD(v TEXT, d DATETIME(6))) RETURNS RECORD(v TEXT, d DATETIME(6)) AS
  BEGIN
    IF state1.d < state2.d THEN
      RETURN state1;
    END IF;
    RETURN state2;
  END //
DELIMITER ;
 
DELIMITER //
CREATE OR REPLACE FUNCTION first_terminate(state RECORD(v TEXT, d DATETIME(6))) RETURNS TEXT AS
  BEGIN
    RETURN state.v;
  END //
DELIMITER ;
 
CREATE AGGREGATE first(TEXT, DATETIME(6)) RETURNS TEXT
  WITH STATE RECORD(v TEXT, d DATETIME(6))
  INITIALIZE WITH first_init
  ITERATE WITH first_iter
  MERGE WITH first_merge
  TERMINATE WITH first_terminate;

A LAST() UDAF that is analogous to FIRST(), but returns the final value in a sequence ordered by timestamp, is as follows:

-- Usage: last(value, timestamp_expr)
-- Example:
--   Get last value of x for each day from a time series in table t
--  t(x, ts)
--   with timestamp column ts.
--
--   SELECT ts :> date, last(x, ts) FROM t GROUP BY 1 ORDER BY 1;
 
DELIMITER //
CREATE OR REPLACE FUNCTION last_init() RETURNS RECORD(v TEXT, d datetime(6)) AS
  BEGIN
    RETURN ROW("_empty_set_", '1000-01-01 00:00:00.000000');
  END //
DELIMITER ;
 
DELIMITER //
CREATE OR REPLACE FUNCTION last_iter(state RECORD(v TEXT, d DATETIME(6)),
   v TEXT, d DATETIME(6))
  RETURNS RECORD(v TEXT, d DATETIME(6)) AS
  DECLARE
    nv TEXT;
    nd DATETIME(6);
    nr RECORD(v TEXT, d DATETIME(6));
  BEGIN
    -- if new timestamp is greater than largest before, update state
    IF state.d < d THEN nr.v = v; nr.d = d; RETURN nr; END IF; RETURN state; END // DELIMITER ; DELIMITER // CREATE OR REPLACE FUNCTION last_merge(state1 RECORD(v TEXT, d DATETIME(6)), state2 RECORD(v TEXT, d DATETIME(6))) RETURNS RECORD(v TEXT, d DATETIME(6)) AS BEGIN IF state1.d > state2.d THEN
      RETURN state1;
    END IF;
    RETURN state2;
  END //
DELIMITER ;
 
DELIMITER //
CREATE OR REPLACE FUNCTION last_terminate(state RECORD(v TEXT, d DATETIME(6))) RETURNS TEXT AS
  BEGIN
    RETURN state.v;
  END //
DELIMITER ;
 
CREATE AGGREGATE last(TEXT, DATETIME(6)) RETURNS TEXT
  WITH STATE RECORD(v TEXT, d DATETIME(6))
  INITIALIZE WITH last_init
  ITERATE WITH last_iter
  MERGE WITH last_merge
  TERMINATE WITH last_terminate;

↧

Webinar: Data Trends for Predictive Analytics and ML

March 28, 2019, 11:33 am

≫ Next: How CEOs Can Stay Relevant in the Age of AI

≪ Previous: What MemSQL Can Do for Time-Series Applications

Feed: MemSQL Blog.
Author: Floyd Smith.

In this webinar, Mike Boyarski and Eric Hanson of MemSQL describe the promise of machine learning and AI. They show how businesses need to upgrade their data infrastructure for predictive analytics, machine learning, and AI. They then dive deep into using MemSQL to power operational machine learning and AI.

In this blog post, we will first describe how MemSQL helps you master the data challenges associated with machine learning (ML) and artificial intelligence (AI). We’ll then show how to implement ML/AI functions in MemSQL. At any point, feel free to view the (excellent) webinar.

Challenges to Machine Learning and AI

Predictive analytics is helping to transform how companies do business, and machine learning and AI are a huge part of that. The McKinsey Global Institute analysis shows ML/AI having trillions of dollars of impact in industry sectors ranging from telecommunications to banking to retail. AI investments are focused in automation, analytics, and fraud, among other areas.

However, McKinsey goes on to report that only 15% of organizations have the right technology infrastructure, and only 8% of the needed data is available to AI systems across an organization. The vast majority of AI projects have serious challenges in moving from concept to production, and half the time needed to deploy an AI project is spent in preparation and aggregation of large datasets.

*Most CIOs expect an increase in AI investment.*

The machine learning and AI lifecycle has ten steps, and several of them have data-related challenges. MemSQL addresses many of the toughest ones:

Define ML use cases. Find specific ML use cases for the project.
Data exploration. Perform exploratory data analysis. MemSQL’s ability to aggregate disparate sources of data and its speed and responsiveness, even on large data sets, help here.
Select algorithm. Choose the ML algorithm that will best perform the task.
Data pipeline and feature engineering. Profile your incoming data to identify the relevant features that will support the ML task. MemSQL helps here due to its flexibility in dealing with data in memory (rowstore) and on disk (columnstore), as well as its scalability to handle arbitrarily large data volumes.
Build ML model. Develop the first iteration of the ML model.
Iterate ML model. Refine the model to improve performance and efficacy – increasingly, this process can itself be ML-assisted.
Present results. Present results of the model in a way that demonstrates its value to stakeholders. MemSQL can power dashboards or other tools for showing the results of ML model refinement.
Plan for deployment. Prepare for deployment to production.
Operationalize model. Deploy and operationalize ML model in production. This is where MemSQL makes the biggest difference. MemSQL’s speed, scalability, and SQL support all make the implemented model faster and more effective.
Monitor model. Monitor the model in production; retrain or rebuild it to add features or improve performance. MemSQL has built-in monitoring tools that can play an important role in the overall monitoring process.

*MemSQL addresses many of the key challenges of ML & AI, especially operationalization.*

To sum up, key challenges in ML/AI implementation that are addressed by MemSQL include modernizing data infrastructure; simplifying and accelerating query performance against big data; and adding scalability and convergence to the process of operationalizing AI.

Overview of AI/ML Support in MemSQL

MemSQL has features that support key aspects of the machine learning and AI lifecycle:

Integration with ML/AI tools
The transforms capability in MemSQL Pipelines, for seamless scoring of relevant features as data is loaded
Using MemSQL extensibility for additional and more complex scoring
Built-in vector similarity functions with very fast performance

Here’s an example of using the transforms capability in MemSQL Pipelines:

CREATE PIPELINE mypipeline AS
LOAD DATA KAFKA '192.168.1.100:9092/my-topic'
WITH TRANSFORM ('http://www.memsql.com/my-transform.tar.gz', 'my-executable.py', '')
INTO TABLE t

For more information, see the MemSQL Documentation.

Image recognition is an important capability enabled by ML/AI, and MemSQL has several customers using this today. You can train the model with other data components that connect well to MemSQL, including Apache Spark, TensorFlow, and Gluon. You can then use your model to extract feature vectors (called embeddings) from images. The feature vectors can then be stored in a MemSQL table for fast processing.

There are several MemSQL functions that are directly useful for vector similarity matching:

DOT_PRODUCT(vector, vector)
EUCLIDEAN_DISTANCE(vector, vector)
JSON_ARRAY_PACK(‘[float [, …]]’)

MemSQL’s capabilities are applicable to a variety of different job tasks in the machine learning and AI lifecycle.

MemSQL's capabilities enable professionals in a variety of AI and ML roles. — MemSQL makes people more productive right across the ML and AI lifecycle.

MemSQL’s connectivity, capabilities, and speed make it a solid choice for machine learning and AI development and deployment.

MemSQL is highly connectible to data sources and other data stores, making it a key component of machine learning and AI architectures.

For more information, view the webinar.

↧

How CEOs Can Stay Relevant in the Age of AI

April 2, 2019, 1:48 pm

≫ Next: What Makes a Database Cloud-Native?

≪ Previous: Webinar: Data Trends for Predictive Analytics and ML

Feed: MemSQL Blog.
Author: Peter Guagenti.

The most important new skills for business leaders are not what you might think.

You’ve read the headlines. Data is the new oil; it’s the new currency; data capital can create competitive advantage. We also hear, over and over again, that machine learning (ML) and artificial intelligence (AI) are the future.

Few will dispute that these things are true, but the trite language masks a deeper challenge.

Data must be collected, analyzed, and acted upon in the right way and at the right time for a business to create value from it. ML and AI are only as powerful as the data that drive them.

In this world, big companies – which may throw away as much data in a day as a startup will generate in a year – should have a significant competitive advantage. However, new approaches are needed to move forward effectively in the age of ML and AI. And the new approaches all start with the data itself.

To build and maintain a successful business in today’s insight-driven economy, business leaders need to develop a new set of skills. We outline what we believe those skills are below.

Skill #1: A Drive to Find Key Data and Make it Useful

Business leaders need to be on a mission to collect and (more importantly) expose to their organizations all of the data that might create a competitive advantage.

We don’t always know exactly what data or insights might be the ones that will allow us to break away from the pack until after we have analyzed and acted on that data, then measured the results and repeated the cycle, over and over.

Business leaders need to encourage collecting as much data as possible in the day-to-day operations of the business, with a particular eye towards where your organization has advantages or challenges. Make sure that the data is not simply collected, but stored in such a way that your teams can easily access, understand, and analyze it.

“Big data” was a great start to enabling the future of our businesses, but what we need today instead is “fast data” – data made available to everyone, to drive fast insight.

Skill #2: The Ability to Create a Culture of Constant Analysis and Action

As the French writer Antoine de Saint-Exupéry stated, “If you want to build a ship, don’t drum up people together to collect wood and don’t assign them tasks and work, but rather teach them to long for the vast and endless sea.”

This adage applies to becoming an insight-driven business. Data is not insight, and insights are not outcomes.

What we seek in collecting and analyzing data is to identify and carry out the actions that will accelerate and transform our business. The best way to leverage data for creating competitive advantage is to encourage a culture of inquisitiveness, of always asking “the 5 Whys” – a series of “why” questions that take us to the root of what’s important, and why.

Compel your teams to constantly look for ways to not just gather and share insights, but to look for ways to turn insights into immediate actions that add value to the business. Innovations such as ecommerce product recommendations, dynamic pricing based on demand, or sensor-based maintenance are all insight-driven innovations that have arisen in the last decade or so and that have generated dramatic competitive advantage.

ML and deep learning – the most practical form of AI currently available to business – accelerate this process. You can use them together to test multivariate alternatives, to vary assumptions and audiences around your current performance, to help you maximize the value of the insights that you find and implement today, and then to help you take your insights to another higher level.

Skill #3: The Insight to Choose the Right Tools and Technologies

The agile movement does not get nearly enough credit for the transformative effect it’s had, and continues to have, on business. But a business can only be agile with the right tools and technologies, and the ability to use them to drive action and change.

It’s no surprise that, up to this point, most of the companies and leaders that are making the best use of data to drive their businesses are digital natives – think Google, Facebook, Uber, Airbnb, et al. They have done this by applying the agile mindset of software development to data architecture, data engineering, and data-driven decisioning.

While the large digital players may have leapt to the forefront in the last 10 years, the traditional enterprise can use its long operational history, its existing volumes of data, and its ability to generate fresh, useful data, to level the playing field and compete effectively in the modern economy.

In order to maximize and utilize these resources, business leaders need to lead the decision making around data infrastructure. The insight-driven enterprise needs the best possible tools and technology to enable fast, flexible, and efficient use of the company’s data. This means shifting the traditional IT mindset from maintaining legacy data infrastructure, overly strict controls, and inflexibility, to one that puts agility first.

Analysts, data scientists, and application developers need access to real-time or near-real-time data sources. And they, and the businesspeople who work with them most closely, need to be empowered to act on that data – be it for rapid decision making or to create insight-driven, dynamic experiences for customers and employees.

This shift requires a new set of tools, processes, and culture that is so critical to the future of the business that business leaders – all the way up to the CEO – needs to ensure that agility is the primary order of the day.

Peter Guagenti is CMO at MemSQL, and is an advisor and a board member for several AI-focused companies. Peter spent more than a decade helping Fortune 500 companies to embrace digital transformation and to use real-time and predictive decisions to improve their businesses.

↧

What Makes a Database Cloud-Native?

April 10, 2019, 10:27 am

≫ Next: Pre-Modern Databases: OLTP, OLAP, and NoSQL

≪ Previous: How CEOs Can Stay Relevant in the Age of AI

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL has been designed and developed as a distributed relational database, bringing the effectiveness of the relational database model into the new world of the cloud, containers, and other software-defined infrastructure – as described in a new report from 451 Research. Today, most of our customers run our software using some combination of the cloud and containers, with many also running it on-premises.

Today, we are purveyors of the leading platform-independent NewSQL database. Having recently joined the Cloud Native Computing Federation, we’d like to take this opportunity to answer the question: “What makes a database cloud-native?”

Cloud-Native Software Definition

There are many definitions of “cloud-native software” available. 451 Research states that cloud-native software is “designed from the ground up to take advantage of cloud computing architectures and automated environments, and to leverage API- driven provisioning, auto-scaling and other operational functions.”

The company continues: “Cloud-native architecture and software include applications that have been redesigned to take advantage of cloud computing architectures, but are not limited to cloud applications – we see cloud-native technologies and practices present in on-premises environments in the enterprise.”

The point is repeated in one of the major headings in the report: “Cloud-native isn’t only in the cloud.” 451 Research commonly finds cloud-native technologies and practices being used in on-premises environments.

What Cloud-Native Means for MemSQL

Let’s break down the 451 Research definition of cloud-native and see how it applies to MemSQL.

Takes Advantage of Cloud Features

The first point from the 451 Research report states that cloud-native software is “designed from the ground up to take advantage of cloud computing architectures and automated environments”.

MemSQL has been available on the major public cloud platforms for years, and deployments are balanced across cloud and on-premises environments. More importantly, MemSQL’s unique internal architecture gives it both the scalability that are inherent to the cloud and the ability to support SQL for transactions and analytics.

An important step has been qualifying MemSQL for use in containers. MemSQL has been running in containers for a long time, and we use a containerized environment for testing our software.

451 Research shows a spectrum of cloud-native software services.

Leverages Software Automation

The report then goes into more detail on this point. Cloud-native software will “leverage API-driven provisioning, auto-scaling and other operational functions.” The ultimate goal here is software-defined infrastructure, in which the software stack is platform-independent and can be managed automatically, by other software.

MemSQL has command-line tools that integrate easily with on-premises deployment tools, such as Ansible, Chef, and Puppet, and cloud deployment mechanisms such as Azure Resource Management and CloudFormation. This is crucial to the definition and nature of cloud-native, and MemSQL’s automatability is crucial to its inclusion as cloud-native software.

MemSQL Studio provides a monitoring environment for MemSQL across deployment platforms – that is, across public cloud providers, private cloud, and on-premises.

Not Limited to Cloud Applications

Concluding their key points, 451 Research then states: “Cloud-native architecture and software include applications that have been redesigned to take advantage of cloud computing architectures, but are not limited to cloud applications… .”

The point here is that “cloud-native” doesn’t mean “cloud-only”. Cloud-native describes a set of capabilities that can be deployed anywhere — in public cloud providers, in modernized data centers, and increasingly at the edge.

The cloud-native movement combines with the unique features of MemSQL to create something really exceptional: a database that can leverage different deployment locations with ease. Flexibility and portability are creating a capability that hasn’t been available before.

Specific MemSQL features make it particularly suitable for cloud-native deployments:

Container-friendly. As mentioned above, MemSQL runs well in containers – which is a defining characteristic for cloud-native software.
Fully scalable. Like NoSQL databases, and unlike traditional relational databases, MemSQL is fully scalable within a cloud, on-premises, or across clouds and on-prem.
Kafka and Spark integration. Apache Kafka and Apache Spark are widely used for data transfer in cloud-native applications, and both work very smoothly with MemSQL Pipelines.
Microservices support. MemSQL’s performance, scalability, and flexibility are useful in microservices implementations, considered emblematic of cloud-native software.

Next Steps

MemSQL’s architecture and capabilities are unique and allow for unbeatable performance and effortless scale — especially when paired with elastic cloud infrastructure. An example is customers who want to move from on-premises Oracle deployments to cloud-native technologies. MemSQL improves on Oracle’s performance and reduces cost while modernizing data infrastructure.

Try MemSQL for free today, or contact us to learn how we can help support your cloud adoption plans

↧