Quantcast
Channel: MemSQL – Cloud Data Architect
Viewing all 427 articles
Browse latest View live

MemSQL SingleStore – And Then There Was One

$
0
0

Feed: MemSQL Blog.
Author: Eric Hanson.

MemSQL SingleStore is a new vision for how databases can work – first blurring and then, for most use cases, erasing any apparent difference between today’s rowstore and columnstore tables. In MemSQL SingleStore Phase 1, shipping as part of MemSQL 7.0, rowstore tables get null compression, lowering TCO in many cases by 50%. Columnstore tables get seekable columnstores, which support fast seeks and updates, giving columnstore tables many of the performance and usability features of rowstore tables. The hard choices developers have faced up to now between rowstore and columnstore tables – or between separate rowstore and columnstore database software offerings – are significantly reduced, cutting costs and improving performance. With the new system of record improvements, also offered in MemSQL 7.0, our vision of “one database to rule them all” begins to be realized.

Introducing MemSQL SingleStore

MemSQL SingleStore is a breakthrough in database storage architecture to allow operational and analytical workloads to be processed using a single table type. This will simplify the developer’s job while providing tremendous scalability and performance, and minimizing costs.

As a significant first step, in MemSQL 7.0, SingleStore Phase 1 allows OLTP applications to use columnstore tables to run operational transactions on data much bigger than RAM. This is supported via new hash indexes and related speed and concurrency improvements for disk-based columnstore tables, delivering seeks and updates at in-memory speeds. SingleStore Phase 1 also now supports transactional applications on larger data sets more economically, via in-memory compression for null values in fast, memory-based rowstore tables, with memory savings of roughly 50% for many use cases.

Together, these improvements give MemSQL customers better performance at lower cost, the flexibility to get the most from computing resources, and the ability to tackle the largest data management problems economically.

Our Vision for the Ultimate Table Format

MemSQL supports two types of data tables in the same database: in-memory rowstores, which are ideal for online transaction processing (OLTP) and hybrid transactional/analytical (HTAP) applications, and disk-based columnstores, which are the best choice for purely analytical applications. Customers love the speed and predictability of rowstores for OLTP and HTAP. They also love the truly incredible analytical performance of columnstores, plus their ability to store far more data than will fit in RAM economically.

But customers have been asking for us to improve the total cost of ownership (TCO) of rowstores, because they have to provision servers with large amounts of RAM when tables get big, which can be costly. They’ve also asked for us to add OLTP-like features to columnstores, such as fast UPSERTS and unique constraints. In response, we’ve developed a vision of the future in which one table type can be used for OLTP, HTAP, and analytics on arbitrarily large data sets, much bigger than the available RAM, all with optimal performance and TCO. We call this MemSQL SingleStore.

Our ultimate goal for SingleStore is that performance for OLTP and HTAP is the same as for a rowstore table, if the amount of RAM that would have been required for an explicitly defined rowstore table is available, and that OLTP performance degrades gracefully if less RAM than that is available. For analytics, utilizing large scans, joins, aggregates, etc., the goal is to provide performance similar to that of a columnstore table.

The old-fashioned way to support OLTP on tables bigger than RAM would be to use a legacy storage structure like a B-tree. But that could lead to big performance losses; we need to do better. In the SingleStore vision, we preserve the performance and predictability of our current storage structures and compiled, vectorized query execution capability, and even improve on it. All while reducing the complexity and cost of database design, development, and operations by putting more capability into the database software.

In the 7.0 release, we are driving toward solving the customer requirements outlined above, and ultimately realizing our vision for a “SingleStore” in two different ways. One is to allow sparse rowstores to store much more data in the same amount of RAM, thus improving TCO while maintaining great performance, with low variance, for seeks. We do this through sparse in-memory compression. The other is to support seekable columnstores that allow highly concurrent read/write access. Yes, you read that right, seekable columnstores. It’s not an oxymoron. We achieve this using hash indexes and a new row-level locking scheme for columnstores, plus subsegment access, a method for reading small parts of columnstore columns independently and efficiently.

In what follows, we’ll explain how we achieve a critical first step in achieving our vision for a SingleStore in MemSQL 7.0. And this is just the beginning.

Sparse Rowstore Compression

To address the TCO concerns of our customers with wide tables that have a high proportion of NULL values – a situation often found in the financial sector – we’ve developed a method of compressing in-memory rowstore data. This relies on a bitmap to indicate which fields are NULL. But we’ve put our own twist on this well-established method of storing less data for NULL values by maintaining a portion of the record as a structure of fixed-width fields. This allows our compiled query execution and index seeking to continue to perform at the highest levels. We essentially split the record into two parts: a fixed-width portion containing non-sparse fields and index keys, and a variable-width portion containing the fields which have been designated as sparse.

Our NULL bitmap makes use of four bits per field instead of one, to enable room for future growth. We’ve created a pathway where we can add the ability to compress out default values like blanks and zeros, and also store normally fixed fields as variable-width fields – e.g., storing small integers in a few bytes, rather than 8 bytes, if they are declared as bigints.

The following figure illustrates how sparse compression works for a table with four columns, the last three of which are designated as SPARSE. (Assume that the first one is a unique NOT NULL column, so is not designated as SPARSE). The fact that the first column is not sparse and the last three columns may be sparse is recorded in table-level metadata.

Rowstore compression reduces storage space and TCO due to memory usage.
Setting the SPARSE flag causes appropriate fields to have null values indicated by flags, rather than by a null value taking up the normal field width.

In the table, the width of the wide fields represents 32 bits and the width of the narrow fields represents 4 bits. Actual space usage also depends on the presence of indexes and storage allocation structures and can’t be easily summarized, but this illustration is a useful visual representation of space savings for sparse fields that can be NULL.

The sweet spot for SPARSE compression is a wide rowstore table with more than half NULL values. Here’s an example:

CREATE TABLE t (
  c1 double,
  c2 double,
  …
  c300 double) compression = sparse;

Specifying compression = sparse at the end of CREATE TABLE causes MemSQL to use sparse encoding for nullable structured fields, including numbers, dates, datetimes, timestamps, times, and varchars.

Using this table schema with MemSQL 7.0, loaded with 1.05 million rows, of which two-thirds are NULL, we observe the following memory usage with no compression vs. roughly half the memory usage with sparse compression:

Compression Setting Memory Use Savings (percent)
NONE 2.62GB N/A
SPARSE 1.23GB 53%

So, for this wide table with two-thirds NULL values, you can store more than twice the data in the same amount of RAM. This of course can lead to big TCO savings, or enable you to tackle bigger problems to generate more business value with MemSQL.

The MPSQL code used to do this experiment is given in Appendix A.

Note. A handy query for calculating table memory use is:

select database_name, table_name, format(sum(memory_use),0) m
from information_schema.table_statistics
group by 1, 2;

This gives the actual memory used for each rowstore table.

Seekable Columnstores with Support for High Concurrency

To increase the scope of operational analytics and OLTP applications that can be handled with best-in-class performance and TCO with MemSQL, the sparse rowstore compression method described in the previous section can take you a long way. But it still requires all data to be kept in RAM, leading to a TCO and scale limit for some potential users. To drive to truly low TCO per row, while enabling operational applications, we’ve also enhanced our columnstore. (MemSQL’s columnstore is disk-based, with portions cached in memory for better performance.)

This enhancement, the second prong of SingleStore Phase 1, is to make columnstores seekable. How, you ask, is that possible? Aren’t columnstores designed for fast scanning, with no thought given to making OLTP-style seeks fast enough? Based on other columnstore implementations available on the market, you might think this is the case. But MemSQL 7.0 introduces new technology to make columnstores seekable, and updatable at fine grain with high concurrency.

First, here’s a little background on our columnstore implementation to help understand what we’ve changed. MemSQL columnstores tables are broken into one-million-row chunks called segments. Within each segment, columns are stored independently – in contiguous parts of a file, or in a file by themselves. These stored column chunks are called column segments. Prior to 7.0, accessing a field of a single row would require scanning the entire, million-row column segment for that field.

Subsegment Access

MemSQL 7.0 speeds up access to a row in a columnstore by allowing the system to calculate the location of the data for that row, and then read only the portion of a column segment that is needed to materialize that row. This is called subsegment access. This may involve reading data for up to a few thousand rows, but it is nowhere near the full million rows of the segment. Once the offset of a row in a segment is known, only portions of that row and adjacent rows need to be retrieved, in the typical case. In some other cases, such as a run-length-encoded column, only a small number of bytes of data may need to be retrieved to materialize a column, simply due to the nature of the compression strategy. This also allows efficient seeking.

The figure below illustrates how just a portion of the file data for a segment needs to be read to find one or a few rows at a specific position in a segment.

Illustration of small portion of column segment files read to perform a seek using sub-segment access in a columnstore segment.

Subsegment access allows rows to be materialized efficiently via seek-style access once the row position in a segment is known. So the question is, how can we come to know the row position, and do that efficiently? One way is to scan a single column to apply a filter, and record the row numbers that match the filter. While this can be extremely fast in MemSQL, based on the use of operations on encoded data and vectorization, it can ultimately require time proportional to the number of rows.

Columnstore Hash Indexes

To run selective queries even faster, we need indexes. Hence, MemSQL 7.0 introduces hash indexes on columnstores. You can create a hash index on any individual column in a columnstore table. Filters on these columns can thus be solved using the index. Seeks of a hash index to solve selective filters can identify the positions of the qualifying rows at a speed that’s orders of magnitude faster than a scan. Then, once the row positions are known, the new subsegment access capability is used to seek into each column referenced by the query to retrieve the data for the qualifying records.

Multi-column filters can also be solved with multiple hash indexes via index intersection, i.e. intersecting the rowid lists from multiple indexes to produce a final set of qualifying rowids.

Fine-Grain Locking for Columnstores

Now that highly-selective, OLTP-style queries can be processed fast on columnstores, what else stands in the way of highly efficient read/write access for hundreds or even thousands of transactions per second? Anything that could make these fine-grain read and write transactions wait, of course. What might they wait for? Each other.

Waiting in MemSQL 6.8 and earlier may happen for columstore updates in conflict with other updates, based on locking at the granularity of a million-row segment. Since this granularity is somewhat coarse, that can limit total concurrency. We’ve dramatically improved concurrency in 7.0 via row-level locking for columnstores.

Performance Gains

To measure the performance of columnstore seeks through hash indexes, I created a table with 1.074 billion (102410241024) rows, which has two columns, and a hash index on each column. The two columns are in completely different orders, and each value in the column is unique, or close to it. The schema of this table is as follows:

create table f(a bigint, b bigint,
 shard key(a), key(a) using clustered columnstore, 
 key(a) using hash, key(b) using hash
);

I created the same table on a 7.0.4 cluster and a 6.8.9 cluster, except that on the 6.8.9 cluster, the hash keys were omitted. I wrote a stored procedure to seek N times into the index and measure the average time to run the query (ignoring client-to-server communication time). This gave the following performance results.

Column being seeked Table size (millions of rows) Runtime (ms), 6.8, no hash indexes Runtime (ms), 7.0, w/hash indexes Speedup (times)
a 1,074 6.70 2.46 2.72X
b 1,074 271 2.54 107X

Notice the dramatic 107X speedup for seeking on column b, which was not ordered by the columnstore key (a). This shows the combined benefit of hash indexes and subsegment access. The key point is that the seek time on column b has gone from being too slow for a highly concurrent OLTP application (271ms) in Version 6.8, to fast enough for OLTP (2.54ms) in Version 7.0, with hash indexes. This greatly expands the kinds of workloads you can run on columnstore tables.

The MPSQL code used for these tests is given in Appendix B. You may wish to try variations of the test, such as adding more columns to the table. Of course, adding more columns will slow down the seek time, since each column will have to be accessed. But even with dozens of columns, seek time can be in single-digit milliseconds, depending on your hardware — good enough to no longer be a bottleneck for many concurrent operational applications. Moreover, even for very wide tables, say with hundreds of columns, if the query selects from one to a few dozen columns, subsegment access and hash indexes can provide single-digit millisecond row lookup times.

Design and Operational Improvements

MemSQL offers both rowstore and columnstore tables in a single database software offering, with JOINs and other operations working smoothly across the two table types. (And more so in MemSQL 7.0.) This makes database design and operations simpler than with some competing solutions.

However, even with MemSQL, there is still complexity in choosing which table type, or set of table types, to use to solve various problems. And operations work becomes more complex as more separate tables and different table types are used.

For many use cases, SingleStore simplifies the database design process, and therefore the level of operations work needed on the resulting implementation. These improvements are significantly realized by the improvements offered in SingleStore Phase 1 / MemSQL 7.0, with room to grow as SingleStore is more fully realized in future releases.

The specifics of how these improvements affect our customers depend, of course, on the specifics of what they need to accomplish. In general, improvements include:

  • Rowstore covers more requirements. Customers doing OLTP and HTAP work are sometimes forced to use columnstore for part or all of their needs because the amount of RAM required is too expensive, or impractically large. The null compression in SingleStore Phase 1 helps considerably – as we’ve mentioned, by a factor of 2 for many of our customers. Future improvements in rowstore compression will extend these gains. (Intel’s Optane memory can help too.)
  • Columnstore covers more requirements. Customers who have rowstore-type problems to solve, but are forced to use columnstore for cost or practicality reasons – or for whom columnstore is a good fit, except for a few impracticably slow operations – will find columnstore performance much improved. This improvement is quite strong in the initial implementation of SingleStore, and also has more room to grow in future versions.
  • Fewer two-headed beasts. Customers often have to put new, active data in rowstore for speed, then move it to columnstore as it ages for cost reasons, despite the complexity this adds to their applications. Or they put a subset of data in a logical table in rowstore for some operations, and the full data set in columnstore for broad analytics purposes, duplicating substantial amounts of data across table types. With SingleStore, the need to use multiple table types often goes away. Future versions will reduce this need further, as MemSQL absorbs the former need for complex table architectures into enhancements in the database software.

Summary and Where We Go From Here

MemSQL 7.0 introduces two major categories of enhancements that allow more data to be processed efficiently for OLTP, HTAP, and analytics workloads:

  1. Sparse rowstore compression to improve TCO for OLTP-style applications that need very fast and predictable row lookup times on multiple access paths, and
  2. Subsegment access, hash indexes, and improved concurrency for columnstore tables, to enable more OLTP-style work to function efficiently with columnstore tables that are larger than the available RAM.

The SingleStore features we’ve built for 7.0 are a down payment on additional capabilities we envision in future releases, including:

  • Hybrid row/columnstore tables that are an evolution of our existing columnstores, allowing you to tune the amount of RAM used for the updatable, in-memory rowstore segment, and have more granular index control over all parts of the data. The goal here is that customers won’t have to try to store the data in two different ways, under application control, thus simplifying application development
  • Unique hash indexes and unique constraints on columnstores
  • Multi-column hash indexes on columnstores
  • Ordered secondary indexes on columnstores
  • Rowstores to cache results of columnstore seeks
  • Automatic adaptation of the size of the updatable rowstore segment of a columnstore
  • A columnstore buffer pool managed directly by MemSQL, rather than simply relying on the file system buffer pool
  • Rowstore compression for zeros and blanks, and variable width encoding of small values

As you can see, the MemSQL 7.0 release delivers a big part of the SingleStore vision. And if you get on board with MemSQL now, you can expect the speed, power, and simplicity of MemSQL to continue to grow and improve. MemSQL SingleStore will truly become a store to rule them all!

Appendix A: Data Generation SPs for SPARSE Compression Measurement

set sql_mode = PIPES_AS_CONCAT;

-- load t using SPARSE compression
call buildTbl(300, 0.666, 1000*1000, "compression = sparse");
-- load t with no compression
call buildTbl(300, 0.666, 1000*1000, "compression = none");

delimiter //
create or replace procedure buildTbl(numCols int, sparsePercent float, nRows bigint,
  compression text)
as
begin
  drop table if exists t;
  call createTbl(numCols, compression);
  call loadTbl(numCols, sparsePercent, nRows);
end //

create or replace procedure createTbl(numCols int, compression text) as 
declare stmt text;
begin
  stmt = "create table t(";
  for i in 1..numCols - 1 loop
    stmt = stmt || "c" || i || " double, ";
  end loop;
  stmt = stmt || "c" || numCols || " double) " || compression || ";";
  execute immediate stmt;
end //

delimiter //
create or replace procedure loadTbl(numCols int, sparseFraction float,
  nRows bigint) as
declare stmt text;
declare q query(c bigint) = select count(*) from t;
declare n int;
begin
  stmt = "insert into t values(";
  for i in 1..ceil(sparseFraction * numCols) loop
    stmt = stmt || "NULL,";
  end loop;
  for i in (ceil(sparseFraction * numCols) + 1)..numCols - 1 loop
    stmt = stmt || "1,"; 
  end loop;
  stmt = stmt || "1);";
  execute immediate stmt;
  n = scalar(q);
  -- Double table size repeatedly until we exceed desired number 
  -- of rows.
  while n 

Appendix B: Exercising Columnstore Hash Indexes

create database if not exists db1;
use db1;
drop table if exists f;
 
-- hash indexes on columnstores
create table f(a bigint, b bigint,
 shard key(a), key(a) using clustered columnstore, key(a) using hash,
 key(b) using hash
);
/*
-- Create table without hash indexes, for comparison
create table f(a bigint, b bigint,
 shard key(a), key(a) using clustered columnstore
);
*/
 
-- This will keep increasing the size of f until it has a t least n rows.
-- the b column is a hash function of the a column and will 
-- be unique for ascending
-- a values 1, 2, 3, ... up until 941083987-1. The data in the b column will
-- be in a different order than the a column. So you can seek on a value of
-- a and a value of b separately to show the benefit of hash indexes without
-- worrying about whether the sort key on a is skewing performance on seeks
-- on column b.
delimiter //
create or replace procedure inflate_data(n bigint) as
declare q query(c bigint) = select count(*) as c from f;
declare tbl_size bigint;
begin
tbl_size = scalar(q);
if tbl_size = 0 then
 insert f values(1, 1);
end if;
while (tbl_size 

            

Replication at Speed – System of Record Capabilities for MemSQL 7.0

$
0
0

Feed: MemSQL Blog.
Author: Nate Horan.

System of record capability is the holy grail for transactional databases. Companies need to run their most trusted workloads on a database that has many ways to ensure that transactions are completed and to back up completed transactions, with fast and efficient restore capability. MemSQL 7.0 includes new features that deliver very fast synchronous replication – including a second copy in the initial write operation, atomically – and incremental backup, which offers increased flexibility and reliability. With these features, MemSQL 7.0 offers a viable alternative for Tier 1 workloads that require system of record capability. When combined with MemSQL SingleStore, and MemSQL’s long-standing ability to combine transactions and analytics on the same database software, MemSQL 7.0 now offers unprecedented design and operational simplicity, lower costs, and higher performance for a wide range of workloads.

The Importance of System of Record Capability

The ability to handle system of record (SoR) transactional workloads is an important characteristic for a database. When a database serves as a system of record, it should never lose a transaction that it has told the user it has received.

In providing system of record capability, there’s always some degree of trade-off between the speed of a transaction and the degree of safety that the system provides against losing data. In MemSQL 7.0, two new capabilities move MemSQL much further into SoR territory: fast synchronous replication and incremental backups.

Synchronous replication means that a transaction is not acknowledged as complete – “committed” – until it’s written to primary storage, called the master, and also to a replica, called the slave. In MemSQL 7.0, synchronous replication can be turned on with a negligible performance impact.

Synchronous durability – requiring transactions to be persisted to disk before a commit – is an additional data safety tool. It does take time, but writing to disk on the master happens in parallel with sending the transaction to the slave; there is an additional wait while the transaction is written to disk on the second system. The performance penalty is, of course, greater than for synchronous replication alone.

Fast sync replication in MemSQL 7.0 makes it possible to run high availability with a small performance hit.

In addition to synchronous replication and synchronous durability capabilities, a system of record database needs flexible restore options. In MemSQL 7.0, we add incremental backups, greatly increasing backup flexibility. Incremental backups allow a user to run backup far more often, without additional impact on the system. An incremental backup means only the data changed since the last backup needs to be stored. So the amount of time it takes to do the backup (and the resources required to implement the backup) are significantly reduced. This means a shorter RPO (Recovery Point Objective), which in turn means less data is lost in the event of an error that requires restoring a backup.

The rest of this blog post focuses on synchronous replication, a breakthrough feature in MemSQL 7.0.

Sync Replication in Action

Synchronous replication in pre-MemSQL 7.0 release was very deliberate, and quite slow. Data was replicated as it was committed. So if there were lots of small commits, you would pay the overhead of sending the data network many separate transactions with small amounts of data. In addition, data sent to the slave partition would be replayed into memory on that system, and then acknowledged by the slave to the master – and, finally, acknowledged in turn to the user. This was slow enough to restrict throughput in workloads that did many writes.

In MemSQL 7.0, we completely revamped how replication works. Commits are now grouped to amortize the cost of sending data on the network. The replication is also done lock-free. Lastly, the master doesn’t have to wait for the slave to replay the changes. As soon as the slave receives the data, an acknowledgement is sent back to the master, who then sends back success to the user.

Because MemSQL is a distributed database, it can implement a highly available system by keeping multiple copies of the data, and then failing over to another copy in the event that it detects a machine has failed. The following steps demonstrate why a single failure – of a network partition, of a node reboot, of a node that runs out of memory, or of a node that runs out of disk space – can’t cause data to be lost. In the next section, we’ll describe how this failure-resistant implementation is also made fast.

To provide failure resistance, here are the steps that are followed:

  1. A CREATE DATABASE command is received. The command specifies Sync Replication and Async Durability. MemSQL creates partitions on the three leaves, calling the partitions db_0, db_1, and db_2. (In an actual MemSQL database, there would be many partitions per leaf, but for this example we use one partition per leaf to make it simpler.)
  2. For redundancy 2 – that is, high availability (HA), with a master and slave copy of all data – the partitions are each copied to another leaf. Replication is then started, so that all changes on the master partition are sent to the slave partition.
  3. An insert hits db_1. The update is written to memory on the master, then copied to memory on the slave.
  4. The slave receives the page and acknowledges it to the master. The master database acknowledges the write to the master aggregator, which finally acknowledges it to the user. The write is considered committed.

This interaction between the master partition and its slave makes transactions failure-resistant. If either machine were to fail, the system still has an up-to-date copy of the data. It’s fast because of the asynchronous nature of log reply on the slave system: the acknowledgement to the primary system takes place after the log page is received, but before it’s replayed in the slave.

Making Log Page Allocation Distributed and Lock-Free

There’s still a danger to this speedy performance. Even if the number of transactions is large, if the transactions are all relatively small, they can be distributed smoothly across leaves, and fast performance is maintained. However, occasional large transactions – for instance, loading a large block of data – can potentially prevent any smaller transactions from occurring until the large operation is complete.

The bottleneck doesn’t occur on actual data updating, as this can be distributed. It occurs on the allocation of log pages. So, to make synchronous replication fast on MemSQL, we made log reservation and replication lock-free, reducing blocking. The largest difficulty in making our new sync replication was the allocation of log pages distributed and lock-free. There are several pieces that work together to prevent locking.

The first part to understand is the replication log. Transactions that interact with the replication log are as follows: Reserve, Write out log record(s), Commit.

The replication log is structured as an ordered sequence of 4KB pages, each of which may contain several transactions (if transactions are small), parts of different transactions, or just part of a transaction (if a transaction is > 4KB in size). Each 4KB page serves as a unit of group commit, reducing network traffic – full pages are sent, rather than individual transactions – and simplifying the code needed, as it operates mostly on standard-size pages rather than on variable-sized individual transactions.

To manage pages, each one is identified by a Log Sequence Number (LSN), a unique ID which begins with the first page numbered zero, then increments by one with each subsequent page. Each page has a page header, a 48 byte structure. The header contains two LSNs: the LSN of the page itself, and the committed LSN – the LSN up to which all pages had been successfully committed at the time the page in question was created. So a page could have LSN number 53, and also record the fact that the committed LSN at the point this page was created was 48 – all of the first 48 pages have been committed, but page 49 (and possibly also other, higher-numbered pages) has not been.

When a transaction wants to log something that it is doing to the log, there is an API which gives it logical space in the log and enough physical resources that it can be guaranteed not to fail, barring the node itself crashing. Next the transaction writes out into the log all the data that it wants within the log. Finally it calls the commit API, which is basically a signal to the log that the data is ready to be shipped over to the slave machine or to disk, or both.

With this background, we can look at how the log works internally. We have a 128-bit structure called the anchor in the log, which we use in order to implement a lock-free protocol for the log reservations. The anchor consists of two 64-bit numbers. One is the LSN of the current page in the log, and the other is the pointer into the page where the next payload of data can be written.

And all threads operate on the anchor using the compare-and-swap instruction, a CPU primitive which allows you to check that a particular location in memory has not changed, and then change it atomically, in one structure. It is very useful for lock-free operations, as we will see in a moment.

MemSQL 7.0 Sync Replication Demonstration

Let’s say we have four threads, and this diagram shows the current state of the anchor. And just for simplicity I’m not going to show the second part of the anchor, only the LSN.

  1. With all compare and swaps, the threads working on trying to write to the log start by loading the most recent LSN, which has the value 1000.
  2. Each thread reserves the number of pages it needs for the operation it’s trying to commit. In this case, Thread 1 is only reserving part of a page, so it wants to change the most recent LSN to 1001, while Thread 2 is reserving a large number of pages, and trying to change it to 2000. Both threads attempt to compare and swap (CAS) at the same time. In this example, Thread 2 gets there first and expects the LSN to be 1000, which it is. It performs the swap, replacing the anchor – the committed LSN – with 2000. It owns this broad swathe of pages and can stay busy with it for a long time.
  3. Then Thread 1 reads the anchor expecting it to be 1000. Seeing that it’s a different number, 2000, the compare fails.
  4. Thread 1 tries again, loading the new value of 2000 into its memory. It then goes on to succeed.

It’s important to note that the CAS operations are fast. Once a thread is successful, it starts doing a large amount of work to put its page together, write the log to memory, and send it. The CAS operation, by comparison, is much faster. Also, when it does fail, it’s because another thread’s CAS operation succeeded – there’s always work getting done. A thread can fail many times without a noticeable performance hit, for the thread or the system as a whole.

By contrast, in the previous method that MemSQL used, it was as if there were a large mutex (lock) around the LSN value. All the threads were forced to wait, instead of getting access and forming their pages in parallel. Compared to the new method, the older method was very slow.

On failovers, the master data store fails, and the slave is promoted to master. The new master now replays all the updates it has received.

It is possible that the old master received a page that was not also forwarded to the slave, because that’s the point at which the primary failed. However, with synchronous replication this is no problem – the page that only got to the master would not have been acknowledged to the user. The user will then retry, and the new primary will perform the update, send it to the new slave, receive an acknowledgement of successful receipt, and acknowledge to the user that the update succeeded.

Performance Impact

In the best case, there’s one round trip required per transaction, from user to master to slave, and back from slave to master to user. This is a low enough communication overhead that it is mostly amortized across other transactions doing work.

As we mentioned above, the cost of turning on synchronous replication is single digit percentage impact on TPC-C, a high-concurrency OLTP benchmark. This makes the performance hit of adding a much better data consistency story effectively free for most users!

The steps above show highlights, but there are many other interesting pieces that make the new synchronous replication work well. Just to name them, these features include async replication; multi-replica replication; chained replication, for higher degrees of HA; blog replication; garbage collection on blobs; divergence detection; and durability, which we’ve mentioned. Combined, all of these new features keep the impact of turning sync replication on very low, and give both the user and the system multiple ways to accomplish shared goals.

Conclusion

Synchronous replication without compromising MemSQL’s very fast performance opens up many new use cases that require system of record (SoR) capability for use with MemSQL. Also, the incremental backup capability, also new in MemSQL 7.0, further supports SoR workloads.

We are assuming here that these will be performed using MemSQL’s rowstore tables, which are kept in memory. Both rowstore and columnstore tables support different kinds of fast analytics.

So MemSQL can now be used for many more hybrid use cases in which MemSQL database software combines transactions and analytics, including joins and similar operations across multiple tables and different table types.

These hybrid use cases may get specific benefits from other MemSQL features in this release, such as MemSQL SingleStore. Our current customers are already actively exploring the potential for using these new capabilities with us. If you’re interested in finding out more about what MemSQL can do for you, download the MemSQL 7.0 Beta or contact MemSQL today.

Announcing MemSQL Helios: The World’s First SingleStore, System of Record Cloud Database

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL extends our operational data platform with an on-demand, elastic cloud service, and new features to support Tier 1 workloads.

MemSQL is proud to announce two exciting new product releases today: MemSQL Helios, our on-demand, elastic cloud database-as-a-service, and MemSQL 7.0 Beta 2, the next major release of our database engine, featuring MemSQL SingleStore – a breakthrough new way of managing data – and new features to fully support Tier 1, system of record workloads.

With MemSQL Helios, you get instant, effortless access to the world’s fastest, most scalable data platform for operational analytics, machine learning, and AI, with platform operations handled by MemSQL. Helios is a fully managed cloud service that provides you with instant access to our best-in-class operational database — on demand and at elastic scale — in public cloud environments around the world.

MemSQL 7.0, which will be generally available in the coming months, brings two exciting new advance advances: MemSQL SingleStore and key features to make MemSQL “system of record” capable for Tier 1 workloads.

Data platform spending is seeing explosive growth as new applications are built to support real-time decisions, predictive analytics, and automation leveraging machine learning and AI. With this growth, and the desire for greater flexibility and cost control, more and more data workloads are moving to the cloud; see Gartner’s perspective in this recent analyst research.

The vast majority of MemSQL’s customers are already deploying our database in the cloud, typically using deployment automation or container infrastructure. Many have chosen MemSQL as the cloud migration path from older, on-premises operational database implementations.

The availability of MemSQL Helios, along with the imminent release of SingleStore and the new system of record capabilities, make MemSQL the ideal choice for both new applications and to move legacy operational databases to the cloud.

Announcing MemSQL Helios

MemSQL Helios has the best-in-class speed, scale, and robust features you have come to expect from MemSQL, but now available on demand. Customers can focus on using their data to build breakthrough applications and analytical systems instead of mundane infrastructure management and tuning.

With MemSQL Helios, there’s no software to deploy or manage. The most important steps in getting started are choosing an initial deployment size from a pull-down menu, then clicking a button to initiate deployment. Compute and storage resources are automatically assigned, the necessary software is installed and configured, and clusters are created, ready to store data — all in a few minutes. High availability is built in, with MemSQL handling data backup and, if needed, restore operations.

MemSQL Helios provides a managed service for a fast, scalable SQL database.
MemSQL Helios handles infrastructure monitoring, cluster configuration, management, maintenance, and support.

Benefits of MemSQL Helios include:

  • Effortless deployment and management. As we have all come to expect from cloud services, deployment and upgrades are built in. With MemSQL Helios you get the full benefits and capabilities of the MemSQL data platform without having to worry about deployment, management, or maintenance. There’s no need to rack servers, script deployments, or manage VMs.
  • Avoid cloud lock-in through multi-cloud flexibility. Helios is available today on Amazon Web Services and Google Cloud Platform. MemSQL operates exactly the same whether deployed on-premises on bare metal, across on-premises or cloud infrastructure, using the MemSQL Kubernetes Operator, or within the Helios service. You can use MemSQL to support a broad set of operational and analytical use cases, allowing for a simple, single platform across applications, analytical systems, and cloud deployments.
  • Superior TCO. Compared to both legacy databases and proprietary databases from the cloud service providers, MemSQL Helios offers superior total cost of ownership (TCO). MemSQL offers high performance, scalability, ANSI SQL support, and the ability to replace traditional databases like Oracle Exadata and SAP Hana, at a fraction of the cost. When compared to the proprietary databases offered on Amazon Web Services and Google Cloud Platform, MemSQL’s unique architecture and high-performance query engine mean that many operational analytics workloads run with far less resource consumption, offering significant cost savings.

MemSQL Helios is available in limited public preview today. The service is secure, stable, and ready for your production workloads. There is a time-limited trial open to all users; however, Helios is initially available only to a limited number of customers to purchase. Try it for yourself now, or request a technical deep dive with a product specialist. If you are interested in purchasing MemSQL Helios immediately, please contact us to request an invitation.

Also, see our getting started videos for MemSQL Helios on YouTube.

Introducing SingleStore: Breakthrough Data Management from MemSQL

With legacy databases, customers had their data in silos: transaction data in rowstore tables, analytics and data warehousing systems using columnstore tables, and extract, transform, and load (ETL) operations to bridge the gap. MemSQL brings rowstore and columnstore tables together in a single database; ETL is eliminated, and SQL queries can combine data from both types of tables. With SingleStore, in future versions of MemSQL, we will eliminate the need to choose one table type or another. The system will optimize storage and data access for you.

MemSQL is seeking to eliminate data duplication, reduce complexity, and cut total cost of ownership with the launch of MemSQL SingleStore, a new breakthrough in data architecture. In the future, SingleStore will offer the fastest possible performance, at the lowest possible cost, for every kind of workload – transactional, analytical, and hybrid – by storing all data in a single table type.

In MemSQL 7.0, SingleStore is delivered through improvements to both rowstore and columnstore tables, allowing each to handle workloads that previously only worked well on the other. Rowstore tables – used for transactions, for seeking a few rows of data, and for analytics on rapidly-changing data – get null elimination via the new sparse compression feature. In null elimination, fields that are subject to sparse compression are labeled as SPARSE when the table is created.

For each sparse field, a flag is created, half a byte wide. When the field’s value is null, the flag is set by the database software, and no value is stored for that column, saving that amount of storage – often 4, 8, or 16 bytes per null. Null elimination reduces the memory footprint of many rowstore tables by 50% or more, cutting memory usage and costs by half or more.

Rowstore compression reduces storage space and TCO due to memory usage.
Setting the SPARSE flag causes appropriate fields to have null values indicated by flags, rather than by a null value taking up the normal field width.

Also in MemSQL 7.0, columnstore tables – highly useful for most analytics purposes, but much harder to update efficiently than rowstore tables – get multiple secondary indexes, allowing fast seeks on multiple access paths, and locking at row level, which enables higher concurrency of updates and deletes.

With the initial implementation of MemSQL SingleStore delivered in MemSQL 7.0, you can use columnstore for more workloads than you could before – workloads that previously required rowstore. Using columnstore for these additional workloads means lower total cost of ownership (TCO), while still meeting service level agreements (SLAs) and user expectations. In addition, for the workloads that still require the updating capabilities and even faster performance of rowstore, compression lowers TCO for those workloads as well.

New Features Enable “System of Record” Capability

MemSQL 7.0 also introduces two key availability features which combine to enable Tier 1, system of record workloads: much faster sync replication, turned on by default, and incremental backup.

Fast sync replication introduces a high-availability replication option — the ability to always keep a replicated, live copy of your database as it’s receiving updates. Together with the availability of sync durability (copy to disk required for an update to be logged), long supported in MemSQL, sync replication enables the level of data safety required for Tier 1, system of record workloads.

Fast sync replication in MemSQL 7.0 makes it possible to run high availability with a small performance hit.

Incremental backup extends the software’s capabilities beyond the ability to do full backups, to include regular backups of changed data from select time frames, and the ability to restore reliably from the previous full backup plus incremental backups.

Together, the addition of the ability to regularly backup recently changed data, and the ability to synchronously replicate your data, mean that MemSQL can now be trusted for critical data workloads.

Get Started Today

Helios is now available in limited public preview. You can try it now for free, or contact us to purchase Helios – availability is limited – or speak to a product specialist for details.

MemSQL Helios live webinar and demo

MemSQL Helios is secure, stable, and ready for production immediately. The service is priced based on consumption, with options for both on-demand purchase and discounted prepaid subscriptions. Not all deployment regions are supported initially.

Existing MemSQL customers can download MemSQL 7.0 Beta 2 to try out our new system of record features and SingleStore. If you are interested in evaluating MemSQL 7.0, please contact us to request access. MemSQL 7.0 will be generally available in the cloud and for download later this year. Sign up for our upcoming Helios webinar to learn more.

Also, see our getting started videos for MemSQL Helios on YouTube.

Case Study: Medaxion Brings Real-Time Decision-Making to MedTech

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Medaxion has found a solution to the analytics problems that plague companies across the health care sector and beyond. John Toups, CTO of Medaxion, puts it plainly. “By combining the ease of presentation and abstraction of Looker for analytics, with the technical prowess of MemSQL as the database behind Looker, we now have simply the best analytics in healthcare.”

Medaxion has changed the working lives of anesthesiologists, its core customers. These highly skilled professionals must deliver absolutely critical medical care, day after day, while also meeting demanding business requirements. These men and women, who once generated a blizzard of paper at every turn, now “record their anesthetics in Medaxion Pulse,” according to Toups, “and through Looker and MemSQL, that data is now actionable.”

Today, surgical teams and patients who sometimes waited hours for an anesthesiologist to arrive are now helped much faster, with the aid of predictive analytics. “If it’s cold and snowy, some elective surgeries might get cancelled because it’s hard to get around,” Toups shared. “At the same time, ER admissions might spike because of hazards on the road and the outdoors. We can monitor the data, minute by minute, and help put anesthesiologists where they’re needed most” – getting care to more people, faster.

Looker Exposes a Slow Database

Looker is an analytics tool that’s designed to make analytics fast and easy. Looker integrates deeply with multiple data sources, rapidly converting user requests to live results. Looker also provides an abstraction layer, LookML, a modeling language that keeps users from having to learn SQL or other code.

With Looker’s speed, and the ease of use provided by LookML, demand for analytics at Medaxion rose sharply. But increased demand meant more queries coming into Looker. And, while Looker itself could handle the volume, MySQL – Medaxion’s previous underlying database – couldn’t keep up.

MemSQL Solves the Problem

Toups and Medaxion had a problem. How did they solve it? Simple: by leaving MySQL and moving to MemSQL.

“Much of what my customers want from analytics is real-time operational information, and there is enormous interest in monitoring and improving quality of care,” said Toups. For instance, anesthesiologists need to be near patients who will be needing surgery. It’s a dispatching problem that is, in a way, similar to the issue faced by Lyft and Uber: have the right person, providing the right service, in the right place, at the right time. But for Medaxion’s anesthesiologist clients, and the patients who need them, the stakes are higher.

With Looker running on MySQL, Medaxion experienced significant problems. “At first, our analytical reporting was incredibly slow,” said Toups.

“When we first started implementing Looker, a couple of years ago, we did a traditional ETL ‘lift and load’ into a MySQL reporting warehouse,” reported Toups. “This resulted in about 600GB of data. Now, on MemSQL, I use columnstore with compression and MemSQL’s native sharding.”

“The underlying data includes measurements such as systolic and diastolic blood pressure, along with heartbeat and respiration and a variety of medical history information, all of which are strongly correlated to one another,” Toups continued. “I use associative techniques to compress that 600GB down to a 20GB dataset. Simply unprecedented compression. We always had similarity on our data, but we couldn’t take advantage of it in the MySQL environment.”

“It was taking 30 or 40 minutes to generate retrospective quality data on the MySQL platform,” Toups said. “Because we couldn’t easily cache the entire data set in memory on MySQL, it was constant disk thrashing. But on MemSQL, the same analysis runs in less than a minute, and most queries return in under a second.”

Toups summed up Medaxion’s progress: “We replaced the middle of the analytics chain with MemSQL. Looker is the body of the rocket ship that carries the information my customers – and their patients – need. MemSQL is the engine of that rocket ship.”

Medaxion saw dramatic performance improvements with the move to MemSQL.

Re-Plumbing the MemSQL+Looker Solution

Solving complicated data flow problems is not always easy. It took a certain amount of expertise and effort to find the right solution to specific problems.

Medaxion wanted a solution that would be easy to implement. It wanted as little change as possible in its pre-existing processing architecture. It didn’t want to have architects working for months, nor did it want to end up with a big team of operations people to monitor and manage a complicated solution.

MemSQL frees Medaxion from a lot of operational overhead, while empowering the anesthesiologist users. The technical people at Medaxion also appreciate the architectural purity of MemSQL. “Scale can be difficult,” said Toups. “But scale done well is a joy.”

New Architecture Delivers Results

With MemSQL and Looker, each day sees new operational efficiencies. “Installing and maintaining MemSQL has been a no-effort proposition,” commented Toups. “And, because the previous solution was so slow, we used to do a lot of efficiency programming, indexing, and query planning. With MemSQL, I’ll pre-aggregate when there’s an obvious need. But I can also just let our customers do as they wish with the data.”

The results are impressive. “We’ve worked to reduce the elapsed time to less than 30 seconds between a data event and a reportable fact,” said Toups. “This is an enormous improvement, and we achieved it without much change to the underlying architecture.”

Medaxion Deploys MemSQL Helios

Medaxion has continued to innovate in its use of MemSQL, serving as an early adopter of MemSQL Helios, MemSQL’s elastic cloud service. Medaxion served as a design partner for the new offering, contributing feedback and ideas as they stood their instance up on Helios and moved to production.

“We get all of the advantages of MemSQL, with no increase in cost relative to hosting it ourselves in the public cloud, but tremendous advantages in operational savings and simplicity for the team. We save time and effort on managing our deployment, and MemSQL’s people can often fix issues before we’re even aware there might be a problem. We’ve helped make their product better, and they’ve helped make us even savvier users of their product.”   

Looking Forward

It’s only been six months since Medaxion became a MemSQL customer. Medaxion’s rapid progress has fundamentally altered Medaxion’s future and the future of its anesthesiologist customers.
SIS
And Medaxion is just getting started. By deploying MemSQL and Looker together, Medaxion has enabled a data culture amongst its customers. Now, the anesthesiologists can do their own data mining, with results that are already amazing.

Medaxion is building up a huge base of data to be mined going forward. Predictive analytics, machine learning, and AI can help Medaxion, and its empowered customers, to improve both business practice and medical practice around the crucial discipline of anesthesiology.

Third, there may still be room for improvement in Medaxion’s data architecture. Toups and others at Medaxion are actively investigating the potential for far-reaching change in how Medaxion operates.

MemSQL has helped Medaxion to save time, save money, and dramatically improve outcomes in the short term. At the same time, Medaxion is opening the door to even bigger, and better, changes in the future.

Case Study: Emmy-Winning SSIMWAVE Chooses MemSQL for Scalability, Performance, and More

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

SSIMWAVE customers – from film producers to network engineers to media business executives – work to some of the highest standards in the world. They demand to work with the best. SSIMWAVE also works at that level, as the company’s 2015 Emmy award for engineering achievement demonstrates. They also ask the same high standards of their technology vendors/partners. For SSIMWAVE’s rather comprehensive analytics needs, only one database makes the grade: MemSQL.

SSIMWAVE has unique technology and unique analytics needs. SSIMWAVE mimics the human visual system, enabling the software to quantify the quality of video streams, as perceived by viewers, into a single viewer score. Video delivery systems can then be architected, engineered, and configured to manage against this score. This score correlates strongly to what actual human beings would perceive the video quality to be. This allows SSIMWAVE users to make informed trade-offs among resources and perceived quality, automatically or manually, and all in real time.

SSIMWAVE Cracks the Code

According to Cisco, video data accounted for 73 percent of Internet traffic in 2017, a share that is projected to grow to 82 percent by 2022. Maximizing the quality of this video content, with the least bandwidth usage and at the lowest cost possible, is one of the most important engineering, business, and user experience issues in the online world.

The barrier to balancing video quality against compression has been that only human beings could accurately assess the quality of a given video segment when it was compressed, then displayed on different devices. Further complicating the picture (no pun intended) is the fact that people, when asked to rate video quality, give different answers with varying levels of consistency over time. This has meant that a panel of several people was needed to render a useful assessment. As a result, a software engineer or operations person wanting to process and deliver data within acceptable levels didn’t have a reliable, affordable method for knowing how much was just enough, without serious compromise to the viewer’s experience.

SSIMWAVE uses MemSQL for video streaming analytics to drive improved customer experiences.
The SSIMWAVE website demonstrates what the company’s breakthrough algorithm and technology can enable for the media & entertainment industry.

SSIMWAVE appears to have cracked the code on this problem with its proprietary SSIMPLUS® algorithm, described on their website, which provides capabilities not found elsewhere. The company’s technology assesses video quality with a single, composite number that achieves a correlation greater than 90 percent between machine assessment and subjective human opinion scores. With this technology, video professionals can make much more efficient use of network resources, while consistently maintaining the desired level of quality.

SSIMWAVE users are able achieve significant bandwidth savings by configuring to deliver on a viewer score. The company’s customers include the largest IPTV providers in the US and Canada. Their platform is affecting the streams of tens of millions of subscribers in North America. MemSQL already has a strong position in media and communications solutions, including having Comcast as a customer, and it was natural for SSIMWAVE to consider MemSQL for its own analytics needs.

SSIMWAVE’s Need for State-of-the-Art Analytics

SSIMWAVE’s business is, in the end, all about numbers. For the company to deliver a complete and reliable service, it needs a high-performance database that can store very large quantities of data and respond very quickly to ad hoc analytics queries.

SSIMWAVE has ambitious analytics goals. In addition to comprehensive internal requirements, it needs to offer state-of-the-art analytics capabilities to customers.

SSIMWAVE needs both up-to-the-moment reporting, on data volumes that will increase exponentially as new data streams in, and the ability to retain all that data to meet customer service level agreements (SLAs).

SSIMWAVE Chooses MemSQL

SSIMWAVE was ready for an innovative solution. It compared three technologies that seemed most likely to meet its requirements:

The database assessment was led by Peter Olijnyk, Director of Technology at SSIMWAVE. Peter has 20 years experience as a software developer, architect, and engineering leader, along with a passion for playing guitar in his rock band.

Olijnyk and his team at SSIMWAVE found the choice relatively easy, and decided on MemSQL. Among the key considerations were:

  • Scalability. SSIMWAVE needs a seamlessly scalable database, as its business needs may drive it to arbitrarily large scale requirements. MemSQL’s distributed architecture fits the bill.
  • Performance. SSIMWAVE needs high performance for its own internal needs, but also for its customers, who will be using the SSIMWAVE data architecture.
  • Ease of setup. SSIMWAVE was able to use MemSQL’s documentation to get its first cluster running easily, in a matter of hours. This ease of setup and comprehensibility will extend to SSIMWAVE customers.
  • Direct SQL queries. SSIMWAVE needs a tool with integrations to third party tools, allowing for direct SQL queries which are fast and responsive.
  • Rowstore and columnstore support. Although its current use case is “99 percent columnstore,” SSIMWAVE likes having the door open to rowstore use cases with MemSQL.
  • Data streaming architecture support. MemSQL works smoothly with leading stream-processing software platforms, including support for exactly-once updates. The benefit of MemSQL is its ability to scale out, enabling very high levels of performance.
  • Wide range of integrations. MemSQL supports a wide range of integrations, including the MySQL wire protocol and other standard interfaces. “We use the ODBC interface in a standard way,” said Olijnyk. “We have found MemSQL’s ODBC interface to be customizable and flexible.”
MemSQL interoperates with a wide range of other technologies for ingress and analytics.
MemSQL’s wide range of integrations is important
to SSIMWAVE and other MemSQL customers.

“The main thing that tipped the scales was the ease of use and out-of-box experience,” according to Olijnyk. “We went from reading about MemSQL to having clusters running in a matter of hours.”

“We implement real-time data streaming and MemSQL for ingest and query response,” he reports. “Also, we recently needed a way to share state across our architecture. We considered ZooKeeper and Redis, but we ended up using MemSQL rowstore, because it gives us such high performance.”

The move to this architecture for SSIMWAVE was never far from Olijnyk’s mind. “We prioritize ease of use and ease of installation. We have to concern ourselves with this approach; otherwise, costs and support effort would rise quickly. The fewer technicians we have to manage to support our customers, the better.”

SSIMWAVE Chooses Helios

SSIMWAVE was able to move quickly and smoothly into production to provide its service at scale to OTT companies. A few months after deployment, SSIMWAVE moved to MemSQL Helios, the new, high-performance, elastic cloud database service. With MemSQL Helios, SSIMWAVE gets the same high performance as before, and includes cloud services and software, but with much less operations effort. 

“Our focus is to make sure each video stream delivered makes its way to a happy customer. SSIMWAVE tunes video content quality to balance feasibility with the best experience possible.We moved to MemSQL Helios, as soon as it was available, because it helps us maintain that focus,” according to Olijnyk. He had cited ease of use and the out-of-the-box experience as drivers in the original move to MemSQL. With MemSQL Helios, both ease of use and the out-of-the-box experience are improved further. 

To see the benefits of MemSQL for yourself, you can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

The Beauty of a Shared-Nothing SQL DBMS for Skewed Database Sizes

$
0
0

Feed: MemSQL Blog.
Author: Eric Hanson.

The limitations of a typical, traditional relational database management system (RDBMS) have forced all sorts of compromises on data processing systems: from limitations on database size, to the separation of transaction processing from analytics. One such compromise has been the “sharding” of various customer accounts into separate database instances, partly so each customer could fit on a single computer server – but, in a typical power law, or Zipf, distribution, the larger database don’t fit. In response, database administrators have had to implement semi-custom sharding schemes. Here, we describe these schemes, discuss their limitations, and show how an alternative, MemSQL, makes them unnecessary.

Prologue

The primary purpose of many database implementations is to respond to a large volume of queries and updates, from many concurrent users and applications, with short, predictable response times. MemSQL is becoming well-known for delivering under these demanding conditions, in particular for large databases, bigger than will fit on a single server. And it can do so for both transactional (online transaction processing, or OLTP) and analytical (online analytics processing, or OLAP) applications.

These characteristics make MemSQL attractive for both:

  • In-house development teams that create different databases for different customers, and
  • Cloud service providers that create separate databases for each of their customers…

…when even just one, or a few, of the databases can be very large.

What follows are tales of two different database application architects who face the same problem—high skew of database size for different customer data sets, meaning a few are much larger than others—and address this problem in two different ways. One tries to deal with it via a legacy single-box database and through the use of “clever” application software. The other uses a scalable database that can handle both transactions and analytics—MemSQL. Judge for yourself who’s really the clever one.

The Story of the Hero Database Applications Architect

Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They came up with what seemed like a great idea. Each customer’s data would be placed in its own database. Then they would allocate one or more databases to a single-node database server. Each server would handle operational queries and the occasional big analytical query.

When a server filled up, they’d just allocate additional databases to a different server.

Everything was going great during development. The application code only had to be written once, for one scenario — all data for a customer fitting one DBMS server. If a database was big, no problem, just provision a larger server and put that database alone on that server. Easy.

Then they went into production. Everything was good. Success brought in bigger customers with more data. Data grew over time. The big customer data grew and grew. The biggest one would barely fit on a server. The architect started losing sleep. He kept the Xanax his doctor prescribed for anxiety in the top drawer and found himself dipping into it too often.

Then it happened. The biggest customer’s data would not fit on one machine anymore. A production outage happened. The architect proposed trimming the data to have less history, but the customers screamed. They needed 13 months minimum or else. He bought time by trimming to exactly 13 months. They only had two months of runway before they hit the wall again.

He got his top six developers together for an emergency meeting. They’d solve this problem by sharding the data for the biggest customer across several DBMS servers. Most queries in the app could be directed to one of the servers. The app developers would figure out where to connect and send the query. Not too hard. They could do it.

But some of the queries had aggregations over all the data. They could deal with this. They’d just send the query to every server, bring it back to the app, and combine the data in the app layer. His best developers actually thought this was super cool. It was way more fun than writing application software. They started to feel really proud of what they’d built.

Then they started having performance problems. Moving data from one machine to the other was hard. There were several ways they could do things. Which way should they do it? Then someone had the great idea to write an optimizer that would figure out how to run the query. This was so fun.

Around this time, the VP from the business side called the architect. She said the pace of application changes had slowed way down. What was going on? He proudly but at the same time sheepishly said that his top six app developers had now made the leap to be database systems software developers. Somehow, she did not care. She left his office, but it was clear she was not ready to let this lie.

He checked the bug count. Could it be this high? What were his people doing? He’d have to fix some of the bugs himself.

He started to sweat. A nervous lump formed in the pit of his stomach. The clock struck 7. His wife called and said dinner was ready. The kids wanted to see him. He said he’d leave by 8.

The Story of the Disciplined Database Applications Architect

Once there was a database application architect. His company managed a separate database for each customer. They had thousands of customers. They at first considered what seemed like a great idea. Each customer’s data would be placed in its own database on a single-node database server. But, asked the architect, what happens when there’s more data than will fit on one machine?

One of the devs on his team said he’d heard of this scale-out database called MemSQL that runs standard SQL and can do both operational and analytical workloads on the same system. If you run out of capacity, you can add more nodes and spread the data across them. The system handles it all automatically.

The dev had actually tried the free version of MemSQL for a temporary data mart and it worked great. It was really fast. And running it took half the work of running their old single-box DBMS. All their tools could connect to it too.

They decided to run just a couple of MemSQL clusters and put each customer’s data in one database on one cluster. They got into production. Things were going great; business was booming. Their biggest customer got really big really fast. It started to crowd out work for other customers on the same cluster. They could see a problem coming. How could they head it off?

They had planned for this. They just added a few nodes to the cluster and rebalanced the biggest database. It was all done online. It took an hour, running in the background. No downtime.

The VP from the business side walked in. She had a new business use case that would make millions if they could pull it off before the holidays. He called a meeting the next day with the business team and a few of his top developers. They rolled up their sleeves and sketched out the application requirements. Yeah, they could do this.

Annual review time came around. His boss showed him his numbers. Wow, that is a good bonus. He felt like he hadn’t worked too hard this year, but he kept it to himself. His golf score was down, and his pants still fit just like in college. He left the office at 5:30. His kids welcomed him at the door.

The Issue of Skewed Database Sizes

The architects in our stories are facing a common issue. They are building services for many clients, where each client’s data is to be kept in a separate database for simplicity, performance and security reasons. The database sizes needed by different customers vary dramatically, following what’s known as a Zipf distribution [Ada02]. In this distribution, the largest databases have orders of magnitude more data than the average ones, and there is a long tail of average and smaller-sized databases.

In a Zipf distribution of database sizes, the size y of a database follows a pattern like

size(r) = C * r -b

with b close to one, where r is the rank, and C is a constant, with the largest database having rank one, the second-largest rank two, and so on.

The following figure shows a hypothetical, yet realistic Zipf distribution of database size for r = 1.3 and C = 10 terabytes (TB). Because of the strong variation among database sizes, the distribution is considered highly skewed.

If your database platform doesn’t support scaleout, then it may be impossible to handle, say, the largest four customer databases when database size is distributed this way–unless you make tortuous changes to your application code, and maintain them indefinitely.

I have seen this kind of situation in real life more than once. For example, the method of creating an application layer to do distributed query processing over sharded databases across single-node database servers, alluded to in the “hero” story above, was tried by a well-known advertising platform. They had one database per customer, and the database sizes were Zipf-distributed. The largest customers’ data had to be split over multiple nodes. They had to create application logic to aggregate data over multiple nodes, and use different queries and code paths to handle the same query logic for the single-box and multi-box cases.

Their top developers literally had to become database systems software developers. This took them away from application development and slowed the pace of application changes. Slower application changes took money off the table.

An Up-to-Date Solution for Skewed Database Sizes

Writing a distributed query processor is hard. It’s best left to the professionals. And anyway, isn’t the application software what really produces value for database users?

Today’s application developers don’t have to go the route of application-defined sharding and suffer the pain of building and managing it. There’s a better way. MemSQL supports transactions and analytics on the same database, on a single platform. It handles sharding and distributed query processing automatically. It can scale elastically via addition of nodes and online rebalancing of data partitions.

Some of our customers are handling this multi-database, Zipf-distributed size scenario by creating a database per customer and placing databases on one or more clusters. They get a “warm fuzzy feeling” knowing that they will never hit a scale wall, even though most of their databases fit on one machine. The biggest ones don’t always fit. And they know that, when a database grows, they can easily expand their hardware to handle it. They only have to write and maintain the app logic one way, one time, for all of their customers. No need to keep Xanax in the top drawer.

MemSQL doesn’t require performance compromises for transactions or analytics [She19]. Quite the contrary, MemSQL delivers phenomenal transaction rates and crazy analytics performance [She19, Han18] via:

  • in-memory rowstore structures [Mem19a],
  • multi-version concurrency control [Mem19c],
  • compilation of queries to machine code rather than interpretation [Mem19e], and
  • a highly-compressed, disk-based columnstore [Mem19b] with
  • vectorized query execution and use of single-instruction-multiple-data (SIMD) instructions [Mem19d].

Moreover, it supports strong data integrity, high availability, and disaster recovery via:

  • transaction support
  • intra-cluster replication of each data partition to an up-to-date replica (a.k.a. redundancy 2)
  • cluster-to-cluster replication
  • online upgrades.

Your developers will love it too, since it supports popular language interfaces (via MySQL compatibility) as well as ANSI SQL, views, stored procedures, and user-defined functions.

And it now supports delivery as a true platform as a service, Helios. Helios lets you focus even more energy on the application rather than running–let alone creating and maintaining–the database platform. Isn’t that where you’d rather be?

References

[Ada02] Lada A. Adamic, Zipf, Power-laws, and Pareto – a ranking tutorial, HP Labs, https://www.hpl.hp.com/research/idl/papers/ranking/ranking.html, 2002.

[Han18] Eric Hanson, Shattering the Trillion-Rows-Per-Second Barrier With MemSQL, https://www.memsql.com/blog/memsql-processing-shatters-trillion-rows-per-second-barrier/, 2018.

[Mem19a] Rowstore, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/rowstore/, 2019.

[Mem19b] Columnstore, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/columnstore/, 2019.

[Mem19c] MemSQL Architecture, https://www.memsql.com/content/architecture/, 2019.

[Mem19d], Understanding Operations on Encoded Data, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/understanding-ops-on-encoded-data/, 2019.

[Mem19e], Code Generation, MemSQL Documentation, https://docs.memsql.com/v6.8/concepts/code-generation/, 2019.

[She19] John Sherwood et al., We Spent a Bunch of Money on AWS And All We Got Was a Bunch of Experience and Some Great Benchmark Results, https://www.memsql.com/blog/memsql-tpc-benchmarks/, 2019.

MemSQL Helios & 7.0 Overview, Part 1

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

This blog post shares the initial section of our recent webinar, MemSQL Helios and MemSQL 7.0 Overview. In this first part, Rick draws the big picture as to the need for a new approach to data processing, and shows how MemSQL fits the bill. In the second and third parts, Rick introduces MemSQL Helios, the new, elastic, on-demand cloud service from MemSQL, and describes highlights of the upcoming MemSQL 7.0.

MemSQL Helios is brand new, and MemSQL 7.0 is the best version of our product we’ve done yet. It’s a great opportunity to be able to talk about it. I’m first going to give a MemSQL overview, describing what we built and why we built it. Describes the problems that businesses, especially large, successful business, face in meeting today’s demands, given their outdated infrastructure. Then we’ll go into the details around our launch of Helios, our managed service, and describe the new features coming in MemSQL 7.0. And we’ll finish up with some questions at the end.

Today’s Successful Businesses – Old and New – Use Operational Data at Scale

If you look at the most successful businesses across industries, there’s one thing that they all have in common, and that’s that they’re powered by live, operational data. They take advantage of the operational data they have to greatly enhance the customer experience and the opportunities and the way they run their business. This includes everything from Uber using the information they have about how people are calling cars to deploy more drivers when they’re needed, to figuring out how to price it, to financial companies like Goldman Sachs who are trying to figure out how to deliver the optimal experiences around portfolio analytics, online banking risk models, and such, to media companies that are delivering streaming media, and need to be able to keep track of exactly what the quality level is at any time, so they can guarantee that people are getting the best experience possible. Truly, across companies in pretty much every industry, data is the key to their success and to the differentiation that they have, becoming the best in their business.

Breakthrough businesses use operational analytics to deliver valuable insights nearly instantaneously.

This isn’t something that just those companies wanted. It’s something that every company wants. But, it’s hard. And it’s hard because the requirements are much higher and more complex than they’ve ever been. So data volumes are rising, which means the amount of data you have to store and manage and work on is increasing at frightening speeds. And then the complexity around that data – both because of the variety of data sources, where the data is coming from, the formats that you have to process, and the demands on what you have to deal with that, are all rising as well.
On top of that, consumers of the data have higher expectations than ever. This is true in the consumer side, where you think about things like people using the airline apps to know whether their flight is delayed or not. They expect instant notifications. Or banking applications – 15 years ago, or maybe even 10 years ago, your way of getting updates was a paper statement you got in the mail once a month. And now, if you swipe your credit card and don’t instantly see it on your app, you’re unhappy with the bank and feel like they’re failing. So expectations have risen, but not just across consumer; this also applies to enterprises as well.

As data volume and complexity rise, MemSQL can be a crucial part of your response.

The users within the enterprise expect to see the data that they need in real time. They make decisions day-to-day. It’s not just a couple analysts in the back room who have access to the data. Everyone wants access to data, and they expect it to be up-to-date and easy to understand. And so the expectations from the users have been growing tremendously as well.

And lastly, we’ve already moved from a historical view of the world, from where we’re looking back in time to try to understand what happened, to looking at what’s happening currently, what’s happening in real time. That’s not where people want to stop. They want to move on from there to even get to predictive. What’s going to happen in the future? And how can I take advantage of that? Already, today, the app that shows you a movie to rent, a book to buy, or when the car you’re booking is likely to appear, is using predictive analytics.

For internal use, people want to know about pricing opportunities, what deals are likely to be successful, what inventory will be needed where? Then, they can pro-actively move it around. The final step is to take humans off of the front line. The system can start ordering new supplies, moving inventory, and so on, automatically, then tell the humans what it’s done.

So every enterprise is on this journey, moving through that maturity cycle, and they’re all trying to get to the end as fast as possible. But they’re all struggling to keep up. The data infrastructure that we use today, that worked well enough against the previous requirements, from the last couple of decades are not keeping up with these demands, these new requirements. You’re trying to get a tighter time to insight, you’re trying to drive that time from between when data’s born to when you can get insight or take action on it, you’re trying to drive that down to zero. And the infrastructure is your bottleneck, and it’s not allowing you to do that. It’s taking minutes or hours or sometimes even days to move the data through the system, and get it to the end, when it’s too late, and you’re unable to meet the SLA that you’re trying to hit.

On top of that, with the rise of the size of the volume of data and the complexity of the data, also comes rising costs. As you try to scale the systems that you have, that weren’t designed for that level of scale – they either hit a bottleneck, or they just hit a ceiling. The cake can get bigger, so you need the system to grow, but the costs of growing to that size are so astronomical, they’re not practical.

And last, as you expose your data to, say, all of your employees versus just the incremental list, you’re moving from tens of people to hundreds, thousands, or even tens of thousands – depending on the size of your organization – all trying to get access to that data. And then they extrapolate even further as you expose it out to your customers or your partners in ways you haven’t tried to do before. The number of simultaneous users – that is, concurrency demands – just grows by orders of magnitude. And again, the data infrastructure of the past wasn’t designed to deal with that level of concurrency and still maintain the SLAs that you’re trying to hit.

How MemSQL Solves Operational Problems

All these new demands, and these trends in the industry, have created new requirements that didn’t exist before. That’s why we built MemSQL. It’s a cloud native operational database built for speed and scale. “Cloud-native” meaning it’s distributed and easy to scale on existing hardware. It’s a relational database, meaning it supports a relational model and a standard SQL interface, but with a modern architecture under the covers, to deliver the speed and the scale requirements that your new, modern applications need.

MemSQL supports the workload we call operational analytics, where you’re doing primarily an analytics workload, or it’s aggregations and group-bys, table scans of large sets of data, and all where you need to meet an SLA. The database has to be resilient and available and durable in a way that the existing data warehouse technologies are not. And, when the intersection of those requirements is where the existing architecture is failing, MemSQL does things an order of magnitude better.

An example is portfolio analytics, where bank customers want to look at their portfolios and have up-to-date market data in real time. They’re able to quickly drill in and understand how their portfolio is performing, and do what-if analysis to see how it would shift if they make a change. And they want to be able to do that even when the market’s busy, and hundreds or thousands of users all connect at the same time. The system scales smoothly, meeting its SLAs at every step, to deliver what customers need. And predictive analytics is just one of many applications that MemSQL is used for.

As an organization moves through the maturity curve, starting to do more predictive ML and AI, where you need a highly scalable system that’s familiar and easy to use, but still delivers on the speed and scale of the type of mathematical operations you need to support an ML model. MemSQL supports that very well.

What’s driving people to consider new architectures – more so now than any time before – is the move to cloud and the need to replace the legacy architectures. Often, the legacy architecture is dependent on custom hardware, particularly if you’re using an appliance-based technology. And as we move to cloud, there’s a realization you just can’t take that architecture, let alone that appliance, to the cloud.

And so data architects and enterprises are being forced to rethink, “Hey, how should I make this work in the cloud?” And therefore, they’re open to looking at newer, more modern architectures, which is opening up opportunities for technologies like MemSQL to replace the legacy systems. But you don’t have to believe me. You can look at who our customers are. We have half of the top 10 banks in North America who are using us, like I said, for analytics fraud, real-time fraud analysis, and risk analytics. We have telcos who are using us for doing things like managing the amount of data coming in, as they move from 3G to 5G, with the size and complexity of the data growing tremendously, and the existing systems can’t keep up.

MemSQL supports half of the Top Ten banks and other leading companies.

Large media companies like Comcast, Hulu, Pandora are using us to track and evaluate the quality of their streaming media to make sure that their customers are having the best experience. (SSIMWAVE, which specializes in this, is using MemSQL to power not only internal workloads, but customer-facing workloads as well. – Ed.) And really, right across pretty much all industries, we find this pattern showing up more and more, as customers are trying to meet the requirements in these new model workloads, and they need an infrastructure that can support the speed and scale, in order for them to realize and meet their requirements.

So how does MemSQL fit into your architecture? You can think of it as an operational database that supports both analytics and custom applications. So – whether you’re doing dashboards or ad hoc queries, and running third-party BI tools like Tableau or Looker, or building custom dashboards or custom applications to do real-time decisioning or Internet of Things (IOT), MemSQL sits as the database, or the data storage layer, underneath any of those applications or tools.

MemSQL can sit at the center of your data architecture, with data streaming in and query responses streaming out.

And when you need to bring data in via native connectivity, things like Kafka, as well as NoSQL back-end storage systems like HDFS, and any of the cloud blob storage like S3 or Azure blob or Google blob storage, as well as being able to bring data in easily from relational database systems such as Oracle or SQL Server or MySQL or PostgreSQL. And programmatic systems like Spark – there’s a native connector to bring data in from Spark.

No matter how you’re bringing the data in or how it’s coming into the system, MemSQL can easily connect to it. And internally, MemSQL can store any kind of data type, whether it’s relational standard data in tables, geospatial data, JSON data, you can have a native column of type JSON, so you can easily store JSON information but also project out properties and index them so you can have fast query access to that data. And MemSQL does very well with time series data, also.

And all of this can be run on the infrastructure that works best for your business, whether it’s you running your software on-premises, either bare-metal or on VMs, or whether you want to self-host, or manage the software in the cloud, or you can run it using Kubernetes on-premises. And lastly, if you don’t want to run it yourself, you can use Helios. We will manage the underlying infrastructure for you, and you can focus on just building up your database.

Now, there are a lot of different database companies out there, all claiming to be the thing you need. So what makes MemSQL unique? What makes us different than all the other players out there? So one is that we are built from the beginning as a distributed cloud-native architecture. It’s a shared-nothing distributed system that can run on industry-standard hardware. When you install MemSQL, and if you need more capacity, you simply add more nodes to the system. You can do that as an online operation and grow the cluster as large as you need it, or shrink it down if you don’t need the capacity.

Unique MemSQL features include streaming ingest, lock-free processing, and fast query compilation.

Streaming ingest has been a key focus for the company from day one, with our Pipelines feature in particular, that allows you to bring data in from other parallel systems like Kafka or the blob storage technologies in the cloud. So you can massively bring it in, in parallel, with exactly-once semantics.

But underneath the covers, the thing that makes MemSQL truly unique is the architecture. Most legacy databases were built on a data structure called the B-tree. B-trees were built around getting data off spinning disks efficiently and very fast. But the world has moved on, and technology has moved on, and spinning disks are no longer the standard mechanism. There’s no reason to stay tied down to that data structure.

So MemSQL, because it was focused more on newer hardware, and places where memory is much more accessible, uses a data structure called a skip list. And skip lists are much more efficient, especially if you want to build them lock-free, so they don’t have the same locking semantics. Whereas, B-trees are a lot more prone to locking.

That’s the underlying mechanism that makes it possible for MemSQL to stream data on ingest. It will continue to let you run queries at the same time. This gives MemSQL the scale and the concurrency and the ability to get the performance that you need. Coupled with innovations like query compilation that make our queries run faster. And making use of SIMD instructions within the processor to do vectorization and a host of other innovations at the query compilation space. These are what give us our speed advantage.

And we couple that with an interface that’s familiar, and it’s compatible with what your people already know how to use. It’s ANSI SQL compliant. We support pretty much all the standard SQL functions capabilities in the ANSI SQL standard and even further, we’re wire-level protocol compatible with MySQL. So the existing ecosystem of tools, whether it’s BI tools or ETL tools or programmatic tools, all work with MemSQL right out of the box.

And, as I mentioned, MemSQL supports all the different data sources that you might want to use, giving you the flexibility to store the data in whatever format and shape that make the most sense for your application. It’s the combination of these that make MemSQL so powerful to use across your different applications and use cases.

MemSQL Helios & 7.0 Overview, Part 2

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

This blog post shares the second part of our recent webinar, MemSQL Helios and MemSQL 7.0 Overview. (Part 1 is here.) In this part, Rick delivers a deep dive into MemSQL Helios, the new, elastic, on-demand cloud service from MemSQL. He describes the features of MemSQL Helios in depth, and delivers a demo. In the last part, Rick describes highlights of the upcoming MemSQL 7.0.

Helios is our fully managed on-demand elastic version of MemSQL. Up to now, you could only download MemSQL software and run it yourself. Now, with Helios, MemSQL will now take care of things like provisioning, deployment, upgrades, and management alerting, and all of that can be offloaded from you to the MemSQL team.

MemSQL Helios handles provisioning, deployment, upgrades, and more.

And you’re responsible for the logical management of the data. Meaning, creating your database, writing your queries, tuning your queries, creating your indexes, creating and granting permissions for users. So you handle all of the logical aspects that are more around what you need in order to build your application and you leave the management of the system to MemSQL, allowing you to be more free and move faster.

This gives you effortless deployment. You can spin up a cluster in just a couple of minutes, and you get elastic scale, able to grow and shrink the cluster as your requirements demand. You get a much superior total cost of ownership (TCO), particularly versus legacy databases. The cost of MemSQL, especially the cost per query in MemSQL, is far more efficient in terms of how you use its hardware. And the fact that it runs on commodity hardware makes it much more cost efficient than the databases you’rv typically been using.

MemSQL Helios delivers scale, lower TCO, and cloud flexibility.

And we’re multi-cloud and hybrid cloud. Meaning, you can run us, of course, on-premises using MemSQL software or you can run us in any cloud. Today we support AWS and GCP for Helios, with Azure coming early next year. And of course, all of this leads to better agility for you. Making it faster and easier for you to build your applications and get up and running as quickly as possible, getting that superior scale and performance.

Now, how do we do it? It turns out that it was much easier than we expected, actually. When we started building Helios, we looked around at different technologies that can help us do the orchestration of spinning up the cluster and all those deployments and upgrades and pieces. And we settled on using Kubernetes. Now, there was some trepidation around whether Kubernetes was ready for a stateful service like a distributed database, because it had been primarily used for non-stateful application level systems. But we found that there had been enough investment by the community that it worked quite well for us – and in fact, a couple of people, over about six months, were able to get the system up and running quite easily.

MemSQL Helios is built on Kubernetes, and uses containers, with HA built in.

Kubernetes has capabilities like auto-healing if there’s a node failure. So if the node fails in the system, it automatically spins up a VM, brings it into the cluster, attaches to the cluster and gets it up and running. It allows us to do auto-scaling, easily growing the cluster as needed, and things like rolling upgrades to make sure that we can upgrade the software without having any impact on the end user.

We use containers to actually pin your compute, so your cores and memory are dedicated to you via containers running on the host machines and we use block storage in the background. So, for example, on AWS, we use the elastic block store for the storage, which then we can easily detach and attach to the containers as needed.

When it comes to things like the instance type and the node configuration, all that’s handled by MemSQL. We look at your workload and pick the optimal configuration that works best for you and you don’t have to worry about it. It’s all transparent. As well as delivering high availability (HA) transparently. So we always keep two copies of the data within the cluster. If there’s any problem with a node or if there’s any failures, it will automatically failover to the other copy, and then either fix the node or grow a new one, so that you get back to having two copies safely. And all this happens without any interruption, invisible to the user.

And of course, security is top of mind for many people as they come to cloud, and there are all these typical security options enabled by default. So you get encryption on the wire and on disk, and we support the authentication mechanisms that people expect. And full support for a role-based access control as well.

But enough talk; let me show you how it works. So this is our customer portal. When you log into portal.MemSQL.comm an account, this is what you’ll see. You click on this Clusters link on the left-hand outside, and from here you can create a cluster. So I’m going to say, Create cluster. I’m going to give it a name, Demo2. I can choose a cluster type. In this case, I’m going to choose development but this helps you keep track, which clusters are staging your production development systems. We support four regions today with more coming in the near future. So GCP Virginia, AWS Virginia, AWS Oregon and GCP Mumbai.

(For the demo, see the recorded webinar, MemSQL Helios and MemSQL 7.0 Overview. For an animation, see our MemSQL Helios page. Example screenshots are shown here. To try MemSQL Studio, including the built-in demo described here, download MemSQL for free. – Ed)

I’m going to choose AWS Virginia. I can choose the number of units. So I can choose four units. Unit is the amount of capacity that you specify for the size of the cluster. And again, whatever you choose, you can always change it later. But I’m going to choose a four-unit cluster, which gives me 32 vCPUs, 256 gigs of RAM, and 4 terabytes across the cluster total. Click Next. I put in my password, and then you can specify, within this an IP range, in order to lock the system down, so it’s only accessible by the machines you want to be accessible to. In my case, I’m just going to allow access from anywhere, because this is a demo. Then it’s Create cluster.

So really, with just a handful of clicks, and a few minutes work, I’ve now spun up a four-node cluster within MemSQL. Now, it’ll take a handful of minutes to spin up, so while we’re waiting for that to spin up, I’ve got another cluster over here that I’m going to use to show you a little bit more about what it looks like. So what you’ll see is the cluster properties. So this information is the region and the size and the information about when it was created and what version I’m on. I can change the password over here, or edit the IP addresses that can connect to it. And then I’m given the endpoints here, that let me connect in order to run commands from your application.

Now, you can either use obviously a SQL driver, a MySQL or ADB driver, to connect from your application, or you can also use Studio. And Studio is our tool for monitoring and managing MemSQL. So I’m going to click this link here to load data with MemSQL studio, and put in my password. Now, I’m in Studio, which gives me overall dashboard that tells me the number of nodes that I have, gives me some information about database usage and gives me a number of options here on the left.

MemSQL Studio helps you monitor and manage MemSQL.

One of the things we introduced recently was a tutorial to help people understand how to make use of MemSQL. So I’m going to walk you through the first couple steps to that tutorial. Let’s say the first thing I want to do is I want to run the tutorial and I’m going to load some sample data. So I click this link here to load sample data. I’m going to load the sample stock data set and so the first thing you have to do before you load a new data, of course, is you have to create a database, which means you have to come up with the schema, what are the tables I want, what are the columns I want.

And so I’m going to click paste queries here and this jumps me down to the SQL editor, which allows me to create and edit queries and run them and what we see is it’s creating a database called Trades and it creates a table called Trade with a number of different columns here. The first one being a columnstore table, and another table called Company that’s a rowstore table. So I’m just going to run that. If we go over to the databases link here, you can see that there’s a new database called Trades, and it’s got two tables. One that’s a rowstore and one a columnstore.

Now, you’ll notice these tables are empty that’s because we haven’t put any data in them. So the next thing we’re going to do is we’re going to load some data. I’m going to paste the load data thing. Now, what this is doing is using that feature I mentioned earlier, MemSQL Pipelines. And the Pipelines feature, what it does is it lets you create an object in the system that will load data from some source location. As I mentioned, it supports both Kafka as well as Linux file systems, and the cloud vendor blob storage. In this case, we’re using AWS, so we’re going to load the data from S3. And it’s as simple as saying Create Pipeline, give it a name, load data from S3, and then you give it a bucket and then a bit of configuration and away you go.

So now the data started loading and of course, if you want to see the pipeline in action, you can click the pipeline’s UI down here. You can see that the pipeline is already finished. So now if I go back to the databases and to this Trades database and I refresh my screen, you can see that there have been over 3,000 rows now loaded into the company table. And so if I want to, in fact, I can go down here and I can see that there are, in fact, 3,288 rows loaded in the table. If I want to get some idea of what they are, I can do a simple query that gets me the first 10 rows of the table. And I can investigate and kind of see what the data is from here.

Now I’m going to stop here, even though there’s a lot more to this tutorial, just in the interest of time. But you’ll see at the end, you can actually run this tutorial yourself and actually go through all the way. We’ll generate more data and actually have you run queries over millions of rows to show you the power of MemSQL, but this gives you an idea of how easy and simply you can now get up and running with loading data and getting the value of MemSQL as quickly as possible.

So next let’s talk about pricing. So how much does it cost? The way we do pricing, it’s very similar to Amazon. If you’re familiar with Amazon or any of the big cloud vendors, we have two billing models. We have both an on-demand model, where you pay for what you use on an hour-to-hour basis, and then that hourly usage is added up and billed to you monthly. So basically the number of units is the way you buy capacity. A unit is 8 vCPU and 64 gigs of RAM and one terabyte of storage. You specify how many units you want per cluster that you spin up. We add up all those units, times the price per unit, and then that’s your bill at the end of the month.

Now if you’re running clusters 24/7 and you know you’re not going to be spinning them up and down, then you can get a discount by paying upfront for the usage for either a one-year or a three-year term. You see the pricing here for Reserved as well. And you can mix the reserved and on-demand pricing – again, similar to what you can do with the existing cloud vendors. Keep in mind please that this pricing is actually the North Virginia pricing and that pricing does vary by region and by cloud provider. And if you want more details on the pricing for other regions or the cloud providers, you can contact our Sales team.

And last, let’s finish up with how does MemSQL compare with the other players in the market. So if you look at the other operational database players, one, you can see the MemSQL as a modern highly available architecture compared to the legacy architectures from some of the other players out there. MemSQL does very well. It’s really top in its class when it comes to performance, both how fast we bring data in, how fast you run queries, and the kind of concurrency you can scale. Doing much better than NoSQL systems like MongoDB or single-box systems like Amazon Aurora.

MemSQL Helios leads as a modern, high-performance, flexible cloud database with ML and AI integrations.

And even Oracle Cloud systems, where they have scalable systems, but the price-performance ratio is not very good, where it costs a lot to get the scale you need. And particularly, in analytical query performance, MemSQL shines and really shows up better than any of the other competitors, particularly the single-box systems like Amazon Aurora, that will tap out once it hits the largest box size you can get. But of course, the fact that we can deploy both on-prem and in the cloud, and on any cloud that you want, gives you the flexibility that almost none of the legacy players have.

It’s certainly not the big cloud providers like Amazon who will only run their stuff in Amazon and you can’t run it anywhere else. And then, as people move into the higher level of maturity, AI and ML integrations are a key important feature, and MemSQL shines there as well. So that finishes up the discussion around Helios.


Why MemSQL? MemSQL 7.0 In Depth

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

MemSQL VP of Product Rick Negrin describes the upcoming MemSQL 7. 0 in depth. He describes crucial features such as fast synchronous replication, for transaction resilience, and SingleStore, which leads to lower total cost of ownership (TCO) and more operational flexibility, at both the architectural and operational levels. Negrin also describes new tools for deploying and managing MemSQL and new features in MemSQL Studio for monitoring, investigating active processes, and editing SQL.

This blog post focuses on MemSQL 7.0 and the new features in MemSQL Studio. It’s based on the third part of our recent webinar, MemSQL Helios and MemSQL 7.0 Overview. (Part 1 is here; Part 2 is here.) This blog post includes our most detailed description yet of the upcoming MemSQL 7.0, and new features in MemSQL Studio.
Let’s dig in and see what’s new in MemSQL 7.0 and MemSQL Studio. So before we do that, I just want to set a little context for how we think about database workloads that are out there, and what’s driving the features that are built into them. If you look at the history of databases, there have been two common workloads that people use a database for. One is around analytics, also known as either Online Analytical Processing (OLAP), or data warehousing. That typically is comprised of requirements around needing queries to be very fast, particularly large and complex queries that are doing aggregations, or group bys, or a large set of filters.

Now you often have a large data size these days measured in terabytes, or hundreds of terabytes, sometimes even petabytes. Usually a company has large, infrequent batch loads of data, with large data loads coming in periodically. And then a need to resource-govern the different workloads that are making use of the system, because you often have different users running different types of queries.

Now the other side is the transactional applications – the Online Transaction Processing (OLTP) workload. That has a difference in its requirements. So in that case, the reads are coming in from the application, and the writes are coming in from the application, as opposed to being different. The reason why is that it’s coming from different sources and the types of queries tend to be less complex queries, and more focused on things like fast record lookup or small, narrow-range queries. But there are much stronger requirements around the service level agreements (SLAs), around concurrency and the availability and resiliency of the system. The more mission-critical, the more there’s less tolerance for downtime.

Whereas on the data warehouse side, the OLAP or Online Analytics Processing side, you’re often running things at night. It’s often offline at night. If an analyst is forced offline for an hour, they go and get coffee and maybe are unhappy but, you know, it’s not the end of the world. Whereas with the transactional side, often it’s an application, and sometimes it’s customer-facing, or internal-facing to many users within your organization. If it’s down, it can be very bad to catastrophic for the business. And so the SLAs around durability, resilience, extensibility are pretty critical.

And what we’re seeing with the new, modern workloads is that often they have a combination of needs around the transactionality and the analytics, that they need the fast query and aggregations and the large data size, but there’s a change in one of the key requirements here. It’s not just large data load but they need fast data load. They need the data to come in within a certain SLA, but there’s an SLA not just on how fast the query is, but also on the ingest, how quickly the data gets into the system in order to meet that near-real-time requirement that people need.

And then combine that with some use cases where, in addition to needing the aggregations, they also need the fast record lookups and all the availability and durability and resiliency that we expect from an operational system. And so it’s the combination of the same requirements that you have for the data warehouse as well, plus the operational ones, all combined in one system. There really aren’t many systems that can do that.

MemSQL has made solid progress pretty much across all of these requirements. At this point, we can do a vast majority of the workloads out there. But that’s not enough for us. We’re not going to be happy or settled until we can do all the workloads, which means being able to be as available and as resilient as the most mission-critical, complicated, highest-tier enterprise applications that are out there. And to do that, we need to get even more investments in things like resiliency. That’s why we focused on the two key features of 7.0, which are around fast synchronous replication and incremental backup.

So up until this current version, we’ve had synchronous and asynchronous replication. The difference being is when we mentioned we have high availability (HA), in which we keep two copies of the data. With async replication, you return back success of the user as soon as at least one copy of the data’s written. Asynchronous replication, you wait until both copies are written before you return back success of the user. That guarantees the user that if there’s a problem or a failover, you’re guaranteed that no data will get lost because you have the data in both copies.

Now, we’ve always offered both mechanisms and the customers will choose which one worked best for them, making a trade-off between performance and durability. And in some cases, customers made one choice versus the other, but it was always unfortunate that they had to make that trade-off. Trade-offs are hard and nobody wants to have to choose between those two things they wanted. And so with 7.0, we revamped how we do replication of the copies such that the synchronous replication is basically, it’s so close to the speed of async that there’s really, it’s a negligible difference between them.

And so we’ve enabled synchronous replication as the default so that everybody gets the durability guarantees and asynchronous replication without having to trade out performance. And this enables you to basically survive any one machine failure without having to worry about any loss of data. And additionally, we move from having full backups to doing incremental backups. Full backups are great. They allow you to make sure that you have a copy of your data off the cluster in the event of a total cluster or major disaster.

But having to do full backups, I mean you can only … Even though our backups are online operations that don’t stop you from running your existing workload, they do take up resources within the cluster and moving to a model with incremental backup allows you to run the backups more often, reducing your RPO and reducing the amount of load that you have in the cluster so you don’t need as much capacity in order to maintain the SLA that you need without being impacted by back-up running. So really driving down the overall TCO of the system and making it more resilient and driving down the RPO.

Now, there’s a lot more stuff in 7.0 but those are the two marquee features that we’ve delivered that’ll make a huge difference to customers at that upper tier of enterprise workloads. Now, the other big investment we made was around the storage mechanisms that we have in the MemSQL. So again, it’s 6.8 and the current version, we have what we call Dual Store. We allow you to have a row oriented table or a column oriented table within your database and you can choose for every table that you pick. So you create which one you want and most of our customers end up choosing a mix of the two because they get a certain set of trade-offs and advantages from a row versus a column oriented table.

Rows to tend to be good for more OTP like applications where you need [inaudible 00:37:31] aggregation or seeks updates or deletes on a set of rows but the downside is that you get a higher TCO because rows store is all sort of memory and memory can get expensive when you get to large data sizes.

The column oriented tables are much better for big data aggregation scanning billions of rows. You get much better compression on the column oriented table but you don’t get very good seek time, so if you need to seek this to one or two rows or a small number rows and you don’t get secondary indexes. And so this put customers in an unfortunate position or they’ve had to choose between row and column and if you need for example, the big data aggregation because you’re doing table scan sometimes but then doing seeks other times, you’re kind of stuck. You have to give up one or the other in terms of when you chose the solution-free application.

And so with 7.0, we’ve made investments to make those trade-offs less harsh, investing in compression within the row store to drive down the TCO and implementing fast seeks and secondary hash indexes so that users who need those can just use column [inaudible 00:38:40] data. We’re not going to stop there. Long term what we want is to have a single table [inaudible 00:38:47] that has all those capabilities and under the covers, we autonomously and automatically will use the right row or column format and make use of memory and disk so that you don’t have to make the choice at design-time. We make the choice for you at an operational time based on how your workload is working and choose what’s optimal for you. And that’s what you’re going to see over the next several versions of MemSQL.

And last but not least, of course, is in order to manage and make use of a distributed database, you need to have the right tools for both deploying and managing it. And so we have a number of new capabilities within our tool chain to make it easier to set up your cluster if you’re using the self-manage and to allow you to do things like online upgrade and do monitoring of your data over time, so you can do capacity management and troubleshoot problems that are intermittent. And then on the visual side, the studio tool, which I briefly showed you during the demo allows you to do things like logical monitoring to visualize the component states of the nodes within a cluster to make sure there’s no hotspots or data skew or other problems that need attention.

Physical monitoring of the actual hosts, so you can see if any one of them is using more resources whether it’s CPU or memory or disk or I/O than it should be using and take action if needed. Of course, the physical monitoring is only something that you can do when you’re self-managed. When using Helios, the physical monitoring is taken care of by MemSQL. We also let you have tools for letting you look for a long running queries, so you can troubleshoot if a query perhaps at a plan change is now got a less optimal plan and using too much capacity or too much resources. So you can find the query, figure out what the problem is and kill it if needed. And of course, the SQL editor, which you saw in the demo that allows you to write queries and experiment with the system as well as manage it.

And that concludes our whirlwind tour of the Helios and 7.0. Of course, you don’t have to believe anything I say. You can try it for you today yourself. You can access Helios, we made our trial available to you, which is available at MemSQL.com/free or you can get started with the MemSQL 7.0 beta at the MemSQL.com/7-beta-2. A

Why MemSQL? 3. MemSQL 7.0 In Depth

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

MemSQL VP of Product Rick Negrin describes the upcoming MemSQL 7. 0 in depth. He describes crucial features such as fast synchronous replication, for transaction resilience, and SingleStore, which leads to lower total cost of ownership (TCO) and more operational flexibility, at both the architectural and operational levels. Negrin also describes new tools for deploying and managing MemSQL and new features in MemSQL Studio for monitoring, investigating active processes, and editing SQL.

This blog post focuses on MemSQL 7.0 and the new features in MemSQL Studio. It’s based on the third part of our recent webinar, MemSQL Helios and MemSQL 7.0 Overview. (Part 1 is here; Part 2 is here.) This blog post includes our most detailed description yet of the upcoming MemSQL 7.0, and new features in MemSQL Studio.

Let’s dig in and see what’s new in MemSQL 7.0 and MemSQL Studio. So before we do that, I just want to set a little context for how we think about database workloads that are out there, and what’s driving the features that are built into them. If you look at the history of databases, there have been two common workloads that people use a database for. One is around analytics, also known as either Online Analytical Processing (OLAP), or data warehousing. That typically is comprised of requirements around needing queries to be very fast, particularly large and complex queries that are doing aggregations, or group bys, or a large set of filters.

Now you often have a large data size these days measured in terabytes, or hundreds of terabytes, sometimes even petabytes. Usually a company has large, infrequent batch loads of data, with large data loads coming in periodically. And then a need to resource-govern the different workloads that are making use of the system, because you often have different users running different types of queries.

Now the other side is the transactional applications – the Online Transaction Processing (OLTP) workload. That has a difference in its requirements. So in that case, the reads are coming in from the application, and the writes are coming in from the application, as opposed to being different. The reason why is that it’s coming from different sources and the types of queries tend to be less complex queries, and more focused on things like fast record lookup or small, narrow-range queries. But there are much stronger requirements around the service level agreements (SLAs), around concurrency and the availability and resiliency of the system. The more mission-critical, the more there’s less tolerance for downtime.

Whereas on the data warehouse side, the OLAP or Online Analytics Processing side, you’re often running things at night. It’s often offline at night. If an analyst is forced offline for an hour, they go and get coffee and maybe are unhappy but, you know, it’s not the end of the world. Whereas with the transactional side, often it’s an application, and sometimes it’s customer-facing, or internal-facing to many users within your organization. If it’s down, it can be very bad to catastrophic for the business. And so the SLAs around durability, resilience, extensibility are pretty critical.

And what we’re seeing with the new, modern workloads is that often they have a combination of needs around the transactionality and the analytics, that they need the fast query and aggregations and the large data size, but there’s a change in one of the key requirements here. It’s not just large data load but they need fast data load. They need the data to come in within a certain SLA, but there’s an SLA not just on how fast the query is, but also on the ingest, how quickly the data gets into the system in order to meet that near-real-time requirement that people need.

And then combine that with some use cases where, in addition to needing the aggregations, they also need the fast record lookups and all the availability and durability and resiliency that we expect from an operational system. And so it’s the combination of the same requirements that you have for the data warehouse as well, plus the operational ones, all combined in one system. There really aren’t many systems that can do that.

MemSQL has made solid progress pretty much across all of these requirements. At this point, we can do a vast majority of the workloads out there. But that’s not enough for us. We’re not going to be happy or settled until we can do all the workloads, which means being able to be as available and as resilient as the most mission-critical, complicated, highest-tier enterprise applications that are out there. And to do that, we need to get even more investments in things like resiliency. That’s why we focused on the two key features of 7.0, which are around fast synchronous replication and incremental backup.

So up until this current version, we’ve had synchronous and asynchronous replication. The difference being is when we mentioned we have high availability (HA), in which we keep two copies of the data. With async replication, you return back success to the user as soon as at least one copy of the data’s written. Asynchronous replication, you wait until both copies are written before you return back success of the user. This guarantees the user that if there’s a problem or a failover, you’re guaranteed that no data will get lost because you have the data in both copies.

Now, we’ve always offered both mechanisms and the customers will choose, which one worked best for them, making a trade-off between performance and durability. And in some cases, customers made one choice versus the other but it was always unfortunate that they had to make that trade-off. Trade-offs are hard and nobody wants to have to choose between those two things they wanted. And so with 7.0, we revamped how we do synchronous replication, so it’s so close to the speed of async that there’s really just a negligible difference between them.

We’ve enabled synchronous replication as the default so that everybody gets the durability guarantees without having to trade out performance. And this enables you to basically survive any one machine failure without having to worry about any loss of data. And additionally, we move from having full backups only, to also offering incremental backups.

Full backups are great. They allow you to make sure that you have a copy of your data off the cluster in the event of a total cluster or major disaster. But having to do full backups only … Even though our backups are online operations that don’t stop you from running your existing workload, they do take up resources within the cluster. Moving to a model with incremental backup allows you to run the backups more often, reducing your recovery point objective (RPO) – the age of the files that you have to recover, before you can return to normal operations – and reducing the amount of load that you have in the cluster, so you don’t need as much capacity in order to maintain the SLA that you need without being impacted by running back-ups and restores. So really driving down the overall total cost of ownership (TCO) of the system and making it more resilient and driving down the RPO.

Now, there’s a lot more stuff in 7.0, but synchronous replication and incremental backups are the two marquee features that we’ve delivered that will make a huge difference to customers at that upper tier of enterprise workloads.

MemSQL SingleStore in (Some) Depth

Now, the other big investment we made was around the storage mechanisms that we have in the MemSQL. So again, if you look at MemSQL 6.8, which is still the current version, we have what we call Dual Store. We allow you to have a row-oriented table or a column-oriented table within your database and you can choose for every table that you create. So you create which one you want. Most of our customers end up choosing a mix of the two, because they get a certain set of trade-offs and advantages from a rowstore versus a columnstore table.

Row-oriented tables to tend to be good for more OLTP-type applications where you need fine-grain aggregation or seeks, for updates or deletes on a set of rows. But the downside is that you get a higher TCO because rowstore is all stored in RAM, and memory can get expensive when you get to large data sizes.

The column-oriented tables are much better for big data aggregation scanning billions of rows. You get much better compression on the column-oriented table, but you don’t get very good seek time if you need to seek just one or two rows, or a small number of rows. You don’t get secondary indexes. And so this put customers in an unfortunate position where they’ve had to choose between row-oriented and column-oriented tables. And if you need, for example, the big data aggregation because you’re doing table scans sometimes, but then doing seeks other times, you’re kind of stuck. You have to give up one or the other when you choose the solution for your application.

Coming up, with MemSQL 7.0, we’ve made investments to make those trade-offs less harsh, investing in compression within the rowstore to drive down the TCO and implementing fast seeks and secondary hash indexes so that users who need those can just use columnstore data.

Here’s the comparison, for rowstore tables (transactional):

  • Benefits: Fine-grain aggregation: seek, update, and delete up to millions of rows, fast; secondary indexes (regular & spatial)
  • Former detriment (pre-MemSQL 7.0): High TCO
  • Now (with MemSQL 7.0 and SingleStore): Average 50% storage compression, with new null compression feature

And here’s a similar comparison, for columnstore tables (analytical):

  • Benefits: Big data aggregation; scan billions of rows, very fast; full-text indexes; 5x-10x compression
  • Former detriments (pre-MemSQL 7.0): Slow seeks; no secondary indexes
  • Now (with MemSQL 7.0 and SingleStore): Fast seeks; secondary hash indexes

We’re not going to stop there. Long term, what we want is to have a single table that has all those capabilities. Under the covers, we autonomously and automatically will use the right rowstore or columnstore format and make use of memory and disk so that you don’t have to make the choice at design time. We make the choice for you, in operational time, based on how your workload is working. Progress in that direction is what you’re going to see over the next several versions of MemSQL.

Here are the benefits for SingleStore tables (transactional + analytical), with their original source – either rowstore or columnstore: Fine-grain aggregation (from rowstore) and big data aggregation (from columnstore); seek, update, delete up to millions of rows, fast (rowstore) and scan billions of rows (columnstore); secondary indexes, regular + spatial + full-text (rowstore) and 5x-10x compression (columnstore).

Persistent storage will be in columnstore format, on disk. The database software will create other elements, such as indexes and hash tables, as needed – first in memory, up to the amount of memory available, and then spilling over to disk where needed.

Data storage in memory vs. on-disk will be managed intelligently, with the following results:

  • Where needed indices and data fit entirely in memory: Rowstore-type performance
  • Where needed indices and data fit partly in memory: More or fewer bursts of rowstore-like performance, where the needed indices and data fit entirely in memory, along with columnstore-like performance for the indices and data that are spilled over to disk
  • Where most of the data, and perhaps some of the indices, are spilled over to disk: Optimized columnstore performance, faster than pre-MemSQL 7.0 and faster than competitors

And last, but not least, of course, is this: in order to manage and make use of a distributed database, you need to have the right tools for both deploying and managing it. And so we have a number of new capabilities within our tool chain to make it easier to set up your cluster, if you’re using the self-managed software (rather than MemSQL Helios, where MemSQL does those things for you). Tools that also allow you to do things like online upgrade and do monitoring of your data over time, so you can do capacity management and troubleshoot problems that are intermittent.

And then on the visual side, the MemSQL Studio tool, which I briefly showed you during the demo, allows you to do things like logical monitoring to visualize the component states of the nodes within a cluster, to make sure there’s no hotspots or data skew or other problems that need attention.

Physical monitoring of the actual hosts, so you can see if any one of them is using more resources whether it’s CPU or memory or disk or I/O than it should be using, and take action if needed. (Of course, the physical monitoring is only something that you can do when you’re self-managed; when using Helios, the physical monitoring is taken care of by MemSQL.)

We also let you have tools to let you look for long-running queries, so you can troubleshoot if a query has problems. Perhaps there’s been a plan change, and you now have less optimal plan, using too much capacity or too much resources. So you can find the query, figure out what the problem is, and kill it if needed.

And of course, the SQL Editor, which you saw in the demo, that allows you to write queries and experiment with the system as well as manage it.

And that concludes our whirlwind tour of MemSQL Helios and the upcoming MemSQL 7.0. Of course, you don’t have to believe anything I say. You can try it for you today yourself. You can access MemSQL Helios. We made our trial available to you, which is available at MemSQL.com/free, or you can get started with the MemSQL 7.0 beta at MemSQL.com/7-beta-2. If you’re not a customer of MemSQL, you can still access the beta for free, or download our free tier to use MemSQL. Thank you very much.

MemSQL Q&A

Q. Is the software the same for MemSQL Helios, your elastic managed service in the cloud, as in the self-managed software? (Which you can deploy yourself, in the cloud or on-premises. Ed.) Are there any differences in capability?

A. Self-managed MemSQL the exact same engine used in the Helios. It’s what we ship on-prem. As we upgrade software, those versions show up in the cloud, in MemSQL Helios. In fact, they’re likely to show up on the managed cloud first, before they show up in the self-managed software that we ship.

There are some slight differences. They’re mostly temporary. There are features that are disabled, that will be in the future. So for example, the transforms feature, where you can put in arbitrary code that will enable transformation of the data as it’s going to the pipeline. We don’t enable that on Helios because we’re not yet ready to support that. We only take arbitrary code and run it inside of our manage service. However, in time, we’ll figure out how to do that safely, and we’ll enable that feature.

There are a few features like that aren’t supported yet but we’ll be supporting in the future. All this is documented in our documentation.

There are also some features that are more around managing the physical aspects of the cluster – for example, the ability to just take a node and add it to the cluster or move it towards a cluster. All these are managed for you by the system, with MemSQL Helios.

Q. Does MemSQL support high availability (HA), and how does it work?

A. HA is built into the system. With the self-managed software, you can choose whether or not you want HA at all, or if you want one or two copies. With Helios, HA is always turned on. You can’t turn it off and it’s turned on automatically, so you won’t have to do anything. It just happens, transparently.

Q. Is there a plan to allow for real-time web based replication for MySQL to MemSQL? If not, do we have current workarounds that can get this done for us, such as MySQL to Kafka to MemSQL, or another solution that is already proven?

A. Yes. You can use existing replication tools, such as Attunity and other replication tools that support MySQL. (You can use these tools because MemSQL supports the MySQL wire protocol – Ed.) You can also stage data to Kafka or S3, and use MemSQL Pipelines to move it into MemSQL. In addition, MemSQL will have a solution that will move data from MySQL, Oracle, SQL Server, and other relational databases to MemSQL later this year.

Q. I was told common table expression (CTE) materialization is coming in MemSQL 7.0. Is this true?

A. Recursive CTE virtualization is not coming in MemSQL 7.0, but it’s on the roadmap for the future.

Q. When is MemSQL 7.0 reaching general availability (GA)?

A. We’re in Beta 2 right now, we’ll have a release candidate (RC) in the next month or two, and then the GA should be up soon after that.

Q. Can you explain how you do fast synchronous replication in MemSQL 7.0, without any penalties?

A. This blog post talks about our sync replication design: https://www.memsql.com/blog/replication-system-of-record-memsql-7-0/.

Q. For MemSQL Helios, do you provide an SLA, and if so, what is it? When will MemSQL 7.0 be available on Helios? Will EU regions be available soon? And also, when will MemSQL Helios be on Azure?

A. MemSQL Helios provides an SLA of 99.9% availability. MemSQL 7.0 will be available on MemSQL Helios when MemSQL 7.0 goes GA – in fact, MemSQL 7.0 is likely to be available on MemSQL Helios first. EU regions will be available in 2019. And MemSQL Helios will be available on Azure in the first half of 2020.

Q. Does MemSQL provide data masking on personally identifiable information (PII) columns?

A. MemSQL does not have masking as a built-in feature, but you can use views on top of tables to accomplish the same effect.

Q. Are RabbitMQ pipelines in the works?

A. These are on the roadmap, but not in the near-term.

Q. Is there a cost calculator for Helios?

A. Cost is # of units * hours used * price per unit. For more information on pricing, please contact MemSQL Sales.

If you have any questions, send an email to team@MemSQL.com. We invite you to learn more about MemSQL at MemSQL.com and give us a try for free at MemSQL.com/free. Again, thank you for participating, and have a great remainder of the day.

From Big Data to Fast Data with SME Solutions Group and MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

SME Solutions Group helps institutions manage risks and improve operations, and they’ve chosen MemSQL as a database partner. Their premier services include data analytics and business intelligence (BI) tools integration, areas where MemSQL adds tremendous value. In this webinar, SME Solutions Group describes how they use MemSQL to solve a wide range of problems, including big data and fast data needs for utilities.

How MemSQL Fits SME

Mike Czabator, a systems engineer with MemSQL, describes MemSQL as a fully relational, fully distributed, memory-first database. It loads data live; scales out to support arbitrarily large numbers of ad hoc SQL queries, BI tools users, application users, machine learning (ML) models and AI algorithms. There is no delay between data ingest and data availability for queries, which can span historical and current data at terabyte and petabyte scale.

Cloud, AI/ML, big data and more

MemSQL supports streaming ingest, rowstore tables that run in memory – traditionally used for transactional (OLTP) applications – and disk-based columnstore tables, widely used in analytical (OLAP) applications. With MemSQL, queries can scan rowstore and columnstore tables, and there is no time-consuming, fraught extract, transform, and load (ETL) process; instead, MemSQL Pipelines move data around.

MemSQL is MySQL wire protocol-compatible and connects to Kafka, Spark, and many other data sources and destinations. MemSQL runs on-premises, in the cloud, in containers, with MemSQL’s Kubernetes Operator, and as a managed service.

Converged transactions and analytics

One of the critical applications for MemSQL is speeding up dashboards created by BI tools, or custom-built; analytics-driven applications, such as Uber ride hailing; and ML and AI applications, such as Epigen, which works with government and business and government clients. This is a perfect fit for the analytics and BI tools integration work done by SME Solutions Group.

Building the Ideal Data Ecosystem

Ron Katzman is Director of Strategy & Operations at SME Solutions Group. He describes current industry trends that the company helps customers align themselves with: data-driven decision making; the need for predictive analytics rather than reactive reports; and the need for speed, agility, and flexibility.

IoT is crucial for utilities; so is AI

In the energy industry, key needs include energy storage, cybersecurity, outage management, and distributed, rather than centralized, energy resources. These areas are ripe for IoT, machine learning and AI, and other emerging technologies. But utility companies still need help with strategy and implementation.

Hadoop has limited security

Traditional data warehouses are simple but slow; the “event to insight cycle time” is long. Today, architectures are complex. The use of NoSQL solutions, as described in MemSQL’s Hadoop/HDFS case study, adds some capability, but also adds both complexity and cost.

MemSQL's convergence offers 10x the performance at one-third the cost

MemSQL offers a converged solution. It has nearly unlimited capability, fast ingest, and fast queries. Unlike Hadoop/HDFS and the whole NoSQL world, it has native support for ANSI SQL, leveraging existing developer skills. SME describes MemSQL as the heart of a modern digital transformation ecosystem, offering 10 times the performance of competing solutions at one-third the cost.

Utility Case Studies

There are several existing case studies for MemSQL client solutions in the energy industry. A Top 10 US utility streamed data through MemSQL for real-time analytics. They started identifying theft within 10 days of going live with MemSQL. Processing time for one legacy job dropped from 22 hours to less than 20 seconds. Ironically, the use of MemSQL increased the lifespan of existing platforms, since emerging, complex tasks could be offloaded to MemSQL.

Scalable analytics platform based on MemSQL

Another top energy company used MemSQL for data integration across billions of data points, with fast, efficient analytics running on top. Quality of service has improved, leading to happier customers. And a leading US energy company mitigates more than $1M a day in drilling costs using machine learning, taking out SAP HANA for a much faster, lower-cost solution based on MemSQL.

SME Group summarizes MemSQL as fast, scalable SQL

SME describes MemSQL as “the fastest thing on the planet”; simple, high-performance, low-cost, and very flexible.

Q&A, and More

Two questions were answered at the end of the webinar. SME has seen time to value as short as two weeks with MemSQL. And, a user asked about the comparison of MemSQL to Snowflake. SME describes Snowflake as a fine database, but limited to the cloud, which is not always the favorite for utilities. MemSQL is unmatched on ingest capability, among other attributes, and gets the job done at a lower price point than other solutions.

The webinar also describes a detailed case study of a MemSQL implementation that supports a new meter deployment with much greater data ingest and processing requirements. We’ll share a deep dive into this case study in a future blog post.

In the meantime, you can schedule a demo with the SME Solutions Group; download and run MemSQL for free; or contact MemSQL.

Building a Database-as-a-Service with Kubernetes

$
0
0

Feed: MemSQL Blog.
Author: Micah Bhakti.

Our new database-as-a-service offering, MemSQL Helios, was relatively easy to create – and will be easier to maintain – thanks to Kubernetes. The cloud-native container management software has been updated to more fully support stateful applications. This has made it particularly useful for creating and deploying MemSQL Helios, as we describe here.

From Cloud-Native to Cloud Service

MemSQL is a distributed, cloud-native SQL database that provides in-memory rowstore and on-disk columnstore to meet the needs of transactional and analytic workloads. MemSQL was designed to be run in the cloud from the start. More than half of our customers run MemSQL on major cloud providers, including Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure.

Even with the relative simplicity of deploying infrastructure in the cloud, more and more of our customers are looking for MemSQL to handle infrastructure monitoring, cluster configuration, management, maintenance, and support, freeing customers to focus on application development and accelerating their overall pace of innovation.

MemSQL Helios delivers such a “managed service” for the MemSQL database. Thanks to the power of Kubernetes and the advancements made in that community, we were able to build an enterprise database platform-as-a-service with a very small team in just six months, a fraction of the time it would have taken previously.

Making MemSQL Helios Portable

Many of the members of the MemSQL team have built SaaS offerings on other platforms, and one of the key things we’ve learned is that applications developed on one cloud platform are not inherently portable to another platform. If you want to be able to move workloads from one platform to another, you have to make careful design choices.

Each cloud provider builds unique features, services, and methods of operation into their offerings to reflect their own ideas as to what users need and to gain competitive advantage. These differences make it harder for customers to move resources – code, data, and operational infrastructure – from one cloud to another. This stickiness, which is often very strong indeed, benefits the cloud provider. Switching becomes expensive. Additionally, developers and operations people become expert on one platform, and have a steep learning curve if they want to move to another.

In response, many companies now follow a “multi-cloud” strategy, where they deploy their IT assets across 2 or more providers. By developing a cloud-agnostic offering, we sought to empower MemSQL customers to deploy their database on the infrastructure of their choice, so that it works the same way across clouds. With cloud provider-specific services like AWS Aurora, or Microsoft SQL Database on Azure, this easy portability disappears.

Achieving True Portability with Kubernetes

Kubernetes allows application containers to be run on multiple platforms, thus reducing the development cost needed to be infrastructure agnostic, and it’s proven at large scale – for example, Netflix serves 139 million customers from their Kubernetes-based platform. And, with Kubernetes 1.5, a new capability called StatefulSets was introduced. StatefulSets give devops staffers resources for dealing with stateful containers, including both ephemeral and persistent storage volumes.

When we began developing our managed service, we actually began by using the Google Kubernetes Engine (GKE). What we discovered was that while Amazon provides Elastic Kubernetes Service (EKS), and Microsoft provides Azure Kubernetes Service (AKS), each of these offerings runs different versions of Kubernetes.

MemSQL Helios runs on AWS, GCP, and Azure - it could have depended on the Kubernetes implementation in each.
Figure 1. The first option MemSQL considered was to use three distinct, cloud provider-specific versions of Kubernetes – EKS, GKS, and AKS.

In some cases, the Kubernetes version on offer is significantly outdated. Also, each is implemented in such a way as to make it hard to migrate applications and services between them. Providing true platform portability was incredibly important to us, so we made the decision not to use EKS, GKE, or AKS. Instead, we chose to deploy our own Kubernetes stack on each of the cloud platforms.

We needed a way to repeatedly deploy infrastructure on each of the clouds in each of the regions we wanted to support. There are currently 16 AWS regions, 15 GCP regions, and 54 (!) Azure regions. That’s an unreasonable amount of infrastructure to manually deploy. Enter Kubernetes Operations (KOPS).

KOPS is an open-source tool for creating, destroying, upgrading, and maintaining Kubernetes clusters. KOPS provides a way for kubernetes and kubectl to interact with our Docker containers. By using KOPS we are able to programmatically deploy Kubernetes clusters to each of the regions we want to support, and then tie the deployments into our back-end infrastructure to create MemSQL clusters.

Creating a Kubernetes Operator

In the past, MemSQL was managed using a stateful ops tool that ran individual clients on each of the MemSQL nodes. This type of architecture is problematic when the master and client get out of sync, or if the client processes crash, or if they fail to communicate with the MemSQL engine.

In light of this, last year we built a new set of stateless tools that interact directly with MemSQL via an engine interface called memsqlctl. Because the memsqlctl interface is built into the engine, users don’t have to worry about the version getting out of sync, or about the client thinking it’s in a different state than the engine expects.

Memsqlctl seemed like the perfect way to manage MemSQL nodes in a Kubernetes cluster, but we needed a way for Kubernetes to communicate with memsqlctl directly.

In order to allow Kubernetes to manage MemSQL operations, such as adding nodes or rebalancing the cluster, we created a Kubernetes Operator. In Kubernetes, an Operator is a process that allows Kubernetes to interface with Custom Resources like MemSQL. Both the ability and the need to create Operators was introduced, along with StatefulSets, in Kubernetes 1.5, as mentioned above.

MemSQL Helios uses MemSQL's Kubernetes stack and KOPS, running directly on each of the public clouds.
Figure 2. The option we chose was to create our own portable Kubernetes stack and a toolset based on KOPS and our operator.

Custom Resources for the Kubernetes Operator

We began by creating a Custom Resource Definition (CRD) – a pre-defined structure, for use by Kubernetes Operators – for memsql. Our CRD looks like this:

memsql-cluster-crd.yaml

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: memsqlclusters.memsql.com
spec:
  group: memsql.com
  names:
    kind: MemsqlCluster
    listKind: MemsqlClusterList
    plural: memsqlclusters
    singular: memsqlcluster
    shortNames:
      - memsql
  scope: Namespaced
  version: v1alpha1
  subresources:
    status: {}
  additionalPrinterColumns:
  - name: Aggregators
    type: integer
    description: Number of MemSQL Aggregators
    JSONPath: .spec.aggregatorSpec.count
  - name: Leaves
    type: integer
    description: Number of MemSQL Leaves (per availability group)
    JSONPath: .spec.leafSpec.count
  - name: Redundancy Level
    type: integer
    description: Redundancy level of MemSQL Cluster
    JSONPath: .spec.redundancyLevel
  - name: Age
    type: date
    JSONPath: .metadata.creationTimestamp

Then we create a Custom Resource (CR) from that CRD.

memsql-cluster.yaml

apiVersion: memsql.com/v1alpha1
kind: MemsqlCluster
metadata:
  name: memsql-cluster
spec:
  license: "memsql_license"
  releaseID: 722ce44d-6f95-4855-b093-9802a9ae7cc9
  redundancyLevel: 1

  aggregatorSpec:
    count: 3
    height: 0.5
    storageGB: 256
    storageClass: standard

  leafSpec:
    count: 1
    height: 1
    storageGB: 1024
    storageClass: standard

The MemSQL Operator running in Kubernetes understands that the memsql-cluster.yaml specifies the attributes of a MemSQL cluster, and it creates nodes based on the releaseid and aggregator and leaf node specs listed in the custom resource.

There are many benefits to MemSQL in having an Operator, beyond using it for MemSQL Helios. MemSQL customers and partners started requesting an Operator as soon as the capability was introduced; now that it’s available, several of them are experimenting with the MemSQL Kubernetes Operator for their own Kubernetes implementations.

Benefits of Kubernetes and Managed Service Infrastructure

Our original goal was to get MemSQL running in containers managed by Kubernetes for portability and ease of management. It turns out that there are a number of other benefits that we can take advantage of by building on the Kubernetes architecture.

Online Upgrades

The MemSQL architecture is composed of master aggregators, child aggregators, and leaf nodes that run in highly-available pairs. Each of our nodes is running in a container, and we have created independent availability groups for the nodes. This means that when we want to perform an upgrade of MemSQL, we can simply launch containers with the updated memsql process. By replacing the leaf containers one availability group at a time, then the child aggregators, and then the master aggregator, we can perform an online upgrade of the entire cluster, with no downtime for data manipulation language (DML) operations.

Declarative Configuration

Kubernetes uses a declarative configuration to specify cluster resources. This means that it monitors the configuration yaml files and, if the contents of the files change, Kubernetes automatically re-configures the cluster to match. So cluster configuration can be changed at any time; and, because Kubernetes and the MemSQL Operator understand how to handle MemSQL operations, the cluster configuration can change seamlessly, initiated by nothing more than a configuration file update.

Recovering from Failure

Kubernetes is designed to monitor all the containers currently running and, if a host fails or disappears, Kubernetes creates a replacement node from the appropriate container image automatically. Because MemSQL is a distributed and fault-tolerant database, this means that not only is the database workload unaffected by the failure; Kubernetes resolves the issue automatically, the database recovers the replaced node, and no user input is required.

This capability works well in the cloud, because you can easily add nodes on an as-needed basis – only paying for what you’re using, while you’re using it. So Kubernetes’ ability to scale, and to support auto-scaling, only works well in the cloud, or in a cloud-like on-premises environment.

Scalability – Scale Up/Scale Down

By the same mechanism used to replace failed instances, Kubernetes can add new instances to, or remove instances from, a cluster, in order to handle scale-up and scale-down operators. The Operator is also designed to trigger rebalances, meaning that the database information is automatically redistributed within the system when the cluster grows or shrinks.

In this initial release of MemSQL Helios, the customer requests increases or decreases in the cluster size from MemSQL, which is much more convenient than making the changes themselves. Internally, this changes a state file that causes the Operator to implement the change. In the future, the Operator gives us a growth path to add a frequently requested feature: auto-resizing of clusters as capacity requirements change.

Parting Thoughts

Using Kubernetes allowed us to accomplish a tremendous amount with a small team, in a few months of work. We didn’t have to write a lot of new code – and don’t have a ton of code to maintain – because we can leverage so much of the Kubernetes infrastructure. Our code will also benefit from improvements made to that infrastructure over time.

Integrating MemSQL with Kubernetes allowed us to build a truly cloud-agnostic deployment platform for the MemSQL database, but it also provided a platform for us to provide new features and increased flexibility over traditional deployment architectures. Because of the declarative nature of Kubernetes, and because we built a custom MemSQL Operator for Kubernetes, we can make it easier to create repeatable and proven processes for all types of MemSQL operations. As a result, we were able to build this with just a couple of experienced people over a period of roughly six months.

Now that we have a flexible and scalable architecture and infrastructure, we can continue to build capabilities on top of the platform. We are already considering features such as region-to-region disaster recovery, expanded operational simplicity – with cluster-level APIs for creating, terminating, or resizing clusters – and building out our customer portal with telemetry and data management tools to let our customers better leverage their data.

This is just the beginning..

Leveraging AWS Sagemaker and MemSQL for Real-Time Streaming Analytics

$
0
0

Feed: MemSQL Blog.
Author: Mark Lochbihler.

On Tuesday, November 12th, MemSQL will be presenting a workshop at AWS Partner Developer Day. (Click here to join, or here for the Presentation and Student Guide on Github.) The workshop will be held at the AWS office at 350 West Broadway in New York City. The workshop will be focused on teaching customers how to enable real-time, data-driven insights. In the workshop, Amazon Sagemaker, the AWS managed service for deploying machine learning models quickly, will be shown working with MemSQL, a highly performant, cloud-native database. MemSQL supports streaming data analytics and the ability to run operational analytics workloads on current data, blended in real time with historical data, not yesterday’s closed data from a nightly batch update.

Putting Machine Learning into Production

Productionalizing a machine learning (ML) model has three phases. The lifecycle of a model begins with a build phase, then a training phase. After that, a model is selected to be used by the business.

AWS SageMaker and MemSQL together handle real-time data.

Some of the selected champion models are deployed into production. These are the models that have the potential to drive significant, immediate business value.

Many organizations today are leveraging Amazon SageMaker’s highly scalable algorithms and distributed, managed data science and machine learning platform to develop, train and deploy their models. In the workshop, we will begin by leveraging a Sagemaker Notebook to build and train an ML model.

SageMaker is a powerful tool for implementing machine learning (ML) models in AWS.

What MemSQL Adds to ML and AI

MemSQL adds a great deal to machine learning and AI, and a large share of MemSQL customers are running machine learning models and AI programs in production using MemSQL.

MemSQL helps modernize existing data infrastructure. The rapid flow of data through the infrastructure is vital to machine learning and AI. New tools, such as Python programs or the use of SparkML for running models, can be integrated with MemSQL directly. Processing is fast, running at production speeds.

Big data systems based on Hadoop were sold and installed with the promise of supporting machine learning and AI. These systems do indeed gather large amounts of data together, where it’s used for the data science work that build machine learning models.

However, when it comes time to run the models, HDFS – the file system for Hadoop – is often too slow. Data is not stored in SQL format, and custom queries are needed to retrieve key information. The queries, however, run too slowly for production.

So data is transferred from the data lake to MemSQL for production. This has three key benefits: the data is now accessible to SQL queries; the queries run much faster; and the performance achieved is scalable, simply by adding more servers.

MemSQL is an instant AI/ML upgrade for data structures.

MemSQL integrates with a wide range of tools popular for use in developing and implementing machine learning models – many of which use Python as the programming language. Tools include pandas, NumPy, TensorFlow, scikit-learn, and named connectors in Microsoft SQL Server Analysis Services (SSAS), an analytics processing and data mining tool for Microsoft SQL Server. SSAS can pull in information from a wide range of different sources. The R programming language is also used, due to its facility with statistical computing and its widespread use for data mining.

A wide range of ML tools integrate with MemSQL.

All of these tools can be used along with MemSQL Pipelines. Data can be “scored on load” against a machine learning model. As data is transferred from a source – AWS S3, the Hadoop HDFS database, Kafka pipelines, or the computer’s file system – it is operated on by Python or executable code, scored against the model. This flexible capability allows models to be operationalized rapidly and for scoring to run at high speed.

Scoring ML models with MemSQL Pipelines.

With MemSQL, user code can be implemented as user-defined functions (UDFs), stored procedures – the ability to run stored procedures on ingest is called Pipelines to Stored Procedures – user-defined aggregate functions, and more.

Because MemSQL is relational, with ANSI SQL support, common SQL functions such as joins and filters, and SQL-compatible tools (such as most business intelligence [BI] programs) are not only easy to use, but run fast, with scalability.

User code is easily integrated, along with SQL calls.

These capabilities allow MemSQL to work well as a partner environment for SageMaker.

What Happens in the Workshop

Once we have developed the ML model, it will be deployed in production, enabling potential significant value add to a real-time business process. This deployment step is also referred to as “operationalizing the model.”

With Sagemaker, our deployed model will be exposed as an inference API endpoint. At this point, we are ready to consume the ML model in a streaming data ingestion pipeline.

With MemSQL, both real time and historical data are available to both your training and operationalization environments, helping organizations greatly reduce time to value and operational costs, while at the same time improving the customer experience. As more data is made available to your model, it can be optimized over time, and dynamically redeployed and leveraged within MemSQL on AWS.

Operational analytics works, with SageMaker and MemSQL.

Participants in this half-day, hands-on workshop will be taught how to jointly deploy MemSQL real-time streaming ingest pipelines with Sagemaker inference ML endpoints. If you are a Data Engineer, Architect, or Developer, this session is designed for you.

We hope you can join us on November 12th in New York City, or at one of our future sessions. Click here to join.

For further information, email Mark Lochbihler, Director of Technical Alliances at MemSQL, or Chaitanya Hazarey, ML Specialist Solutions Architect at AWS.

Case Study: Moving to Kafka and AI at a Major Technology Services Company

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

How does a major technology services company equip itself to compete in the digital age – and provide services so outstanding that they can significantly advance the business prospects of their customers? All while reducing complexity, cutting costs, and tightening SLAs – in some cases, by 10x or more? For one such company, the solution is to deliver real-time, operational analytics with Kafka and MemSQL.

In this company, data flowed through several data stores, and from a relational, SQL database, into NoSQL data stores for batch query processing, and back into SQL for BI, apps, and ad hoc queries. Now, data flows in a straight line, through Kafka and into MemSQL. Airflow provides orchestration.

Before MemSQL: Custom Code, PostgreSQL, HDFS, Hive, Impala, and SQL Server

At this technology services company, analytics is absolutely crucial to the business. The company needs analytics insights to deliver services and run their business. And they use their platform to provide reports and data visualizations to their customers. (We’re leaving the company unidentified so they can speak more freely about their process and their technology decisions.)

The company’s original data processing platform was developed several years ago, and is still in use today – soon to be replaced by Kafka and MemSQL. Like so many companies at that time, they chose a NoSQL approach at the core of their analytics infrastructure.

Data flowed through the analytics core in steps:

  • A custom workflow engine brings in data and schedules jobs. The engine was written in Python to maximize flexibility in collecting data and scheduling data pipelines.
  • The data is normalized and stored in PostgreSQL, one of the leading relational databases.
  • Data then moves into HBase, the data store for Hadoop – a NoSQL system that provides the ability to version data at an atomic (columnar) level.
  • In the next step, data moves to Apache Hive, the data warehousing solution for Hadoop. Then new, updated Parquet tables are created on Cloudera’s version of the Apache Impala Hadoop-to-SQL query engine.
  • Data then moves to SQL Server, another leading relational database, where it can be accessed by traditional, SQL-based business intelligence (BI) tools.
Hadoop-HBase-HDFS were slower than MemSQL
The previous architecture had data going from SQL to NoSQL, then back to SQL.

This system has worked well for batch-type analytics and other batch-oriented use cases, which is still most of what the company does with data. And, at different stages, data was available through either a traditional SQL interface, or through the ecosystem that has developed around Hadoop/HDFS (ie Impala).

Costs were reasonable, due to use of a combination of in-house, open source, and licensed software. And, because the data was in relational format, both before and after storage in HDFS, it was well-understood and orderly, compared to much of the data that is often stored in NoSQL systems.

However, the company is moving into real-time operational analytics, machine learning (ML), and AI. Looking at ML and AI highlighted many of the issues with the analytics processing core at the company:

  • Stale data. Data is batch-processed several times as it moves through the system. At each step, analytics are being run on older and older data. Yet, as the prospect of implementing AI showed, it’s the newest data that’s often the most valuable.
  • Loss of information. As data moves into a relational database (PostgreSQL), then into a NoSQL storage engine (HDFS), then into a cache-like query system (Cloudera Impala), and finally to another relational database (Microsoft SQL Server), the level of detail in the data that can be pulled through at each step is compromised.
  • Clumsy processes. All of the steps together have considerable operational overhead, and specific steps have their own awkward aspects. For instance, Cloudera Impala works by taking in entire files of data from Hive and making them available for fast queries. Updating data means generating entire new files and sending them to Impala.
  • Operational complexity. The company has had to develop and maintain considerable technical expertise dedicated just to keeping things going. This ties up people who could otherwise be building new solutions.
  • Not future-ready. As the company found when it wanted to move to ML and AI, their complex infrastructure prevented the embrace of new technology.

Moving to Kafka and MemSQL

The previous architecture used by the technical services company had been blocking their path to the future. So they’re moving to a new, simpler architecture, featuring streaming data and processing through MemSQL, with support for ML and AI. The company will even be driving robotics processes. They will track and log changes to the data. This will allow them to have a sort of “time travel,” as they refer to it.

The company has described what they need in the new platform:

  • Simplicity. Eliminate Hadoop to reduce complexity and delays. Eliminate Impala to cut out 40-50 minute waits to load big tables on updates.
  • Data sources. Oracle, Salesforce, Sharepoint, SQL Server, Postres (100-plus sources).
  • Concurrent processing. Scale from 20-25 concurrent jobs, maximum, to 200 or more.
  • Query types. Simple, complex, hierarchical (analytics), aggregates, time-series; ad hoc.
  • Query speed. Cut ETL from 12 hours to 1 hour or less.
  • SQL support. Most of the company’s engineers are strong SQL developers – but they need a scalable platform, not a traditional, single-process relational database.
  • Business benefits. Take processing past thresholds that are currently being reached; reduce worker wait times (more jobs/day, more data/day, distributed container support).

What platform could do the job? As one of the project leads puts it, “We were awesomely pleased when we saw MemSQL. We can enter SQL queries and get the result in no time. This, we thought, can solve a lot of problems.”

The company quickly knew that Kafka would be the best way to ingest data streaming in at high volume. So, when they investigated MemSQL, they were especially happy to find the Kafka pipeline capability, including exactly-once updating, in line with Kafka. This helped them move Kafka to a larger role in their planning – from only being used to feed AI, to a streaming pipeline for all their analytics data.

The company is still in the design phase. Kafka and MemSQL have handled all the use cases they’ve thrown at the combination so far. They can replicate the same environment in the cloud and on-premises, then move workloads wherever it’s cost-effective and convenient.

The company can also mix and match rowstore and columnstore tables. For instance, data can flow into a rowstore table, where they will perform concurrent transformations on it. They then aggregate new and existing data in a columnstore table, eventually totaling petabytes of information.

Unlike most databases – such as with the Impala solution and Parquet files, in the current solution – MemSQL can update columnstore tables without having to re-build and re-load them. (This capability is enhanced in the upcoming MemSQL 7.0, with faster seeks and, as a result, faster updates.)

This matches one of the company’s biggest use cases: support for lab queries. The labs need nearly infinite disk space for ongoing storage of data. They have columnstore tables with hundreds of columns, and their old solution had trouble pulling them through the Hive metastore. MemSQL gives them the ability to support thousands of discrete columns. Since MemSQL also has built-in support for JSON, as the project lead puts it, “there really are no limits.” They can then move selected data into rowstore tables for intensive processing and analysis.

The company stopped looking at alternatives after it found MemSQL. For instance, Citus for Postgres provides a distributed query engine, Citus, on top of a very mature database, PostgreSQL. “However,” says the project lead, “it has limitations, and requires a lot of up-front planning to make it work.”

Design and Implementation: Airflow, Kafka, and MemSQL, plus Spark

There are many databases that can do some of these things in a pilot project, with limited amounts of data – or, in particular, with limited numbers of users. But, as the project lead says, “The beauty of MemSQL is, if you get more users, you just add more nodes.”

The technical services company will use Airflow as a scheduler. Airflow sends commands to remote systems to send their data. Commands such as Create Pipeline get data flowing. Airflow sends the data to be ingested into a Kafka topic, then drops the connection to the remote system. Airflow doesn’t hold data itself; it’s simply there for orchestration.

Replacing Hadoop-HBase-HDFS with Kafka and MemSQL increased speed and recency.
A straight flow from Kafka to MemSQL, with processing against MemSQL, speeds operations with fresh data.

The company then uses a MySQL interface – MemSQL is MySQL wire protocol-compatible – to ingest the data into MemSQL. With the data in MemSQL, they have almost unlimited options. They can run Python code against database data, using MemSQL transforms and Pipelines to stored procedures.

“Our only cost,” says the team lead, “is team members’ time to develop it. This works for us, even in cases where we have a one-man design team. Data engineers leverage the framework that we’ve built, keeping costs low.”

The former analytics processing framework, complex as it was, included only open source code. “This is the first time we’ve introduced subscription-based licensing” into their analytics core, says the project lead. “We want to show the value. In the long run, it saves money.”

They will be exploring how best to use Spark with MemSQL – a solution many MemSQL users have pioneered. For instance, they are intrigued by the wind turbine use case that MemSQL co-CEO Nikita Shamgunov demonstrated at Spark Summit. This is a real, live use case for the company.

They currently have a high-performance cluster with petabytes of storage. They will be scaling up their solution with a combination of MemSQL and Spark, running against a huge dataset, kept up to date in real time.

“The core database, by itself, is the most valuable piece for us,” says the project lead. “And the native Kafka integration is awesome. Once we also get the integration with Spark optimized – we’ve got nothing else that compares to it, that we’ve looked at so far.

The company runs Docker containers for nearly everything they do and is building up use of Kubernetes. They have been pleased to learn that MemSQL Helios, MemSQL’s new, elastic cloud database, is also built on Kubernetes. It gives the project lead a sense of what they can do with MemSQL and Kubernetes, going forward. “That’s the point of containerization,” says the project lead. “Within a minute or two, you have a full cluster.”

“We’ve gone from a screen full of logos,” says the project lead, “in our old architecture, to three logos: Airflow, Kafka, and MemSQL. We only want to deal with one database.” And MemSQL will more than do the job.

Spin Up a MemSQL Cluster on Docker Desktop in 10 Minutes

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Even though MemSQL is a distributed system, you can run a minimal version of MemSQL on your laptop in Docker. We tell you how in this blog post. The combination of free use of the software, and being able to run MemSQL on your laptop, can be extremely convenient for demos, software testing, developer productivity, and general fooling around.

In this post we’ll quickly build a single-instance MemSQL cluster running on Docker Desktop, on a laptop computer, for free. You’ll need a machine with at least 8GB RAM and four CPUs. This is ideal for quickly provisioning a system to understand the capabilities of the SQL engine. Everything we build today will be running on your machine, and with the magic of Docker containers, we’ll not need to install or configure much of MemSQL to get it running.

MemSQL and Docker are a natural pair for running on a laptop.

The steps here are: install Docker Desktop; get a free MemSQL license; create a Docker Compose file; start the MemSQL cluster through Docker; and browse to MemSQL Studio. You’ll have a bare-bones MemSQL cluster running on your laptop machine in no time.

Why Docker, and Why a Single Docker Container?

Docker is a great way to run software in a protected sandbox and easy-to-manage environment, with less overhead than a virtual machine (VM) – and much less than a dedicated server. You can use it to spin up applications, systems, and virtual hardware to try out software or to quickly spin up a database to support local application development. We’ll use Docker here to provision and spin up a free MemSQL cluster, and just as easily, destroy it when we’re done.

Using Docker makes it much easier to run a small MemSQL cluster without interfering with other software running on the machine. Pre-installed in the cluster-in-a-box container image is an aggregator node, a leaf node, and MemSQL Studio, which is a browser-based SQL editor and database maintenance tool, all running in one place, all pre-configured to work together. The minimal hardware footprint wouldn’t be nearly enough for production workloads, but it allows us to quickly spin up a cluster, connect it to our project, and try things out.

We could use a Virtual Machine (VM), but Docker containers are lighter-weight than a virtual machine (VM). Like virtual machines, Docker provides a sandbox between processes. But unlike VMs, containers virtualize the operating system instead of the hardware, and Docker’s configuration-as-code mindset ensures we can quickly provision a complete virtual system from a small text file stored in Git.

With the single-container MemSQL cluster described here, you can craft the simplest of tables all the way up to running a complex app, a dashboard, a machine learning model, streaming ingest from Kafka or Spark, or anything else you can think of against MemSQL. You’ll quickly understand the methodology and features of MemSQL, and can plan accordingly.

The MemSQL cluster-in-a-box container has minimum hardware specs disabled, but you’ll still want a machine with at least 8 GB RAM and four CPUs. With specs well below MemSQL’s limits, you’ll see poor performance, so this system is definitely not the right setup for a proof of concept (PoC). But you can use this setup to experience the full features of MemSQL, and understand how it applies to your business problems.

Once you know how MemSQL works, you can take these experiments and use what you learn to help you achieve your service-level agreements (SLAs) on distributed clusters that meets MemSQL’s minimum requirements and your high-availability needs. That’s when you can really open up the throttle, learn how your data performs on MemSQL, and dial in system performance for your production workloads.

Cluster-in-a-box Multi-node Cluster
Hardware Laptop computer Many hefty servers
Best use-case * Try out MemSQL
* Test MemSQL capabilities
* Prototyping
* Proof of concept (PoC)
* Production workloads
* High availability
* High availability
Cost Free up to four nodes with 32GB RAM each, and with community support Free up to four nodes with 32GB RAM each, and with community support

Sign Up for MemSQL

To get a free license for MemSQL, register at memsql.com/download and click the link in the confirmation email. Then go to the MemSQL customer portal at portal.memsql.com and login. Click “Licenses” and you’ll see your license for running MemSQL for free. This license never expires, and is good for clusters up to four machines and up to 128GB of combined RAM. This is definitely not the license you’ll want for a production cluster, but it’s great for these “kick the tires” scenarios. Note this license key. We’ll need to copy/paste it into place next.

Install Docker Desktop

The first step in getting our MemSQL cluster running in Docker Desktop is to get Docker Desktop installed. If you already have a recent version of Docker Desktop, you need only ensure you’re in Linux containers mode.

Docker’s install requirements are quite specific, though most modern mid-range systems will do. Docker Desktop for Windows runs a Linux VM in Hyper-V, and Hyper-V requires Windows 10 Pro or Enterprise. Docker Desktop for Mac runs a Linux VM in xhyve, and requires a 2010 or newer model with macOS 10.13 or better.

To install Docker Desktop, go to Docker Hub and choose the Docker Desktop version for your operating system. The download will require you to create a free account. Run the downloaded installer and accept all the defaults.

Windows containers can run MemSQL too.

Note for Windows users: If you are doing a fresh install, ensure you choose “Linux Containers” mode. If you installed Docker previously, ensure you’re running in Linux containers mode. Right-click on the Docker whale in the system tray (bottom-right by the clock), and choose “Switch to Linux Containers”. If it says “Switch to Windows Containers”, you’re already in the right place – that is, in Linux Containers mode.

Adding more RAM: Though not required, MemSQL will definitely behave better when Docker Desktop has more capacity. Click on the Docker whale, choose “Settings…” on Windows or “Preferences…” on Mac, click on the “Advanced” tab. If your machine has more than 8 GB RAM, set this to 8192. If your machine has 8 GB RAM or less, set it as high as you can. Then change the CPU count from 2 to 4.

Create a Docker Compose file

A docker-compose.yaml file gives Docker Desktop instructions to spin up one or more containers together. It’s a great way to capture all the docker pull, docker build, and docker run details. This file doesn’t replace Dockerfiles but rather makes it much easier to use them.

We’ll use the memsql/cluster-in-a-box image built by MemSQL and available on Docker Hub. Pre-installed in this image is the MemSQL database engine and MemSQL Studio. The minimum system requirements are disabled in this “cluster-in-a-box” configuration.

Create an empty directory and create a file named docker-compose.yaml inside. Open this file in your favorite code editor and paste in this content:

A yaml file is a text file that’s great for capturing our architecture setup. As with Python source code, white space is significant. Yaml uses two spaces, not tabs. Double-check the yaml file to ensure each section is indented with exactly two spaces. If you have more or fewer spaces, or if you’re using tabs, you’ll get an error on startup.

Here’s the sections in the docker-compose.yaml file:

  1. We’re using the Docker Compose syntax Version 2.
  2. The services array lists all the containers that should start up together. We’ve only defined a single container here, named memsql, which uses the memsql/cluster-in-a-box image built by MemSQL.
  3. In the Ports section, we identify inbound traffic that’ll route into the container. Open port 3306 for the database engine and port 8080 for MemSQL Studio. If either of these ports are in use on your machine, change only the left port. For example, to connect from outside Docker to the database on port 3307 use 3307:3306.
  4. The first environment variable, START_AFTER_INIT, exists for legacy reasons. Without this environment variable, the container will spin up, initialize the MemSQL cluster, and immediately stop. This is great for debugging, but not the behavior we want here.
  5. The second environment variable, LICENSE_KEY, is the placeholder for the license we got from portal.memsql.com. Don’t copy the license key into place here — you’ll accidentally leak secrets into source control. Instead, this syntax notes that we’ll reference an environment variable set in the terminal.

In time, we could easily add our application, business intelligence (BI) dashboard, and other resources to this file. If you have an existing docker-compose.yaml file, you can copy that content into place here too. Save the file, and we’re ready to launch Docker.

Starting the MemSQL Cluster

Open a new terminal window in the same directory as the docker-compose.yaml file. This could be Powershell, a command prompt, or a regular terminal.

First, we’ll set the license key as an environment variable. Copy the license from portal.memsql.com in the Licenses tab, and create an environment variable in the terminal:

Command prompt:
set LICENSE_KEY=paste_license_key_here
Powershell:
$env:LICENSE_KEY = ‘paste_license_key_here’
Mac/Linux/Git Bash:
export LICENSE_KEY=paste_license_key_here

Paste your actual license key in place of paste_license_key_here. It’s really long and probably ends with ==.

Next, type this in the shell:

docker-compose up

This tells Docker to pull or build all the images, start up all the containers in our docker-compose.yaml file, and stream the console output from each container to our terminal. I find it fascinating to watch each application spew their innards here.

If you get an error starting the cluster, double-check that the license key is correct and that Docker is running. If you get an image pull failure, ensure your network connection is working as expected. To retry, type Cntrl-C in the terminal, then type:

docker-compose down
docker-compose up

Congratulations! We’ve launched a MemSQL cluster. Let’s dive in and start using it.

Start MemSQL Studio

Now that MemSQL is running in Docker, let’s dive in and start using it. Open the browser to http://localhost:8080 to launch MemSQL Studio. Click on the local cluster, enter username of root, leave the password blank, and login.

On this main dashboard screen, we can see the health of the cluster. Note that this is a two-node cluster. Clicking on Nodes on the left, we see one node is a leaf node, one is an aggregator node, and they’re both running in the same container. In production, we’d want more machines running together to support production-level workloads and to provide high availability.

Click on the SQL Editor page, and we see the query window. In the query window, type each command, select the line, then push the execute button on the top-right:

For more details on MemSQL Studio, check out the docs or watch the MemSQL Studio tour video.

Where Can We Go From Here?

MemSQL is now ready for all the “kick the tires” tasks we need. You can:

Hook up your analytics dashboard to MemSQL.
Add your application to the docker-compose.yaml file and connect it to MemSQL using any MySQL connector.
Create an ingest pipeline from Kafka, S3, Spark, or other data source.

Being able to run these tasks from such a simple setup is a real time-saver. However, don’t expect the same robustness or performance as you would have with a full install on a full hardware configuration.

Cleanup

.yaml

We’ve finished our experiment for today. To stop the database, push Ctrl-C in the terminal, and type:

docker-compose down

This will stop all the containers in the docker-compose.yaml file. If you’re done experimenting with MemSQL, delete the image by running docker image rm memsql/cluster-in-a-box in the terminal. Or better yet, leave this image in place to quickly start up your next experiment. Run docker system prune to remove any stored data or dangling containers left in Docker Desktop and free up this space on your hard disk.

Conclusion

With the MemSQL cluster-in-a-box container and Docker Desktop, we quickly provisioned a “kick the tires” MemSQL cluster. Aside from Docker itself, there was nothing to install, and thus cleanup is a breeze. We saw how easy it is to spin up a cluster, connect to it with MemSQL Studio, and start being productive. Now get a copy of MemSQL for free and go build great things!


Spin Up a MemSQL Cluster on Kubernetes in 10 Minutes

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Even though MemSQL is a distributed system, you can run a minimal version of MemSQL on your laptop in Kubernetes. We tell you how in this blog post. The combination of free access, and being able to run MemSQL on your laptop, can be extremely convenient for demos, software testing, developer productivity, and general fooling around.

In this post we’ll quickly build a single-instance MemSQL cluster running inside Kubernetes on Docker Desktop, on a laptop computer, for free. You’ll need a machine with at least 8GB RAM and four CPUs. This is ideal for quickly provisioning a system to understand the capabilities of the SQL engine. Everything we build today will be running on your machine, and with the magic of containers, we’ll not need to install or configure much of MemSQL to get it running.

You can run MemSQL in Kubernetes on your laptop.

The steps here are: get a free MemSQL license; install Docker Desktop; create a Kubernetes yaml file; start the MemSQL cluster through Kubernetes; and browse to MemSQL Studio. You’ll have a bare-bones MemSQL cluster running on your laptop machine in no time.

Why Kubernetes, and Why a Single Container?

Containers are a great way to run software in a protected sandbox and easy-to-manage environment, with less overhead than a virtual machine (VM) – and much less than a dedicated server. You can use containers to spin up applications and systems to try out software, or to quickly spin up a database to support local application development. We’ll use Docker Desktop in Kubernetes mode to provision and spin up a free MemSQL cluster, and just as easily, destroy it when we’re done.

Using Kubernetes makes it much easier to run a small MemSQL cluster without interfering with other software running on the machine. The cluster-in-a-box container image that we’ll use here includes an aggregator node, a leaf node, and MemSQL Studio (our browser-based SQL editor and database maintenance tool), all running in one place, all pre-configured to work together. The minimal hardware footprint wouldn’t be nearly enough for production workloads, but it allows us to quickly spin up a cluster, connect it to our project, and try things out.

You also have the option of using Docker containers without Kubernetes. We believe that having Kubernetes in the mix makes it easier to manage your cluster and introduces you to a powerful modus operandi for running MemSQL. However, if you don’t already have Kubernetes in your production environment, nor much experience for running MemSQL, you may want to consider running MemSQL in Docker containers, without Kubernetes. The steps to do that are very similar to the steps described in this blog post, and you can view them here.

We could also use a Virtual Machine (VM), but containers are lighter-weight than a virtual machine (VM). Like virtual machines, containers provide a sandbox between processes. But unlike VMs, containers virtualize the operating system instead of the hardware, and the configuration-as-code mindset shared by Docker and Kubernetes ensures that we can quickly provision a complete virtual system from a small text file stored in Git.

With the single-container MemSQL cluster described here, you can craft the simplest of tables all the way up to running a complex app, a dashboard, a machine learning model, streaming ingest from Kafka or Spark, or anything else you can think of against MemSQL. You’ll quickly understand the methodology and features of MemSQL, and can plan accordingly.

The MemSQL cluster-in-a-box container has minimum hardware specs disabled, but you’ll still want a machine with at least 8 GB RAM and four CPUs. With specs well below MemSQL’s limits, you’ll see poor performance, so this system is definitely not the right setup for a proof of concept (PoC). But you can use this setup to experience the full features of MemSQL, and understand how it applies to your business problems.

Once you know how MemSQL works, you can take these experiments and use what you learn to help you achieve your service-level agreements (SLAs) on distributed clusters that meets MemSQL’s minimum requirements and your high-availability needs. That’s when you can really open up the throttle, learn how your data performs on MemSQL, and dial in system performance for your production workloads.

Cluster-in-a-box Multi-node Cluster
Hardware Laptop computer Many hefty servers
Best use-case * Try out MemSQL
* Test MemSQL capabilities
* Prototyping
* Proof of concept (PoC)
* Production workloads
* High availability
* High availability
Cost Free up to four nodes with 32GB RAM each, and with community support Free up to four nodes with 32GB RAM each, and with community support

Sign Up For MemSQL

To get a free license for MemSQL, register at memsq.com/download and click the link in the confirmation email. Then go to the MemSQL customer portal at portal.memsql.com and login. Click “Licenses” and you’ll see your license for running MemSQL for free. This license never expires, and is good for clusters up to four machines and up to 128GB of combined RAM. This is not the license you’ll want for a production cluster, but it’s great for these “kick the tires” scenarios. Note this license key. We’ll need to copy/paste it into place next.

Install Docker Desktop

The first step in getting our MemSQL cluster running in Kubernetes (k8s) is to get Docker Desktop installed. We’ll use Docker Desktop’s Kubernetes mode as the simplest way to a Kubernetes cluster. Though beyond the scope of this article, you can also use another K8s cluster such as MiniKube, K3s, MicroK8s, or kind.

Docker’s install requirements are quite specific, though most modern mid-range systems will do. Docker Desktop for Windows runs a Linux VM in Hyper-V, and Hyper-V requires Windows 10 Pro or Enterprise. Docker Desktop for Mac runs a Linux VM in xhyve, and requires a 2010 or newer model with macOS 10.13 or better. (Kubernetes does not add any system requirements beyond those needed for Docker.)

To install Docker Desktop, go to Docker Hub and choose the Docker Desktop version for your operating system. The download will require you to create a free account. Run the downloaded installer and accept all the defaults.

Note for Windows users: If you are doing a fresh install, ensure you choose “Linux Containers” mode. If you installed Docker previously, ensure you’re running in Linux containers mode. Right-click on the Docker whale in the system tray (bottom-right by the clock), and choose “Switch to Linux Containers”. If it says “Switch to Windows Containers”, you’re already in the right place – that is, in Linux Containers mode.

Note – adding more RAM: Though not required, MemSQL will definitely behave better when Docker Desktop has more capacity. Click on the Docker whale, choose “Settings…” on Windows or “Preferences…” on Mac, and click on the “Advanced” tab. If your machine has more than 8 GB RAM, set this to 8192. If your machine has 8 GB RAM or less, set it as high as you can. Then change the CPU count from 2 to 4.

To turn on Kubernetes, open the Docker whale, choose “Settings…” on Windows or “Preferences…” on Mac, click the Kubernetes tab, and check “Enable Kubernetes”. If you don’t see this option, ensure you’re running in Linux containers mode or upgrade Docker Desktop. The first time you enable Kubernetes mode, it’ll take quite a while to download all the K8s control plane containers and start the cluster. Next time you start Docker, it’ll start much faster.

Kubernetes Configuration Files

Kubernetes stores configuration details in yaml files. (A yaml file is a text file that’s great for capturing our architecture setup.) Typically each yaml file contains a single resource. For simplicity, we’ll create one yaml file that includes both a deployment and a service.

We’ll connect to the service, the service will proxy to the pod, and the pod will route the request into the container.

We’ll use the memsql/cluster-in-a-box image built by MemSQL and available on Docker Hub. This image comes with the MemSQL database engine and MemSQL Studio preinstalled. The minimum system requirements are disabled in this “cluster-in-a-box” configuration.

Create an empty directory, and create a file named kubernetes-memsql.yaml inside. Open this file in your favorite code editor and paste in this content. As with Python source code, white space is significant.

Yaml uses two spaces, not tabs. (It is not our intent to argue spaces vs. tabs. We are simply letting you know how yaml files do it.) Double-check the yaml file to ensure each section is indented with exactly two spaces. If you have more or fewer spaces, or if you’re using tabs, you’ll get an error on startup.

Here are the sections in the kubernetes-memsql.yaml file:

— designates the break between the two resources. The content above it is the deployment, the content below it is the service. The deployment manages and restarts pods on failure, and the service load-balances traffic across all matching pods. A pod is a Kubernetes wrapper around one or more containers.

Deployment:

  1. replicas: 1 notes that we only want one pod (container) to spin up.
  2. The metadata section is a list of key/value pairs. There’s nothing magic about these the names and values here, but they must match between both references in the deployment and the service.
  3. In the containers list, we list only one container. We pull the memsql/cluster-in-a-box image built by MemSQL, and name the container memsql.
  4. In the ports section, we identify inbound traffic that’ll route into the container. Open port 3306 for the database engine and port 8080 for MemSQL Studio.
  5. The first environment variable, START_AFTER_INIT, exists for legacy reasons. Without this environment variable, the container will spin up, initialize the MemSQL cluster, and immediately stop. This is great for debugging, but not the behavior we want here.
  6. The second environment variable, LICENSE_KEY, is the placeholder for the license we got from portal.memsql.com. In production scenarios, we’d pull this value from a k8s secret, but for this demo, paste your license key from portal.memsql.com into place in this file.

Service:

  1. The selector section matches the metadata from the deployment. This is how the service knows which pods to use to load-balance incoming traffic.
  2. Ports of type NodePort are exposed between 30,000 and 32,767, so we adjust the port numbers into this range. In the service, we route database traffic into k8s from port 30306 to the container on port 3306, and we route SQL Studio traffic to k8s from port 30080 to the container on port 8080. Through the magic of Kubernetes, only traffic on these two ports routes from the WAN side of the k8s router to the LAN side of the container. All other traffic is blocked. If either 30306 or 30080 is in use on your machine, change these to an open port between 30,000 and 32,767. It’s also possible to get Kubernetes to randomly assign a port, though that’s out of the scope for this article.

Save the file, and we’re ready to launch the resources in Kubernetes.

Starting the MemSQL Cluster

Open a new terminal window in the same directory with the kubernetes-memsql.yaml file. This could be Powershell, a command prompt, or a regular terminal.

Type this in the shell:

kubectl apply -f kubernetes-memsql.yaml

This tells Kubernetes to create (or adjust) the service and deployment definitions, and to startup the container. The output from the container isn’t streamed to the console.

To see the status of the pod as it starts up, type:

kubectl get all

The results look like this:

Results as part of running MemSQL in Kubernetes on your laptop.

If the pod status doesn’t say Ready then we need to look at the container’s logs. Grab the pod name (in this case it’s pod/memsql-6cfd48586b-8b2fj then type:

kubectl logs pod/memsql-YOUR_POD_HERE

Substitute your pod name into place.

If you get an error starting the cluster, double-check that the license key is correct and from the Docker whale icon, ensure that both Docker and Kubernetes mode are running. If you get an image pull failure, ensure your network connection is working as expected.

To relaunch the Kubernetes content, type:

kubectl delete -f kubernetes-memsql.yaml
kubectl apply -f kubernetes-memsql.yaml

Congratulations! We’ve launched a MemSQL cluster. Let’s dive in and start using it.

Start MemSQL Studio

Now that MemSQL is running in Kubernetes, let’s dive in and start using it. Open the browser to http://localhost:30080 to launch MemSQL Studio. Click on local cluster, enter username of root, leave the password blank, and login.

On this main dashboard screen, we can see the health of the cluster. Note that this is a two-node cluster. Clicking on Nodes on the left, we see one node is a leaf node, one is an aggregator node, and they’re both running in the same container. In production, we’d want more machines running together to support production-level workloads and to provide high availability.

Click on the SQL Editor page, and we see the query window. In the query window, type each command, select the line, then push the execute button on the top-right.

For more details on MemSQL Studio, check out the docs or watch the MemSQL Studio tour video.

Where Can We Go From Here?

MemSQL is now ready for all the “kick the tires” tasks we need. You could:

  1. Hook up your analytics dashboard to MemSQL, connecting to localhost:30306.
  2. Start your application and connect it to MemSQL using any MySQL connector.
  3. Create an ingest pipeline from Kafka, S3, or other data source.

Being able to run these tasks from such a simple setup is a real time-saver. However, don’t expect the same robustness or performance as you would have with a full install on a full hardware configuration.

Cleanup

We’ve finished our experiment today. To stop the database, run this command:

kubectl delete -f kubernetes-memsql.yaml

This will delete the service and the deployment, which will delete the pod and stop the container. If you’re done experimenting with MemSQL, delete the image by running docker image rm memsql/cluster-in-a-box in the terminal. Or better yet, leave this image in place to quickly start up your next experiment. Run docker system prune to remove any stored data or dangling containers left in Docker Desktop and free up this space on your hard disk.

Conclusion

With the MemSQL cluster-in-a-box container and Kubernetes, we quickly provisioned a “kick the tires” MemSQL cluster. Aside from Docker itself, there was nothing to install, and thus cleanup is a breeze. We saw how easy it is to spin up a cluster, connect to it with MemSQL Studio, and start being productive. Now go build great things!

Case Study: Thorn Frees Up Resources with MemSQL Helios to Identify Trafficked Children Faster

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Thorn’s child sex trafficking investigations tool, Spotlight, gathers information from escort sites to provide law enforcement with a tool to help find trafficked children, fast. (A Special Agent for the Wisconsin Human Trafficking Task Force describes Spotlight this way: “It is the greatest tool we have in the fight against human trafficking.“) And using MemSQL is one of the ways they do it. MemSQL is a powerful solution that meets Thorn’s requirements, including SQL support; fast query response time; support for machine learning and AI; a large and scalable number of simultaneous users; and horizontal scale-out. Also, MemSQL runs just about anywhere, notably including on-premises installations and all the major public clouds.

Still, Thorn had a business problem. As a tech non-profit, they are highly skilled at identifying and making tradeoffs that will allow their small team to deliver the biggest impact. In a constantly shifting digital environment, they know they need to focus on keeping Spotlight agile, to help find victims faster. So they need to keep the operation and maintenance of Spotlight as simple and easy to manage as possible.

In support of this strategy, Thorn is moving to MemSQL Helios, the fully managed, on-demand, and elastic cloud database from MemSQL. Where MemSQL 7.0 meets Thorn’s database needs, MemSQL Helios meets Thorn’s operational needs – removing work from Thorn’s development and operations personnel, and leaving it in the hands of MemSQL.

Peter Parente, data engineer at Thorn, puts it well: “We want to focus our time on building the application for our mission, rather than managing every detail of exactly how the data is going to be stored.” Now, Thorn can focus on growing Spotlight to meet the needs of its users and fulfill its mission: to build technology to defend children from sexual abuse.

What Thorn Delivers

As Thorn describes it, new technologies can be used by abusers to facilitate abuse – and, thankfully, the same new technologies can be leveraged to stop this abuse. Thorn leverages data to find trafficked children faster, building technology to create a world where every child can be safe, curious, and happy.

There are more than 150,000 escort ads posted daily across the US, totaling in the millions of ads a year – and, somewhere in that mountain of data, children are being sold for sex. Thorn’s research shows that 63% of child sex trafficking survivors were advertised online at some point. Harnessing that data, Spotlight is offered for free to users who are involved in actively investigating child sex trafficking cases.

When Thorn started several years ago, they only focused on a few problematic sites and online sources. Now, the number of sites with child sex trafficking content is increasing, and the user base for Spotlight has grown. Thorn is a strong example of the need that so many organizations have for nearly limitless scalability and concurrent access.

“As time passes, we have greater data complexity. More data to store and more users that need to analyze that data,” says Parente. “There are more sites, and some of the sites have added features that increase the data flowing in from them as well.”

But, even as the demands increase, so does Thorn’s effectiveness. Thorn has huge impact. Spotlight has been very successful, helping to identify over 10,000 trafficked children. On average, eight children a day are identified with it. And Thorn is proud of having sped up law enforcement investigation time, by as much as 63% – that is, they’ve cut time for investigations by nearly two-thirds. (Thorn also educates people on these topics; more than 3.5 million teens have learned to identify and prevent sextortion – extortion focused on nude images of the victim – through Thorn projects.)

To work effectively, such a system needs to meet a number of technical requirements:

  • Fast ingest and fast processing. Processing a site of interest quickly; finding matches in minutes, not hours; and synthesizing results to users for easier analysis.
  • Fully scalable. Thorn needs to be able to speed up or extend the system by adding capacity in a horizontal, linear fashion.
  • Fast query response time. As with finding matches and reporting, query response time must be fast – seconds, not minutes or hours.
  • High concurrency. Thorn needs to be able to support an ever-increasing number of signed-in recipients and interrogators of its data from a small computing footprint, with full scalability to meet new demands.

In addition, Thorn identified two business-oriented requirements, to allow them to fulfill their specific mission most effectively:

  • Low-maintenance. Thorn needs to spend as much of their engineering time as possible improving Spotlight by expanding its feature set. Building a reliable, flexible, data pipeline to support their solution needs to be as hassle-free and worry-free as possible. No one but Thorn can do this work. By making the core system as low-maintenance as possible, Thorn frees up their technical talent for this vital work.
  • Stateless. Thorn quickly identified Kubernetes as a core element of any solution. Kubernetes is very good for managing stateless components; stateful support has recently been added, but it’s still somewhat of a work in progress. (And will always be more complex than managing the stateless parts.) So Thorn sought to keep its solution stateless in as many components as possible, if not all of them.

How MemSQL Helios Helps Thorn Succeed

Thorn built a tool that meets all their requirements:

  1. Thorn finds new or updated content on targeted websites.
  2. The content is placed in an Amazon Simple Storage Service (S3) bucket.
  3. A scalable, Python data pipeline using the Dramatiq library (similar to Celery) receives notifications of new text and media content in S3 via Amazon’s Simple Query Service (SQS) and processes it.
  4. The data pipeline stores the processed, transformed data in MemSQL Helios for exploration in the Spotlight application.
  5. Trained investigators look for key details that indicate a child trafficking victim, to build their case, and to locate the most vulnerable victims.

MemSQL Helios sits at the heart of the system. “It’s currently our primary data store,” according to Parente.

MemSQL Helios, running MemSQL 7.0, brings the latest and greatest, plus ease of use, to Thorn.

Using Machine Learning and AI to Facilitate Identifications

Thorn uses MemSQL’s Euclidean distance function for computing image similarity, resulting in very high throughput rates for image comparisons. The process is described in detail in this blog post from MemSQL co-CEO Nikita Shamgunov: MemSQL as a Data Backbone for Machine Learning and AI.

The slide below shows the use of this function. Thorn has previously worked with MemSQL on advances in machine learning for image recognition.

MemSQL has special functions that support ML and AI, including image recognition.

Using Amazon SQS as a Data Pipeline

Thorn uses Amazon S3 and SQS as the input source for their data pipeline. Many other MemSQL customers have used Kafka in similar situations. (We recently published a case study featuring the Kafka-plus-MemSQL architecture from a major technology services company.) But Thorn finds Amazon SQS easier to maintain and manage.

According to Parente, “Our data is not currently delivered to S3 in a streaming fashion. It’s more a set of micro batches. We don’t currently have a need for the streaming support that you typically see associated with Kafka.”

“We rely on SQS to provide us with the notifications we need, as data is delivered into our S3 buckets,” continues Parente. “When we receive a notification, our data pipeline runs a set of machine learning models and natural language processing annotators before storing the results in MemSQL for use by our application.”

MemSQL Helios Helps Thorn Achieve Statelessness

Why has Thorn chosen MemSQL Helios, rather than self-managed MemSQL software, which they could install and run on AWS themselves? The main reason is to focus their technical resources on other areas. Every hour saved in database administration is an hour freed up for work that will speed up an investigator’s process, providing timely insights and aggregating information across time and space to find child victims faster.

The features of MemSQL Helios lend themselves to Thorn’s needs. Thorn has designed their system in such a way as to offload software maintenance and management to the greatest degree possible, using Kubernetes as their management tool for most of the system, and MemSQL Helios – which is built on Kubernetes, and managed using it – as their core database.

Kubernetes was originally developed for stateless services, and Thorn built their data pipeline (above) to be as stateless as possible. Parente says, “The pipeline workers are all stateless. If we fail processing some input data, the pipeline simply retries the input from S3 at some point in the future. Our processing is idempotent.”

More recently, Kubernetes has added features for managing stateful software. To make these features work, stateful software such as MemSQL (or any database) requires a Kubernetes Operator, which serves as an interface between the database and Kubernetes. MemSQL has created a Kubernetes Operator and uses it for managing MemSQL Helios. MemSQL customers are also using this Operator in their own development efforts.

Thorn could have used the MemSQL Operator to integrate self-managed MemSQL software into their Kubernetes management framework. Instead, they chose MemSQL Helios. “So in some way,” Parente continues, “the indirect answer to the question, ‘Are we depending on the stateful features of Kubernetes?,’ is ‘Yes – but indirectly, through Helios.’” Thorn maintains their stateless management framework by leaving the management of stateful software – their MemSQL database – to MemSQL, the company, through Helios.

“One of the reasons we’re using MemSQL Helios is to offload having to manage that stateful data store,” continued Parente. “If we weren’t using Helios, and instead hosting our own database, we would be responsible for scaling it on Kubernetes, making sure data is retained after nodes restart, repartitioning data to take advantage of new nodes, and so on.”

Thorn defers to other industry-leading experts for its other data store. “For S3, Amazon is managing the complexity,” says Parente. “The files are written, and then we assume that S3 works as advertised.”

The same questions arise for both technologies: “Are we sure it’s backed up? Is it going to scale? We want to offload that onto other vendors, including AWS and MemSQL. That’s time better spent for our mission-oriented work. We focus more on how we build out our system, or surface the processed information to our users in the best available fashion.”

This approach allows Thorn to work more closely with their users, improve the system to meet user needs, and get data out to them in the way they need it, in the formats and with the timeliness they need to prioritize the identification of child sex trafficking victims.

Conclusion

As Julie Cordua, CEO of Thorn, has said: “MemSQL is delivering a real impact for our organization by making real-time decisions and predictive analytics easier. And, because it easily scales to support our machine learning and AI needs, MemSQL helps us continually build better tools to find victims of trafficking and sexual abuse, faster. It is a true case of technology being applied in a way that will make a real difference in people’s lives.”

You too can take advantage of the ease of use, ease of management, and reliability of MemSQL Helios. Use MemSQL for free or contact MemSQL today.

It’s About Time: Getting More from Your Time-Series Data With MemSQL 7.0

$
0
0

Feed: MemSQL Blog.
Author: Eric Hanson.

MemSQL is uniquely suited to real-time analytics, where data is being ingested, updated, and queried concurrently with aggregate queries. Real-time analytics use cases often are based on event data, where each separate event has a timestamp. It’s natural to interpret such a sequence of events as a time series.

Prior to the 7.0 release, MemSQL delivered many capabilities that make it well-suited to time-series data management [Han19]. These include:

  • a scaled-out, shared-nothing architecture that supports transactional and analytical workloads with a standard SQL interface,
  • fast query execution via compilation and vectorization, combined with scale out,
  • ability to load data phenomenally fast using the Pipelines feature, which supports distributed, parallel ingestion,
  • non-blocking concurrency control so readers and writers never make each other wait,
  • window functions for ranking, moving averages, and so on,
  • a highly-compressed columnstore data format suitable for large historical data sets.

Hence, many of our customers are using MemSQL to manage time series data today.

For the MemSQL 7.0 release, we decided to build some special-purpose features to make it even easier to manage time-series data. These include FIRST(), LAST(), TIME_BUCKET(), and the ability to designate a table column as the SERIES TIMESTAMP [Mem19a-d]. Taken together, these allow specification of queries to summarize time series data with far fewer lines of code and fewer complex concepts. This makes expert SQL developers more productive, and opens up the ability to query time series data to less expert developers.

We were motivated to add special time series capability in MemSQL 7.0 for the following reasons:

  • Many customers were using MemSQL for time series data already, as described above.
  • Customers were asking for additional time series capability.
  • Bucketing by time, a common time series operation, was not trivial to do.
  • Use of window functions, while powerful for time-based operations, can be complex and verbose.
  • We’ve seen brief syntax for time bucketing in event-logging data management platforms like Splunk [Mil14] and Azure Data Explorer (Kusto) [Kus19] be enthusiastically used by developers.
  • We believe we can provide better overall data management support for customers who manage time series data than the time series-specific database vendors can. We offer time series-specific capability and also outstanding performance, scalability, reliability, SQL support, extensibility, rich data type support, and so much more.

Designating a Time Attribute in Metadata

To enable simple, brief SQL operations on time series data, we recognized that all our new time series functions would have a time argument. Normally, a table has a single, well-known time attribute. Why not make this attribute explicit in metadata, and an implicit argument of time-based functions, so you don’t have to reference it in every query expression related to time?

So, in MemSQL 7.0 we introduced a special column designation, SERIES TIMESTAMP, that indicates a default time column of a table. This column is then used as an implicit attribute in time series functions. For example, consider this table definition:

CREATE TABLE tick(
  ts datetime(6) series timestamp,
  symbol varchar(5),
  price numeric(18,4));

It defines a table, tick, containing hypothetical stock trade data. The ts column has been designated as the series timestamp. In examples to follow, we’ll show how you can use it to make queries shorter and easier to write.

The Old Way of Querying Time Series

Before we show the new way to write queries briefly using time series functions and the SERIES TIMESTAMP designation in 7.0, consider an example of how MemSQL could process time series data before 7.0. We’ll use the following data for examples:

INSERT INTO tick VALUES
 ('2020-02-18 10:55:36.179760', 'ABC', 100.00),
 ('2020-02-18 10:57:26.179761', 'ABC', 101.00),
 ('2020-02-18 10:59:16.178763', 'ABC', 102.50),
 ('2020-02-18 11:00:56.179769', 'ABC', 102.00),
 ('2020-02-18 11:01:37.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:46.179769', 'ABC', 103.00),
 ('2020-02-18 11:02:59.179769', 'ABC', 102.60),
 ('2020-02-18 11:02:46.179769', 'XYZ', 103.00),
 ('2020-02-18 11:02:59.179769', 'XYZ', 102.60),
 ('2020-02-18 11:03:59.179769', 'XYZ', 102.50);

The following query works in MemSQL 6.8 and earlier. As output, it produces a separate row, for each stock, for each hour it was traded at least once. (So if a stock is traded ten or more times, in ten separate hours, ten rows are produced for that stock. A row will contain either a single trade, if only one trade occurred in that hour, or a summary of the trades – two or more – that occurred during the hour.) Each row shows the time bucket, stock symbol, and the high, low, open, and close for the bucket period. (If only one trade occurred in that hour, the high, low, open, and close will all be the same – the price the stock traded at in that hour.)

WITH ranked AS
(SELECT symbol,
    RANK() OVER w as r,
    MIN(price) OVER w as min_pr,
    MAX(price) OVER w as max_pr,
    FIRST_VALUE(price) OVER w as first,
    LAST_VALUE(price) OVER w as last,
    from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) as ts
    FROM tick
    WINDOW w AS (PARTITION BY symbol, 
               from_unixtime(unix_timestamp(ts) div (60*60) * (60*60)) 
               ORDER BY ts
               ROWS BETWEEN UNBOUNDED PRECEDING
               AND UNBOUNDED FOLLOWING))
 
SELECT ts, symbol, min_pr, max_pr, first, last
FROM ranked
WHERE r = 1
ORDER BY symbol, ts;

This query produces the following output, which can be used to render a candlestick chart [Inv19], a common type of stock chart.

+---------------------+--------+----------+----------+----------+----------+
| ts                  | symbol | min_pr   | max_pr   | first    | last     |
+---------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+---------------------+--------+----------+----------+----------+----------+

The query text, while understandable, is challenging to write because it uses a common table expression (CTE), window functions with a non-trivial window definition, a subtle use of ranking to pick one row per group, and a non-obvious divide/multiply trick to group time to a 60*60 second bucket.

New Time-Series Functions in MemSQL 7.0

Here I’ll introduce the new time series functions, and then show an example where we write an equivalent query to the “candlestick” query above using the new functions. I think you’ll be impressed by how concise it is!

Also see the latest documentation for analyzing time series data and for the new time series functions.

FIRST()

The FIRST() function is an aggregate function that takes two arguments, as follows:

FIRST (value[, time]);

Given a set of input rows, it returns the value for the smallest associated time.

The second argument is optional. If it is not specified, it is implicitly the SERIES TIMESTAMP column of the table being queried. It’s an error if there is no SERIES TIMESTAMP available, or if there is more than one available in the context of the query where FIRST is used; in that case, you should specify the time explicitly.

For example, this query gives the symbol of the first stock traded among all stocks in the tick table:

SELECT first(symbol) FROM tick;

The result is ABC, which you can see is the first one traded at 10:55:36.179760 in the rows inserted above.

LAST()

LAST is just like FIRST except it gives the value associated with the latest time.

TIME_BUCKET()

TIME_BUCKET takes a time value and buckets it to a specified width. You can use very brief descriptions of bucket width, like ‘1d’ for one day, ‘5m’ for five minutes, and so on. The function takes these arguments:

TIME_BUCKET (bucket_width [, time [,origin]])

The only required argument is bucket_width. As with FIRST and LAST, the time argument is inferred to be the SERIES TIMESTAMP if it is not specified. The origin argument is used if you want your buckets to start at a non-standard boundary – say, if you want day buckets that begin at 8am every day.

Putting It All Together

Now that we’ve seen FIRST, LAST, TIME_BUCKET, and SERIES TIMESTAMP, let’s see how to use all of them to write the candlestick chart query from above. A new version of the same query is simply:

SELECT time_bucket('1h') as ts, symbol, min(price) as min_pr,
    max(price) as max_pr, first(price) as first, last(price) as last
FROM tick
group by 2, 1
order by 2, 1;

The new version of the query produces this output, which is essentially the same as the output of the original query.

+----------------------------+--------+----------+----------+----------+----------+
| ts                         | symbol | min_pr   | max_pr   | first    | last     |
+----------------------------+--------+----------+----------+----------+----------+
| 2020-02-18 10:00:00.000000 | ABC    | 100.0000 | 102.5000 | 100.0000 | 102.5000 |
| 2020-02-18 11:00:00.000000 | ABC    | 102.0000 | 103.0000 | 102.0000 | 102.6000 |
| 2020-02-18 11:00:00.000000 | XYZ    | 102.5000 | 103.0000 | 103.0000 | 102.5000 |
+----------------------------+--------+----------+----------+----------+----------+

Look how short this query is! It is 5 lines long vs. 18 lines for the previous version. Moreover, it doesn’t use window functions or CTEs, nor require the divide/multiply trick to bucket time. It just uses standard aggregate functions and scalar functions.

Conclusion

MemSQL 7.0 makes it much simpler to specify many time-series queries using special functions and the SERIES TIMESTAMP column designation. For a realistic example, we reduced lines of code by more than three-fold, and eliminated the need to use some more advanced SQL concepts.

Given the high performance, unlimited scalability, and full SQL support of MemSQL, it was a strong platform for time series data in earlier releases. Now, in MemSQL 7.0, we’ve taken that power and added greater simplicity with these new built-in capabilities. How can you apply MemSQL 7.0 to your time-oriented data?

References

[Han19] Eric Hanson, What MemSQL Can Do For Time Series Applications, https://www.memsql.com/blog/what-memsql-can-do-for-time-series-applications/, March 2019.

[Inv19] Understanding Basic Candlestick Charts, Investopedia, https://www.investopedia.com/trading/candlestick-charting-what-is-it/, 2019.

[Kus19] Summarize By Scalar Values, Azure Data Explorer Documentation, https://docs.microsoft.com/en-us/azure/kusto/query/tutorial#summarize-by-scalar-values, 2019.

[Mem19a] FIRST, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/first/, 2019.

[Mem19b] LAST, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/last/, 2019.

[Mem19c] TIME_BUCKET, MemSQL Documentation, https://docs.memsql.com/v7.0-beta/reference/sql-reference/time-series-functions/time_bucket/, 2019.

[Mem19c] CREATE TABLE Topic, SERIES TIMESTAMP, https://docs.memsql.com/v7.0-beta/reference/sql-reference/data-definition-language-ddl/create-table/, 2019.

[Mil14] James Miller, Splunk Bucketing, Mastering Splunk, O’Reilly, https://www.oreilly.com/library/view/mastering-splunk/9781782173830/ch03s02.html, 2014.

MemSQL 7.0 Now Generally Available on MemSQL Helios

$
0
0

Feed: MemSQL Blog.
Author: Peter Guagenti.

MemSQL’s breakthrough new “SingleStore” data management, enhanced system of record capabilities, time series enhancements, and more, are now available instantly and on demand with MemSQL Helios. You can try MemSQL 7.0 and Helios for free, instantly.

MemSQL is proud to announce the general availability of MemSQL 7.0, available first on MemSQL Helios, the company’s elastic cloud database available on public cloud providers around the globe. Fully managed and available on demand, MemSQL Helios delivers instant, effortless access to the world’s fastest, most scalable data platform for operational analytics, machine learning and AI. MemSQL 7.0 delivers new, fast resilience features; the first iteration of MemSQL SingleStore, delivering table type convergence; new time series features; and other new features, described below.

With the public availability of MemSQL 7.0 — available now on MemSQL Helios and for download on December 10th— MemSQL further cements itself as a powerful fit for a company’s innovative and most critical operational workloads. MemSQL Helios delivers enhanced ease of use and reduced management complexity, lower total cost of ownership (TCO) compared to both on-premises and cloud provider offerings, and the flexibility to run your data workloads in multiple cloud providers and hybrid deployments.

Also see the AWS-MemSQL press release for MemSQL Helios and the MemSQL 7.0 press release.

What’s New in MemSQL 7.0

We previously described two key sets of features in MemSQL 7.0:

  • Resilience features. Resilience features allow database content to survive server and software failures. Resilience features in MemSQL 7.0 include much faster synchronous replication and synchronous durability, and incremental backup, which captures only recently changed data.
  • MemSQL SingleStore features. MemSQL SingleStore will eventually see data stored in a single table type whose size is not limited by the available memory, yet still provides the highest level of performance for both transactional and analytical operations. MemSQL 7.0 features include rowstore table data compression and fast seeks in columnstore tables.
MemSQL 7.0 pairs synchronous replication (sync rep) and synchronous duplication (sync dup) to provide strong resilience options.
Fast sync replication and sync durability in MemSQL 7.0 make it possible
to run in high availability mode with only a small performance hit.

MemSQL 7.0 also includes a wide range of additional features. Highlights include:

  • Improvements in MemSQL Studio. MemSQL Studio features available now include logical monitoring of nodes; physical monitoring of cluster resource usage; the ability to find, understand, and kill long-running queries; and the ability to separate out SQL Editor results into tabs, then export them into a CSV file for analysis.
  • Improvements in MemSQL Tools. MemSQL tools enable installation and control of self-managed MemSQL clusters, on-premises and in the cloud. MemSQL Tools include an improved cluster setup utility and easier migration from the original MemSQL Ops to the new tools. Tools also perform upgrades from MemSQL 6.7 and MemSQL 6.8 to MemSQL 7.0.
  • Automatic gathering of MemSQL statistics. All types of statistics, including range and cardinality, are now automatically gathered, on all types of tables. This is a big ease-of-use improvement and dovetails with the benefits of MemSQL Helios and the move to MemSQL SingleStore.
  • Time Series functions. MemSQL 7.0 includes the new FIRST, LAST, and TIME_BUCKET functions, and the ability to designate a SERIES TIMESTAMP column in metadata, greatly simplifying code for time series processing with MemSQL.
  • Additional improvements. These include improvements to query execution, query optimization, data storage, data loading, and data backup, as well as cross-database views.

MemSQL Helios and MemSQL 7.0 continue to include robust free options. MemSQL Helios has a free 8-hour trial; the self-managed MemSQL software can be used indefinitely for free, within limits on the number of nodes used, and with community support only.

Visual Explain shows query processing in MemSQL Studio.
The updated Visual Explain feature in MemSQL Studio.

Benefits of MemSQL Helios

With the initial release of MemSQL 7.0 occurring through MemSQL Helios, on AWS and other cloud platforms, the combined advantages of MemSQL Helios and MemSQL 7.0 become available on AWS.

MemSQL Helios advantages include:

  • Effortless deployment and elastic scale. WIth MemSQO Helios, you get the full capabilities of one-click deployment and easy cloud scalability. Users skip all the initial stages that are usually required for database availability: hardware procurement; racking servers; operating system and libraries installation and testing; database software deployment and configuration; and management of VMs or containers and their uptime. With MemSQL Helios, customers simply choose the number of nodes they wish. Helios starts them up and keeps them running. It handles data backup and restore, and managing and maintaining the software, using the cloud-native MemSQL Kubernetes Operator and the MemSQL Kubernetes stack.
  • Enhanced ease of use and reduced TCO. Automatic setup and automated resizing of MemSQL clusters greatly increases ease of use for customers. (Resizing is handled on request, and will be further automated in future versions of MemSQL Helios.) In Helios, MemSQL handles much of the work, and many of the issues, that were formerly left to customer DevOps staff – either alone, or in cooperation with MemSQL Support. Staff are freed up as MemSQL Helios helps eliminate tedious, time-consuming tasks. As a cloud resource, MemSQL Helios ensures that the organization pays only for what it needs. (These aspects of MemSQL Helios are greatly enhanced by the new features in MemSQL 7.0, below.)
  • Greatly increased flexibility. MemSQL combines the capabilities of OLTP databases and online analytical processing (OLAP) databases into a single database. The same database manages streaming data on ingest and processing, and replaces complex extract, transform, and load (ETL) processes with MemSQL Pipelines. It also supports both high concurrency of query volume and high query complexity. This eliminates data sprawl – the need for multiple copies of the same data – and reduced the number of data tools you need to support a given application. You get less complexity and up-to-date data for your applications.

A converged database, managed by MemSQL as the provider, with SQL support and full scalability – that is, MemSQL Helios, powered by MemSQL 7.0 – offers a breakthrough capability to power rapid growth for organizations large and small.

Next Steps

MemSQL Helios, powered by MemSQL 7.0, is available now. You can get started instantly with MemSQL today for free or contact Sales.

The Story Behind MemSQL’s Skiplist Indexes

$
0
0

Feed: MemSQL Blog.
Author: Adam Prout.

This blog post was originally published in January 2014, and it has long been the first blog post on the MemSQL blog – and one of the best. In this blog post, MemSQL ceo-founding engineer Adam Prout explains one of the key technical features that distinguishes MemSQL: its use of skiplist indexes over Btrees and similar structures. Adam has now revised and updated this blog post to include the recently released MemSQL 7.0 and MemSQL SingleStore™.

The most popular data structure used for indexing in relational databases is the Btree (or its variant, the B+tree). Btrees rose to popularity because they do fewer disk I/O operations to run a lookup compared to other balanced trees. To the best of my knowledge, MemSQL is the first commercial relational database in production today to use a skiplist, not a Btree, as its primary index backing data structure for in-memory data.

MemSQL, founded in 2011, began as an in-memory, rowstore database. MemSQL’s storage design evolved, in the years following, to support on-disk data in columnstore format. We then added more intelligence around when rows are stored in memory vs. on disk, and in rowstore or columnstore format, with the SingleStore project. Through all this, the skiplist has remained the index of choice for in-memory rowstore data.

A lot of research and prototyping went into the decision to use a skiplist. I hope to provide some of the rationale for this choice, and to demonstrate the power and relative simplicity of MemSQL’s skiplist implementation. I’ll show some very simple single-threaded table scans that run more than eight times faster on MemSQL, compared to MySQL, as a very basic demonstration. (MemSQL performs even better than this on more aggressive and complex workloads). This article will stick to high-level design choices and leaves most of the nitty-gritty implementation details for other posts.

What is a Skiplist?

Btrees made a lot of sense for databases when the data lived most of its life on disk and was only pulled into memory and cached as needed to run queries. On the other hand, Btrees do extra work to reduce disk I/O that is needless overhead if your data fits into memory. As memory sizes have increased, it is feasible today to support indexes that only function well for in-memory data, and are freed from the constraints of having to index data on disk. A skiplist is well suited for this type of indexing.

Skiplists are a relatively recent invention. The seminal skiplist paper was published in 1990 by William Pugh: Skiplists: a probabilistic alternative to balanced trees. This makes the skiplist about 20 years younger than the Btree, which was first proposed in the 1970s.

A skiplist is an ordered data structure providing expected O(Log(n)) lookup, insertion, and deletion complexity. It provides this level of efficiency without the need for complex tree balancing or page splitting like that required by Btrees, redblack trees, or AVL trees. As a result, it’s a much simpler and more concise data structure to implement.

Lock-free skiplist implementations have recently been developed; see this paper, Lock-Free Linked Lists and Skiplists, published in 2004 by Mikhail Fomitchev and Eric Ruppert. These implementations provide thread safety with better parallelism under a concurrent read/write workload than thread-safe balanced trees that require locking. I won’t dig into the details of how to implement a lock-free skiplist here, but to get an idea of how it might be done, see this blog post about common pitfalls in writing lock-free algorithms.

skiplist singlestore

A skiplist is made up of elements attached to towers. Each tower in a skiplist is linked at each level of the tower to the next tower at the same height, forming a group of linked lists, one for each level of the skiplist. When an element is inserted into the skiplist, its tower height is determined randomly via successive coin flips (a tower with height n occurs once in 2^n times).

The element is linked into the linked lists at each level of the skiplist, once its height has been determined. The towers support binary searching by starting at the highest tower and working towards the bottom, using the tower links to check when one should move forward in the list or down the tower to a lower level.

Why a Skiplist Index for MemSQL

There are many reasons why skiplists are best for MemSQL, primarily: they’re memory-optimized, simple (including the need for many fewer lines of code to implement them), much easier to implement in a lock-free fashion, fast, and flexible.

1) Memory-Optimized

MemSQL supports both in-memory rowstore and on-disk columnstore table storage. The rowstore is designed for fast, high-throughput access to data stored in memory. The columnstore is designed for scanning and aggregating large amounts of data quickly. The columnstore is disk-backed but keeps recently written data in-memory in rowstore layout before flushing it to disk in columnstore layout.

The columnstore also stores all metadata about files in memory, in internal rowstore metadata tables – i.e, metadata such as maximum and minimum values of each column, a bitmap of deleted rows, etc. This data is needed in memory for fast and easy access by query execution for eliminating entire files from a filter. See our documentation for more information about our columnstore.

These two storage types are being converged by the MemSQL SingleStore project to create a storage design with most of the benefits of both table types, while eliminating the need to think about the details of storage layouts when designing and managing an application. Thus, both of MemSQL’s table types have a need for a memory-optimized rowstore index, as does our future ideal design of a SingleStore table type. (You can see a deep dive on SingleStore from Eric Hanson, and some useful information about our current and future implementation of SingleStore in Rick Negrin’s webinar.)

Being memory-optimized means indexes are free to use pointers to rows directly, without the need for indirection. In a traditional database, rows need to be addressable by some other means than a pointer to memory, as their primary storage location is on disk. This indirection usually takes the form of a cache of memory-resident pages (often called a buffer pool) that is consulted in order to find a particular row’s in-memory address, or to read it into memory from disk if needed.

This indirection is expensive and usually done at the page level (e.g., 8K at a time in SQL Server). MemSQL doesn’t have to worry about this overhead. This makes data structures that refer to rows arbitrarily by pointer, like a skiplist does, feasible. Dereferencing a pointer is much less expensive than looking up a page in the buffer pool.

2) Simple

MemSQL’s skiplist implementation is about 1500 lines of code, including comments. Having recently spent some time in both SQL Server’s and Innodb’s Btree implementations, I can tell you they are both close to 50 times larger in terms of lines of code, and both have many more moving parts. For example, a Btree has to deal with page splitting and page compaction, while a skiplist has no equivalent operations. The first generally available build of MemSQL took a little over a year to build and stabilize. This feat wouldn’t have been possible with a more complex indexing data structure.

3) Lock-Free

A lock-free or non-blocking algorithm is one in which some thread is always able to make progress, no matter how all the threads’ executions are interleaved by the OS. MemSQL is designed to support highly concurrent workloads running on hardware with many cores. These goals makes lock-free algorithms desirable for MemSQL. (See our original blog post on lock-free algorithms and our description of sync replication, by Nate Horan.)

The algorithms for writing a thread-safe, lock-free skiplist are now a solved problem in academia. A number of papers have been published on the subject in the past decade. It’s much harder to make a lock-free skiplist perform well when there is low contention (ie., a single thread iterating over the entire skiplist, with no other concurrent operations executing). Optimizing this case is a more active area of research. Our approach to solving this particular problem is a topic for another time.

Btrees, on the other hand, have historically needed to use a complex locking scheme to achieve thread safety. Some newer lock-free, Btree-like data structures such as the BWtree have recently been proposed that avoid this problem. Again, the complexity of the BWTree data structure far outpaces that of a skiplist or even a traditional Btree. (The BWTree requires more complex compaction algorithms then a Btree, and depends on a log-structured storage system to persist its pages). The simplicity of the skiplist is what makes it well suited for a lock-free implementation.

4) Fast

The speed of a skiplist comes mostly from its simplicity. MemSQL is executing fewer instructions to insert, delete, search, or iterate compared to other databases.

5) Flexible

Skiplists also support some extra operations that are useful for query processing and that aren’t readily implementable in a balanced tree.

For example, a skiplist is able to estimate the number of elements between two elements in the list in logarithmic time. The general idea is to use the towers to estimate how many rows are between two elements linked together at the same height in the tree. If we know the tower height at which the nodes are linked, we can estimate how many elements are expected to be between these elements, because we know the expected distribution of towers at that height.

Knowing how many elements are expected to be in an arbitrary range of the list is very useful for query optimization, when calculating how selective different filters in a select statement are. Traditional databases need to build separate histograms to support this type of estimation in the query optimizer.

Addressing Some Common Concerns About Skiplists

MemSQL has addressed several common concerns about skiplists: memory overhead, CPU cache efficiency, and reverse iteration.

Memory Overhead

The best-known disadvantage of a skiplist is its memory overhead. The skiplist towers require storing a pointer at each level of each tower. On average, each element will have a tower height of 2 (we flip a coin in succession to determine the tower height, such that a tower of height n occurs 1 in 2^n times). This means, on average, each element will have 16 bytes of overhead for the skiplist towers.

The significance of this overhead depends on the size of the elements being stored in the list. In MemSQL, the elements stored are rows of some user’s table. The average row size in a relational database tends to be hundreds of bytes in size, dwarfing the skiplist’s memory overhead.

BTrees have their own memory overhead issues that make them hard to compare directly to skiplists. After a BTree does a page split, both split pages are usually only 50% full. (Some databases have other heuristics, but the result of a split is pages with empty space on them).

Depending on a workload’s write patterns, BTrees can end up with fragmented pages all over the BTree, due to this splitting. Compaction algorithms to reclaim this wasted space are required, but they often need to be triggered manually by the user. A fully compacted BTree, however, will be more memory-efficient than a skiplist.

Another way MemSQL is able to improve memory use compared to a traditional database is in how it implements secondary indexes. Secondary indexes need only contain pointers to the primary key row. There is no need to duplicate the data in key columns like secondary Btree indexes do.

CPU Cache efficiency

Skiplists do not provide very good memory locality because traversing pointers during a search results in execution jumping somewhat randomly around memory. The impact of this effect is very workload-specific and hard to accurately measure.

For most queries, the cost of executing the rest of a query (sorting, executing expressions, protocol overhead to return the queries result) tends to dominate the cost of traversing the tower pointers during a search. The memory locality problem can also be mostly overcome by using prefetch instructions (mm_prefetch on Intel processors). The skiplist towers can be used to read ahead of a table scan operation and load rows into CPU caches, so they can be quickly accessed by the scan when it arrives.

Reverse Iteration

Most skiplist implementations use backwards pointers (double linking of list elements) to support iterating backwards in the list. The backwards pointers add extra memory overhead and extra implementation complexity; lock-free, doubly-linked lists are difficult to implement.

MemSQL’s skiplist employs a novel reverse iterator that uses the towers to iterate backwards without the need for reverse links. The idea is to track the last tower link that is visited at each level of the skiplist while seeking to the end, or to a particular node, of the skiplist. These links can be used to find the element behind each successive element, because each time the reverse iterator moves backwards, it updates the links at each level that are used.

This iterator saves the memory required for backwards pointers, but does result in higher reverse-iteration execution cost. Reverse iteration is important for a SQL database because it allows ORDER BY queries to run without sorting, even if the ORDER BY wants the opposite sort order (ie, ascending instead of descending) of the order the index provides.

Quick Performance Comparison

Performance benchmarking of databases and data structures is very difficult. I’m not going to provide a comprehensive benchmark here. Instead, I’ll show a very simple demonstration of our skiplist’s single-threaded scan performance compared to innodb’s Btree.

I’m going to run SELECT SUM(score) FROM users over a 50 million-row users table. The test is set up to, if anything, favor MySQL. There is no concurrent write workload in this demonstration (which is where MemSQL really shines), and MemSQL is running with query parallelism disabled; both MySQL and MemSQL are scanning using only a single thread. Innodb is running with a big enough buffer pool to fit the entire table in memory, so there is no disk I/O going on.

CREATE TABLE `users` (
`user_id` bigint(20) NOT NULL AUTO_INCREMENT,
`first_name` varchar(100) CHARACTER SET utf8,
`last_name` varchar(100) CHARACTER SET utf8,
`install_date` datetime,
`comment` varchar(500),
`score` bigint(20),
PRIMARY KEY (`user_id`)
)

memsql-mysql skiplist vs. btree

MemSQL’s single-threaded scan performance is 5 times faster in the first case and 8 times faster in the second case. There is no black magic involved here. MemSQL needs to run far fewer CPU instructions to read a row, for the reasons discussed above. MemSQL’s advantages really come to the fore when there are concurrent writes and many concurrent users.

Conclusion

MemSQL takes advantage of the simplicity and performance of lock-free skiplists for in-memory rowstore indexing. These indexes are used to back rowstore tables, to buffer rows in memory for columnstore tables, and to store metadata about columnstore blob files on disk. The result is a more modern indexing design, based on recent developments in data structure and lock-free/non-blocking algorithms research. The simplicity of the skiplist is the source of a lot of MemSQL’s speed and scalability.

Viewing all 427 articles
Browse latest View live