Quantcast
Channel: MemSQL – Cloud Data Architect
Viewing all 427 articles
Browse latest View live

Webinar: Time Series Data Capture & Analysis in MemSQL 7.0

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

With the MemSQL 7. 0 release, MemSQL has added more special-purpose features, making it even easier to manage time series data within our best-of-breed operational database. These new features allow you to structure queries on time series data with far fewer lines of code and with less complexity. With time series features in MemSQL, we make it easier for any SQL user, or any tool that uses SQL, to work with time series data, while making expert users even more productive. In a recent webinar (view the recording here), Eric Hanson described the new features and how to use them.

The webinar begins with an overview of MemSQL, then describes how customers have been using MemSQL for time series data for years, prior to the MemSQL 7.0 release. Then there’s a description of the time series features that MemSQL has added, making it easier to query and manage time series data, and a Q&A section at the end.

Introducing MemSQL

MemSQL is a very high-performance scalable SQL relational database system. It’s really good for scalable operations, both for transaction processing and analytics on tabular data. Typically, it can be as much as 10 times faster, and three times more cost-effective, than legacy database providers for large volumes under high concurrency.

We like to call MemSQL the No-Limits Database because of its amazing scalability. It’s the cloud-native operational database that’s built for speed and scale. We have capabilities to support operational analytics. So, operational analytics is when you have to deliver very high analytical performance in an operational database environment where you may have concurrent updates and queries running at an intensive, demanding level. Some people like to say that it’s when you need “Analytics with an SLA.”

Now, I know that everybody thinks they have an SLA when they have an analytical database, but when you have a really demanding SLA like requiring interactive, very consistent response time in an analytical database environment, under fast ingest, and with high concurrency, that’s when MemSQL really shines.

We also support predictive ML and AI capabilities. For example, we’ve got some built-in functions for vector similarity matching. Some of our customers were using MemSQL in a deep learning environment to do things like face and image matching and customers are prototyping applications based on deep learning like fuzzy text matching. The built-in dot product and Euclidean distance functions we have can help you make those applications run with very high performance. (Nonprofit Thorn is one organization that uses these ML and AI-related capabilities at the core of their app, Spotlight, which helps law enforcement identify trafficked children. – Ed.)

Also, people are using MemSQL when they need to move to cloud or replace legacy relational database systems. When they reach some sort of inflection point, like they know they need to move to cloud, they want to take advantage of the scalability of the cloud, they want to consider a truly scalable product, and so they’ll look at MemSQL. Also, when it comes time to re-architect the legacy application – if, say, the scale of data has grown tremendously, or is expected to change in the near future, people really may decide they need to find a more scalable and economical platform for their relational data, and that may prompt them to move to MemSQL.

Here are examples of the kinds of workloads and customers we support: Half of the top 10 banks banks in North America, two of the top three telecommunications companies in North America, over 160 million streaming media users, 12 of the Fortune 50 largest companies in the United States, and technology leaders from Akamai to Uber.

If you want to think about MemSQL and how it’s different from other database products, you can think of it as a very modern, high-performance, scalable SQL relational database. We have all three: speed, scale, and SQL. We get our speed because we compile queries to machine code. We also have in-memory data structures for operational applications, an in-memory rowstore structure, and a disk-based columnstore structure.

MemSQL is the No-Limits Database

We compile queries to machine code and we use vectorized query execution on our columnar data structure. That gives us tremendous speed on a per-core basis. We’re also extremely scalable. We’re built for the cloud. MemSQL is a cloud-native platform that can gang together multiple computers to handle the work for a single database, in a very elegant and high-performance fashion. There’s no real practical limit to scale when using MemSQL.

Finally, we support SQL. There are some very scalable database products out there in the NoSQL world that are fast for certain operations, like put and get-type operations that can scale. But if you try to use these for sophisticated query processing, you end up having to host a lot of the query processing logic in the application, even to do simple things like joins. It can make your application large and complex and brittle – hard to evolve.

So SQL, the relational data model, was invented by EF Codd (PDF) – back around 1970 – for a reason. To separate your query logic from the physical data structures in your database, and to provide a non-procedural query language that makes it easier to find the data that you want from your data set. The benefits that were put forth when the relational model was invented are still true today.

We’re firmly committed to relational database processing and non-procedural query languages with SQL. There’s tremendous benefits to that, and you can have the best of both. You can have speed, and you can have scale, along with SQL. That’s what we provide.

How does MemSQL fit into the rest of your data management environment? MemSQL provides tremendous support for analytics, application systems like dashboards, ad-hoc queries, and machine learning. Also other types of applications like real-time decision-making apps, Internet of Things apps, dynamic user experiences. The kind of database technology that was available before couldn’t provide the real-time analytics that are necessary to give the truly dynamic user experience people are looking for today; we can provide that.

MemSQL architectural chart CDC and data types

We also provide tremendous capabilities for fast ingest and change data capture (CDC). We have the ability to stream data into MemSQL from multiple sources like file systems and Kafka. We have a feature called Pipelines, which is very popular, to automatically load data from file folders, AWS S3, Kafka. You can transform data as it’s flowing into MemSQL, with very little coding. We support a very high performance and scalable bulk load system.

We have support for a large variety of data types including relational data, standard structured data types, key-value, JSON, geospatial, time-oriented data, and more. We run everywhere. You can run MemSQL on-premises, you can run it in the cloud as a managed database platform, or as a service in our new Helios system, which just was delivered in September.

We also allow people to self-host in the cloud. If they want full control over how their system is managed, they can self-host on all the major cloud providers and also run in containers; so, wherever you need to run, we are available.

I mentioned scalability earlier and I wanted to drill into that a little bit to illustrate the, how our platform is organized. MemSQL provides an image to the database client application as just, it’s just a database. You have a connection string, you connect, you set your connection to use us as a database, and you can start submitting SQL statements. It’s a single system image. The application doesn’t really know that MemSQL is distributed – but, underneath the sheets, it’s organized as you see in this diagram.

MemSQL node and leaf architecture

There are one or more aggregator nodes, which are front-end nodes that the client application connects to. Then, there can be multiple back-end nodes. We call them leaf nodes. The data is horizontally partitioned across the leaf nodes – some people call this sharding. Each leaf node has one or more partitions of data. Those partitions are defined based on some data definition language (DDL); when you create your table, you define how to shard the data across nodes.

MemSQL’s query processor knows how to take a SQL statement and divide it up into smaller units of work across the leaf nodes, and final assembly results is done by the aggregator node. Then, the results are sent back for the client. As you need to scale, you can add additional leaf nodes and rebalance your data, so that it’s easy to scale the system up and down as needed.

How Customers Have Used MemSQL for Time Series Data

So with that background on MemSQL, let’s talk about using MemSQL for time series data. First of all, for those of you who are not really familiar with time series, a time series is simply a time-ordered sequence of events of some kind. Typically, each time series entry has, at least, a time value and some sort of data value that’s taken at that time. Here’s an example time series of pricing of a stock over time, over like an hour and a half or so period.

MemSQL time series stock prices

You can see that the data moves up and down as you advance in time. Typically, data at any point in time is closely correlated to the immediately previous point in time. Here’s another example, of flow rate. People are using MemSQL for energy production, for example, in utilities. They may be storing and managing data representing flow rates. Here’s another example, a long-term time series of some health-oriented data from the US government, from the Centers for Disease Control, about chronic kidney disease over time.

These are just three examples of time series data. Virtually every application that’s collecting business events of any kind has a time element to it. In some sense, almost all applications have a time series aspect to them.

Let’s talk about time series database use cases. It’s necessary, when you’re managing time-oriented data, to store new time series events or entries, to retrieve the data, to modify time series data – to delete or append or truncate the data, or in some cases, you may even update the data to correct an error. Or you may be doing some sort of updating operation where you are, say, accumulating data for a minute or so. Then, once the data has sort of solidified or been finalized, you will no longer update it. There are many different modification scenarios for time series data.

Another common operation on time series data is to do things like convert an irregular time series to a regular time series. For example, data may arrive with a random sort of arrival process, and the spacing between events may not be equal, but you may want to convert that to a regular time series. Like maybe data arrives every 1 to 10 seconds, kind of at random. You may want to create a time series which has exactly 1 data point every 15 seconds. That’s an example of converting from an irregular to a regular time series.

MemSQL time series use cases

Another kind of operation on time series is to downsample. That means you may have a time series with one tick every second, maybe you want to have one tick every one minute. That’s downsampling. Another common operation is smoothing. So you may have some simple smoothing capability, like a five-second moving average of a time series, where you average together like the previous five seconds worth of data from the series, or a more complex kind of smoothing – say, where you fit a curve through the data to smooth it , such as a spline curve. There are many, many more kind of time series use cases.

A little history about how MemSQL has been used for time series is important to give, for context. Customers already use MemSQL for time series event data extensively, using our previously shipped releases, before the recent shipment of MemSQL 7.0 and its time series-specific features. Lots of our customers store business events with some sort of time element. We have quite a few customers in the financial sector that are storing financial transactions in MemSQL. Of course, each of these has a time element to it, recording when the transaction occurred.

MemSQL Time series plusses

Also, lots of our customers have been using us for Internet of Things (IoT) events. For example, in utilities, in energy production, media and communications, and web and application development. For example, advertising applications. As I mentioned before, MemSQL is really tremendous for fast and easy streaming. With our pipelines capability, it’s fast and easy to use load data, and just very high-performance insert data manipulation language (DML). You can do millions of inserts per second on a MemSQL cluster.

We have a columnstore storage mechanism which has tremendous compression – typically, in the range of 5x to 10x, compared to raw data. It’s easy to store a very large volume of historical data in a columnstore table in MemSQL. Because of the capabilities that MemSQL provides for high scalability, high-performance SQL, fast, and easy ingest, and high compression with columnar data storage. All those things have made MemSQL really attractive destination for people that are managing time series data.

New Time Series Features in MemSQL 7.0

(For more on what’s in MemSQL 7.0, see our release blog post, our deep dive into resiliency features, and our deep dive into MemSQL SingleStore. We also have a blog post on our time series features. – Ed.)

Close to half of our customers are using time series in some form, or they look at the data they have as time series. What we wanted to do for the 7.0 release was to make time series querying easier. We looked at some of our customers’ applications, and some internal applications we had built on MemSQL for historical monitoring. We saw that, while the query language is very powerful and capable, it looked like some of the queries could be made much easier.

MemSQL easy time series queries

We wanted to provide a very brief syntax to let people write common types of queries – to do things like downsampling, or converting irregular time series to regular time series. You want to make that really easy. We wanted to let the more typical developers do things they couldn’t do before with SQL because it was just too hard. Let experts do more, and do it faster ,so they could spend more time on other parts of their application rather than writing tricky queries to extract information from time series.

So that said, we were not trying to be the ultimate time series specialty package. For example, if you need curve fitting, or very complex kinds of smoothing ,or you need to add together two different time series, for example. We’re not really trying to enable those use cases to be as easy and fast as they can be. We’re looking at sort of a conventional ability to manage large volumes of time series data, ingest the time series fast, and be able to do typical and common query use cases through SQL easily. That’s what we want to provide. If you need some of these specialty capabilities, you probably want to consider a more specialized time series product like KBB+ or something similar to that.

Throughout the rest of the talk, I’m going to be referring a few times to an example based on candlestick charts. A candlestick chart is a typical kind of chart used in the financial sector to show high, low, open, and close data for a security, during some period of time – like an entire trading day, or by minute, or by hour, et cetera.

MemSQL time series candlestick chart

This graphic shows a candlestick chart with high, low, open, close graphic so that the little lines at the top and bottom show the high and low respectively. Then, the box shows the open and close. Just to start off with, I wanted to show a query using MemSQL 6.8 to calculate information that is required to render a candlestick chart like you see here.

MemSQL time series old and new code

On the left side, this is a query that works in MemSQL 6.8 and earlier to produce a candlestick chart from a simple series of financial trade or transaction events. On the right-hand side, that’s how you write the exact same query in MemSQL 7.0. Wow. Look at that. It’s about one third as many characters as you see on the left, and also it’s much less complex.

On the left, you see you’ve got a common table expression with a nested select statement that’s using window functions, sort of a relatively complex window function, and several aggregate functions. It’s using rank, and then using a trick to pick out the top-ranked value at the bottom. Anyway, that’s a challenging query to write. That’s an expert-level query, and even experts struggle a little bit with that. You might have to refer back to the documentation.

I’ll go over this again in a little more detail, but just please remember this picture. Look how easy it is to manage time series data to produce a simple candlestick chart on the right compared to what was required previously. How did we enable this? We provide some new time series functions and capabilities in MemSQL 7.0 that allowed us to write that query more easily.

New MemSQL time series functions

We provide three new built-in functions: FIRST(), LAST(), and TIME_BUCKET(). FIRST() and LAST() are aggregate functions that provide the first or last value in a time window or group, based on some time period that defines an ordering. I’ll say more about those in a few minutes. TIME_BUCKET() is a function that maps a timestamp to a one-minute or five-minute or one-hour window, or one-day window, et cetera. It allows you to do it in a very easy way with a very brief syntax, that’s fairly easy to learn and remember.

Finally, we’ve added a new designation called the SERIES TIMESTAMP column designation, which allows you to mark one of your columns as the time column for your time series. That allows some shorthand notations that I’ll talk about more.

Time series timestamp example

Here’s a very simple example table that holds time series data for financial transactions. We’ve got a ts column, that’s a datetime 6 marked as the series timestamp. The data type is datetime 6, which is, it’s standard datetime with six places to the right of the decimal point. It’s accurate down to the microsecond. Symbol is like a stock symbol, a character string up to five characters. Price is a decimal, with up to 18 digits in 4 places to the right of the decimal point. So very simple time series table for financial information.

Some examples that are going to follow, I’m going to use this simple data set. Now, we’ve got two stocks, made-up stocks, ABC and XYZ that have some data that’s arrived in a single day, February 18th of next year, in a period of a few minutes. We’ll use that data and some examples set in the future.

Let’s look in more detail at the old way of querying time series data with MemSQL using window functions. I want to, for each symbol, for each hour, produce high, low, open, and close. This uses a window function that partitions by a time bucket. The symbol and time bucket ordered by timestamp, and the rows are between unbounded preceding and unbounded following. “Unbounded” means that any aggregates we calculate over this window will be over the entire window.

Old code for time series with SQL

Then, we compute the rank, which is the serial number based on the sort order like 1, 2, 3, 4, 5. One is first, two is second, so forth. Then, the minimum and maximum over the window, and first value and last value over the window. First value and last value are the very original value and the very final value in the window, based on the sort order of the window. Then, you see that from Unix time, Unix timestamp, ts divided by 60 times 60, times 60 times 60.

This is a trick that people who manage time series data with SQL have learned. Basically, you can multiply, you can divide a timestamp by a window, and then multiply by the window again, and that will chunk up a fine-grain timestamp into a coarser grain that is bound at a window boundary. In this case, it’s 60 times 60. Then, finally, the select block at the end, you’ve got, you’re selecting the time series, the timestamp from above the symbol, min price, max price, first, last, but above that produced an entry for every single point in the series, so we really only want one. We pick out the top-ranked one.

Anyway, this is tricky. I mean, this is the kind of thing that will take an expert user from several minutes, to many minutes, to write, and with references back to the documentation. Can we do better than this? How can we do better? We introduced first and last as regular aggregate functions, in order to enable this kind of use case, with less code. We’ve got a very basic example. Now, select first, price, ts from tick, but the second argument to the first aggregate is a timestamp, but it’s optional.

If it’s not present, then we infer that you meant to use the series timestamp column of the table that you’re querying. The top one is the full notation, but in the bottom query, you say select first price, last price from tick. That first price and last price from tick implicitly use the series timestamp column ts as the time argument, the second argument to those aggregate functions. It just makes the query easier to write. You don’t have to remember to explicitly put in the series time value in the right place when you use those functions.

Next, we have a new function for time bucketing. You don’t have to write that tricky divide, and then that multiply kind of expression that I showed you before. Much, much easier to use, more intuitive. Time bucket takes a bucket width, and that’s a character string like 5m, for five minutes, 1h for one hour, and so forth. Then, two optional arguments – the time and the origin.

New code with MemSQL Time Series functions

The time is optional just like before. If you don’t use it, if you don’t specify it, then we implicitly add the series timestamp column from the table or table, from the table that you’re querying. Then, origin allows you to provide an offset. For example, if you want to do time bucketing but start at 8:000 AM every day, you want a bucket by day but start your day at 8AM instead of midnight, then you can put in an origin argument.

Again, this is far easier than the tricky math expression that we used for that candlestick query before. Here’s some example of uses of origin with an 8AM origin. For example, we’ve got this table T with that, and ts is the series timestamp ,and v is a value that’s a double-precision float. You see the query there in the middle: select time bucket 1d ts, and then you pick a date near the timestamps that you’re working with, and provide… That’s your origin. It’s an 8AM origin.

Then, some of these. You can see down below that the days, the day bucket boundaries are starting at 8AM. Normally, you’re not going to need to use an origin, but if you do have that need to have an offset you can do that. Again, let’s look at the new way of answering, providing the candlestick chart query. This uses, we say select time bucket 1h, which is a one hour bucket. Then, the symbol, the minimum price, the maximum price, the first price, and the last price.

Notice that in first and last and time bucket, we don’t even have to refer to the timestamp column in the original data set, because it’s implicit. Some of you may have worked with specialty products for managing web events like Splunk or Azure Kusto, and so this concept of using a time bucket function or a bucket function with an easy notation like this, you may be familiar with that from those kind of systems.

One of the reason people like those products so much for the use cases that they’re designed for is that it’s really easy to query the data. The queries are very brief. We try to bring that brevity for time series data to SQL with this new capability, with the series timestamp that’s an implicit argument to these functions. Then, just group by 2, 1, which is the time bucket and the symbol and order by 2, 1. So, very simple query expression.

Just to recap, MemSQL for several years has been great for time series ingest and storage. People loved it for that. We have very fast ingest, powerful SQL capability, with time-oriented functions as part of our window function capability. High-performance query processing based on compilation to machine code and vectorization, as well as scalability through scale-out and also the ability to support high concurrency, where you’ve got lots of writers and readers concurrently working on the same data set. And not to mention, we provide transactions, support, easy manageability, and we’re built for the cloud.

Now, given all the capabilities we already had, we’re making it even easier to query time series data with this new brief syntax, these new functions, first, last, and time bucket in the series timestamp concept, that allows you to write queries very briefly, without having to reference, repeatedly and redundantly, to the time column in your table.

Time series functions recap

This lets non-expert users do more than they could before, things they just weren’t capable of before with time series data, and it makes experts users more productive. I’d like to invite you to try MemSQL for free today, or contact Sales. Try it for free by using our free version, or go on Helios and do an eight-hour free trial. Either way, you can try MemSQL for no charge. Thank you.

Q&A: MemSQL and Time Series

Q. What’s the best way to age out old data from a table storing time series data?

A. The life cycle management of time series data is really important in any kind of time series application. One of the things you need to do is eliminate or purge old data. It’s really pretty easy to do that in MemSQL. All you have to do is run a delete statement periodically to delete the old data. Some other database products have time-oriented partitioning capabilities, and their delete is really slow, so they require you to, for instance, swap out an old partition once a month or so to purge old data from a large table. In MemSQL, you don’t really need to do that, because our delete is really, really fast. We can just run a delete statement to delete data prior to a certain time, whenever you need to remove old data.

Q. Can you have more than one time series column in a table?

A. You can only designate one column in a table as the series timestamp. However, you can have multiple time columns in a table and if you want to use different columns, you can use those columns explicitly with our new built-in time functions – FIRST(), LAST(), and TIME_BUCKET().There’s an optional time argument, so if you want to have like a secondary time on a table that’s not your primary series time stamp, but you want to use it for some of those functions, you can do it. You just have to name the time column explicitly in the FIRST(), LAST(), and TIME_BUCKET() functions.

Q. Does it support multi-tenancy?

A. Does it support multi-tenancy? Sure. MemSQL supports any number of concurrent users, up a very high number of concurrent queries. You can have multiple databases on a single cluster, and each application can have its own database if you want to, to have multi-tenant applications running on the same cluster.

Q. Does MemSQL keep a local copy of the data ingested or does it only keep references? If MemSQL keeps a local copy, how is it kept in sync with external sources?

A. MemSQL is a database system. You create tables, you insert data in the tables, you query data in the tables, you can update the data, delete it. So when you add a record to MemSQL it, that record, a copy of the information and that record, the record itself is kept in MemSQL. It doesn’t store data by reference, it stores copies of the data. If you want to keep it in sync with external sources, you need to, as the external values change, you’ll need to update the record that represents that information in MemSQL.

Q. How can you compute a moving average on a time series in MemSQL?

A. Sure. You can compute a moving average; it depends on how you want to do it. If you just want to average the data in each time bucket, you can just use average to do that. If you want to do a moving average, you can use window functions for that, and you can do an average over a window as it moves. You can average over a window from three preceding rows, to the current row, to average the last four values.

Q. Did you mention anything about Python interoperability? In any event, what Python interface capabilities do you offer?

A. We do have Python interoperability – in that, you can let client applications that connect to MemSQL and insert data, query data, and so forth in just about any popular query language. We support connectivity to applications through drivers that are MySQL wire protocol-compatible. Essentially, any application software that can connect to the MySQL database and insert data, update data, and so forth, can also connect to MemSQL.
We have drivers for Python that allow you to write a Python application and connect it to MemSQL. In addition, in our Pipeline capability, we support what are called transforms. Those are programs or scripts that can be applied to transform batches of information that are flowing into MemSQL through the Pipeline. You can write transforms in Python as well.

Q. Do I need to add indexes to be able to run fast select queries on time series data, with aggregations?

A. So, depending on the nature of the queries and how much data you have, how much hardware you have, you may or may not need to use indexes to make certain queries run fast. I mean, it really depends on your data and your queries. If you have very large data sets and high-selectivity queries and a lot of concurrency, you’re probably going to want to use indexes. We support indexes on our rowstore table type, both ordered indexes and hash indexes.

Then, our columnstore table type, we have a sort key, a primary sort key, which is like an index in some ways, as well as support for secondary hash indexes. However, the ability to share your data across multiple nodes in a large cluster and use columnstore, data storage structures that with very fast vectorized query execution makes it possible to run queries with response times of a fraction of a second, on very large data sets, without an index.
That can make it easier as an application developer, you can let the power of your computing cluster and database software just make it easier for you and not have to be so clever about defining your indexes. Again, it really depends on the application.

Q. Can you please also talk about encryption and data access roles, management for MemSQL?

A. With respect to encryption, for those customers that want to encrypt their data at rest, we recommend that they use Linux file system capabilities or cloud storage platform capabilities to do that, to encrypt the data through the storage layer underneath the database system.
Then, with respect to access control, MemSQL has a comprehensive set of data access capabilities. You can grant permission to access tables and views to different users or groups. We support single sign-on through a number of different mechanisms. We have a pretty comprehensive set of access control policies. We also support row-level security.

Q. What of row locking will I struggle kind with by using many transactions, selects, updates, deletes at once?

MemSQL has multi-version concurrency control, so readers don’t block writers and vice versa. Write-Write conflicts usually happen at row-level lock granularity.

Q. How expensive is it to reindex a table?

A. CREATE INDEX is typically fast. I have not heard customers have problems with it.

Q. Your reply on moving averages seem to pertain to simple moving averages, but how would you do exponential moving averages or weighted moving averages where a windows function may not be appropriate?

A. For that you’d have to do it in the client application or in a stored procedure. Or consider using a different time series tool.

Q. Are there any utilities available for time series data migration to / from an existing datastores like Informix,

A. For straight relational table migration, yes. But you’d have to probably do some custom work to move data from a time series DataBlade in Informix to regular tables in MemSQL.

Q. Does series timestamp accept integer data type or it has to be datetime data type?

A. The data type must be time or datetime or timestamp. Timestamp is not recommended because it has implied update behavior.

Q. Any plans to support additional aggregate functions with the time series functions? (e.g. we would have liked to get percentiles like first/last without the use of CTEs)

A. Percentile_cont and percentile_disc work in MemSQL 7.0 as regular aggs. If you want other aggs, let us know.

Q. Where can I find more info on AI (ML & DL) in MemSQL?

A. See the documentation for dot_product and euclidean_distance functions. And see webinar recordings about this from the past. And see blog: https://www.memsql.com/blog/memsql-data-backbone-machine-learning-and-ai/

Q. Can time series data be associated with asset context and queried in asset context. (Like a tank, with temperature, pressure, etc., within the asset context of the tank name.)

A. A time series record can have one timestamp and multiple fields. So I think you could use regular string table fields for context and numeric fields for metrics to plot and aggregate.

Q. Guessing the standard role based security model exists to restrict access to time series data.

A. Yes.

(End of Q&A)

We invite you to learn more about MemSQL at https://www.memsql.com, or give us a try for free at https://www.memsql.com/free.


Case Study: MemSQL Replaces Hadoop for Managing 2 Million New-Style Utility Meters

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

SME Solutions Group is a MemSQL partner. An SME customer, a utility company, was installing a new meter network, comprising 2 million meters, generating far more data than the old meters. The volume of data coming in, and reporting needs, were set to overwhelm their existing, complex, Hadoop-based solution. The answer: replacing 10 different data processing components with a single MemSQL cluster. The result: outstanding performance, scalability for future requirements, the ability to use standard business intelligence tools via SQL, and low costs.

SME Solutions Group (LinkedIn page here) helps institutions manage risks and improve operations, through services such as data analytics and business intelligence (BI) tools integration. MemSQL is an SME Solutions Group database partner. George Barrett, Solutions Engineer at SME, says: “MemSQL is like a Swiss Army knife – able to handle operational analytics, data warehouse, and data lake requirements in a single database.” You can learn more about how the two companies work together in this webinar and in our previous blog post.

Introduction

A utility company had installed a complex data infrastructure. Data came in from all the company’s systems of record: eCommerce, finance, customer relationship management, logistics, and more.

The utility required fast ingest and fast processing, along with concurrency.

The new meter network was going to blow up ingest requirements to 100,000 rows per second, with future expansion planned. The existing architecture was insufficient, and it lacked the ability to scale up quickly. It featured ten components:

  • HDFS, Hive, and Druid. The core database was made up of the Hadoop Distributed File System (HDFS); Hive, for ETL and data warehousing; and Druid, an online analytics processing (OLAP) database, all frequently used together for big data / data lake implementations.
  • ODS, EDW, and Spark. An operational data store (ODS) and an electronic data warehouse (EDW) fed Spark, which ran analytics and machine learning model.
  • Key-value store, MongoDB (document-based), Cassandra (columnstore data), and ElasticSearch (semi-structured data). Four different types of NoSQL databases stored data for different kinds of reporting and queries.

The Old Solution Fails to Meet New Requirements

This mix of different components was complex, hard to manage, and hard to scale. Worse, it was simply not up to the task of handling the anticipated ingest requirements, even if a lot of effort and investment were expended to try to make it work.

The utility required simultaneous writes and reads and high performance, along with time series bucketing.

Core requirements would have hit different parts of this complex system:

  1. Ingest data as fast as it was getting streamed. HDFS, for example, is best used for batch processing. Though it can handle micro-batches, it’s not really adapted for true streaming, as required for the new meter system.
  2. Aggregate data on three levels. The utility needed to aggregate data continuously, at five second intervals per meter, per day, and per month, with the ability to add aggregations going forward. The interaction between HDFS and Hive could not run fast enough to provide the needed aggregations without resorting to multi-second, or longer, response times.
  3. Bucket selected reads into high alarms and low alarms. Some reads needed to be marked off as alarms due to a value being too high or too low, with the alarm levels changed as needed by the utility. (MemSQL’s new time series features can help here.) The HDFS/Druid pairing could not handle this requirement flexibly and with the needed performance.
  4. Query data throughout process. The utility needed to allow operations personnel, analysts, and management to interactively query data throughout the process, and to be able to add applications and machine learning models in the future. The 10-component stack had a variety of interfaces, with different levels of responsiveness, altogether too limited and too complex to support query needs from the business.
  5. Maintain performance. Performance that could meet or exceed tight service level agreements (SLAs) was needed throughout the system. The variety of interfaces and the low performance levels of several components almost guaranteed that performance would be inadequate initially, and unable to scale up.

MemSQL Meets and Exceeds Requirements

The utility built and tested a new solution, using Kafka to stream data into MemSQL. The streaming solution, with MemSQL at its core, pulled all the functionality together into a single database, instead of 10 different components, as previously. And it more than met all the requirements.

MemSQL and Kafka deliver exactly-once updates in real time.

  1. Ingest data as fast as it’s getting streamed. The combination of Kafka and MemSQL handles ingest smoothly, with Kafka briefly holding data when MemSQL is performing the most complex operations, such as monthly aggregations.
  2. Aggregate data on three levels. MemSQL handles the three aggregations needed – per meter, meter/day, and meter/month – with room for more.
  3. Bucket selected reads into high alarms and low alarms. MemSQL’s ability to run comparisons fast, on live data, makes it easy to bucket reads into alarm and non-alarm categories as needed.
  4. Query data throughout process. MemSQL supports SQL at every step, as well as the MySQL wire protocol, making it easy to interface MemSQL to any needed tool or application.
  5. Maintain performance. MemSQL is so fast that a modestly sized cluster handles the entire operation. If data volumes, query volumes, or other business needs require it, MemSQL scales linearly to handle the increased demands.

MemSQL meets all requirements, and delivers room to grow.

There are also obvious operational advantages to using a single database, which supports the SQL standard, to ten disparate components which don’t.

Machine learning and AI are now also much easier to implement. With a single data store for all kinds of data, live data and historical data can be kept in separate tables in the same overall database. Standard SQL operations such as JOINs can unify the data for comparison, queries, and more complex operations, with maximum efficiency.

The Future with MemSQL

With MemSQL at the core, SME’s customer is able to run analytics and reporting across their entire data-set using a wide variety of tools and ad-hoc processes. Although the original use case was 140 million rows of historical meter read data, they are easily able to scale their environment as their data grows to billions and even trillions of rows.

George and others are also excited about the new SingleStore capability, launched in MemSQL 7.0. In this initial implementation of SingleStore, rowstore tables have compression, and columnstore tables have fast seeks. The tables are more alike, and the need to use multiple tables, of two different types, to solve problems is greatly reduced. Over time, more and more problems will be solved in one table type, further simplifying the already much-improved operations workload.

You can learn more about how the two companies work together in this webinar and in our previous blog post. To get in touch with the SME Solutions Group, you can schedule a demo or visit their Linkedin page. To try MemSQL, you can run MemSQL for free; or contact MemSQL.

The Write Stuff

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Why did the world need MemSQL? In this blog post, updated from a few years ago, early MemSQL Product Manager Carlos Bueno explains why MemSQL works better, for a wide range of purposes, than a NoSQL setup. (Thanks, Carlos!) We’ve updated the blog post with new MemSQL product features, graphics, and relevant links. To wit, you should also see Rick Negrin’s famous blog post on NoSQL and our recent case study on replacing Hadoop with MemSQL.

Tell us if this sounds familiar. Once upon a time a company ran its operations on The Database Server, a single machine that talked SQL. (Until the advent of NewSQL, most relational database systems – the ones that support SQL – ran on a single core machine at a time. – Ed.) It was tricked out with fast hard drives and cool blue lights. As the business grew, it became harder for The Database to keep up. So they bought an identical server as a hot spare and set up replication, at first only for backups and failover. That machine was too tempting to leave sitting idle, of course. The business analysts asked for access so they could run reports on live-ish data. Soon, the “hot spare” was just as busy – and just as mission-critical – as the master server. And each machine needed its own backup.

The business grew some more. The cost of hardware to handle the load went way up. Caching reads only helped so much, and don’t get us started about maintaining cache consistency. It was beginning to look like it would be impossible for The Database Server to handle the volume of writes coming in. The operations people weren’t happy either. The latest semi-annual schema change had been so traumatic and caused so much downtime that they were still twitching.

It was then that the company took a deep breath, catalogued all their troubles and heartache, and decided to ditch SQL altogether. It was not an easy choice, but these were desperate times. Six months later the company was humming along on a cluster of “NoSQL” machines acting in concert. It scaled horizontally. The schemas were fluid. Life was good.

For a while, anyway. It turned out that when scaled up, the NoSQL cluster worked fine except for two minor things: reading data and writing data. Reading data (“finding documents”) could be sped up by adding indexes. But each new index slowed down write throughput. The business analysts weren’t about to learn how to program just to make their reports. That task fell back onto the engineers, who had to hire more engineers, just to keep up. They told themselves all this was just the price of graduating to Big Data.

The business grew a little more, and the cracks suddenly widened. They discovered that “global write lock” essentially means “good luck doing more than a few thousand writes per second.”

A few thousand sounds like a lot, but there are only 86,400 seconds in a day, and the peak-hour of traffic is generally two or three times the average – because, people sleep. A limit of 3,000 writes per second translates to roughly 90 million writes a day. And let’s not talk about reads. Flirting with these limits became as painful as the database platform they’d just abandoned.

Tell us if this sounds familiar. I’ve seen a lot of companies suddenly find themselves stuck up a tree like this. It’s not a fun place to be. Hiring performance experts to twiddle with the existing system may or may not help. Moving to a different platform may or may not help either. A startup you’ve definitely heard of runs four – count ‘em, four – separate NoSQL systems, because each one had some indispensable feature (eg, sharding or replication) that the others didn’t. That way lies madness.

Hypothetical Corp

Let’s look at the kinds of hardware running Hypothetical Corp’s business.

  • 50 application servers (lots of CPU)
  • 10 Memcached servers (lots of RAM)
  • Four NoSQL servers (lots of disk)

The interesting thing is that Hypothetical has several times more RAM in its fleet than the size of their database. If you ask them why, they’ll tell you “because accessing data from RAM is much faster than from disk.” This is, of course, absolutely true.  Accessing  a random piece of data in RAM is 100,000 times faster than a spinning hard disk, and 1,000 times faster than from SSDs.

MemSQL can sit at the center of your data architecture, with data streaming in and query responses streaming out.

Here’s a crazy idea: instead of throwing a bunch of memory cache around a disk-based NoSQL database that has only half the features you want, what if you cut to the chase and used a database with in-memory rowstore tables, and disk-based columnstore tables, instead? One that talks SQL? And has replication? And sharding that actually works? And high availability? And massive write throughput via lock-free data structures? And transactions – including transactions in stored procedures? And flexible schemas with JSON & non-blocking ALTER TABLE support…

…and one that’s steadily growing in capabilities and features. Since this blog post was written, MemSQL has added columnstore tables (see above), MemSQL Studio visual tool for managing clusters, MemSQL Helios – our elastic cloud database service, MemSQL SingleStore, the ability to run MemSQL for free – on premises or in the cloud – and so much more.

As mentioned above, please see Rick Negrin’s NoSQL blog post and our case study on replacing Hadoop with MemSQL. You can then get started with MemSQL for free or contact Sales.

Notes:

http://docs.mongodb.org/manual/core/write-performance/
“After every insert, update, or delete operation, MongoDB must update every index associated with the collection in addition to the data itself. Therefore, every index on a collection adds some amount of overhead for the performance of write operations.”

https://tech.dropbox.com/2013/09/scaling-mongodb-at-mailbox/
“…one performance issue that impacted us was MongoDB’s database-level write lock. The amount of time Mailbox’s backends were waiting for the write lock was resulting in user-perceived latency.”

http://redis.io/topics/partitioning
“The partitioning granuliary [sic] is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set.”

The G2 Crowd Has (Much) More to Say About MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL is garnering a lot of positive attention on business solutions review site G2.com (formerly known as G2 Crowd). Since our look at the site last July, dozens of new reviews have been posted, citing MemSQL’s speed, high capacity, and SQL compatibility, among other features. However, the most recent review was in December; now, some lucky MemSQL user has the chance to register the first comment of the new year. (Also, as most would say, of the new decade, though – as xkcd points out – this is a matter of some controversy.)

G2.com features information and frank user reviews of business solutions – software and services – across a wide range of categories. Software categories include CRM software, demand generation software, BI tools, AI development and delivery tools, marketplace platforms, CAD/CAM tools, and many more. Services categories include lawyers, tax people, sales training, graphic design, website design, staffing services, channel partners, and much more. 

It’s easy to link from a product to its overarching category (such as Relational Databases Software, for MemSQL and related software, or to interact with a G2 Advisor, to help you narrow your search with help from a live person. (Not a bot – at least, not last time we checked.) You can quickly share your comments on social media, ask a vendor question, or request a demo from a vendor. 

Using G2.com will help you see whether a product is a strong player in the market; assess its strengths and weaknesses; look at the competition; and get ready to answer intelligent questions in a vendor call. Your review comments encourage the vendors of products you use to keep up the good work and to fix any problems. 

Note: Be sure to hit the Show More link on G2.com reviews. Otherwise, you might miss comments about recommendations to others, the problems being solved, benefits being realized, and other valuable information. 

What MemSQL Users on G2.com Say About MemSQL

The recent comments on G2.com have yielded a star rating of 4 out of a possible 5 points. The major positive areas cited in include speed, capacity, and ease of implementation. (Some of the comments have been lightly edited for spelling, syntax, spelling out acronyms, and so on – the original comments are on the G2 site.) 

Speed

The major focus of many comments, and a common thread through nearly all of them, is MemSQL’s speed. 

MemSQL achieves its high speed through a distributed, lock-free architecture. A key architectural component, which helps in avoiding locks, is the use of skiplists over B-trees for scans. 

MemSQL’s architecture features the use of skiplists – especially efficient for in-memory operations – over B-trees.

Comments include:

  • “We saw a sub-second response on a test case with 1000 concurrent heavy API calls (scanning billions of rows across many tables + window functions etc) along with applying role-based RBAC functionalities on the fly.” (MemSQL has processed over a trillion rows per second.) “Incredibly fast performance for dashboards!” This was the heading of a five-star review from a VP of IT Operations. They went on, “We have had excellent luck with ingest rates and the ability to do lightning fast counts and other general math on time-based data sets.” 
  • “We’ve been able to pump through many millions of records in seconds, where in other database systems we were having our queries time out.”
  • “I… recommend it for businesses who have issues with data retrieval slowness and are looking to improve data retrieval speeds. We solved slowness in our grid reports by using MemSQL. This reduced the time clients waited for reports to be generated and made their experience much more pleasant.”
  • “The best thing about MemSQL is the speed at which queries are run. Add Kafka pipelines for your ETL process and everything data-related becomes much easier.”
  • “We handle extremely large data set analysis. MemSQL provides us with the capability to do so at incredible speed!”
  • “We, a Big 4 accounting firm, use MemSQL for rapid creation and destruction of clusters. MemSQL’s speed and reliability are key for our use case, since speed is our priority. Not having to keep clusters active saves time and money. MemSQL’s speed enables that competitive advantage.”
  • Problem being solved: Slow queries in MySQL.” Benefits realized: “MemSQL allows us to run all queries, with no limitation, at every scale of data.”
  • “Currently the teams are using Memsql to parse extremely large datasets in times that used to take days.”

Capacity – and “Solving too Many Problems”

What’s truly remarkable about MemSQL is its linear scalability – the ability to maintain high performance, for both transactions and analytics, at high volumes. This includes high ingest volumes, processing large volumes of data, high volumes of queries, and high concurrency – many queries coming from many sources, including ad hoc queries, business intelligence tools, application demands, machine learning models, and more. When people refer to MemSQL’s speed, as in the previous section, they often really mean its ability to maintain very high speeds with very large volumes of data. 

A number of the more recent comments on G2.com address this core MemSQL capability:

  • “Our primary BI tool is Tableau. Before we implemented MemSQL, we were connecting Tableau to Microsoft SQL Server and running reports off there. However, the performance was extremely slow, especially when querying large chunks of data. With MemSQL, we can query as much data as we need, with little to no impact on performance time.”
  • “We replaced the databases of our core operational systems with MemSQL. We mostly go after greater performance and scalability.”
  • “Very scalable, fabulous, and very easy to use. I recommend it for the management of the organization’s data.”
  • “Great for extreme data processing. For extremely large data mining, MemSQL stands above the rest. The ability to process an extreme amount of data expediently is by far the greatest part of MemSQL.”
  • “(We implemented a) data lake with MemSQL. Performance increased multi-fold and scaled for a large volume of data.”
  • “We are solving too many problems with the help of MemSQL. Now we can analyze large amounts of data with efficiency.”

Two Sides to Ease of Use

The only major aspect of MemSQL’s capabilities that receives mixed reviews on G2 Crowd is ease of use. Some commenters describe it as very easy to use, and very easy to implement; this is especially true for those who have MySQL experience. (Not only do both databases use SQL, but MemSQL directly supports the MySQL wire protocol for connectivity.) Other commenters describe MemSQL as complicated or taking time to master; “It has some learning curve,” says one. 

The commenters are, indeed, talking about the same database. The point here is that, when dealing with large volumes of data, MemSQL eliminates the need to shard your own database, or to buy in a complex and expensive solution such as Oracle Exadata, when capacity demands outstrip what can be handled on a single server. So MemSQL is easy to use, compared to the big-data alternatives. But it’s harder to understand how it works than with some single-server relational databases. 

MemSQL is also unusual in its ability to handle both transactions and analytics, while replacing slow and complex extract, transform, and load (ETL) processes with easy-to-use MemSQL Pipelines. Several comments recognize this; “it has potential to be used for both for transaction processing as well analytical queries.” 

But a couple of comments seem to reflect a view of MemSQL as an in-memory database (only); “Excellent in-memory option for high-powered data analysis,” says one. MemSQL rowstore tables, used for transactions, do run in memory; but columnstore tables, used for most analytics operations, are disk-based, compressed 5-10x, and very efficient. Most of the comments recognize both sets of capabilities. 

Finally, MemSQL 7.0 includes the first iteration of MemSQL SingleStore, which will eventually unify rowstore and columnstore tables in a single table type, for most workloads. In MemSQL 7.0, SingleStore allows for compression of in-memory rowstore tables of about 50%, saving a great deal of money. At the same time, columnstore tables get big performance improvements from first-generation SingleStore features in the new release.  

A senior data engineering manager sums up many users' comments in a five-star review, one day after Christmas.

Summing Up

One comment sums up many of MemSQL’s benefits, well, elegantly. When asked, What benefits have you realized?,” this customer comments: “The problem of running streaming analytics on high-velocity, high volume data sets with sub-second API responses. This product elegantly solves the problem.”

If you’re already a MemSQL user – whether you have an Enterprise license, or are using MemSQL for free – consider posting a review today. (And remember that you can get questions answered on the MemSQL Forums as well.) Your efforts will benefit the community as a whole.

If you haven’t yet tried MemSQL, take a look at the reviews on G2.com. Post questions there, or on the MemSQL Forums. And consider trying MemSQL for free today.

New Year, New FAQs

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

As interest in MemSQL increases, we get many questions about how MemSQL works, how to get started with it, and more. Performance reviews website G2.com has a MemSQL Q&A section where potential customers can ask questions. Here are some of the questions we hear most often – with answers – lightly edited, for context and clarity.

Q. What is the advantage of MemSQL over other distributed databases?

A. Compared to relational databases – those which support SQL – we believe that MemSQL is the fastest, most efficient SQL database. MemSQL features full, linear scalability, unlike competitors. We also handle both transactions and analytics in one database, as described by customers in the reviews on G2.com

Compared to NoSQL, MemSQL is at least as scalable, far more efficient with machine resources, and of course, unlike NoSQL databases, we have full ANSI SQL support. MemSQL also supports data types, such as JSON and geospatial data, that may otherwise only be supported by NoSQL databases. 

Q. How to simplify scaling of a MemSQL cluster? We would like our microservices to use-in memory processing and storage for analytics purposes.

A. This question does seem particularly pertinent to microservices, as you are more likely to have multiple data stores. There are several parts to the answer: 

  1. This tutorial describes how to scale your cluster for optimal performance. 
  2. You can use Kubernetes, specifically the MemSQL Kubernetes Operator, to scale clusters flexibly. 
  3. With MemSQL Helios, our elastic managed service in the cloud, you simply send a request, and MemSQL will quickly rescale the cluster for you. 
  4. For more specifics, please use the MemSQL Forums to give a more detailed description and get a more detailed answer – or file a support ticket, if you have an Enterprise license. 
  5. Alternatively, contact MemSQL directly for more in-depth information. 

Q. Is MemSQL a row-based or column-based store?

A. We are happy to report that the answer is: Yes. MemSQL supports both rowstore and columnstore tables in the same database (or separate databases), with the ability to efficiently run JOINs and other SQL operations across both table types. We also support new capabilities, under the umbrella of MemSQL SingleStore, which will gradually unify the two table types for most workloads; see the description of SingleStore-related changes in MemSQL 7.0, below. And see the MemSQL review comments on G2.com for more information about rowstore and columnstore tables, and also contact MemSQL directly.

Rowstore transaction tables and columnstore analytics tables are merging into SingleStore over time.

Q. What is the relationship between MemSQL and MySQL?

A. MemSQL and MySQL are both ANSI SQL-compliant, so the same queries work on both – as is, or with minor changes. In addition, MemSQL directly supports the MySQL wire protocol for connectivity. 

Q. What versions of MemSQL are available?

A. The current version, MemSQL 7.0, is the only version available. MemSQL 7.0 is available for (free) download and is also the version that powers MemSQL Helios, our on-demand, elastic, managed service in the cloud. (Helios is compared to other cloud databases in the table below.) Some existing MemSQL customers are still running older versions, as is typical with infrastructure software.  

MemSQL Helios beats Mongo, Oracle Cloud, and Amazon Aurora for features.

Q. Where can I go for help if I do not have access to paid support?

A. Please visit the MemSQL Forums

Q. How can I get a copy of MemSQL?

A. You can use MemSQL for free. You can download a fully capable version of the MemSQL software for use on your on-premises servers or in the cloud. This comes with community support and has fairly liberal capacity constraints. Alternatively, you can get a free 8-hour trial of MemSQL Helios, our elastic managed service in the cloud. Or, contact MemSQL to discuss using MemSQL for a proof of concept enterprise. 

Stop the Insanity: Eliminating Data Infrastructure Sprawl

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

There is a trend in industry which says that modern applications need to be built on top of one or more special-purpose databases. That every application benefits from using the best-of-breed technology for each requirement. And that the plethora of special-purpose options available from certain cloud providers is reasonable to manage. That’s all BUNK. The reality is that navigating the choices, figuring out how to use them effectively, and dealing with the ETL and inevitable data sprawl, is so difficult that the pain far outweighs any technical advantage you might get. In the vast majority of use cases, a single modern, scalable, relational database can support all of an application’s needs, across cloud providers and on-premises. 

Over the past decade, applications have become more and more data-intensive. Dynamic data, analytics, and models are now at the core of any application that matters. In order to support these requirements, there is a commonly held belief that modern applications need to be built on top of a variety of special-purpose databases, each built for a specific workload. It is said that this allows you to pick the best ones to solve your application needs. 

This trend is apparent when you look at the plethora of open source data tools that have proliferated in recent years. Each one was built to scratch an itch; optimized for specific, narrow use cases seen in a smattering of projects. In response, some of the cloud vendors have packaged up these multiple database technologies for you to choose from, commonly forking from existing open source projects. You’re then meant to wire together several of these tools into the needed data solution for each application.  

On the surface, the argument seems logical. Why bother to build or use general-purpose technology across multiple problem domains, maybe having to work around limitations that come from being a general-purpose solution, when you can use multiple tools, purpose-built for each of the specific problems you are trying to solve? 

Andy Jassy, CEO of AWS, made this point in his keynote at the company’s Re:Invent conference recently. Saying: “In the past, customers primarily used relational databases, and the day for that has come and gone…. Customers have asked for, and demand, purpose-built databases.”

The claim is that relational databases are too expensive; not performant; don’t scale. That is supposedly why Amazon offers eight operational databases. (It had been seven, but they announced another one at the conference: Amazon Managed Apache Cassandra Service.) This is not including the various analytic data warehouse technologies from AWS, such as Redshift, Athena, and Spectrum. Jassy goes on a rant about how anyone who tries to convince you otherwise is fooling you, and you should just ignore the person and walk away. (See the 1:18:00 mark in the keynote.)

Well, I am that person – and I am not alone.

This is not to say that there is no value in each of the special-purpose database offerings. There are certainly use cases where those special purpose databases shine, and are truly the best choice for the use case. These are cases where the requirements in one specific dimension are so extreme that you need something special-purpose to meet them. 

But the absolute requirement for specialty databases is mostly in outlier use cases, which are a tiny fraction of the total of workloads out there. In the vast majority of apps that people build, the requirements are such that they can be satisfied by a single, operational NewSQL database – a distributed relational database, supporting a mix of transactional and analytical workloads, multi-model, etc. – such as MemSQL. This is especially true when you find you need more than just a couple of special-purpose databases in your solution, or when your requirements are expected to change over time.

The burden of choice has always been the dilemma of the software engineer. It used to be that the choice was whether to buy an existing component or to build it yourself. You had to make the trade-off between the dollar cost to purchase the component – and the risk it might not be as good as you hoped – vs. the cost, in time and engineering resources, to build and maintain a custom solution. 

Most experienced engineers would likely agree that, in most cases, it is better to buy an existing component if it can meet the requirements. The cost to build is always higher than you think, and the cost to work out issues and to maintain the solution over time often dwarfs the initial cost. In addition, having someone to call when something breaks is critical for a production system with real customers. 

But then things changed.

How Choices Have Ballooned with Open Source and the Cloud

The emergence of open source software has fundamentally changed the “build vs. buy” choice. Now, it is a choice of build, buy – or get for free. And people love free. 

Most engineers who use open source don’t really care about tinkering with the source code and submitting their changes back to the code base, or referring to the source code to debug problems. While that certainly does happen (and kudos to those who contribute), the vast majority are attracted to open source because it is free. 

The availability of the Internet and modern code repositories like Github have made the cost to build software low, and the cost to distribute software virtually nothing. This has given rise to new technology components at a faster rate than ever seen before. Github has seen massive growth in the number of new projects and the number of developers contributing, with 40 million contributors in 2019, 25% of whom are new, and 44 million repositories. 

On the face of it, this seems great. The more components that exist, the better the odds that the one component that exactly matches my requirements has already been built. And since they are all free, I can choose the best one. But this gives rise to a new problem. How do I find the right one(s) for my app? 

Too Many Options

There are so many projects going on that navigating the tangle is pretty difficult. In the past, you generally had a few commercial options. Now, there might be tens or hundreds of options to choose from. You end up having to narrow it down to a few choices based on limited time and information. 

Database technology in particular has seen this problem mushroom in recent years. It used to be you had a small number of choices: Oracle, Microsoft SQL Server, and IBM DB2 as the proprietary choices, or MySQL if you wanted a free and open source choice.

Then, two trends matured: NoSQL, and the rise of open source as a model. The number of choices grew tremendously. In addition, as cloud vendors are trying to differentiate, they have each added both NoSQL databases and their own flavors of relational (or SQL) databases. AWS has more than 10 database offerings; Azure and GCP each have more than five flavors. 

AWS database options, from S3 to DynamoDB
AWS offers a bewildering plethora of database choices.

DBEngines (a site for tracking the popularity of database engines) has more than 300 databases on the list, with new ones getting added all the time. Even the definition of what is a “database” has evolved over time, with some simple data tools such as caches marketing themselves as databases. This is making it difficult to know, without a lot of research, whether a particular technology will match the requirements of your application. Fail to do enough research, and you can waste a lot of time building on a data technology, only to find it has some important gap that tanks your design.

Choosing a Specialty Database Type

There are many different flavors of databases on the list. Operational databases and data warehouses are the most common types, but there are several more. Each has a set of requirements which they solve.

Database Types

Requirements
Operational Databases 
Oracle, SQL Server, Postgres, MySQL, MariaDB, AWS Aurora, GCP Spanner
  • Fast Insert
  • Fast Record Lookup
  • High Concurrency
  • High Availability
  • High Resilience
  • Relational Model
  • Complex Query
  • Extensibility
Data Warehouses 
Teradata, Netezza, Vertica, Snowflake
  • Fast Query
  • Aggregations
  • Large Data Size
  • Large Data Load
  • Resource Governance
Key-Value Stores
Redis, GridGain, Memcached
  • Fast Insert
  • Fast Record Lookup
  • High Concurrency
  • High Availability
Document Stores
MongoDB, AWS DocDB, AWS DynamoDB, Azure Cosmos DB, CouchDB
  • Fast Record Lookup
  • High Availability
  • Flexible Schema
Full-Text Search Engines
Elasticsearch, AWS Elasticache, Solr
  • Fuzzy Text Search
  • Large Data Sets
  • High Availability
Time Series:
InfluxDB, OpenTSDB, TimescaleDB, AWS Timestream
  • Simple queries over time series data
GraphDB:
Neo4j, JanusGraph, TigerGraph, AWS Neptune
  • Graph-Based Data Relationships
  • Complex Queries

Table 1. Fitting Your Prospective Application to Different Database Types

Every database is slightly different in the scenario it excels at. And there are new specialty databases emerging all the time.

If you’re building a new solution, you have to decide what data architecture you need. Even if you assume the requirements are clear and fixed – which is almost never the case – navigating the bewildering set of choices as to which database to use is pretty hard. You need to assess requirements across a broad set of dimensions – such as functionality, performance, security, and support options – to determine which ones meet your needs. 

AWS has you choose latency, durability, scale, and more to try to choose an AWS database service.
AWS boils it down to a 54-slide deck to help you choose.

If you have functionality that cuts across the different specialty databases, then you will likely need multiple of them. For example, you may want to store data using a standard relational model, but also need to do full text queries. You may also have data whose schema is changing relatively often, so you want to use a JSON document as part of your storage. 

The combination of databases you can use in your solution is pretty large. It’s hard to narrow that down by just scanning the marketing pages and the documentation for each potential solution. Websites cannot reliably tell you whether a database offering can meet your performance needs. Only prior experience, or a PoC, can do that effectively.

How Do I Find the Right People?

Once you have found the right set of technologies, who builds the application? You likely have a development team already, but the odds of them being proficient in programming applications on each specific, new database are low. 

This means a slower pace of development as they ramp up. Their work is also likely to be buggier as they learn how to use the system effectively. They also aren’t likely to know how to tune for optimal performance. This affects not just developers, but the admins who run, configure, and troubleshoot the system once it is in production.

How Do I Manage and Support the Solution with So Many Technologies?

Even after you pick the system and find the right people, running the solution is not easy. Most likely you had to pick several technologies to build the overall solution. Which means probably no one in your organization understands all the parts. 

Having multiple parts also means you have to figure out how to integrate all the pieces together. Those integration points are both the hardest to figure out, and the weakest point in the system. It is often where performance bottlenecks accumulate. It is also a source of bugs and brittleness, as the pieces are most likely not designed to work together. 

When the solution does break, problems are hard to debug. Even if you have paid for support for each technology – which defeats the purpose, if you’re using things which are free – the support folks for each technology are not likely to be helpful in figuring out the integration problems. (They are just as likely to blame each other as to help you solve your problem).

The Takeaway

Going with multiple specialty databases is going to cost you, in time, hassle, money and complexity:

  • Investigation analysis. It takes a lot of energy and time to interrogate a new technology to see what it can do. The number of choices available is bewildering and overwhelming. Every minute you spend doing the investigation slows down your time to market.
  • Many vendors. If you end up choosing multiple technologies, you are likely to have different vendors to work with. If the solution is open source, you are either buying support from a vendor, or figuring out how to support the solution yourself. 
  • Specialized engineers. It takes time and experience to truly learn how to use each new data technology. The more technology you incorporate into your solution, the harder it is to find the right talent to implement it correctly.
  • Complicated integrations. The most brittle parts of an application are the seams between two different technologies. Transferring data between systems with slightly different semantics, protocols that differ, and connection technologies that have different scale points are the places where things break down (usually when the system is at its busiest).
  • Performance bottlenecks. Meshing two different technologies is also where performance bottlenecks typically occur. With data technologies, it is often because of data movement.
  • Troubleshooting integration problems. Tracking down and fixing these issues is problematic, as the people doing the tracking down are rarely experts in all the technologies. This leads to low availability, frustrated engineers, and unhappy customers.

Considering MemSQL – a New Solution for a New Era

Ideally, there would be a database infrastructure which is familiar; which has an interface that most existing developers know how to use and optimize; and which has functionality needed to handle 90% or more of the use cases that exist. It would need to be cloud-native – meaning it natively runs in any cloud environment, as well as in an on-premises environment, using cloud-friendly tools such as Kubernetes. 

This ideal technology would also be distributed, so that it scales easily, as required by the demands of the application. This database would be the default choice for the vast majority of applications, and developers would only need to look for other solutions if they hit an outlier use case. Using the same database technology for the vast majority of solutions means the engineers will be familiar with how to use it, and able to avoid the issues listed above.

This is why we built MemSQL. Legacy databases like Oracle and SQL Server served this function for a long time. But the scale and complexity requirements of modern applications outgrew their capabilities. 

These needs gave rise to the plethora of NoSQL systems that emerged out of the need to solve for the scale problem. (I discuss this in my blog about NoSQL and relational databases.) But the NoSQL systems gave up a lot of the most useful functionality, such as structure for data and SQL query support, forcing users to choose between scale and functionality. 

NewSQL systems like MemSQL allow you to have the best of both worlds. You get a highly scalable cloud native system that is durable, available, secure, and resilient to failure – but with an interface that is familiar to developers and admins. It supports a broad set of functionality. It supports ANSI SQL. It supports all major data types – relational, semi-structured (native support for storing JSON and for ingesting JSON, AVRO and Parquet), native geo-spatial indexes, and Lucene-based full text indexes that allow Lucene queries to be embedded in relational queries. It has both rowstore and columnstore tables – currently merging into SingleStore tables – supporting complex analytical workloads, as well as transactional and operational workloads. 

MemSQL has support for transactions. It supports stored procedures and user-defined functions (UDFs), for all your extensibility needs. It can ingest data natively from all the major data sources from legacy database systems, to blob stores like S3 and Azure Blob, as well as modern streaming technologies such as Kafka and Spark. 

MemSQL runs everywhere and offers operational analytics, and much more.

The combination of a shared-nothing scale-out architecture and support for in-memory rowstore tables means there is no need for a caching layer for storing and retrieving key-value pairs. Because MemSQL is wire protocol-compatible with MySQL, it supports a huge ecosystem of third-party tools. 

MemSQL has a rich set of security features, such as multiple forms of authentication (username/password, Kerberos, SAML, and PAM). It supports role-based access control (RBAC) and row-level security for authorization. MemSQL supports encryption. You can use the audit feature to determine who accessed your data, and what was accessed. 

Lastly, MemSQL can be deployed using standard command-line tools, on Kubernetes via a native Kubernetes Operator, or managed by MemSQL, via our managed service, MemSQL Helios.

MemSQL in Action

Let’s walk through some examples of a few customers who ran into these problems while first trying to build their solution using a combination of technologies, and how they ultimately met their requirements with MemSQL.

Going Live with Credit Card Transactions

A leading financial services company ran into challenges with their credit and debit fraud detection service. The firm saw rising fraud costs and customer experience challenges that ultimately prompted re-building their own in-house solution. The goal was to build a new data platform that could provide a faster, more accurate service – one that could catch fraud before the transaction was complete, rather than after the fact, and be easier to manage.

You can see in the diagram below that the customer was using ten distinct technology systems. Stitching these systems together, given the complexity of interactions, was very hard. 

Ultimately, the overall system did not perform as well as they hoped. The latency for the data to traverse all the systems was so high that they could only catch fraud after the transaction had gone through. (To stop a transaction in progress, you need to make a decision in tens or hundreds of milliseconds. Anything longer means an unacceptable customer experience.) 

It was also hard to find people with the right experience in each of the technologies to be sure the system was being used correctly. Lastly, it was hard to keep the system up and running, as it would often break at the connection points between the systems.

Financial services company faces data sprawl.

They replaced the above system with the architecture below. They reduced the ten technologies down to two: Kafka and MemSQL. They use Kafka as the pipeline to flow all the incoming data from the upstream operational systems. 

All of that lands in MemSQL, where the analysis is done and surfaced to the application. The application then uses the result to decide whether to accept the credit card transaction or reject it. In addition, analysts use MemSQL to do historical analysis to see when and where fraud is coming from. They are now able to meet their service level agreements (SLAs) for running the machine learning algorithm and reporting the results back to the purchasing system, without impacting the user experience of the live credit card purchase.

MemSQL replaces many systems with one.

Making e-Commerce Fanatics Happy

Another MemSQL customer, Fanatics, also used MemSQL to reduce the number of technologies they were using in their solution. 

Fanatics has its own supersite, Fanatics.com, but also runs the online stores of all major North American sports leagues, more than 200 professional and collegiate teams, and several of the world’s largest global football (soccer) franchises. Fanatics has been growing rapidly, which is great for them as a business – but which caused technical challenges.

Fanatics’ workflows were very complex and difficult to manage during peak traffic events, such as a championship game. Business needs evolve frequently at Fanatics, meaning schemas had to change to match – but these updates were very difficult.

Maintaining the different query platforms and the underlying analytics infrastructure cost Fanatics a lot of time to keep things running, and to try to meet SLAs. So the company decided on a new approach.

Fanatics had three different analytics pipelines for three audiences.

MemSQL replaced the Lucene-based indexers. Spark and Flink jobs were converted to SQL-based processing, which allows for consistent, predictable development life cycles, and more predictability in meeting SLAs.

The database has grown to billions of rows, yet users are still able to run ad hoc queries and scheduled reports, with excellent performance. The company ingests all its enterprise sources into MemSQL, integrates the data, and gains a comprehensive view of the current state of the business. Fanatics was also able to unify its users onto a single, SQL-based platform. This sharply lowers the barriers to entry, because SQL is so widely known.

MemSQL replaces three analytics pipelines with one pipeline.More details about the Fanatics use cases are in this blog post from MemSQL

The above are just two examples. MemSQL has helped many customers with simplification and performance of their application:

Conclusion

Some cloud providers claim that you need to have eight different purpose-built databases, if not more, to build your application – and that it is impractical for one database to meet all of the requirements for an application. 

We at MemSQL respectfully disagree. While there are some outlier use cases that may require a specialty database, the vast majority of applications can have all their key requirements satisfied by a single NewSQL database, such as MemSQL. 

Even if you can build your solution using a number of special-purposes databases, the cost to investigate, build, optimize, implement, and manage those systems will outweigh any perceived benefit. 

Keep it simple; use the database that meets your requirements. Try MemSQL for free today, or contact us for a personalized demo.

Gartner Peer Insights Applauds MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Gartner Peer Insights features technology product reviews from enterprise users, with more than 300,000 reviews covering nearly 6,000 products in more than 330 categories. Upholding their reputation as top industry analysts for enterprise technology, Gartner sees to it that the reviews are, in their words, “rigorously vetted,” with “no vendor bias.” MemSQL has nearly two dozen reviews, with an overall rating of 4.5 stars. Reviews cover  highlighting key points of what the software does for users. For those who want to know more about MemSQL, these reviews are a stellar resource. And, for those who are already MemSQL users, you can post a review today

Each rating includes dozens of areas for comment, in several distinctive areas: evaluation & contracting, such as reasons for purchase; integration & deployment, such as listing other platforms and products the software will be integrated with; service & support, such as the quality of support; product capabilities, such as the variety of data types supported; and additional context. 

Gartner Peer Insights has MemSQL reviews from enterprise users.

Across the reviews, several key points come up:

  • Speed. MemSQL is fast – “blazing fast aggregate queries,” says one user. “The queries perform really well on sharded data,” says another. A third liked “fast data ingest – running at a trillion rows per second.” A senior consultant with a large manufacturer described “10x to 100x performance improvements compared to Microsoft SQL Server.” 
  • Scalability. MemSQL is a scalable SQL database, a rare commodity, since legacy databases don’t scale easily, if at all. This opens up a range of capabilities; “creating pipelines within SQL is amazing.” 
  • MySQL compatibility. MemSQL supports ANSI SQL, and is wire protocol-compatible with MySQL, making it easy to integrate and easy to use. “The learning curve with our engineers was minimal,” said one user. “It’s friendly for a MySQL user, but faster,” said another. 
  • Flexibility. “MemSQL has a hybrid, rowstore and columnstore solution to resolve different use cases.” This flexible architecture was mentioned by many reviewers. 

Reviewers also shared a few things they wanted others to be aware of, in addition to the positives. “Make sure you understand the difference between the memory engine and the columnstore engine,” reads one review. Several said scaling to a larger or smaller footprint should be smoother. 

Reviewers integrated MemSQL with a wide range of widely used technologies, including:

  • MySQL, S3, and PHP. 
  • Python, Databricks, and Tableau. 
  • Other business intelligence (BI) products, such as DundasBI. 

One user replaced SAP Sybase with MemSQL; two others replaced Microsoft SQL Server. Most deployments took from zero to three months. 

Several comments summarized users’ reactions. “If you could start over, what would your organization do differently?,” says one question. The answer: “Start using MemSQL earlier”; another echoed, “Pick MemSQL sooner.”  A highly satisfied user purchased MemSQL to “create internal/operational efficiencies” and “drive innovation.” 

Other comments described the “strong customer focus” of MemSQL, the way in which the product is “evolving fast,” and the “strong roadmap” for feature development. “An awesome team,” said one user in finance, who also said that their own “engineering teams are very happy with the product.” 

Much of the value of review sites like this one comes from the large number of distinctive voices that contribute. If you are already a MemSQL user, you can post a review today. If not, you can also read about the reviews on G2.com, and ask questions on G2.com, or on the MemSQL Forums. And you can try MemSQL for free today.

Find and Fix Problems Fast with MemSQL Tools

$
0
0

Feed: MemSQL Blog.
Author: Roxanna Pourzand.

MemSQL Tools is a new set of command line programs for managing the clusters of servers or instances that make up your database.  You can use MemSQL Tools to help you find and fix problems with MemSQL quickly and incisively. Our legacy management tool, MemSQL-Ops, generated cluster reports, with output logs and diagnostics per cluster. For efficiency, the MemSQL team developed an internal tool called ClusteRx, to parse the reports coming out of MemSQL-Ops. We are now sharing this functionality with our customers as two new commands in MemSQL Tools, memsql-report collect and memsql-report check. Read on to learn how these tools can help you find and fix problems fast. 

At MemSQL, we are continuously working to enhance the monitoring and troubleshooting tools for the database, and aiming to create an engine that is fully self-healing. While this is the end goal we are striving towards, it is not a small feat. Today, it is often beneficial to understand the point-in-time health of the MemSQL system (the host, the nodes, and the database itself). These types of health assessments are most useful when you are troubleshooting an ongoing issue that does not have a clear root cause. Specifically, these health checks can be useful in cases where a user may need some hints or indicators that will give them direction on where and how to start investigating a particular symptom.

Perhaps you notice that some of your application queries are failing intermittently, but there is no obvious culprit. Where do you start troubleshooting the issue? You might go through a slough of system checks, but you aren’t sure if you’re simply chasing the white rabbit. How do you narrow down the problem? We’ll get back to this example shortly…

Past Health Checks at MemSQL

MemSQL has a legacy management tool called MemSQL-Ops, which runs in conjunction with the database. MemSQL-Ops has the ability to generate what we call a cluster report, which output logs and other diagnostics on a given cluster. It’s an informative report – but it can be challenging to navigate if you don’t know where to start your investigation. It’s a classic ‘finding a needle in a haystack’ problem. When a customer filed a customer support ticket, the MemSQL support team typically requested that they run this report, and the support team then used the report to help fix problems. (If you have filed a case with our teams, you are probably very familiar with how to collect a report).

Over time, the MemSQL support team learned what critical data points in these dense cluster reports offer the most insight to nail down or troubleshoot an issue. This feedback loop led to an internal tool that was developed by the support team, called ‘ClusteRx’ (notice the ‘Rx’ health pun!). ClusteRx parses the cluster report and provides an output that notifies the support team of various pass/fail checks that can be used as indicators on where the root cause of a particular problem may lie. (We will provide a practical example a little later on in the article).

Making the Internal Tool External

This internal tool, developed by our support team, became so useful for helping to troubleshoot a cluster experiencing problems that we decided to make it available to our customers. This is very exciting, because making this feature available to our customers enables them to troubleshoot MemSQL without assistance, and it also ensures they are equipped with the information and tools they need to manage their MemSQL cluster successfully.

Fast forward to the present day: We redesigned our entire management tool into a new framework called MemSQL Tools, which replaces MemSQL-Ops. We took the lessons learned from the internal ClusteRx tool for point-in-time health assessments that MemSQL support and engineering iterated on together, and we applied them to new functionality within MemSQL Tools, called memsql-report collect and memsql-report check.

What the New Tool Does

The newly redesigned version of this tool does the following: 

  • memsql-report collect gathers the diagnostic report from the given cluster. 
  • memsql-report check runs various pass/fail checkers on the report and outputs the result to the user, highlighting actionable tasks if a certain check failed. 

As of this blog post, memsql-report check has 55 different checkers, and we are continuously developing more. Below are a few examples of the checkers in memsql-report check:

  • outOfMemory – reports on any recent out of memory failures.  
  • leavesNotOnline – provides information on MemSQL leaves that are offline in the cluster. 
  • userDatabaseRedundancy – confirms your data is properly replicated across the cluster. 
  • defaultVariables – checks certain MemSQL system variables and ensures they are set to recommended values. 
The checkers for memsql-reports help you find and fix problems fast.
MemSQL documentation describes the 55 different checkers for memsql-report check.

The Real-Life Problem

Back to the real-life user example that we introduced at the beginning of this article…

A user noticed that a subset of their application query workload was failing, but there were no leading indicators as to why. How can we use memsql-report check to help us? 

Some definitions, before we go any further: 

  • A MemSQL aggregator is a cluster-aware query router. It is responsible for sending all the queries to the nodes that have the data on them (MemSQL leaf nodes). 
  • The ready queue is a queue of all processes that are waiting to be scheduled on a core.

The customer with this issue filed a ticket with support, and support instructed them to run memsql-report collect and memsql-report check on their cluster. Using the output of memsql-report check, the customer immediately detected that the ready queue was saturated on one of the aggregators. 

Each query that runs on a cluster – including internal queries used by the nodes to communicate with each other – requires exactly one thread on an aggregator; the ready queue saturation message means that the maximum number of connections allowed on that aggregator has been reached. Saturation of the ready queue typically means that queries will be queued on that aggregator, and depending on your application logic, it can lead to timeouts and failures. 

This explains why some, but not all, queries were failing in the customer’s application. Tying this back to MemSQL’s report-check functionality, the customer was able to identify that the ready queue was saturated by looking at one particular checker that stood out, the readyQueueSaturated checker. 

Here is the example output on the ready queue that piqued our interest:

readyQueueSaturated [FAIL]

The ready queue has not decreased on 10.22.182.7

The customer shared the output of the checker with us, and we focused on the aggregator that exhibited this failure (10.22.182.7) and identified in the processlist that there were about 50 open connections on this aggregator. Hmm. This finding was puzzling to our team because MemSQL aggregators are typically meant to handle more than 50 concurrent connections at a time. So, why were queries failing? 

The Real-Life Solution

It turns out memsql-report check clued the customer in on another issue, which they brought to our attention. The aggregator was configured to only have 50 connection threads (max_connection_threads) open at once. The max_connection_threads setting on an aggregator is essentially a limit on the number of queries – including internal MemSQL queries – the aggregator will run simultaneously. The value recommended for aggregator connection threads is 192, so this aggregator was configured to service almost four times fewer connections than it was supposed to!

readyQueueSaturated [FAIL]

The ready queue has not decreased on node 10.22.182.7

Warn: Value of max_connection_threads 50 doesn't follow the recommended value 192 for node 10.22.182.7

As soon as the customer updated the value for max_connection_threads to the recommended level, 192, the issue was resolved.  

Without report-check, it would have taken a lot of investigation time to get to the bottom of the issue. With this tool, you can find problems in minutes that could have taken hours otherwise.

For example, in this case, the customer would have had to check the logs for every node to identify that the ready queue was saturated on a given aggregator. Furthermore, the user would have had to check each nodes’ setting for max_connection_threads to find the misconfiguration. Both of these could have taken a significant amount of time, especially with a large MemSQL cluster. 

Trying It Yourself

This scenario is one of many examples of how useful memsql-report check can be for quickly identifying and mitigating issues and limiting break-to-fix times. Many of our customers also use this tool after they deploy MemSQL, before they do upgrades, maintenance, or other significant activities, as a sanity check to confirm their cluster is in good shape. 

If you haven’t used memsql-report check, you should check it out. We encourage our customers to use this to troubleshoot issues on their own! And, if you’re still stuck, reach out. Your efforts will help MemSQL Support to help you, faster and more effectively. (If you’re not yet a paying customer, your efforts will help you find help on the MemSQL Forums.) 

Looking ahead, we want to expand this tool so that customers can do system checks to validate their environment before they install MemSQL, including validating performance on their hardware. Additionally, we want to incorporate some of the applicable health checks into the database engine directly. 

If you have feedback on what we should add to the checker, please post in the MemSQL Forums. And see the MemSQL documentation for a full list of all the checkers. If you haven’t done so yet, you can try MemSQL for free, or contact MemSQL today. 

memsql-report check Example

Below is an example of output for memsql-report check. I suggest that, when using it, you look at all the checks and review the ones that failed. For example, for the memory check failure, this user allocated a large percentage more memory to the database than is actually available on their machines. In this case, I would adjust maximum_memory on all my nodes to ensure my cluster is within physical memory limits. 

✘ maxMemorySettings ……………………….. [FAIL]
FAIL total maximum_memory of all nodes on 127.0.0.1 too high (180% of RAM) [7106/3947]
bash-4.2$ memsql-report check –report-path /home/memsql/report-2019-12-20T012005.tar.gz
✓ explainRebalancePartitionsChecker …………. [PASS]
✓ queuedQueries …………………………… [PASS]
✘ defaultVariables ………………………… [WARN]
WARN internal_keepalive_timeout is 90 on F91D002A777E0EB9A8C8622EC513DA6F0D359C4A (expected: 99)
WARN internal_keepalive_timeout is 90 on 5E8C86A7D53EFE278FD70499683041D4968F3356 (expected: 99)
✓ memsqlVersions ………………………….. [PASS]
NOTE version 6.8.13 running on all nodes
✘ vmOvercommit ……………………………. [WARN]
WARN vm.overcommit_memory = 1 on 127.0.0.1. The Linux kernel will always overcommit memory, and never check if enough memory is available. This increases the risk of out-of-memory situations
✘ maxMemorySettings ……………………….. [FAIL]
FAIL total maximum_memory of all nodes on 127.0.0.1 too high (180% of RAM) [7106/3947]
✓ leafAverageRoundtripLatency ………………. [PASS]
✘ detectCrashStackTraces …………………… [WARN]
WARN data from MemsqlStacks collector unavailable on host 127.0.0.1: /tmp/memsql-report656183805/127.0.0.1-MA-LEAF/memsqlStacks.files.json not found
✓ failedBackgroundThreadAllocations …………. [PASS]
✓ columnstoreSegmentRows …………………… [PASS]
NOTE columnstore_segment_rows = 1024000 on all nodes
✓ tracelogOOD …………………………….. [PASS]
✓ runningBackup …………………………… [PASS]
✓ interpreterMode …………………………. [PASS]
NOTE interpreter mode INTERPRET_FIRST found on all nodes
✘ userDatabaseRedundancy …………………… [WARN]
WARN this cluster is not configured for high availabililty
✘ maxOpenFiles ……………………………. [WARN]
WARN fs.file-max = 524288 might be low on 127.0.0.1, recommended minimum is 1024000
WARN open files ulimit (1048576) is set higher than fs.file-max value (524288) on 127.0.0.1
✓ offlineAggregators ………………………. [PASS]
✘ outOfMemory …………………………….. [WARN]
WARN dmesg unavailable on host 127.0.0.1: error running command: `”/usr/bin/dmesg”`: exit status 1
✓ kernelVersions ………………………….. [PASS]
NOTE 4.9 on all
✓ replicationPausedDatabases ……………….. [PASS]
✓ mallocActiveMemory ………………………. [PASS]
✓ missingClusterDb ………………………… [PASS]
✓ failedCodegen …………………………… [PASS]
✘ numaConfiguration ……………………….. [WARN]
WARN NUMA nodes unavailable on host 127.0.0.1: exec: “numactl”: executable file not found in $PATH
✓ duplicatePartitionDatabase ……………….. [PASS]
✓ tracelogOOM …………………………….. [PASS]
✘ transparentHugepage ……………………… [FAIL]
FAIL /sys/kernel/mm/transparent_hugepage/enabled is [madvise] on 127.0.0.1
FAIL /sys/kernel/mm/transparent_hugepage/defrag is [madvise] on 127.0.0.1
NOTE https://docs.memsql.com/memsql-report-redir/transparent-hugepage
✓ defunctProcesses ………………………… [PASS]
✓ orphanDatabases …………………………. [PASS]
✓ pendingDatabases ………………………… [PASS]
✓ orphanTables ……………………………. [PASS]
✓ leafPairs ………………………………. [PASS]
NOTE redundancy_level = 1
✓ unrecoverableDatabases …………………… [PASS]
✓ unkillableQueries ……………………….. [PASS]
✓ versionHashes …………………………… [PASS]
✓ filesystemType ………………………….. [PASS]
✓ runningAlterOrTruncate …………………… [PASS]
✓ leavesNotOnline …………………………. [PASS]
✘ maxMapCount …………………………….. [FAIL]
FAIL vm.max_map_count = 262144 too low on 127.0.0.1
NOTE https://docs.memsql.com/memsql-report-redir/configure-linux-vm-settings
✓ defaultWorkloadManagement ………………… [PASS]
✓ longRunningQueries ………………………. [PASS]
✓ diskUsage ………………………………. [PASS]
✓ validLicense ……………………………. [PASS]
NOTE you are using 32.0 GB out of 320.0 GB cluster capacity (licensed: 320.0 GB)
NOTE License expires on 2020-01-31 08:00:00 UTC
✓ cpuFeatures …………………………….. [PASS]
✓ unmappedMasterPartitions …………………. [PASS]
✓ blockedQueries ………………………….. [PASS]
✓ orchestratorProcesses ……………………. [PASS]
✓ delayedThreadLaunches ……………………. [PASS]
✓ disconnectedReplicationSlaves …………….. [PASS]
✓ minFreeKbytes …………………………… [PASS]
✓ cpuModel ……………………………….. [PASS]
NOTE Intel(R) Core(TM) i7-6567U CPU @ 3.30GHz on all
✓ readyQueueSaturated ……………………… [PASS]
✓ failureDetectionOn ………………………. [PASS]
✓ collectionErrors ………………………… [PASS]
✓ secondaryDatabases ………………………. [PASS]
Some checks failed: 44 PASS, 7 WARN, 3 FAIL

A Balanced Approach to Database Use with Microservices

$
0
0

Feed: MemSQL Blog.
Author: Rob Richardson.

Microservices architectures have focused heavily on compute services, with data storage and retrieval – a crucial topic – sidelined, and left as an exercise for the developer. Crucially, Kubernetes did not include direct support for stateful services (ie, databases) initially. The Red Hat developers blog has suggested that data management is The Hardest Part About Microservices. In response, in this blog post, we suggest that a NewSQL database, such as MemSQL, can provide data services to multiple microservices implementations in a manageable way, simplifying the overall architecture and boosting performance. 

There are many ways to tackle data needs for a microservices architecture. Because microservices architectures have such a strong following in the open source community, and because Kubernetes was slow to support stateful services such as databases, many architects and developers seem to assume several predicates:

  • Every service in a microservices architecture will have its own data store
  • Across data stores, data will be eventually consistent – not, well, consistently consistent (ie, not meeting ACID guarantees)
  • The application developer will be responsible for managing data across services and for ultimate consistency. 

But not every microservices thinker is ready to throw out the baby – the valuable role that can be played by a relational database – with the bathwater of a restriction to open source, and usually NoSQL, databases. 

Kubernetes Operators and Microservices 

Both the tools available for using databases with microservices, and the thinking that a developer can draw on when considering their options, are evolving. In the area of tools, Kubernetes has developed stateful services support. Additions such as PersistentVolume, PersistentVolumeClaim, and StatefulSet make these services workable. The emergence of Operators in the last few years makes it much, much easier to use Kubernetes for complex apps that include persistent data, as most apps do. (You can see the blog post that introduces, and explains, Operators here.) 

You can learn how to build an Operator from Red Hat’s OpenShift site

As an example, MemSQL has used Kubernetes to build and manage its, well, managed service, MemSQL Helios. After earlier attempts to develop such a service ran into difficulties, Kubernetes, and the development of a MemSQL Kubernetes Operator by the team, made it possible for MemSQL to bring Helios to market with just a few months of work by a few individuals. With an elastic, on-demand, cloud database as the very definition of a stateful service, this is just one example that the Kubernetes Operator, as well as Kubernetes as a whole, are fully ready for prime time. 

New Thinking About the State (of Data)

Some daring thinkers – in one particular case, at RedHat – have focused on reminding their fellow developers of some of the advantages of a single, central, relational database, long taken for granted: ACID transactions; one place to find and update data, a single thing to manage, and a long history of research and development. The authors then go on to develop a primer on how best to share a relational data store among multiple services. In their sample, they use MySQL as the relational data store. 

One microservices maven, Chris Richardson, offers both options. He gives robust descriptions of the use of both a database-per-service approach and a shared database approach in microservices development.  

Microservices maven Chris Richardson describes varied
approaches to database access in microservices apps. 

But RedHat’s reference to MySQL, as a venerable and widely used relational database, incidentally highlights one of the primary objections to the use of legacy relational databases for microservices development: their lack of scalability. Scalability is one of the chief, if not perhaps even the single most important, attributes of a microservices architecture. It’s so important that many microservices developers restrict themselves to NoSQL architectures, which assume scalability as an attribute, simply to avoid having to deal with artificial constraints on things like database size or transaction volume. 

A Modest Proposal (for Microservices Data) 

We would like to suggest here that MemSQL is a solid candidate for use as a shared relational database for microservices applications. This choice is not restrictive; specific services can still use local databases, and they can be of any type needed. But for complex operations such as transactions, and even for many incidental operations such as logging users, a relational database which can be shared or sectioned as needed, and used on a database-per-service database when that’s required, and that works well alongside NoSQL data stores, might be a valuable asset. 

NewSQL databases in general, and MemSQL in particular, have the attributes needed to serve this role, including: 

  • Speed. MemSQL is very fast, for ingest, processing, and transaction responsiveness. 
  • Scalability. MemSQL retains its speed across arbitrarily large data sizes and concurrency demands, such as application queries. 
  • SQL support. Not only is the SQL standard ubiquitous, and therefore convenient, it’s also been long optimized for both speed and reliability. 
  • Multiple data types. MemSQL supports an unusually wide range of data types, for a relational, SQL database: relational data, JSON data, time series data, geospatial data, and can import in the AVRO format typically used in Kafka, as well as offering full-text search on data. 
  • Transactions plus analytics. MemSQL supports both transactions and analytics in a single database; you simply create rowstore tables for some data and columnstore tables for others. Also, with MemSQL 7.0 having reached GA last December, you can now use Singlestore features to depend more often on just one table type or the other. 
  • Cloud-native. MemSQL is truly cloud-native software that runs unchanged on-premises, on public cloud providers, in virtual machines, in containers, and anyplace you can run Linux. MemSQL Helios, which is itself built on Kubernetes, offers a managed service, reducing operational cost and complexity. 

While MemSQL has other features that are beneficial in any context, these are the key features that stand out the most in a microservices environment. 

Getting Started

MemSQL offers free use of the software for an unlimited time, within a generous footprint limit, and with community support, rather than a support restriction. You can typically run an entire proof of concept on MemSQL before you are likely to need to scale up enough to purchase an Enterprise license. So we suggest that you try MemSQL for free, and get started with your next big project, today.

Webinar Recap, Part 1: Building Fast Distributed Synchronous Replication at MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This is the first part of a two-part blog post; part two is here. The recent release of MemSQL 7.0 has fast replication as one of its major features. With this release, MemSQL offers high-throughput, synchronous replication that, in most cases, only slows MemSQL’s very fast performance by about 10%, compared with asynchronous replication. This is achieved in a very high-performing, distributed, relational database. In this talk, available on YouTube, Rodrigo Gomes describes the high points as to how MemSQL achieved these results. 

Rodrigo Gomes is a senior engineer in MemSQL’s San Francisco office, specializing in distributed systems, transaction processing, and replication. In this first part of the talk (part two is here), Rodrigo describes MemSQL’s underlying architecture, then starts his discussion of replication in MemSQL. 

Introduction to MemSQL Clusters

Okay. So before I start, I’m just going to give some context and a quick intro to MemSQL. This is a rough sketch of what a MemSQL cluster looks like. 

MemSQL clusters are made up of aggregators and leaves.

I’m not going to go into all of this. We’re going to explore a very small part of this; specifically, replication. 

Sync replication is focused on keeping a copy of each partition.

But to motivate why we need it, basically we have a database here, and that database is sharded into multiple partitions, and each partition has redundancy. And we do that so that we can provide durability and availability to our partitions. So if there are failures, you don’t lose your data; your transactions don’t go away. And today I’m going to talk about redundancy and replication within a single shard (of the database – Ed.).

Focusing on Transactions

So a quick intro, I’m going to talk about transactions. Transactions are what your application uses to talk with a database. So if you build an application, what it’s doing is it’s sending read transactions to the database to see what’s the status and also updates or writes or deletes to make modifications to that state.

And you want a bunch of very nice properties from transactions so that writing applications is easy. For today, the one we care about is that transactions are durable. What that means is that when the system tells you your transaction has committed, if you wrote something to the database – if you made a change to the state – and something fails, that change is not magically going to go away. 

Durability is the ACID property addressed by synchronous replication.

And the way you maintain durability, in most systems, is by doing two things: 

  • Maintaining a transaction log
  • Replication

The first thing most systems do is they have a transaction log. What the transaction log allows you to do is persist to disk the binary format of your transactions. So you can imagine that you’re making changes to the database. You say, “Write the number 10 on my database.” And the way it goes about doing this is it first will persist that to disk before telling you that it’s committed or necessarily even making those changes visible.The transaction log persists the binary format of transactions.

And when the disk tells you that the number 10 is going to be committed, you tell the user that it’s committed. This is an oversimplification, and a lot of this presentation is going to have oversimplifications, because replication, transactions, and durability is a fairly hard topic. I’ll try to note what oversimplifications I use, so that if you have any questions during the coffee break, I am very happy to talk about it. 

But basically, transaction logging is what you use so that if you crash when you restart, you have a log of your transactions, and the user never hears that a transaction is committed before it’s persisted to the log. Some systems actually write the “how to undo the transaction” part first. 

MemSQL has one simplifying factor, which is all the mutable state actually exists in memory, so you never need to undo anything. Systems that write to disk need to undo the changes because they might crash in the middle of doing them, but in memory we just lose everything and we just replay the redo log. So the problem is just how to apply them.

Replication and MemSQL 7.0

The other way we maintain durability is with replication. This doesn’t give you just durability, this also gives you availability. The idea is that you never really want your data to live only on one machine. You want it to live on multiple machines, because you might lose one machine – and, even if it’s temporary, then your system is completely offline. So if it lives on some other machine, you can make it the new go-to machine for this piece of data. 

Also, you might lose one machine forever. Imagine that – I don’t know – one of the racks in your data center suddenly went into a volcano, of its own will, and now it’s not coming back. But at least you always have another machine – hopefully, on another rack – that has all your data, and you just keep going. 

MemSQL uses primary-secondary replication.
The type of replication I’m going to be talking about today is primary-secondary. (There are others, consensus being one of the other popular ones.) The idea with primary-secondary replication is that there’s one node that is the primary, and that’s what the user interacts with. 

It doesn’t necessarily interact with the primary directly. An example is, in the MemSQL cluster we have an indirection layer, because the data is sharded. So we have one node that’s responsible to know where things go. But, for this presentation, you can assume that it’s the user interacting directly. 

In MemSQL those would be the nodes we call aggregators, but they are kind of users of this copy. And then the primary node sends over the network, to the other nodes, the transaction log. 

There are other kinds of replications. You can do statement replication, where you’re sending the statements the user sends you. Which is still kind of a transaction log, but we use physical replication and physical durability, which means that we actually persist all the effects of a transaction onto disk. 

That allows us to actually do some more interesting things. Because with statements, the order of the statements matters a lot, whereas with transaction changes, the order doesn’t necessarily matter as much. So you just persist all the effects, and we just send over the network such that the secondary always has a logically equivalent copy. 

So I’ve been working on this for longer than I care to admit – or than anyone else should know, outside of MemSQL – but we just shipped a revamped replication system in MemSQL 7. I’m not going to describe everything we’ve done, there’s like 50,000 lines of code there. There’s not enough time in this day to describe everything that goes into it, but I will go through how you would build a replication system of your own. And I’m also going to go in some detail into one of the optimizations we made to make our new replication system very fast.

MemSQL 7.0 has a revamped system for synchronous replication.

Again, be warned there’s going to be a lot of oversimplification. I’m going to gloss over things like failures. I’m pretty sure about two-thirds of those lines are failure handling, so that things don’t blow up, but let’s go into it. 

Conclusion – Part 1

This is the end of Part 1 of this webinar recap. You can see Part 2; view the talk described in this blog post; read a detailed description of fast, synchronous replication in MemSQL 7.0 in another technical blog post; and read this description of a technically oriented webinar on MemSQL 7.0. If you want to experience MemSQL yourself, please try MemSQL for free or contact our sales team.

Webinar Recap, Part 2: Building Fast Distributed Synchronous Replication at MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This is the second part of a two-part blog post; part one is here. The recent release of MemSQL 7.0 has fast replication as one of its major features. With this release, MemSQL offers high-throughput, synchronous replication that, in most cases, only slows MemSQL’s very fast performance by about 10%, compared with asynchronous replication. This is achieved in a very high-performing, distributed, relational database. In this talk, available on YouTube, Rodrigo Gomes describes the high points as to how MemSQL achieved these results. 

Rodrigo Gomes is a senior engineer in MemSQL’s San Francisco office, specializing in distributed systems, transaction processing, and replication. In this second part of the talk (part one is here), Rodrigo looks at alternatives for replication, then describes how MemSQL carries it out. 

Considering Replication Alternatives

First, we should define what the goals are. What are the objectives? We have a primary and secondary, as before, two nodes – and we want the secondary to be a logically equivalent copy of the primary. That just means that if I point my workload at the secondary, I should get the same responses as I would on the primary. 

What is highly desirable is performance. You don’t want replication to be taking a very long time out of your system, and you don’t really want to under-utilize any resources. So this goes hand in hand with performance.

The first effort for synchronous replication falls short.

At some point you’re going to be bottlenecked on the pipe of the network or the pipe to disk, whichever one is smaller, and if you are under-utilizing those pipes, that means you leave performance on the table, and you can get more. 

So how would one go about doing this? Here’s what a naive solution would look like. 

Now the idea is, when your transaction finishes before you make its effect visible – and this is somewhat based on what MemSQL does so it’s not necessarily the only way to do this – but it is kind of a useful abstraction of how one would go about doing it.

So you figure out how you’re going to serialize your transaction like the binary format, you write it to disk, you send it to secondaries, and you declare it committed. On the secondary, you’re just receiving from the primary, also persisting it to disk, because you should have your transaction log there and then applying it on the secondary as well. Does anyone know what is wrong with this?

Audience: You didn’t wait for the secondary to get back to you.

Exactly. So does anyone know why that is a problem? The answer was, you didn’t wait for the secondary to get back to you. So it might not be completely obvious, because you might think that the sends are blocking, but most systems would not do that. They would just say, I’m going to send and forget; right, it’s completely asynchronous. But there we’re not waiting for the secondary to tell us it’s received the transaction. What could go wrong with that?

Audience: Because then, if your first one dies, you’ll have corrupted data.

Yep. So here’s a helpful diagram I drew in my notebook earlier. 

How the naive solution #1 for synchronous replication fails.

So the user sends a transaction, the primary sends it to the secondary, but it just kind of gets lost halfway. The primary says, okay, transaction has committed to the user – and then it goes offline. We promote the secondary to primary so that the workload can keep going. The user asks about its transaction and the secondary just says, “I don’t know, never happened.” 

All right, so the way we fix it is we wait for acknowledgements. The secondary also has to send it back. Here’s a description of that. 

Naive solution #2 for synchronous replication also falls short.

Now what is wrong with this one? This one is a little subtler.

Audience: This is a typical two-phase commit approach.

Right. This is a typical two-phase commit, but there is still something incorrect here.

Audience: Now you have a problem with waiting for it – latency.

So that is a great point. We can wait forever here. 

The second naive solution for synchronous replication fails due to latency concerns.

For this presentation, we’re not going to fix that. So we care that there’s a secondary that’s eligible for promotion, that has all the data. But if there is a failure in the secondary, you are never going to get an ACK because it’s gone.

Audience: Can’t you just get the ACK after the transaction is stored?

Well, but it’s stored in one place, you want it to be stored in two places, right?

Audience: Can’t you go from the transaction log and still load the disk, even if it fails?

Well, the log is on disk still, but we’re saying we want it on some different machine – because if your primary machine fails, you don’t want the workload to die, and you don’t want to lose data if your primary machine fails forever. 

What you could do for this is you can have some other system that notices the failure in the secondary – and, in a way, demotes it from being a promotable secondary, and that way you don’t need to wait for ACKs from that secondary anymore. You would lose one copy of the data, but the same system can try to guarantee there is a new copy before it allows that to happen. 

I’m always talking about a single secondary here, but you could also have multiple copies all ready. You don’t necessarily just have to have one other copy – and, in a way, that allows you to stay available. If you just have something detecting the failure saying, I don’t care about the secondary anymore, I have a copy somewhere else. Let’s keep going. 

What else is wrong with this?

Audience: It looks slow. Can’t we send ACK right after sending to disk?

That’s a good observation.

Audience: Not before?

Before sending to disk, we can’t, because then if the secondary dies right there, your data is gone. We could do it right after sending to disk if all you care about is that the data is persisted, but not necessarily visible – and in this case, we kind of care. 

But that means when you promote the secondary, you have to add an extra step waiting for the data to become visible so that you can use it. Otherwise you would get the same kind of problem. But at that point today it is persisted, so it will eventually become visible.

Audience: Can you send to the secondaries first? I know that’s strange.

There is nothing that stops you from doing that.

Audience: Then you can have a protection on the other side when the first one fails.

But it’s equivalent to having the transaction on the first one and then the other one failing. Right? You could do that. You could also just pretend the disk is one of your secondaries, and basically pretend it’s like a special machine, and now you just treat it the same way you would treat all other secondaries. 

It’s actually a very useful, simplifying way for coding it, because then you don’t have to replicate a lot of logic. But disks are special; that’s one of the things I’m going to oversimplify here. We’re not going into how disks are special, because it’s awful. They tell you they’ve persisted their data; they haven’t. They tell you they’ve persisted their data – you write something to them, they overwrite your data. You have to be very careful when dealing with disks. 

There’s something else wrong here, or there’s something… It’s wrong because it makes things very, very hard to code, but it’s weird. So imagine that you have many concurrent transactions running at the same time. How could that break things?

Audience: Are the transactions getting read from the secondary, or they all are always persisted by the primary?

Let’s assume they’re always processed by the primary. 

So if you have multiple transactions going at the same time, they can reorder when they send to disk, and when they send to the secondaries.

Audience: Because if the first line takes too much, then the write …

So you always have the guarantee that the secondary has all the data, but the transaction log is not necessarily the same. This is not necessarily incorrect, because the main goal we stated is that they are logically equivalent, and these two are logically equivalent. They have the same set of transactions in their transaction logs, but things become really tricky if you need to catch up a secondary. 

So imagine that the primary fails, and you failover to a secondary, and maybe there’s another secondary, and now this secondary also fails, and now you have three different transaction logs. This secondary comes back up, and at this point, you don’t necessarily want to send everything to it from scratch, because that could take a long time. 

The failures could be temporary, if your restarts are quick, or it could be a network issue that caused you to detect a failure – whereas the node was still online, and maybe the network issue just lasted a second.

And now, if you have a lot of data, it could take maybe an hour or more to provision this secondary from scratch. But you know that the secondary at some point was synchronized with you – it had the same transactions; you just want to send the missing transactions. 

And now the transaction logs are different, though, so you don’t necessarily know what it’s missing compared to you. If the transaction logs were the same, then you know that there is an offset before which everything is the same. 

The first reasonable solution gets closer to supporting synchronous replication.

That offset is actually not too complicated to compute when things are the same, because you know that if something is committed, if you’ve got an ack for that transaction, then everything before that transaction is the same on every node, if the transaction logs are the same.

And you can basically send that offset from the primary to the secondary saying, okay, everything before this has been committed – and you can either do it periodically, and in MemSQL we just do it kind of lazily. When we need to write something new, we just send that value if it’s changed, and so you can use that value as the offset, if everything is the same. If everything is not the same then you may have a different set of transactions before that offset on both. That offset just doesn’t make sense anymore. So how would we go about fixing this?

Audience: Okay. Can you just send the last version of the transaction log from the first one to the second one. The second one just grabs that and applies it?

Not sure I follow.

Audience: Imagine you said you have the transactions happening on one, and you send that transaction, together with data, to the second one. The second one grabs that data and persists both the transaction and the data itself.

Right, but-

Audience: What if you sent the transaction log already computed – the transaction log – and the second one just… So you always send a valid copy of the transaction log.

Oh, so I guess the idea is that, instead of sending when a transaction commits, you just send the log as you write to it. So you could do that. That becomes very slow, because you basically have to wait for the log to be written up to where you wrote to it. And that’s not necessarily going to be fast. 

You can imagine you have a transaction that’s going to take an hour to write to the log, and then another transaction that’s going to write to the log next, and that’s going to be a bit slow. In a way, it’s a reasonable solution; it’s sort of what this is doing. 

Basically, put a lock in there so that you send to disk and to secondaries at the same time, and you’re just going to send as you write to the log. The difference is the transactions can write to disk in that solution, all at the same time, and then only later does it send to the secondaries. But you still have to wait for that to happen.

The first reasonable solution for sync replication can still be slow.

So I kind of spoiled it. One of the big issues here is that it’s going to be pretty slow. I drew a fun graphic here. So you can imagine you have a bunch of transactions all doing work, all doing progress at the same time, and then they’re all going to compete on the same lock. And this is not necessarily the exact way it’s going to happen, but you can imagine that if you read it just like here you’re going to send to disk first, and then send to the secondaries.

So while you’re waiting on the network, your disk is empty, or it’s not doing anything. While you’re waiting on disk the network is not doing anything. You could do these in parallel so you could like push this back, but there’s always going to be one that’s slower, either disk or network; and while one is executing, the other one is waiting for it. 

That can be a pretty big performance bottleneck and that was actually something that could sort of happen on older MemSQL versions. So can anyone tell me what they would do to fix this? To avoid this bottleneck?

Audience: Generate, like, sequence numbers, since it’s the primary that processes all and the transactions and generates the sequence number for each one. Then send without logging to the secondaries, and the secondaries need to always know what was the last one that they wrote, and check if what they receive is the next one. Or if they should wait for… If they are receiving what was one, they receive three, so they know that two should arrive.

Audience: So basically moving the lock to the secondary. It’s the same performance?

Not quite the same performance.

Audience: The way it writes on the log.

So the suggested idea is, each transaction has a sequence number, and you write them in that order to the log, but you have to do it locally as well – which can also become a bottleneck, but you don’t need to lock around the sending over the network. And the secondary just has to make sure that it writes them in the same order. 

That has the problem that it would still kind of lock on the disk, right? So you’d still be all waiting for this; each transaction has to wait on every other transaction to write to disk before it can send. So I think it’s still isn’t ideal, but it’s close, it’s very close.

Audience: What about the difference between that and also using a random generated number. So something like three options that just seem like the random generator and ID like in one, adjusting the one the three, and then two. You know already they are representing the same transaction but… just get the order correctly at the end on the second node.

I’m not sure I follow. So the idea is to add the random number generator-

Audience: To represent the transaction-

So each transaction has a unique ID basically.

Audience: Basically, but the transactions also could be represented by, let’s say, the three actions, like write, write, write. So you don’t have to wait on the disk; you persist with the the serial… but an order could be 3-1-2 for instance. But when you’re collecting information from the transaction log, we correct… The persistence on this should be…

So you persisted out of order but then you reorder them at the end?

Audience: Yeah.

That has the same issue as before, where you don’t necessarily have an offset to send things after. The problem from before was that computing where to start catching up, in a new secondary, is very hard, if things are ordered differently on different nodes. I think that would still have the same problem.

Audience: Is this solution like sacrificing the secondary start-up time?

No, you shouldn’t have to sacrifice the start up time. I’m going to count to five to see if anyone has any other ideas. One –

Audience: I think you can afford to have some rollback when you have conflicts, and if you have at least one piece that provides you the commits, to have some kind of eventual consistency. On top of that, you can have different groups of primary backups. And you can at least attenuate or diminish a little bit that lock. It won’t prevent it, but you’ll at least reduce it. So, different groups of primary backups, according to your workload? You could do some kind of sharding also.

You could do sharding, and it does help, because now you have some parallelism for different locks, but we still want it to be faster per shard. 

How the Revamped Replication System in MemSQL 7.0 Works

So MemSQL does do sharding, and there were some debates we had over whether it’s a good idea to make replication much faster, because you still have parallelism across shards. We found that it was still a very good idea to do it, to make it really fast. All right, one, two, three, four, five. Okay.

Out of order replication solves the synchronous replication problem for MemSQL.

It’s very similar to the sequence number ideas suggested initially. So this is one of the optimizations we implemented in Replog. There are a few others, but this is the one I’m going to go into more detail in here. Where you still allow sending to secondaries out of order, but you don’t just send the transaction content, you also send the offset, and the log where it’s written, and the way we do that… 

So you have to be very careful when you do that, because you can’t just write to the disk – otherwise, if you have two concurrent writes, they might just write on the same offset. What we do is we have a counter for each shard that says where we’re at on the disk, and every time we want to write a new transaction to the log, we increment that counter by the size of the transaction, and the current value of the counter gives us the current offsets, and you can do the incrementing at an atomic manner without using locks.

The hardware is sort of using locks but it’s pretty fast and then you have your offset and then you just have to… You know that no other thread, no other transaction, is going to touch your reserved portion of the disk, and so when you write to disk, there are no locks there. The socket still sort of has a lock, it doesn’t even need to be an explicit lock. Like when you call send on the socket, it will go to the OS, and the OS will just send everything serially. But at that point you just send the offset, and the transaction log content, and you’re able to effectively max out the network pipe at least.

Audience: Don’t you have an empty disk space at this portion if you reserve the offset, but then it fails?

That is a great observation. Yes. So one of the great challenges of doing something like this is that, if something fails, you have portions of the disk that are empty, and now you have to be able to distinguish those from non-empty portions. And you also have to deal with those when trying to figure out where to do the catch-up. 

For the catch-up part, it’s not actually very hard, we do basically the same thing as before. Because we have something in the background in our replication system that knows an underestimate of everything below this offset is already committed. And so you know that everything below that offset doesn’t have any gaps, and so it is safe to use that offset to start catching up.

Above that offset, the challenging part is: suppose that your primary fails, and you have a secondary that has a hole, you have to know how to identify that hole. And what we do is – actually, there’s a lot more that we do that’s not here – but we basically have a checksum for each transaction block that allows us to distinguish between something that was written. So we just read the entire log and check against the checksum and something that’s a hole – or we call them, torn pages – and that allows you to kind of skip it. And we can basically overwrite other secondaries’ logs with that hole, and they know to skip it as well, if they need to recover.

Audience: If you’re applying the log right after writing the log to disk – if you have a gap, you’re basically applying the transaction in the wrong order.

Not necessarily. That’s actually a great question.

Audience: Unless they are commutative.

Right. So the question is, if you have a gap, and actually if you’re just replicating out order, and you’re applying right when you received the data, you might be applying things out of order – and that is true. You’re not applying things in the same order as in your primary necessarily, but they are going to necessarily be commutative, because you can’t have dependencies on a transaction that is going to commit by another transaction. 

If you have a hole, that means that everything in that hole or that gap is not committed yet necessarily, because you haven’t acknowledged it. So if you’re going to apply something after the gap that you’ve already received, that means that the two transactions are running concurrently, and are committing at the same time. And if that is true, then they will not conflict, because they both validated their commits at the same time.

Audience: Oh, when you reach this point, you already validated… so they are commutative.

So in MemSQL, we actually use pessimistic concurrency control, which means that they just have locks on everything they are going to write, which is how we can guarantee at that point they’re not going to conflict. You could have other mechanisms, with optimistic concurrency control, where you just scan your write set. If we did that at that point, we would already have validated.

Audience: You forgot to mention what is the isolation level provided by MemSQL.

So today it is pretty bad; it’s read committed. Read committed isolation level just guarantees that you’ll never read data that was never committed, but it’s pretty weak. Because you might read something in the middle of two transactions – for example, you might see half a transaction, and half another. That’s actually what I’m going to be working on at MemSQL for the next few months, increasing the isolation levels we support. So pretty exciting stuff.

The main reason we don’t have higher isolation levels right now is that, due to the distributed nature of the system, everything is sharded. We don’t have a way to tell between two shards of the data, that two rows were committed in the same transaction, right now. And so within one shard we actually do support higher isolation levels, that you can do snapshot reads, which guarantees you that you don’t ever see half a transaction. And it’s fairly easy to build on top of that for snapshot isolation or even read committed when performance. 

But we are working, for the next major release of MemSQL in a few months, to have higher isolation levels, hopefully. We are all engineers here, we all know how planning can go. How am I doing on time?

Audience: Good. We are okay.

So does anyone have questions about this?

Audience: The only kind of weird question I have is it feels weird to not write to disk continuously. I’m not sure if that’s actually a problem.

When you say continuously, you mean like always appending, right?

Audience: Yes, always appending, because it’s weird to go back in offsets and writing to disk.

It may feel weird, but it’s actually not a problem, because you kind of still write continuously when you reserve here. So one way to think about it is when you’re writing to a file, if you’re appending at the end, you’re actually doing two operations. You are moving the size up so there’s a metadata change on the file system and you’re also writing all the pages to disk. And what we’re doing is moving the size change into the transaction. So these two operations together are sort of equivalent to writing at the end of the file in the file system because it’s still two operations.

That’s actually a tricky thing, because if you’re doing synchronous writes to a file, and your file is not a fixed size, and you’re moving the size up, your writes might go through and tell you they’re persisted, but the metadata change could be independent. And so you have to either pay a very high cost to make sure that the metadata change goes through, and it’s persisted, or you can pre-size files, which is what we do. 

But that has the separate costs that now you have to keep track of the ends of your log separately, and manage that yourself. So you still need that metadata somewhere. We just thought we could do better. I think we do better.

Any other questions? Right. That is it for the presentation. We are hiring in Lisbon if anyone’s interested in working on interesting problems like this one. We also have other teams that work on different types of problems – like front-end, back-end systems management. And if you have any questions about this presentation, the work we do, or if you’re interested, we have a bunch of people from MemSQL here – right now, and later. Thank you.

Audience: Okay. Thank you Rodrigo.

Conclusion

This is the conclusion of the second, and final, part of this webinar recap. You can see part one of this talk; view the talk described in these two blog posts; detailed description of fast, synchronous replication in MemSQL 7.0 in another technical blog post; and read this description of a technically oriented webinar on MemSQL 7.0. If you want to experience MemSQL yourself, please try MemSQL for free or contact our sales team

Capterra Captures Positive Reviews of MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Capterra describes itself as “the leading online resource for business software buyers,” boasting more than one million verified software reviews across hundreds of software categories. Those one million reviews include many positive takes on MemSQL software, including this comment from a big data manager: “Best Distributed RDBMS Ever!” 

Capterra is growing fast, with more than five million monthly users, and the number of verified reviews on the site nearly doubling between 2018 and mid-2019. Capterra research asserts that reviews reduce purchase cycles by more than six months. More than four in ten small and medium business (SMB) users surveyed use reviews in their purchase process. 

And these users, by and large, love MemSQL. It’s a “great product” with “similarity to MySQL” and “more scalability.” It’s “very fast,” “easy to operate,” and “works perfect on Linux.” “If you are looking for something for real-time analytics/dashboard, this is the go-to option.” 

The Capterra site is focused on SMB users. For enterprise users, we have a roundup of comments and an update with newer comments from reviews site G2 Crowd, which is more focused on enterprise users. And we’ve captured highlights from reviews site Gartner Peer Insights, which also focuses on the enterprise. (Gartner owns both Gartner Peer Insights and Capterra, which it purchased for more than $200M in 2019.) 

Together, these review sites can give you a fair picture of MemSQL’s suitability for your needs – and, hopefully, shorten your purchase cycle, as Capterra does for many of its users. 

Most Helpful Reviews (Nearly) Say It All 

The most helpful reviews show at the top of the list for Capterra’s software reviews. Several of the most helpful reviews for MemSQL include an awful lot of the best features of MemSQL, as seen on Capterra, and across all three of the reviews sites we’ve described: 

  • “One solution for streaming analytics on big data,” says a senior manager for data engineering. He’s focused on machine learning and AI, and he describes the software as “super simple.” His shop runs “multi-petabyte S3” stores with “huge Kafka clusters.” They see “sub-second response on a test case with 1000 concurrent heavy API calls (scanning billions of rows).” MemSQL is “incredibly fast,” “fantastic partners” who offer “access to their core engineering team.” 
  • A director of IT infrastructure at a large oil and energy company has built a “simplified data lake system” around MemSQL. He sees “large amounts of IoT data (trillions of rows)” that “can be queried in milliseconds.” Processes that “took hours to run” are now “running sub-second on MemSQL.” The software offers “amazing performance” and is “highly and easily scalable.” 
  • A senior architect for IT and services at a large company calls MemSQL a “supersonic DB” that “aces every database” that he has worked with. MemSQL is “the database of the new generation” with “huge potential for both on-premises and cloud.” It features “high compatibility,” “resilience,” and “scalability.” MemSQL is “highly recommended to any organization wanting to get rid of old-fashioned databases.” 

Many of the comments offer real insight. One big data manager lists pros which include “JSON support and full-text search,” “drop-in replacement to the famous MySQL,” and “in-memory tables for high OLTP workloads and on-disk columnar storage for OLAP workloads.”

Users are able to “ingest millions of documents every day and run sophisticated dashboards against them.” They achieve a “huge performance win,” see MemSQL as “easy to connect to Kafka” and “easy to set up on Kubernetes.” MemSQL is a “great replacement for Hadoop for a fraction of the cost,” with aggregation times dropping from over 2 hours to less than 20 minutes. 

And “You can seamlessly join both row and columnar tables and query across it.”

A few more adjectives, from these and other reviews: “elegant”; “excellent”; “amazing”; “the go-to option” for real-time analytics and dashboards; “great support”; “blazing fast”; “good engineering principles”; “fast implementation” (in a one-day hackathon); “too easy to set up.”

A senior data engineering manager for AI offers insightful comments in a five-star review.

The Bottom Line

One user sums it up, with words that many users of other solutions would like to be able to say: “We are within our SLA.” 

If you use MemSQL already, consider posting a review today. Your efforts will benefit the community as a whole.

If you haven’t yet tried MemSQL, take a look at the reviews on the Capterra site. You can also post questions on the MemSQL Forums or try MemSQL for free today.

Dresner’s Wisdom of Crowds Report Puts MemSQL at the Top

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

A recent Dresner report, the Analytical Data Infrastructure Market Study, includes MemSQL for the first time. The report shows MemSQL as the leader among full-service database providers – ahead of Google, Amazon, Microsoft, IBM, SAP, and Oracle. (A few specialty analytics products also rank highly.) The Dresner report is part of their Wisdom of Crowds series, and rates MemSQL as an Overall Leader in both major categories of the report: Customer Experience and Vendor Credibility, with a perfect Recommend score. 

About the Dresner Study

Dresner Advisory Services is well-known for their insightful market studies on business intelligence (BI) and analytics. Howard Dresner, the founder of Dresner Analytics, coined the term business intelligence (BI). Their marquee report series, the Wisdom of Crowds, asks real users to rate BI and analytics products on key performance indicators. 

The reports “contain feedback from real life implementers, IT directors, and contractors that actually are working with BI products in the field.” In these reports, Dresner asks actual users to rate their products on the key indicators that users themselves require. 

The Analytical Data Infrastructure report targets “technology components for integrating, modeling, managing, storing, and accessing the data sets that serve as sources for analytic/BI consumers, e.g., analytic/business applications, tools, and users.” The report focuses on databases, and MemSQL is one of more than a dozen vendors studied. MemSQL, as we’ll describe below, was the top-rated general purpose database (that is, not solely an analytics product). 

MemSQL is the top-rated general purpose database in the Dresner analysis

Use cases for the products studied include (from most-used to least-used):

  • Reporting and dashboards 
  • Discovery and exploration by business users
  • Data science, which includes the use of machine learning and AI for predictive and advanced analytics
  • Embedded analytics within business applications (high volume, low latency)

Users named performance, security, and scalability as their most-desired features. Performance was rated as especially desirable within the embedded analytics use case, where MemSQL shines, and in the largest companies – those with more than 10,000 employees (usually those with more than $1B in revenue). 

The top feature desired is scalability, which is core to MemSQL’s differentiation as a fast, scalable relational database with SQL support. Scalability is also strongly desired by tech companies, financial services, retail, and consumer services companies, as well as government. The largest and smallest companies are the most interested in scalability, while smaller and medium-sized companies put data life cycle management first. 

SQL support and columnar data support are seen as critical or very important by the most companies, especially for business uses. Among analytical features, aggregations lead the way, along with multi-dimensional/OLAP-type queries. MemSQL’s strong SQL support and ability to mix rowstore and columnstore data tables make it well-suited to meet thee requirements. Also highly desired are user-defined functions and machine learning support, both areas where MemSQL stands out. 

Dresner rates MemSQL highly for analytics

About MemSQL in This Report

MemSQL ranks very strongly on both major axes that Dresner measures for customer experience: product/technology, where MemSQL is in the top five, near Google, and sales and service, where MemSQL is in a near-tie for second with Snowflake, which is exclusively an analytics provider. Taking the two measures together, MemSQL is the leader among full-service databases (that handle both transactions and analytics), ahead of Google, Amazon, Microsoft, IBM, SAP, and Oracle. (Among these companies, only Google ranked above average on both measures; Oracle, IBM, and SAP ranked worst overall.) In the Vendor Credibility model, MemSQL ranked very high on user confidence, and above average for value. 

The Dresner spider chart shows MemSQL with high rankings

According to Dresner, MemSQL stands out as “an Overall Leader in both the Customer Experience and Vendor Credibility models.” Further, MemSQL is best in class for technical support professionalism, for responsiveness, and for consulting professionalism. MemSQL’s recommend score is perfect. 

Conclusion

Considered alongside Industry leaders, MemSQL rates very highly – and is #1 among full-service database providers in both Customer Experience and Vendor Credibility. You can download the report for free. You can also try MemSQL for free today, or contact us for a personalized demo. 

What’s After the MEAN Stack?

$
0
0

Feed: MemSQL Blog.
Author: Rob Richardson.

The MEAN stack – MongoDB, Express.js, Angular.js, and Node.js – has served as a pattern for a wide variety of web development. But times have changed, and the components of the MEAN stack have failed to keep up with the times. Let’s take a look at how the MEAN stack superseded the previous stack, the LAMP stack, and at the options developers have now for delivering efficient Web applications.

Introduction

We reach for software stacks to simplify the endless sea of choices. The MEAN stack is one such simplification that worked very well in its time. Though the MEAN stack was great for the last generation, we need more; in particular, more scalability. The components of the MEAN stack haven’t aged well, and our appetites for cloud-native infrastructure require a more mature approach. We need an updated, cloud-native stack that can boundlessly scale as much as our users expect to deliver superior experiences.

Stacks

When we look at software, we can easily get overwhelmed by the complexity of architectures or the variety of choices. Should I base my system on Python?  Or is Go a better choice? Should I use the same tools as last time? Or should I experiment with the latest hipster toolchain? These questions and more stymie both seasoned and newbie developers and architects.

Some patterns emerged early on that help developers quickly provision a web property to get started with known-good tools. One way to do this is to gather technologies that work well together in “stacks.” A “stack” is not a prescriptive validation metric, but rather a guideline for choosing and integrating components of a web property. The stack often identifies the OS, the database, the web server, and the server-side programming language.

In the earliest days, the famous stacks were the “LAMP-stack” and the “Microsoft-stack”. The LAMP stack represents Linux, Apache, MySQL, and PHP or Python. LAMP is an acronym of these product names. All the components of the LAMP stack are open source (though some of the technologies have commercial versions), so one can use them completely for free. The only direct cost to the developer is the time to build the experiment.

The “Microsoft stack” includes Windows Server, SQL Server, IIS (Internet Information Services), and ASP (90s) or ASP.NET (2000s+). All these products are tested and sold together. 

Stacks such as these help us get started quickly. They liberate us from decision fatigue, so we can focus instead on the dreams of our start-up, or the business problems before us, or the delivery needs of internal and external stakeholders. We choose a stack, such as LAMP or the Microsoft stack, to save time.

In each of these two example legacy stacks, we’re producing web properties. So no matter what programming language we choose, the end result of a browser’s web request is HTML, JavaScript, and CSS delivered to the browser. HTML provides the content, CSS makes it pretty, and in the early days, JavaScript was the quick form-validation experience. On the server, we use the programming language to combine HTML templates with business data to produce rendered HTML delivered to the browser. 

We can think of this much like mail merge: take a Word document with replaceable fields like first and last name, add an excel file with columns for each field, and the engine produces a file for each row in the sheet.

As browsers evolved and JavaScript engines were tuned, JavaScript became powerful enough to make real-time, thick-client interfaces in the browser. Early examples of this kind of web application are Facebook and Google Maps. 

These immersive experiences don’t require navigating to a fresh page on every button click. Instead, we could dynamically update the app as other users created content, or when the user clicks buttons in the browser. With these new capabilities, a new stack was born: the MEAN stack.

What is the MEAN Stack?

The MEAN stack was the first stack to acknowledge the browser-based thick client. Applications built on the MEAN stack primarily have user experience elements built in JavaScript and running continuously in the browser. We can navigate the experiences by opening and closing items, or by swiping or drilling into things. The old full-page refresh is gone.

The MEAN stack includes MongoDB, Express.js, Angular.js, and Node.js. MEAN is the acronym of these products. The back-end application uses MongoDB to store its data as binary-encoded JavaScript Object Notation (JSON) documents. Node.js is the JavaScript runtime environment, allowing you to do backend, as well as frontend, programming in JavaScript. Express.js is the back-end web application framework running on top of Node.js. And Angular.js is the front-end web application framework, running your JavaScript code in the user’s browser. This allows your application UI to be fully dynamic. 

Unlike previous stacks, both the programming language and operating system aren’t specified, and for the first time, both the server framework and browser-based client framework are specified.

In the MEAN stack, MongoDB is the data store. MongoDB is a NoSQL database, making a stark departure from the SQL-based systems in previous stacks. With a document database, there are no joins, no schema, no ACID compliance, and no transactions. What document databases offer is the ability to store data as JSON, which easily serializes from the business objects already used in the application. We no longer have to dissect the JSON objects into third normal form to persist the data, nor collect and rehydrate the objects from disparate tables to reproduce the view. 

The MEAN stack webserver is Node.js, a thin wrapper around Chrome’s V8 JavaScript engine that adds TCP sockets and file I/O. Unlike previous generations’ web servers, Node.js was designed in the age of multi-core processors and millions of requests. As a result, Node.js is asynchronous to a fault, easily handling intense, I/O-bound workloads. The programming API is a simple wrapper around a TCP socket. 

In the MEAN stack, JavaScript is the name of the game. Express.js is the server-side framework offering an MVC-like experience in JavaScript. Angular (now known as Angular.js or Angular 1) allows for simple data binding to HTML snippets. With JavaScript both on the server and on the client, there is less context switching when building features. Though the specific features of Express.js’s and Angular.js’s frameworks are quite different, one can be productive in each with little cross-training, and there are some ways to share code between the systems.

The MEAN stack rallied a web generation of start-ups and hobbyists. Since all the products are free and open-source, one can get started for only the cost of one’s time. Since everything is based in JavaScript, there are fewer concepts to learn before one is productive. When the MEAN stack was introduced, these thick-client browser apps were fresh and new, and the back-end system was fast enough, for new applications, that database durability and database performance seemed less of a concern.

The Fall of the MEAN Stack

The MEAN stack was good for its time, but a lot has happened since. Here’s an overly brief history of the fall of the MEAN stack, one component at a time.

Mongo got a real bad rap for data durability. In one Mongo meme, it was suggested that Mongo might implement the PLEASE keyword to improve the likelihood that data would be persisted correctly and durably. (A quick squint, and you can imagine the XKCD comic about “sudo make me a sandwich.”) Mongo also lacks native SQL support, making data retrieval slower and less efficient. 

Express is aging, but is still the defacto standard for Node web apps and apis. Much of the modern frameworks — both MVC-based and Sinatra-inspired — still build on top of Express. Express could do well to move from callbacks to promises, and better handle async and await, but sadly, Express 5 alpha hasn’t moved in more than a year.

Angular.js (1.x) was rewritten from scratch as Angular (2+). Arguably, the two products are so dissimilar that they should have been named differently. In the confusion as the Angular reboot was taking shape, there was a very unfortunate presentation at an Angular conference. 

The talk was meant to be funny, but it was not taken that way. It showed headstones for many of the core Angular.js concepts, and sought to highlight how the presenters were designing a much easier system in the new Angular. 

Sadly, this message landed really wrong. Much like the community backlash to Visual Basic’s plans they termed Visual Fred, the community was outraged. The core tenets they trusted every day for building highly interactive and profitable apps were getting thrown away, and the new system wouldn’t be ready for a long time. Much of the community moved on to React, and now Angular is struggling to stay relevant. Arguably, Angular’s failure here was the biggest factor in React’s success — much more so than any React initiative or feature.

Nowadays many languages’ frameworks have caught up to the lean, multi-core experience pioneered in Node and Express. ASP.NET Core brings a similarly light-weight experience, and was built on top of libuv, the OS-agnostic socket framework, the same way Node was. Flask has brought light-weight web apps to Python. Ruby on Rails is one way to get started quickly. Spring Boot brought similar microservices concepts to Java. These back-end frameworks aren’t JavaScript, so there is more context switching, but their performance is no longer a barrier, and strongly-typed languages are becoming more in vogue.

As a further deterioration of the MEAN stack, there are now frameworks named “mean,” including mean.io and meanjs.org and others. These products seek to capitalize on the popularity of the “mean” term. Sometimes it offers more options on the original MEAN products, sometimes scaffolding around getting started faster, sometimes merely looking to cash in on the SEO value of the term.

With MEAN losing its edge, many other stacks and methodologies have emerged.

The JAM Stack

The JAM stack is the next evolution of the MEAN stack. The JAM stack includes JavaScript, APIs, and Markup. In this stack, the back-end isn’t specified – neither the webserver, the back-end language, or the database.

In the JAM stack we use JavaScript to build a thick client in the browser, it calls APIs, and mashes the data with Markup — likely the same HTML templates we would build in the MEAN stack. The JavaScript frameworks have evolved as well. The new top contenders are React, Vue.js, and Angular, with additional players from Svelte, Auralia, Ember, Meteor, and many others. 

The frameworks have mostly standardized on common concepts like virtual dom, 1-way data binding, and web components. Each framework then combines these concepts with the opinions and styles of the author.

The JAM stack focuses exclusively on the thick-client browser environment, merely giving a nod to the APIs, as if magic happens behind there. This has given rise to backend-as-a-service products like Firebase, and API innovations beyond REST including gRPC and GraphQL. But, just as legacy stacks ignored the browser thick-client, the JAM stack marginalizes the backend, to our detriment.

Maturing Application Architecture

As the web and the cloud have matured, as system architects, we have also matured in our thoughts of how to design web properties.

As technology has progressed, we’ve gotten much better at building highly scalable systems. Microservices offer a much different application model where simple pieces are arranged into a mesh. Containers offer ephemeral hardware that’s easy to spin up and replace, leading to utility computing.

As consumers and business users of systems, we almost take for granted that a system will be always on and infinitely scalable. We don’t even consider the complexity of geo-replication of data or latency of trans-continental communication. If we need to wait more than a second or two, we move onto the next product or the next task.

With these maturing tastes, we now take for granted that an application can handle near infinite load without degradation to users, and that features can be upgraded and replaced without downtime. Imagine the absurdity if Google Maps went down every day at 10 pm so they could upgrade the system, or if Facebook went down if a million people or more posted at the same time.

We now take for granted that our applications can scale, and the naive LAMP and MEAN stacks are no longer relevant.

Characteristics of the Modern Stack

What does the modern stack look like?  What are the elements of a modern system?  I propose a modern system is cloud-native, utility-billed, infinite-scale, low-latency, user-relevant using machine learning, stores and processes disparate data types and sources, and delivers personalized results to each user. Let’s dig into these concepts.

A modern system allows boundless scale. As a business user, I can’t handle if my system gets slow when we add more users. If the site goes viral, it needs to continue serving requests, and if the site is seasonally slow, we need to turn down the spend to match revenue. Utility billing and cloud-native scale offers this opportunity. Mounds of hardware are available for us to scale into immediately upon request. If we design stateless, distributed systems, additional load doesn’t produce latency issues.

A modern system processes disparate data types and sources. Our systems produce logs of unstructured system behavior and failures. Events from sensors and user activity flood in as huge amounts of time-series events. Users produce transactions by placing orders or requesting services. And the product catalog or news feed is a library of documents that must be rendered completely and quickly. As users and stakeholders consume the system’s features, they don’t want or need to know how this data is stored or processed. They need only see that it’s available, searchable, and consumable.

A modern system produces relevant information. In the world of big data, and even bigger compute capacity, it’s our task to give users relevant information from all sources. Machine learning models can identify trends in data, suggesting related activities or purchases, delivering relevant, real-time results to users. Just as easily, these models can detect outlier activities that suggest fraud. As we gain trust in the insights gained from these real-time analytics, we can empower the machines to make decisions that deliver real business value to our organization.

MemSQL is the Modern Stack’s Database

Whether you choose to build your web properties in Java or C#, in Python or Go, in Ruby or JavaScript, you need a data store that can elastically and boundlessly scale with your application. One that solves the problems that Mongo ran into – that scales effortlessly, and that meets ACID guarantees for data durability. 

We also need a database that supports the SQL standard for data retrieval. This brings two benefits: a SQL database “plays well with others,” supporting the vast number of tools out there that interface to SQL, as well as the vast number of developers and sophisticated end users who know SQL code. The decades of work that have gone into honing the efficiency of SQL implementations is also worth tapping into. 

These requirements have called forth a new class of databases, which go by a variety of names; we will use the term NewSQL here. A NewSQL database is distributed, like Mongo, but meets ACID guarantees, providing durability, along with support for SQL. CockroachDB and Google Spanner are examples of NewSQL databases. 

We believe that MemSQL brings the best SQL, distributed, and cloud-native story to the table. At the core of MemSQL is the distributed database. In the database’s control plane is a master node and other aggregator nodes responsible for splitting the query across leaf nodes, and combining the results into deterministic data sets. ACID-compliant transactions ensure each update is durably committed to the data partitions, and available for subsequent requests. In-memory skiplists speed up seeking and querying data, and completely avoid data locks.

MemSQL Helios delivers the same boundless scale engine as a managed service in the cloud. No longer do you need to provision additional hardware or carve out VMs. Merely drag a slider up or down to ensure the capacity you need is available.

MemSQL is able to ingest data from Kafka streams, from S3 buckets of data stored in JSON, CSV, and other formats, and deliver the data into place without interrupting real-time analytical queries. Native transforms allow shelling out into any process to transform or augment the data, such as calling into a Spark ML model.

MemSQL stores relational data, stores document data in JSON columns, provides time-series windowing functions, allows for super-fast in-memory rowstore tables snapshotted to disk and disk-based columnstore data, heavily cached in memory.

As we craft the modern app stack, include MemSQL as your durable, boundless cloud-native data store of choice.

Conclusion

Stacks have allowed us to simplify the sea of choices to a few packages known to work well together. The MEAN stack was one such toolchain that allowed developers to focus less on infrastructure choices and more on developing business value. 

Sadly, the MEAN stack hasn’t aged well. We’ve moved on to the JAM stack, but this ignores the back-end completely. 

As our tastes have matured, we assume more from our infrastructure. We need a cloud-native advocate that can boundlessly scale, as our users expect us to deliver superior experiences. Try MemSQL for free today, or contact us for a personalized demo.

There’s Life After Oracle: Finding Database Happiness 

$
0
0

Feed: MemSQL Blog.
Author: Peter Guagenti.

“Breakup Day” is February 21st. In the spirit of moving on from unhealthy relationships, we decided to write our very own “Dear John” letter to the vendor we hear our customers complain the most about… or in this case, a “Dear Larry” letter. We hope this gives anyone who’s feeling taken advantage of by their behemoth vendor the courage to sever ties. Cheers.

Dear Larry,

Let’s be honest – this relationship is not working. I’m ditching you for someone who treats me like I matter…  and who doesn’t take all of my money! 

You’ve been mistreating me for years, Larry. And to say that you’ve been painfully slow to react to my wishes, inflexible as a “partner”, and unwilling to evolve to meet my changing needs, is a huge understatement. There is no one who makes me feel as taken advantage of as you do. Now that there are clearly so many better options for me, I finally have the courage and conviction to leave.

So why is all of my data packed up and ready to go? You deserve to know:

You’re the definition of “legacy,” and you can’t easily support the kind of diversity I need. How difficult and expensive you make it to work with mixed workloads, mixed data types, and streaming systems in this day and age is a joke. You might think having lots of expensive, specialized tools makes you desirable, but trust me – it’s the opposite. I don’t want to have to somehow figure out how to stitch everything together and make it work. The things I’m trying to do are already a challenge, without you making everything more complicated.

Then there’s your performance and scale limitations. You were fast, in your day, but software has evolved! I need speed and scale, without having to give up an arm and a leg for it. 

The new database companies give me the performance I need, without compromises. And they do it at a fraction of what you charge me. Not to mention I can run them anywhere! 

And, no, don’t try to sell me on Exadata being the answer to these problems. We both know that it’s too little, too late, too limited. Oh, and too expensive, of course.

Most importantly, I need someone who supports me and behaves like a real partner. Who understands where the world is going, and appreciates what I’m going through. A vendor who is flexible to suit me, and doesn’t expect me to flex to do things their way. 

All that being said, I also want to thank you, Larry. Seriously. It’s really hard to make the first move, and you served us well in your time.

You came along when there was pretty much no other game in town. It’s next to impossible to be first and get everything right from the get-go. So, props to you for putting yourself out there and using the best available resources at the time to just plain make things happen. We all thank you for what you helped us do for many years.

Our relationship was an important chapter in my life, Larry, but I need to move on. I have new needs, and I must embrace the modern world we’re living in. I need to move fast – like, real-time fast – and I want someone who is there at my side and helps me try new and exciting things. 

I need real speed and elastic scale in my life, not to mention someone who is all-in on the cloud, and not a Johnny Come Lately (and Reluctantly and Cost-Prohibitively, I might add). Frankly, Larry, I need a partner who understands where I’m going, and who is heading there with me… 

I’m through waiting for you to change – beyond the superficial fixes which you claim are game-changing, but we both know they’re just hot air. Expensive hot air, at that.

It’s time to face reality, Larry: the world is leaving you behind, and so am I. You had a million chances, and you blew them all. I’m taking up with the NewSQL crowd. They get me. And we’re about to make beautiful, data-driven music together.

Good-bye,

A soon-to-be-former-customer

P.S. If you want to see what a “healthy” database vendor relationship looks like, with a more focused and committed partner, check out our Oracle vs. MemSQL comparison or contact us for a personalized demo!


Webinar Recap #1 of 3: Migration Strategy for Moving Operational Databases to the Cloud

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This webinar describes the benefits and risks of moving operational databases to the cloud. It’s the first webinar in a three part series focused on migrating operational databases to the cloud. Migrating to cloud based operational data infrastructure unlocks a number of key benefits, but it’s also not without risk or complexity. The first session uncovers the motivations and benefits of moving operational data to the cloud and describe the unique challenges of migrating operational databases to the cloud. (Visit here to view all three webinars and download slides.) 

About This Webinar Series

Today starts the first in a series of three webinars:

  • In this webinar we’ll discuss in broad strokes, migration strategy, cloud migration, and how those strategies are influenced by larger IT transformation or digital transformation strategy. 
  • In our next webinar, we’ll go into the next level of details in terms of database migration best practices, where we’ll cover processes and techniques of database migration across any sort of database, really. 
  • In the final webinar, we’ll get specific to the technical nuts and bolts of how we do this in migrating to Helios, which is MemSQL’s database as a service. 

In this webinar, we’ll cover the journey to the cloud, a little bit about the current state of enterprise IT landscapes, and some of the challenges and business considerations that go into making a plan, making an assessment, and choosing what kind of workloads to support. 

Next we’ll get into the different types of data migrations that are typically performed. And some of the questions you need to start asking if you’re at the beginning of this kind of journey. And finally, we’ll get into some specific types of workloads along the way. 

Any sort of change to a functioning system can invoke fear and dread, especially when it comes to operational databases, which of course process the critical transactions for the business. After all, they’re the lifeblood of the business. And so, we’ll start to peel the onion and break that down a little bit.

If you’re just starting your journey to the cloud, you’ve probably done some experimentation, and you’ve spun up some databases of different types in some of the popular cloud vendors. And these cloud providers give guidelines oriented towards the databases and database services that they support. There’s often case studies which relate to transformations or migrations from Web 2.0 companies, companies like Netflix, who famously have moved all of their infrastructure to AWS years ago.

But in the enterprise space, there’s a different starting point. That starting point is many years, perhaps decades of lots of different heterogeneous technologies. In regards to databases themselves, a variety of different databases and versions over the years. Some that are mainframe-resident, some from the client-server era, older versions of Oracle and Microsoft SQL, IBM DB2, et cetera.

And these databases perform various workloads and may have many application dependencies on them. So, unlike those web 2.0 companies, most enterprises have to start with a really sober inventory analysis to look at what their applications are. They have to look at that application portfolio, understand the interconnections and dependencies among the systems.In the last 10 to 15 years especially, we see the uptake of new varieties of data stores, particularly NoSQL data stores such as Cassandra or key-value stores or in-memory data grids, streaming systems, and the like.

Note. See here for MemSQL’s very widely read take on NoSQL

Introduction

In companies that have just been started in the last 15, 20 years, you could completely run that business without your own data center. And in that case, your starting point often is a SaaS application for payroll, human resources, et cetera. In addition to new custom apps that you will build, and of course, those will be on some infrastructure or platform as a service (PaaS) provider.

So some of this is intentional, and that large enterprises may want to hedge their bet across different providers. And that’s consistent with a traditional IT strategy in the pre-cloud era, where I might have an IBM Unix machine, and then an HP Unix machine, or more recently Red Hat, Linux, and Windows and applications.

Cloud Migration Webinar - Enterprise Cloud Strategy

But these days, it’s seen as the new platform where I want that choice is cloud platforms. Other parts of this are unintentional, like I said, with the lines of business, just adopting SaaS applications. And what you see here on the right, in the bar chart is that the hybrid cloud is growing. And to dig into that a little bit, to see just how much hybrid cloud has grown just from the year prior and 2018, it’s quite dramatic in the uptake of hybrid, and that speaks to the challenge that enterprise IT has, in that legacy systems don’t go away overnight.

Cloud Migration Webinar - State of the Cloud
It’s not surprising that cloud spend is the first thing that sort of bites businesses. And it does have an advantage for experimentation with new applications, new go to markets, especially customer facing applications.

Cloud Migration Webinar - Poll 1 Provider

Because it’s so easily scalable, you may not be able to predict how popular the mobile app may be, for instance, or your API, or your real-time visualization dashboard. So putting it in an elastic environment makes sense. But the cost may explode pretty quickly as other applications get there too. 

Cloud Migration Webinar - Database Migrations

And with governance and security, I think those are obvious in that when you’re across a multi-cloud environment, you’ve got to either duplicate or integrate those security domains to ensure that you have the right control over your data and your users. There are regulatory things to be concerned about in terms of the privacy of the data, depending on the business, traffic protection of data in the U.S. and California, or in Europe with the general data protection regulation (GDPR).

We’re now at a point in the adoption of cloud, that it’s not just sort of SaaS applications and ancillary supporting services around them, but it’s also the core data itself, like the databases service, in particular relational databases. And this might be a surprise given the popularity of NoSQL in recent years, you’ll see that NoSQL databases service are growing, but to lesser extent than relational. And what’s happening across relational data warehousing or OLTP, traditional OLAP, and NoSQL databases, is that there’s been a proliferation of all of these different types. But the power of relational still is what is most useful in many applications.

Gartner’s view of this is that just in the next two years that 75% of all databases will be deployed or migrated to a cloud platform. So that’s a lot of growth. That number doesn’t necessarily mean the retirement of existing databases. I think it speaks to the growth of new databases going in the cloud, because launching those new systems is so convenient and so easy, and – for the right kinds of workload – affordable.

So at this point, let’s pause and let’s have a question to the audience. So, who is your primary cloud service provider? You see the popular ones listed there. You may have more than one cloud service provider. But what’s your predominant or standard one is what we’re asking here. And we’ll wait for a few moments while responses come in. 

Okay, this result matches what we’ve seen from other industry reports in terms of the popularity of AWS and then second Azure. Given the time and the market, this isn’t such a surprise. In a year from now, we might see a very different mix with what’s happening with the adoption, uptake of Google and Azure in the different services. So let’s move on.

So what are the challenges of database migrations? Within enterprise IT, the first thing that needs to be done is to understand what that application dependency is. 

And when it comes to a database, you need to understand particularly how the application is using the database. And so just some examples of those dependency points to look for, what are the data types that are going to be used there? Are there bar codes, integers? What’s the distribution of those stored procedures? 

Although there’s a common language on families of databases, often there are nuances to how what’s available in a stored procedure in terms of processing, so the migration of stored procedures takes some effort. Most traditional SQL databases will provide user-defined functions where a user can extend functions. 

And then the query language itself in terms of the data manipulation language (DML) for queries, create, update, delete, et cetera. And in terms of the definition of objects in the database, the Data Definition Language (DDL) concerning how tables are created, for instance, and the definition of triggers and stored procedures and constraints. 

There’s also a hardware dependency to look at for depending on the age of the application, that software might be tied to your particular processor or machine type. And the application itself may only be available on that platform combination. 

In my own experience, I’ve seen this many times in airlines where the systems for gate and boarding, systems for check in, systems for ground operations, they were written decades ago provided typically by an industry specific technology provider, and they suited the business processes of that airline for many years.

But as the airline is looking to do more customer experience interactions and collect data about the customer’s experience from existing touch points like the check-in touch point, the kiosk, the mobile app, but they want to enhance this data. And they want to bring operational data, typically a lot of these operational data systems in logistics and create providers and airlines and other types of operations manufacturing, they don’t lend themselves well to do this.

So migrating these applications can be more difficult. Often it’s going to be Agra modernization where you’re just moving off of that platform. Initially, you would integrate with these, and you may store the data that you event out in your targets, new database in the cloud. And finally, there is often a management mismatch of the application. In other words, the configuration of that application as database doesn’t quite fit the infrastructure model of the cloud model that you’re migrating to.

The assets aren’t easily divided parametrized and put into your DevOps process and your CI/CD pipeline. Often it’s not easy to containerize. So these are some of the challenges that make it more difficult in enterprise IT context to migrate the applications which of course drag along the databases for these applications.

Charlie Feld, a visionary in the area of IT transformation, has his Twelve Timeless Principles:

  1. No Blind Spot
  2. Outcomes: Business, Architecture, Productivity
  3. Zoom Out
  4. Progressive Elaboration & Decomposition
  5. Systems Thinking
  6. The WHO is Where All the Leverage Is
  7. 30 Game Changers
  8. Functional Excellence is Table Stakes
  9. Think Capabilities
  10. Architecture Matters
  11. Constant Modernization
  12. Beachhead  First, Then Accelerate

So let’s talk about the phases of migration. So we’ll go into this more in the second webinar, where we talk about best practices, but I’ll summarize them here. 

Cloud Migration Webinar - Migration Phases

Assessing applications and workloads for cloud readiness allows organizations to:

  • Determine what applications and data can – and cannot – be readily moved to a cloud environment 
  • What delivery models (public, private, or hybrid) can be supported
  • Which applications you do not want to move to cloud 

You’ve got to classify these different workloads. So you can look at them in terms of what’s more amenable to the move? How many concurrent users do I expect? Where are they geographically distributed? Can I replicate data across more easily in the cloud to provide that service or without interrupting that service?

Cloud Migration Webinar - Steps

Do I have new applications and transactions coming online? Perhaps there are more, there are new sensors in IoT, sensors that I need to now bring that data to these applications. So you need to categorize these workloads in terms of the data size, the data frequency, the shape and structure of the data, and look at what kind of compute resources you’re going to need, because it’s going to be a little bit different. Of course, this will require some testing by workload.

So at this point, I’d like to pause and ask Alicia have another polling question. So what types of workloads have you migrated to the cloud so far? Given the statistics we see from the surveys, most likely, most of you have done some sort of migration or you’re aware of one in your business and what you’ve done. And you might be embarking on new types of applications in terms of streaming IoT.

Cloud Migration Webinar - Poll2 Workloads

So roughly a third have not been involved in migration so far. And another third, it’s been analytics and reporting. That result on analytic and reporting, I think is insightful, because when you think about the risks and rewards of migrating workloads, the offline historical reporting infrastructure is the least risky. 

If you have a business scenario where you’re providing weekly operational reports on revenue or customer churn or marketing effectiveness, and those reports don’t get reviewed perhaps until Monday morning, then you can do the weekly reporting generation over the weekend. If it takes two hours or 10 hours to process the data, it’s not such a big deal. Nobody’s going to look at it until Monday.

So there’s a broader array of sort of fallbacks and safety measures. And it’s less time-critical. Those are sort of the easier ones. So 16% of you reported that transactional or operational databases you’re aware of, or you’ve been involved in moving this to the cloud. And that is really what’s happening right now, that we find at MemSQL as well, is that the first wave was this wave of analytical applications, and now recently, you see more of the operational transactions, which is the core part of the business.

Here are criteria to choose the right workloads for data migration: 

  • Complexity of the application 
  • Impact to the business
  • Transactional and application dependencies 
  • Benefits of ending support for legacy applications 
  • Presence or absence of sensitive data content 
  • Likelihood of taking advantage of the cloud’s elasticity

What are the most suitable candidates for cloud migration? Here are a few keys:

  • Applications and databases which already need to be modernized, enhanced, or improved to support new requirements or increased demands
  • Consider apps having highly variable throughput
  • Apps used by a broad base of consumers, where you do not know how many users will connect and when 
  • Apps that require rapid scaling of resources
  • Development, testing and prototyping of application changes

Cloud Migration Webinar - Workloads and Benefits

Q&A and Conclusion

How do I migrate from Oracle to MemSQL?

Well, we’ve done this for several customers. And we have a white paper available online that goes into quite a lot of detail on how to approach that, and have a plan for an Oracle to MemSQL migration. 

What makes MemSQL good for time series?

That’s a whole subject in itself. We’ve got webinars and blog articles available on that. But essentially, I’ll give a few of them here and that MemSQL allows you to first of all ingest that data without blocking for writes; you can do that in parallel often. So if you’re reading from Kafka, for instance, which itself is deployed with multiple brokers and multiple partitions, MemSQL is a distributed database, and you can ingest that time series data in real time and in parallel. So that’s the first point is ingestion.

Secondly, we provide time series-specific functions to query that data that allows it for easy convenience, so it’s not necessary to go to a separate, distinct, unique database. Again, MemSQL is a unified converged database that handles relational, analytical, key-value, document, time series, geospatial all in one place. And so it’s suitable to the new cloud native era, where you’re going to have these different data types and access patterns.

What is the difference between MemSQL and Amazon Aurora?

Yeah, so that question is probably coming because when you’re migrating to a cloud database, typically you’re looking at one of the major cloud providers, AWS or Google Cloud Platform or Microsoft Azure. And each of these providers provides various types of databases. 

Amazon Aurora is a database built on Postgres, and there’s a version also for MySQL, or at least compatibility in that way that allows you to do that. So it’s worth a look. But what you’ll find when you’re doing sort of high-performance application is that the system architecture of Aurora itself is the biggest Achilles’ heel there, which is it’s composed of the single-node databases of MySQL or Postgres, depending on the edition you’ve chosen, and it’s basically sharding that across multiple instances and providing a shard and middleware layer above that.

And that has inefficiencies. It’s going to utilize more cloud resources. And in the beginning that might – at small volumes, that might not manifest into a problem. But when you’re doing this at scale across many applications, and on a bigger basis, those compute resources really add up in terms of the cost. 

So MemSQL is a much more efficient way, because it was written from the ground up, it’s not built out of some other single-node, traditional SQL database like Aurora. MemSQL’s built from the storage layer all the way up to take advantage of current cloud hardware as well as modern hardware in terms of AVX2 instruction sets and SIMD and, if that’s available, non volatile memory.

Secondly, I’d say that Aurora differs in a major way and that it’s oriented to just the transactions, OLTP type processing. Whereas MemSQL does that, but not just that it also has a rowstore with a columnstore, which is what our traditional analytical database like Amazon Redshift has. So, in a way, you could say that with Amazon, you would need two databases to do what MemSQL can do with a single database.

We invite you to learn more about MemSQL at memsql.com or get started with your trial of MemSQL Helios

Webinar Recap #2 of 3: Ensuring a Successful Cloud Data Migration

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This webinar describes what’s needed to successfully migrate your data to the cloud. What does a good cloud data migration look like? What are the strategic decisions you have to make as to priorities, and what are the challenges you’ll face in carrying out your plans? We will describe a tried and tested approach you can use to get your first wave of applications successfully migrated to the cloud. 

About This Webinar Series

This is the second part of a three-part series. Last week we covered migration strategies; broad-brush business considerations to think about, beyond just the technical lift and shift strategy or data migration. And the business decisions and business strategy to guide you as to picking what sorts of workloads you will migrate. What sorts of new application architectures you might take advantage of.

In today’s webinar we’ll go one level deeper. Once those decisions are mapped out, what does a good cloud data migration look like? And in the final session, we’ll go a layer deeper, and we’ll get into migrating a particular source to target database.

Challenges of Database Migrations

So you’re looking at migrating an on-premises, so-called legacy database. You have particular IT responsibilities that span lots of different areas, lots of different infrastructure, lots of different layers. You’ve got the responsibility of all of these things. But a lot of this work doesn’t provide any differentiation for you in the marketplace or for your business.

So when you look at what’s possible in moving to a cloud database, the main thing that you get to take advantage of is a lot of that infrastructure is taken care of for you. And so any cloud database is going to greatly reduce that cost of ownership in terms of the infrastructure management. 

So the general value proposition of a cloud database, or any SaaS service, is that it allows you to reduce all of this work and focus on the business differentiation of your application.

Cloud data migration: Challenges

There are still challenges that you have to address in moving to a cloud database. First is the dependencies of applications. So an application may use particular proprietary data types. There may be custom logic and stored procedures, custom functions, etc., and SQL extensions that you’ll need to look at in your initial assessment.

But it’s not just the database itself that has to be considered when you’re doing a migration. There’s an ecosystem around the database, tooling and such, doing things such as replication. You may have real-time data integration in and out of your database through ETL, Kafka, middleware products, that sort of thing.

You want to look for what visibility you have in terms of monitoring and management. And any sorts of practices or automation that you have around your existing visibility and monitoring. You have to look at that in terms of migration, or redo those processes and techniques and backup and recovery. And you have to discover and set your goals. 

Ask the right questions:

  • Determine which applications and data can – and cannot – be readily moved to a cloud environment. 
  • Identify the workload types – maybe high-frequency transactions on an OLTP type of database, or needs further into the analytics spectrum.
  • Determine which applications you do not want to move to cloud. 

Cloud data migration: Separated workloads

There’s a data integration process – extract, transform, and load (ETL) – that provides significant latency between when data’s written to the original source and into an analytic store for queries. The trade-off and the cost of this is that you have latency introduced here in the queries to the right.

So this might be fine if this is a weekly report. It’s not a report that has to run in a particular timeframe. But, as you move into real-time scenarios, this is what we at MemSQL call operational analytics.

There’s no time for ETL etc. anymore. Thus, your reasons for moving to a cloud database. One is to scale further because your data is expected to grow. You also want to avoid the cost of all of that infrastructure. And you want a better consumption model for going forward. Things like Oracle tend to be very expensive. They perform, up to a point, but they’re expensive and difficult to scale. 

Cloud data migration: Data aspects

So what about the data itself? What do we think about what to look for here? And there’s four aspects essentially, when you’re looking at the migration from a particular database to a cloud database. And the first is the shape, which follows from what I just discussed in terms of data models. How the data is structured in the schema and what types are used will dictate that shape. 

if you’ve got row structure data along with JSON you may want to join … It would be easier in your application and simplify your application if you could join the array structure of JSON with rows. If you could do distributed joins across any of the distributed data. MemSQL allows this converged model across these different types of data shapes. 

So what is the total dataset size? Is it somewhat fixed, or is it something that is unbounded – that you expect to grow on a continual basis? You collect this data because it has consequences to your business operations. How efficient or how well that operation’s doing, and time matters. 

You want to make decisions in the moment. What’s the subset of the data that’s going to be queried? So it could be terabytes and terabytes, maybe petabytes. We can query that in a way that we can spread data across a distributed database, such as MemSQL, and get parallelism. The goal should be to serve the largest variety of workloads with the least amount of infrastructure. 

You can do that in a cloud native database like MemSQL Helios that handles all of these workloads. So that simplification is really important, especially in the cloud era, when infrastructure in the cloud is so easy to create. With just a push of a button – or not even that, an automated API call.  

So let’s take a poll of you, the people attending. 

Cloud data migration: Database types

So overwhelmingly the attendees, you’ve reported that a traditional relational database – Oracle; SQL Server; Aurora, which is either a Postgres or MySQL variant; and MySQL are your most common. And that’s not a surprise; MySQL is the most popular database in the world. 

Typically we also find that the analytic warehouses are the ones to move first to the cloud because there’s less risk. But what we’re seeing now, and it’s happening in a big way, is the operational database moving to the cloud and handling these cloud-scale, mixed workload things. 

One pattern that’s worked for MemSQL with our customers, is to take an incremental step. And to split some of the workload from maybe that transactional system that might be handling all of the ingestion of the transactions as well as analytics. If you can have a near-real-time, or with the smallest lag possible, to do analytics against that transaction and scale the concurrency. So that’s a good initial first step, to do a partial migration and replicate the data needed for analytics – provide it to the applications, the data scientists, the analysts.

MemSQL, as a distributed database, allows you to handle highly concurrent reads. The nodes of the database for handling those inbound queries can be scaled independently of the data nodes themselves. And so that allows you to do this cost-effectively. If you’re using MemSQL Helios for this, it’s as simple as pushing a button and resizing your cluster. But if you have this on-prem, or self-managed MemSQL in the cloud, again you’re scaling the aggregators to allow this concurrency. 

So this pattern is a good initial first step that helps to minimize risk. Especially, it works when you’ve got really high read to write ratios on this data. In talking to some of our banking customers, sometimes if we’re talking about retail banking and it’s a mobile banking application, that sort of thing. Or the web application, it can be as much as nine-to-one or 10-to-one in terms of reads to writes. 

Data Migration

In moving applications to the cloud, you’re dealing typically with really large amounts of data. And so how you handle the replication itself, for example, should be elastic. Just in the way that a modern database like MemSQL is distributed and MemSQL Helios is elastic. 

Cloud data migration: Migration process

You often have massive amounts of data. It could be hundreds of terabytes, from tens to hundreds of databases. So doing it in a serial single fashion can take quite a lot of time. So you want to be able to distribute that migration and replication work and parallelize it as much as possible. 

Cloud data migration: MemSQL Replicate

MemSQL Replicate is a built in capability of the product. It’s not something separate. It allows you to do this replication from sources such as Oracle, Oracle RAC, SQL Server. 

And in the future we’ll be doing more with it. Today these are the sources. You’ll find more information about this at docs.memsql.com. And it supports what I’ve just described earlier. The essential characteristics of reliable cloud data migration, where it comes to the data migration and replication itself. And then it supports the essential elasticity. 

It’s distributed, it can replicate in parallel. It can recover from checkpoints. So that when you’re moving massive datasets from your operational database, you can get all of that reliably into your target system. 

So at this point, I’ll pause for a polling question: what types of replication are you using today? Some of this might be dictated by the database ecosystem that your company currently uses. GoldenGate for instance, is part of the Oracle Suite. Informatica is independent. 

Cloud data migration: Replication types

So what types of replication are you using today? That’s an interesting result. This space of data integration and ETL has grown quite a lot in recent years. Especially in the cloud context. There are other cloud native ways such as Google, Alooma to do integrations with their databases. Matillion is another one that comes to mind. 

So I think this result, to put it in perspective, indicates what we’ve seen in the large amount of options that have grown. And also the fact that there’s so many choices here that it shows how much data migration or database migration to the cloud is happening. 

Cloud data migration: Success criteria

I’ll leave you with these three takeaways, and that there’s lots to consider when migrating an operational database to the cloud. 

First and foremost, assessing the workload. It’s important to understand that you may find different shapes of data, different data types, different data models used. And that no longer are you restricted to having to move in a one-to-one fashion or a one-to-many fashion. In the way that Amazon has moved from Oracle databases and to five or more different types of databases. That’s a complex scenario. And it’s expanded their infrastructure in terms of the variety and types. 

That’s more complex to manage. So you have to consider, does your business have the staff and skills to manage a growing variety of database types. Or can you look to move to a cloud database that supports these multiple types of models in a converged fashion, as MemSQL Helios does. 

Secondly, migration of a database is not an all-or-none proposition. You can do this in a partial migration using a pattern that we’ve seen successful with our customers. If the workload ratio of reads to writes is very high, and needs high concurrency for reporting, web, mobile access for users, then consider just replicating the data to Helios to first provide the analytical workload in the cloud. And then secondly, come back and move the transactional workload. 

Thirdly, automate as much as possible in the migration process. Tools are a big part of the answer, but they’re not the only part. 

And as I said, no matter what source or target database you’re moving from or moving to, the stored procedures, procedural languages is where you should expect some manual work. Even if you’re moving from one version of MySQL to another version of MySQL or one version of PostgreSQL to another, you’ll have those problems. 

For the next session, the final session of the series, I’ll be talking about database migration best practices. And so I’ll get into one more level of technical specifics of how you do this with MemSQL and replicate, as I showed you. So you have something concrete in terms of seeing how this process is done. 

Q&A and Conclusion

Have any MemSQL customers migrated from Oracle? 

Yes. I would say that’s the most common. Oracle has a very large footprint in the enterprise. And we’ve migrated from Oracle RAC and Oracle DB and Oracle Exadata. Not only Oracle though, we’ve also migrated from Microsoft SQL Server, SAP HANA. And then newer types of databases as well, such as migrations from Snowflake to MemSQL and SAP HANA to MemSQL. And in the Snowflake example, it’s often because there’s a need for lower latency for operational analytics, including greater concurrency. And the way that MemSQL was built allows for that low latency, real-time result – an HTAP, hybrid transactional/analytical process use case. 

Besides replicate, what are ways to get data into MemSQL?

A real-time way to get data in is through MemSQL Pipelines. So that is an integration capability built into this product that allows you to subscribe to Kafka topics. You can do ingestion in parallel. If you’ve got multiple Kafka partitions, then those can map to the distributed partitions of MemSQL Helios, such that you can ingest in parallel. It’s a little bit different of an integration strategy, because you’re not getting all of the guarantees that I just described in replication through MemSQL Replicate. But it is an initial way to get data in. And then finally a bulk load of data from a flat file source, CSV, Hadoop, S3 buckets. What I would call data at rest, or static data. You can bulk load in that way. 

Can I manage MemSQL in my own cloud environment? 

Yes. So Helios provides MemSQL as a database as a service. But with MemSQL Helios, you’re not managing any of the infrastructure. You can also deploy MemSQL in a cloud environment and manage it yourself. We make that easy because we provide the Docker image and the Kubernetes Operator. Such that if that’s the fashion that you’re doing it in your VPC, it’s fairly easy to do. Or you can do it with just the native binaries and install it in your cloud environment yourself. You’ll also find it on the Amazon marketplace, and you can try out setting that up in Amazon. 

You can try MemSQL for free – either MemSQL Helios, or the MemSQL software that you manage yourself – or contact MemSQL today

Webinar Recap #3 of 3: Best Practices for Migrating Your Database to the Cloud

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

This webinar concludes our three-part series on cloud data migration. In this session, Domenic Ravita actually breaks down the steps of actually doing the migration, including all the key things you have to do to prepare and guard against problems. Domenic then demonstrates part of the actual data migration process, using MemSQL tools to move data into MemSQL Helios. 

About This Webinar Series

This is the third part of a three-part series. First, we had a session on migration strategies; broad-brush business considerations to think about, beyond just the technical lift and shift strategy or data migration. And the business decisions and business strategy to guide you as to picking what sorts of workloads you will migrate, as well as the different kinds of new application architectures you might take advantage of. And then, last week, we got down to the next layer, talked about ensuring a successful migration to the cloud of apps and databases in general. 

In today’s webinar we’ll talk about more of the actual migration process itself. We’ll go into a little bit of more detail in terms of what to consider with the data definition language, queries, DML, that sort of thing. And then I’ll cover one aspect of that, which is the change data capture (CDC) or replication from Oracle to MemSQL, and show you what that looks like. 

Database Migration Best Practices

I’ll talk about the process itself here in terms of what to look at, the basic steps that you are going to be performing, what are the key considerations in each of those steps. Then we’ll get into more specifics of what a migration to Helios looks like and then I’ll give some customer story examples to wrap up, and we’ll follow this with a Q and A. 

The actual process of cloud data migration, such as from Oracle to MemSQL Helios, requires planning and care.

So the process that we covered in the last session has mainly these key steps, that we’re going to set the goals based on that business strategy in terms of timelines, which applications and databases are going to move, considering what types of deployment you’re going to have, and what’s the target environment. 

We do all this because it’s not just the database that’s being moved, it’s the whole ecosystem around the database. So connections for integration processes, the ETL processes from your operational database to your data warehouse or analytic stores. As well as where you’ve got multiple applications sharing a database, understanding what that is and what the new environment is going to look like.

Whether that’s going to be a similar application use of the database or if you’re going to redefine or refactor or modernize that application, perhaps splitting a monolith application to microservices for instance. Then that’s going to have an effect on what your data management design is going to be. 

So just drilling into step three there, for migration, that’s the part we’ll cover in today’s session. 

Cloud data migration requires the transfer of schema, existing data, and new, incoming data.

Within that, you’re going to look at specific assessment of the workloads and you’re going to look, what sorts of datasets to return, what tables are hit, what’s the frequency, the concurrency of these? This will help in capacity sizing, it’ll also help in understanding which functions of the source database are being used in terms of features and capabilities, such as stored procedures.

And once you do that assessment, and there are automated tools to do this, you’ll look at planning that schema migration of step two. And the schema migration involves the table definitions but also all sorts of other objects in the database that you’ll be prepared to adapt. Some will be one-to-one – it depends on your type of migration – and then the initial load, and then continuing to that replication with a CDC process. 

For a successful cloud data migration, identify the applications that use a database and database table dependencies.

So, let’s take the first one here in terms of the assessment of the workloads, what you want to consider here. And when you think about best practices for this, you want to think about how are applications using the database? Are they using it in a shared manner? And then specifically, what are they using from the database? So for instance, you may need to determine what SQL queries, for instance, are executed by which application so that you can do the sequencing of the migration appropriately.

So finding that dependency, first of which applications use the database, and then more fine-grained, what are the objects, use of stored procedures, etc., and tables. And then finally, what are the specific queries? 

So one strategy or tactic, I would say, that’s helpful in understanding that use by the application is to find a way to intercept the SQL queries that are being executed by those applications. So, if this is a job application, if you wrapped the connection object when the Java connection object is being created, and also the object for the dynamic SQL, then you can use this kind of wrapper to collect metrics and information and capture the query itself so that you have specific data on how applications use the database, which tables, which queries, how often they’re fired. 

And you could do this in other languages as long as you’re… it’s for that client library, whether it’s an ODBC, JDBC, et cetera. This technique helps to build a data set as you assess the workload to get really precise information about what the queries are executed and what objects.

And then secondly, when you have that data, you’ll find that the next thing that you want to do is to look at the table dependencies. So again, if this is an operational database that you’re migrating, then it’s typical that you might have an ETL process that keys off of one-to-many tables to replicate that data into a historical store, a data warehouse, a data mart, etc. And so, understanding what those export and/or ETL processes are, on which tables they depend, is fairly key here. 

These are just two examples of the kinds of things that you want to look at for the workload. And of course with the first one, once you have the queries, and you can see what tables are involved, you can get runtime performance on that, you can have a baseline for what you want to see in the target system, once the application and the database and the supporting system pipelines have been migrated.

Migrating schema is a key part of a successful cloud data migration, and may require creating new code to replace existing code.

So now let’s talk a little bit about schema migration. And there’s a lot involved in schema migration because we’re talking about all the different objects in the database, the tables, integrity constraints, indexes, et cetera. But we could sort of group this into a couple of broad areas, the first being the data definition language (DDL) and getting a mapping from your source database to your target. 

In previous sessions in this series we talked about the migration type or database type, whether it’s homogeneous or heterogeneous. Homogeneous is like for like, you’re migrating from a relational database source to the same version even, of a relational database, just in some other location – in a cloud environment or some other data center. 

That’s fairly straightforward and simple. Often the database itself provides out-of-the box tools for that sort of a migration and replication. When you’re moving from a relational database to another relational database, but from a different vendor, that’s when you’re going to have a more of an impedance mismatch of some of the implementations of DDL, for instance.

You’ll find many of the same constructs because they’re both relational databases. But despite decades of so-called standards, there’s going to be variation… there are going to be some specific things for each vendor database. So for instance, if you’re migrating from Oracle to MemSQL, as far as data types, you’ll find a pretty close match from an Oracle varchar2 to a MemSQL varchar, from a nvarchar2 to MemSQL varchar, from an Oracle FLOAT to MemSQL decimal. 

Those are just some examples, and we have a white paper that describes this in detail and gives you those mappings, such that you can use automated tools to do much of this schema migration as far as the data types and the table definitions, etc. 

After the data types, the next thing that you would be looking at would be queries and the data manipulation language (DML). So, when you look at queries, you’ll be thinking, “Okay, what are the different sorts of query structures? What are the operators in the expression language of my source, and how do they map to the target?”

So, how can I rewrite the queries from the source to be valid in the target source? Again, you’re going to look at the particular syntax around, for instance, outer join syntax, do I have recursive queries? Again, just using Oracle as an example, MemSQL has a fairly clear correspondence of those capabilities from relational data stores like Oracle, PostgreSQL, mySQL, etc., and MemSQL. 

If your source is a mySQL database, you’ll find that the client libraries can be used directly in MemSQL because our client bindings are MemSQL wire protocol compliant. So you can use basically any driver, any client driver, from MemSQL in the hundreds that are available throughout every programming language into MemSQL, so that simplifies a little bit of some of your testing in that particular case.

The third thing I’d point out here is that while you may be migrating from a relational database to another relational database, and you may still consider this, or you should consider this, a heterogeneous move, because the architecture of the source databases often, almost always these days, a legacy single-node type of database. Meaning that it’s built on a disk-first architecture, it’s meant to scale vertically, meaning a single machine to get more performance, you scale up, you get a bigger hardware with more CPU processors. 

And when you’re coming to MemSQL, you can run it as a single node, but the power of MemSQL is that it’s a scale-out distributed database, such that you can grow the database to the size of your growing dataset with simply adding nodes to the MemSQL cluster. MemSQL is distributed by default, or distributed native you might say, and that’s what also is one of the attributes that makes it a good cloud-native database with Helios, and that allows us to elastically scale that cluster for Helios, scale up and down, and I’ll come back to that in a moment.

But as part of that, when you think about the mapping, the structure of a source relational to a target like Helios, you’re mapping a single-node database to a distributed one. So there’s a few extra things to consider, like the sharding of data, or some people call this partitioning or distributing the data, across the cluster. 

The benefit of that is that you get resiliency, in the case of node failures you don’t lose data, but you also get to leverage the processing power of multiple machines in parallel. And this helps when you’re doing things like real-time raw data ingestion from Kafka pipelines and other sources like that. 

This is described in more detail in our white paper, which I’ll point out in just a moment. So once you’ve got those things mapped, and you may be using an automated tool to do some of that schema mapping, you’ll have to think about the initial load.

To perform the initial data load in your cloud data migration, disable constraints temporarily.

And this, depending on your initial dataset, could take some amount of time, a significant amount of time, just when you consider the size of the source dataset, the network link across which data must move, what’s the bandwidth of that link? 

And so if you’re planning a migration cut-over, like over a weekend time, you’ll want to estimate based on those things. And what’s the initial load going to be, and by when will that initial data load complete, such that you can plan the new transactions start of the replicating of the new data. And also when you’re doing the load, what other sorts of data prep needs to happen in terms of ensuring that the integrity constraints and other things like that are working correctly. I’ll touch a little bit about how we address that through parallelism.

Your cloud data migration needs a process to replicate new data into the cloud, for instance by using CDC.

So finally, once that initial load is done, then you’re looking to see how you can keep up with new transactions that are written to the source database. So you’re replicating, you’ve got a snapshot for the initial load, and now you’re replicating from a point in time, doing what’s called change data capture (CDC). 

As the data is written you want the minimal latency possible to move and replicate – copy – that data to the target system. And there are various tools on the market to do this. Generally you want this to have certain capabilities such as, you should expect some sort of failure. And so you need sort of some checkpoint in here so you don’t have to start from the very beginning. 

Again, this could be tens, hundreds of terabytes in size if this is an operational database, or an analytic database, it’s going to have more data if it’s been used over time. Or, if it’s multiple databases, each may be a small amount of data, but together you have got a lot in process at the same time. So you want to have your replication such that it can be done in parallel and you have checkpointing to restart from the point of failure rather than the very beginning. 

Your cloud data migration should include data validation and repair processes.

And then finally, data validation and repair. With these different objects in a source and a target, there’s room for error here, and you’ve got to have a way to automatically validate and run tests against the data that are valid, you want to think about automating that. And as much as possible in testing, doing your initial load, you want to validate data there before starting to replicate; as data’s replicating you’re going to have a series of ongoing validations to ensure that you’re not mismatching or your logic is not incorrect.

Let’s go to our first polling question. You’re attending this webinar probably because you’ve got cloud database migration on your mind. Tell us when you are planning a database migration. Coming up in the next few weeks, in which case you would have done a lot of this planning and testing already. Or maybe in the next few months, and you’re trying to find the right target database. Or maybe it’s later in this year or next year, and maybe you’re still in the migration strategy, business planning effort. 

Most of our webinar attendees for cloud data migration are not currently planning a database migration process.

Okay. Most of you are not in the planning phase yet, so you’re looking to maybe see what’s possible. You might be looking to see what target databases are available, and what you might be able to do there. We hope you take a look at what you might do with Helios in the Cloud. 

We’ll talk about moving workloads to Helios. Helios is MemSQL’s database as a service. Helios is, at its essence, the same code base is as MemSQL, self-managed, as we call it, but it’s provided as a service such that you don’t have to do any of your own work on the infrastructure management. It takes all the non-differentiated heavy lifting away, such that you can focus just on the definition of your data.

Like MemSQL self-managed, MemSQL Helios provides a way to do multiple workloads of analytics with transaction simultaneously on the same database or multiple databases in a Helios cluster. You can run Helios in all the major cloud providers, AWS, GCP, Azure is coming soon, in multiple regions.

For a successful cloud data migration, assess the database type of the source and the target database.

Looking at some of the key considerations in moving to Helios … I mentioned before this identification of the database type source and target. There’s multiple permutations of what to consider like for like with homogenous. Although it may be a relational to relational database, as the examples I just provided with, say, Oracle and MemSQL, there are still some details to be aware of.

The white paper we provide gives a lot of that guidance on the mapping. There are things that you can take advantage of in Helios that are just not available or not as accessible in Oracle. Again, that’s things like the combination of large, transactional workloads simultaneously with the analytical workloads.

Next thing is the application architecture. I mentioned this earlier. Are you moving? Is your application architecture going to stay the same? Most likely it’s going to change in some way, because when these migrations are done for a business, typically they’re making selections in the application portfolio for new replacement apps, SaaS applications often, to replace on-prem applications.

A product life cycle management, a PLM system on prem, often is not carried on into the cloud environment. Some SaaS cloud provider is used, but you still have the integrations that need to be done. There could be analytical databases that need to pull from that PLM system, but now they’re going to be in the cloud environment.

Looking at, what are the selections and the application portfolio or the application rationalization, as many people may think about it? As to what that means for the database. Then for any particular app, if it’s going to be refactored from a monolith to microservices-based, what does that mean for the database?

Our view in terms of MemSQL for use in microservices architectures is that you can have a level of independence of the service to the database, yet keep the infrastructure as simple as possible. We live in an era where it’s really convenient to spin up lots of different databases really easily, but even when they’re in the cloud, those are more pieces of infrastructure that you now have to manage the the life cycle of.

As much as possible you should try to minimize the amount of cloud infrastructure that you have to manage. Not just in the number of instances of database, but also the variety of types. Our view of purpose-built databases and microservices is that you can have the best of purpose-built, which is support for different data structures and data access methods, such as having a document store, and geospatial data, full-text search with relational, with transactions, analytics, all living together without having to have the complexity of your application to communicate with different types of databases, different instances, to get that work done.

Previously in the industry, and part of the reason why purpose-built databases caught on, is that they provided a flexibility to start simply, such as document database, and then grow and expand quickly. Now we, as an industry, have gone to the extreme of that where there’s an explosion of different types of sources. To get a handle on that complexity, we’ve got to simplify and bring that back in.

MemSQL provides all of those functions I just described in a single service, and Helios does that as well. You can still choose to segment by different instances of databases in the Helios cluster, yet you have the same database type, and you can handle these different types of workloads. For a microservices-based architecture, it’s giving you the best of both worlds; the best of the purpose-built polyglot persistence NoSQL sorts of capabilities and scale out, but with the benefits of robust ANSI SQL and relational joints.

Finally, the third point here is optimizing the migration. As I said, with huge datasets, the business needs continuity during that cutover time. You’ve got to maintain service availability during the cutover. The data needs to be consistent, and the time itself needs to be minimized on that cutover.

Advantages of migrating your cloud data into MemSQL Helios include scalability, predictable costs, reliability, and less operations work.

Let me give a run through some of the advantages of moving to Helios. As I said, it’s a fully managed cloud database as a service, and, as you would expect, you can elastically scale up a MemSQL cluster and scale it down.

Scaling down is also maybe even perhaps the more important thing, because if you have a cyclical or seasonal type of business like retail, then there’ll be a peak towards the end of the year, typically Thanksgiving, Christmas, holiday seasons. That infrastructure, you’ll want to be able to match to the demand without having to have full peak load provisioned for the whole year. Of course cloud computing, this is one of the major benefits of it. But, your database has to be able to take advantage of that.

Helios does that through, again, its distributed nature. If you’re interested in how this works exactly, go to the MemSQL YouTube channel. You’ll see quick tutorials on how to spin up a Helios cluster and resize it. The example there shows growing the cluster. Then once that’s done, it rebalances the data, but you can also size that cluster back down.

As I mentioned, it eliminates a lot of infrastructure and operations management. It gives you some predictability in costs. With Helios, without going into the full pricing of Helios, basically our pricing is structured around units or nodes. Those nodes, or resources, are described by computing resources in terms of how many CPUs, how much RAM. Eight virtual CPUs, 64 gigabytes of RAM. It’s based on your data growth and your data usage patterns. That’s the only thing you need to be concerned about in terms of cost. That makes doing financial forecasts for applications a lot simpler.

Again, since Helios is provided in multiple cloud providers like AWS, GCP, and soon Azure, in multiple regions, you can co-locate or have a network proximity of your chosen Helios cluster to your target application environment. Such that you can minimize any costs across in terms of data ingress and egress.

When you bring data into Helios, you just get the one cost, so the Helios unit cost. From your own application, your cloud-hosted application or your datacenter-hosted application that’s bringing data into Amazon, or Azure, or GCP, you may incur some costs from those providers, but from us, it’s very simple. It’s just the per-unit, based on the node.

Helios is reliable out of the box in that it’s a high availability (HA) deployment, such that if any one node fails, you’re not losing data. Data gets replicated to another leaf node. Leaf nodes in Helios are data nodes that store data. On every leaf node, there’s one-to-many partitions, so you’re guaranteed to have a copy of that data on an another machine. Most of this would be fairly under the covers for you. You should not be experiencing any sort of slowdown in your queries, provided that your data is distributed.

Next, freedom without sprawl. What I’m talking about is, Helios allows you to, as I said earlier, combine multiple types of workloads, to do mixed workloads of transactions and analytics, and different types of data structures like a document. If you’re creating a product catalog and you’re querying that, or you have orders structured as documents, with Helios as well as MemSQL, you can store these documents, such as the order, in the raw JSON format, directly in MemSQL. We have an index into that such that you can query and make JSON queries part of your normal application logic.

In that way, MemSQL can act as a document or key-value store in the same way that MongoDB or AWS DocumentDB or other types of document databases do. But, we’re more than that, in that you’re not just limited to that one kind of use case. You can add relational queries. A typical use case here is storing the raw JSON but then selecting particular parts of the nested array to put into relational or table columns, because those can be queried as a columnstore in MemSQL. That has the advantage of compression.

There’s a lot of advantages in doing these together, relational with document, for instance, or relational with full-text search. Again, you can have this freedom of the different workloads, but without the sprawl of having to set up a document database and then separately a relational database to handle the same use case.

Then, finally, I would say a major advantage of Helios is that it provides a career path for existing DBAs of legacy single-node databases. There’s a lot of similarity in the basic fundamental aspects of a database management system, but what’s different is that, with MemSQL, you get a lot of what was previously confined to the NoSQL types of data stores, key-value stores, and document stores, for instance. But, you’d get those capabilities in a distributed nature right in MemSQL. It’s, in some ways, the ideal path from a traditional single-node relational database career and experience into a cloud-native operational and distributed database like MemSQL.

The demo shows how to take a snapshot of an Oracle database and migrate the data to the cloud, in the form of MemSQL Helios.

So what I would like to do at this point is show you a simple demonstration of how this works. I’d refer you to our whitepaper for more details about what I discussed earlier from migrating Oracle to Helios, and you’ll find it there by navigating from our homepage to resources, Whitepapers and Oracle and MemSQL migration. So with that, let me switch screens for just a moment.

And what I’m going to do. So I’ve got a Telnet session here or SSH session into a machine running an Amazon where I’ve got an Oracle database, and I’m going to run those two steps I just described. Basically the initial data load, and then I’m going to start the replication process. Once that replication process, with MemSQL Replication, is running, then I’ll start inserting data, new data into the source of Oracle database. And you’re going to see that written to my target. And I’ll show a dashboard to make this easy to visualize and the data that’s being written. 

So the data here is billing data for a utility billing system. I’ve got new bills and payments and clearance notifications that come through that source database. I’ll show you the schema in just a moment. So what I’ll do is I’ll start my initial snapshot. I’ve got one more procedure to run here.

Okay. So that’s complete and now I’ll start my application process. And so from my source system, Oracle, we’re writing to MemSQL Helios. And you see it’s written 100,000 rows to the BILLS table, 300,000 to CCB and 100,000 to PAYMENTS. 

So now let’s take a look at our view there, and we can take a look at the database. It’s writing to a database called UTILITY. And if I refresh this, I’ll see that I will have some data here in those three tables… it gave me the count there, but I can quickly count the rows, see what I got there.

MemSQL Studio shows progress in a demo of cloud data migration to MemSQL Helios.

So I also have a dashboard, which I’ll show it here and that confirms that we’re going against the same database that I just showed you the query for. So at this point I’ve got my snapshot of the initial data for the bills, payments, and clearance notices. 

Looker also shows progress in a demo of cloud data migration to MemSQL Helios.

So what I’ll do now is start another process that’s going to write data into this source Oracle database. And we’ll see how quickly this happens. Again, I’m running from a machine image in Amazon US East. I’ve got my Helios cluster also on Amazon US East. And so let’s run this to insert into these three tables here. 

And as that runs, you’ll see MemSQL Replicate, which you’re seeing here, it’s giving us how many bytes per second are being written, how many rows are being inserted into each of these tables, and what’s our total run time in terms of the elapsed and the initial load time for that first snapshot data. So here you’ll see my dashboard’s refreshing. You start to see this data being written here into Helios.

MemSQL Studio displays the results of the demo of cloud data migration to MemSQL Helios.

What we can do is use MemSQL Studio to view the data as it’s being written. So let’s first take a look at the dashboard and you can see we’re writing roughly anywhere from four to 10,000 rows per second against the database, which is a fairly small rate. We can get rates much higher than that in terms of tens of thousands or hundreds of thousands of rows written per second depending on the size. If they’re small sometimes it can be millions of rows in certain situations.

And let’s take a look at the schema here. And you’ll see that this data is increasing in size. As I write that data and MemSQL gives you this studio view such that you can see diagnostics on the process on the table as it’s happening, as data’s being written. Now you may notice that these three tables are columnstore tables. Columnstores are used for analytic queries and they have really superior performance for analytic queries and they compress the data quite a lot. And our column stores use a combination of memory and disk. 

After some period of time this data and memory will persist, will write to disk, but even when it’s in memory, you’re guaranteed the durability and resilience. Again, because Helios provides high availability by default, which you can see that redundancy through the partitions of the database through this view.

Case Studies

I’ll close out with a few example case studies. First… Well I think we’re running a little bit short on time, so I’m going to go directly to a few of these case studies here. 

Helios was first launched back in fall of last year and since then we’ve had several migrations. It’s been one of the fastest launches in terms of uptake of new customers that we’ve seen in the company’s history. 

This is a migration from an existing MemSQL self-managed environment for a company called SSIMWAVE who provides video compression and acceleration for all sorts of online businesses, and their use case is around interactive analytics, ad hoc queries. And they want to be able to look at what are the analytics around how to optimally scale their video serving and their video compression.

We have a case study of the move by SSIMWAVE to MemSQL Helios.

And so they are a real-time business and they need operational analytics on this data. Just to draw an analogy, if you’re watching Netflix and you have a jitter or a pause, or you’re on Prime Video and you have a pause for any of these online services, it’s an immediately customer-impacting kind of customer-facing scenario. And so this is a great example of a business that depends on Helios and the Cloud to provide this reliability to deliver analytics for a customer facing application. Sort of what we call analytics live in SLA. So they’d been on Helios now for several months and you see some of the quote here on why they’ve moved and the advantage of the time savings with Helios.

A second example is Medaxion, and they were moving from… Initially they moved to MemSQL from MySQL instance, and then over to Helios. And their business is providing information to anesthesiologists, and for them, again, it’s a customer facing scenario for operational analytics. 

We also have a case study of the move by Medaxion to MemSQL Helios.

They’ve got to provide instantaneous analysis through Looker dashboards and ad hoc queries against this data. And Helios is able to perform in this environment for an online SAS application essentially where every second counts in terms of looking at what’s the status of the records that Medaxion handles. 

We also have a case study of the move by Thorn to MemSQL Helios.

And then finally, I’ll close with this Thorn. They are a nonprofit that focuses on helping law enforcement agencies around the world identify trafficked children faster. 

And if there’s any example that that shows that time criticality and the importance of it in the moment operational analytics, I think this is it, because most of this data that law enforcement needs exists in various silos or various systems or different agencies systems and what Thorn does is to unify and bring all of this together but do it a convenient searchable way.

So they’re taking the raw sources among which are posts online, which they feed through machine learning process to then land that processed data into Helios such that their Spotlight application can allow instant in the moment searches by law enforcement to identify based on image recognition and matching if a child has been involved… is in a dangerous situation and correlating these different law enforcement records. 

So those are three great examples of Helios in real-time operational analytics scenarios that we thought we’d share with you. And with that I’ll close, and we’ll move to questions.

Q&A and Conclusion

How do I learn more about migration with MemSQL?

On our resources page you’ll find a couple of Whitepaper’s on migration. One about Oracle specifically and one more generally about migrating. Just navigate to the home page, go to Resources, Whitepapers. You’ll find that. Also there is a webinar we did back last year or before, the five reasons to switch – you can catch that recording. Of course you can also contact us directly and we’ll provide an email address here to share with you.

Where do I find more about MemSQL Replicate?

So that’s part of 7.0, so you’ll find all of our product documentation is online and MemSQL Replicate is part of the core product, so if you go to docs.memsql.com, then you’ll find it there under the references.

Is there a charge for moving my data into Helios?

There’s no ingress charge that you incur using self-managed MemSQL or MemSQL Helios. Our pricing for Helios is purely based on the unit cost as we call it. And the unit again is the computing resources for a node and it’s just the leaf node, it’s just the data node. So eight vCPUs, 64GB of RAM, that is a Helios leaf node. All of that is just a unit. That’s the only charge. 

But you may incur data charges from depending on where your source is, your application or other system for the data leaving, or the egress from that environment, if it’s a cloud environment. So not from us per se, but you may from your cloud provider.

MemSQL Helios Technical Overview

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Solutions Engineer Mihir Bhojani presents a 20-minute technical overview of MemSQL Helios. In this webinar recap, we present the key points of the interview, and give you a chance to review them at your leisure. You can also view the MemSQL Helios Technical Overview

In this webinar, we’ll cover what exactly MemSQL Helios is and how it compares with self-managed MemSQL, which you download yourself, provision, and run in the cloud or on-premises. With MemSQL Helios, MemSQL provisions the hardware in the cloud and runs MemSQL itself; you just set up tables and manage your data. After describing Helios, I’ll have a hands-on demo that will show you the whole end-to-end process of getting started with Helios. 

Helios is fully managed, on-demand, and elastic

Helios is basically the MemSQL product that you’ve known and you have used before, except that it’s a fully managed service. We offer it on an on-demand model so you can spin up clusters on the fly temporarily for a few hours, or keep them long-running. 

MemSQL Helios is also an elastic database, because you’ll be able to grow your cluster or shrink on the fly, and also on demand. All of this is done online, so there’s no downtime when you scale out or scale down. With Helios, we here at MemSQL take care of your cluster provisioning and software deployment. We’ll do biweekly maintenance upgrades on your cluster. And ongoing management and operations of your cluster will be handled by MemSQL experts. That leaves you, as the user, to be only responsible for logical management of your data, helping you to keep your focus on application development.

Helios delivers effortless deployment, superior TCO, cloud flexibility and more

With MemSQL Helios, you get effortless deployment and elastic scale, so you can have a cluster up and running in five minutes. We’re going to demonstrate that today. You have superior TCO when you compare to legacy databases, so you can do a lot more with Helios with a lot less hardware allocated. You have multi-cloud and hybrid flexibility, so you have your pick when it comes to which cloud provider and which region you want to deploy in.

Helios uses the MemSQL Kubernetes Operator and containers

Currently MemSQL Helios is available on AWS and GCP; we have multiple global regions in both of those platforms, and we have Azure support coming in the next five to six months.

Let’s talk about how Helios works under the hood, then we can jump into the demo. Helios is facilitated by, and built on, Kubernetes. (Using the beta MemSQL Kubernetes Operator – Ed.) This helps us enable unique features like auto-healing, handling node failures, and also lets us enable features like auto-scaling and to do rolling online upgrades. 

Because MemSQL is inherently a distributed system, it’s really important that there’s high availability at all times. For storage, MemSQL makes use of basically the block storage of your cloud provider. We choose optimal instance type and node configuration when you start creating the cluster.

High availability (HA) is transparently built in. So you will have HA as soon as you spin up the cluster. The way HA works in MemSQL is that leaf nodes are basically paired up. Leaf nodes are basically data nodes, and you have data being duplicated from one leaf node to another and vice versa. So if one leaf node goes down, the other one can keep the cluster up and live. Security is also enabled by default. The way we do that is we have in-flight encryption, so we encrypt your connections using TLS in-flight. Then we also have at-rest encryption. We use your cloud provider’s encryption services to do encryption.

Let’s see Helios in action. I want to outline what exactly we’ll be seeing today. 

Our demo shows an ad tech use case built with Helios

The use case for today that we’re going to handle, and this is a real-life use case that we have companies using MemSQL for, is ad tech. So this data set is a digital advertising data example, which basically drills down and gets data from a traditional funnel. So we’re going to see multiple different advertisers and campaigns and we’re going to see what kind of engagements and events they are generating. 

We’re going to facilitate that by creating a cluster from scratch. We’ll create the database and tables from scratch as well. And then we’ll create a pipeline to start ingesting data from an existing Kafka topic. Then finally, we’ll access all of this data from Helios via a business intelligence (BI) dashboard.

Let’s get started. So if you go to portal.memsql.com, you’ll be able to spin up a set 48-hour clusters. That’s the amount of time that you get to spin up a cluster as a trial. So if you go to portal.memsql.com and you go to the Clusters tab, you’ll see a create button and when you click on that, you’ll be brought to this screen. I’m just going to name this helios-webinar. You’ll have your pick when it comes to which cloud provider and region you want to be in. And we’re always adding new regions as well. 

Creating a four-unit Helios cluster on AWS Virginia

So right now I’ll leave it at AWS Virginia. Cluster size – so, because MemSQL is a distributed system in this case, when we refer to units, we’re really talking about leaf nodes. (Leaf nodes hold data. They are accompanied by one or more aggregator nodes, which hold schema and which dispatch queries to the leaves – Ed.) One unit in MemSQL Helios is actually eight VCPUs, 64GB of RAM and one terabyte storage. You can only purchase units in pairs, so in increments of two essentially. For now we’re going to leave this at four, so it’s a modest-sized cluster. We’ll generate a password and make sure to copy it.

Then you can also configure cluster access restrictions. So if you want to white-list specific IP ranges, we absolutely recommend that you do that. For now, I’ll just hide my own IP address. Then, as an advanced setting, you can also configure if you want to automatically expire this cluster after a certain number of hours. 

Once we click on Create cluster, we’ll see that the cluster will start reconciling resources. What it means, it’s really trying to go to AWS and using Kubernetes, it’s going to spin up a cluster on the fly. This process is expected to take about five minutes. In the interest of saving time, I already have a cluster spun up before the webinar, so I’m just going to use that going forward for this webinar. It’s essentially the same size cluster as I created before, and once it’s up and running, you’ll see something like this on your screen, where on the right side you’ll have all the connection information.

Displaying cluster properties in the MemSQL Portal

The MemSQL Studio that we’re going to tour just in a second here is basically the GUI interface for interacting with MemSQL Helios. But we also get these endpoints where if you want to connect from your own applications, your own BI tools using JDBC/ODBC, you can basically use these endpoints to connect to the MemSQL cluster. For now, let’s just go into MemSQL Studio. 

Showing cluster health and usage in MemSQL Studio

Here, we’ll enter the password that we generated earlier. As soon as you log in you’ll see a screen, something like this, which is basically a dashboard that shows you what current health and usage looks like on your cluster. As we can see, there’s currently no pipelines, no data, nothing. The reason why it says seven nodes is because when we chose four units, when spinning up the cluster, so we got four leaf nodes as expected; the rest of the three nodes are basically aggregator notes. So you get one master agg and two child agg, plus the leaf nodes. All of those are obviously load balanced.

Borrowing sample schema from MemSQL Docs

Let’s start creating some schemas. So here we have this sample use case basically. Here is a sample schema that we’re going to use. 

I’m just going to explain what’s happening here. So here we’re going to create a database and then two tables. The first table is the events table and this is basically the fact table, right? So this is where all the raw events from the Kafka topic are going to stream into. 

This is a columnstore table. So when you create tables in MemSQL, you have two options. You have rowstore and columnstore. The default table type in MemSQL is rowstore. So if you don’t specify, then by default you’ll get a rowstore table, which means all your data will go in memory. But if you do specify a key using cluster columnstore, then basically you get a columnstore table, where your data is actually on disk. We use memory to optimize performance, and we store indexes of metadata in memory. Columnstore is definitely recommended for OLAP (online analytical processing) workloads and rowstore is recommended for OLTP (online transaction processing) workloads.

Then the second table here is campaigns, which is basically just a lookup table. This table is that we’re going to join when we need to look up specific campaigns. That’s why we’re creating it as a reference table. Reference tables actually store the entire copy of the table on every single leaf node, so it really helps in situations where you need to join this table often. You just need to be careful that this table is a good (small) size to be able to fit into every single node. Then we’re going to also populate the campaigns table with some static values. We have about 14 campaigns that we’re going to populate to begin with. 

Creating tables in MemSQL Studio

Let’s run everything. This should take about eight to nine seconds. 

Now that we have everything running, we have the ad tech database now, and then we have two tables inside this database. The next step is, we have this Kafka topic here, it’s a Kafka cluster that we’re hosting in-house. How do we get that data into the MemSQL Helios cluster? That’s where the pipelines feature comes in.

Borrowing a pipeline definition from MemSQL Docs

The pipelines feature is something that can connect to Kafka obviously, but it can also connect to other sources like AWS S3, Azure blobstore, HDFS. If you have data sitting in any of those places, then you can always use a pipeline to natively stream data directly to MemSQL. 

For now we’re going to use Kafka, as it’s one of our most-used technologies. You can also specify a batch interval, how often you want to batch data into MemSQL. So right now it’s at 2,500 milliseconds or 2.5 seconds. I’m just going to make this 100 milliseconds and see what kind of results we get. We’re going to obviously send all the data into the events table that we created earlier. Then here are just all the fields that we want to populate coming in from the topic.

This pipeline is created. But if we go back to the dashboard, we’ll see that there’s no data being actually written yet because we haven’t started it. I just need to go back to the SQL editor and then alter the pipeline offsets so that we only get data basically from the latest offset. Then I’m going to finally start the pipeline. 

Checking our cluster again in MemSQL Studio

Now if I go back to the dashboard, we should see rows being written at a steady pace into MemSQL. Just to reiterate, this demo is not to show you any performance aspects of MemSQL, it’s basically just a functional demo. MemSQL Helios can handle very high throughputs when it comes to ingestion and also query performance.

Now that we have data flowing in, let’s start running some queries. Here’s some prewritten SQL queries that we have written. Let’s explore what we have so far. Then like I mentioned, we’ll go over to a BI tool to see how this data looks in a dashboard format. 

For a simple question, how many events have we processed so far? That number right now is 51,000 and every time you run this, we expect it to go up a few thousands. So if I ran it again, it should go up. So right now it’s at 57,000 and if I ran it again, it’s at 60,000. 

Running a SQL query in MemSQL Studio

What campaigns are we running? This is the lookup table that we populated earlier. These are all the 14 campaigns that we are running right now. Then here’s the traditional funnel query for analysts to see broken everything down by campaign, how many impressions, clicks and downstreams they’re getting per campaign.

Let’s see how this data looks like in a Looker dashboard. Now Looker is a BI tool that we partner with and MemSQL works, integrates with almost all the BI tools out there. We are also MySQL compatible. If there’s a tool out there that does not have a MemSQL connector, you can always leverage the MySQL connector to connect to us. 

This is the dashboard that’s basically reading off of the Helios cluster that we just created. This number of ad events should match up with the number here. So yes, it’s 101. If I go back here, it should be close to that number.

Our example Looker Ad Events dashboard

All right. This is basically just the raw number that’s coming into the MemSQL cluster right now. Here we just have some charts that are… By the way, all of them are updating in real time. So this is a dashboard that’s refreshing every one seconds. If I just click on edit and go to settings – yep, there it is. 

So all these charts are set to auto-refresh every one second. Now almost all BI tools have that option, but they’re almost never used, because it’s really important for the underlying database to be able to handle a number of queries coming in at a steady pace and be able to respond back to all those queries in time so that the dashboard doesn’t choke up. We can see here that Helios and MemSQL are able to do just that.

Ad events by region in our Looker dashboard

Here we can see we have a number of ad events by top five advertisers. Here it’s broken down, ad events by region and obviously which companies. Then here is just the raw table of ad events as they are coming in. It’s really important in MemSQL that we emphasize the time that goes from an event to happen to insight, to someone to analyze it. It’s extremely short. 

Basically in the last 10 minutes or so, we basically created a cluster. We created databases. We even created a pipeline and then we started flowing the data in. And now we have a dashboard that we’re able to get insight from. And not only that, now that we have the insight, we can now take the appropriate action.

Even though this use case focuses basically on a lot of the ad tech aspects, it’s really important to know that we have done this for multiple industries, whether that’s IoT, financial services… We work with a variety of customers out there. The real value of MemSQL here is that you basically get the best of all worlds when it comes to speed, scale, and SQL. With speed, that could mean the time it takes to actually spin up a cluster in a distributed system that’s complex to set up. Now we are able to get it up and running in less than five minutes. Or it could be the ultra-fast ingest and the high performance you get when it comes to querying your data.

Helios delivers MemSQL's triad of speed, scale, and SQL

The scale-out mechanism of MemSQL is obviously really unique, the way that you can always expand your aggregator or leaf nodes. Then, in Helios, it’s even easier now, because all it takes is one click of a button if you want to expand your cluster or shrink down. And all of those things are facilitated using SQL. 

It’s really important to the companies that we work with and the analysts that we work with that they’re able to continue using SQL, because it is a powerful analytical language and you basically get the best of all worlds when it comes to using MemSQL. You still get to use SQL, but you get all the benefits of the speed and the scalability that you traditionally get only with NoSQL databases. (Because previously, those were the only databases that were scalable; now, NewSQL databases combine scalability and SQL. See our popular blog post on NoSQL. – Ed) Now we basically combine all three and give it to you in one package.

There’s also another use case that I wanted to cover. Medaxion is a company that I also personally work with. Medaxion is basically a medical analytics platform. They provide a platform for anesthesiologists to analyze medical data. It’s really important for them to have a 24/7, always-on analytics solution so that they can help and change lives. 

Medaxion uses Helios to help anesthesiologists save lives

Before MemSQL, they had some challenges – pretty simple challenges and fair challenges as well, that they weren’t able to get – their event to insight time was way too slow. Partly because of the legacy systems or single-box systems that are not able to keep up with the growth and demand that a lot of the startups face from time to time. In this case, MySQL couldn’t scale, and wasn’t able to fit their needs, and that was leading to a poor customer experience. As a result of that, Medaxion couldn’t help the anesthesiologists be at the right place at the right time, and also automate their work as much as possible.

When Medaxion came to us, they had these three requirements – of many more, but these three were the main requirements. They wanted something that was fast and they wanted something that had SQL. They wanted a database that could work natively with the Looker platform that we just saw. That’s what Medaxion uses as well. And they wanted to be able to write ad hoc SQL queries. 

So they wanted something that could be scalable, that could have really high performance when it comes to ingest and querying, and also be accessible using SQL. They wanted a managed service. Now just like all other startups out there, there’s a lot of operational complexity when it comes to managing a distributed system. Of course when it’s a database that’s mission critical to your application, then it does have some complexity related.

They were very clear that they wanted a managed service because they didn’t want to add additional personnel just to handle the operations of a system like MemSQL. They wanted something obviously that had reliable performance, so something that was running 24/7, and no downtime whatsoever. When they switched to MemSQL, these were the results that we got. 

We got really fast and scalable SQL, basically exactly what they wanted, and now they have dramatic performance gains and no learning curve. This boils down to the second point, we’re MySQL-compatible. There was a very low amount of work and risk involved when it comes to switching from a system like MySQL to MemSQL because we’re not only wire protocol-compliant, you’re already familiar with the language, and you’re already familiar with the tools that integrate with those technologies.

With MemSQL Helios, now they can just eliminate any operational headaches. They don’t have to worry about upgrading every two weeks. They don’t have to worry about high configuring, high availability, encryption, load balancing – a lot of those things are now taken care of by us and MySQL experts. That leaves you, and in this case Medaxion, to just focus on development tasks. As a result of that, now they have near-real-time results. And they have what the CTO of Medaxion calls, “The best analytics platform in the healthcare industry.” They said it’s facilitated because of MemSQL.

I just want to thank you guys for your time today. If you do want to spin up your own Helios trial, please feel free to go to memsql.com/helios. All our trials are 48 hours, giving you more time to ingest your data, more time to test your queries. Of course, if you have any questions about pricing and billing, please reach out to us at Helios@MemSQL.com

Q&A

What is the uptime SLA guaranteed on Helios?

That’s a good question. Uptime SLA on Helios, we’re guaranteeing three nines of uptime, so 99.9%.

If the underlying storage is in S3, then storage will be cheaper.

That’s a good question. Right now you can’t have underlying storage be S3. You basically do have the ability to ingest data directly from S3, but the data does need to be within the MemSQL cluster.

How about the Kinesis pipeline?

Right now we only support Kafka pipelines, and Kinesis is something that would take some custom work. We do have some custom scripts available if you do want to ingest data from Kinesis, but right now natively we only connect to Kafka.

How does backup/restore work on Helios?

We will take daily backups on Helios for you and we’ll retain a copy of that backup for seven days. So you can always request a restore from a particular day in the last week or so. But you also have the ability to kick off your own backups whenever you desire. So you can always back up into your S3 bucket, or your Azure blobstore bucket, and then you can always restore from those technologies as well.

Helios is fully managed, on-demand, and elastic

Cloud Database Trend Report from DZone Features MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

A new Trend Report from DZone highlights the move to cloud databases. The report paints a bright picture for cloud database adoption, with roughly half of developers asserting that all of their organization’s data will be stored in the cloud in three years or fewer. You can get a copy of the report by request from MemSQL.

DZone has issues a new trend report on cloud databases. In the report, leaders in the database space focus on specific use cases, calling out the factors that help you decide what you need in any database, especially one that’s in the cloud.

The advantages of cloud databases include flexibility to scale up and scale back, easier backups of data, moving database infrastructure out of house, and offloading some database maintenance.

MemSQL is a database that runs anywhere Linux does, on-premises and on all three major cloud providers – AWS, Google Cloud Platform (GCP), and Microsoft Azure. MemSQL Helios is a managed service with the MemSQL database at its core. Helios is available on AWS and GCP, with Azure support to follow soon. The MemSQL Kubernetes Operator gives you the flexibility to manage this cloud-native database with cloud-native tools.

MemSQL is also a fast, scalable SQL database that includes many features that are normally claimed only by NoSQL databases: easy scalability, fast ingest, fast query response at volume, and support for a wide range of data types, especially JSON and time series data.

Between self-managed MemSQL (the version you download and run on Linux), and MemSQ Helios, the advantages of cloud databases – scalability, easy and reliable backups, and moving both infrastructure and maintenance out-of-house – are readily available, on a solution that’s also identical on-premises.

The report points out several interesting facts:

  • Slightly more than half of organizations that have a cloud database solution in place have had one for two years or less.
  • More than two-thirds of cloud database users either use multiple clouds (40%) or are seriously considering doing so (26%).
  • Analytics is the #1 reason for moving databases to the cloud, with modernization of existing apps and becoming cloud native also ranking highly.
  • The database as a service (DBaaS) model, represented by MemSQL Helios and many other options, has a slight lead over those who use a self-managed database.
  • About half of respondents believe all of their data will be in the cloud in three years or fewer.

The report goes on to interview Adam Ballai, CEO at RevOps, in depth about the research findings.

You can access a copy of the report by contacting MemSQL. This version includes a customer case study from Thorn, which seeks to eliminate child sexual abuse from the Internet, using machine learning, AI – and MemSQL. You can also try MemSQL for free.

Viewing all 427 articles
Browse latest View live