Quantcast
Channel: MemSQL – Cloud Data Architect
Viewing all 427 articles
Browse latest View live

Pre-Modern Databases: OLTP, OLAP, and NoSQL

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

In this blog post, the first in a two-part series, I’m going to describe pre-modern databases: traditional relational databases, which support SQL but don’t scale out, and NoSQL databases, which scale out but don’t support SQL. In the next part, I’m going to talk about modern databases – which scale out, and which do support SQL – and how they are well suited for an important new workload: operational analytics.

In the Beginning: OLTP

Online transaction processing (OLTP) emerged several decades ago as a way to enable database customers to create an interactive experience for users, powered by back-end systems. Prior to the existence of OLTP, a customer would perform an activity. Only at some point, relatively far off in the future, would the back-end system be updated to reflect the activity. If the activity caused a problem, that problem would not be known for several hours, or even a day or more, after the problematic activity or activities.

The classic example (and one of the main drivers for the emergence of the OLTP pattern) was the ATM. Prior to the arrival of ATMs, a customer would go to the bank counter to withdraw or deposit money. Back-end systems, either paper or computer, would be updated at the end of the day. This was slow, inefficient, error prone, and did not allow for a real-time understanding of the current state. For instance, a customer might withdraw more money than they had in their account.

With the arrival of ATMs, around 1970, a customer could self-serve the cash withdrawal, deposits, or other transactions. The customer moved from nine to five access to 24/7 access. ATMs also allowed a customer to understand in real time what the state of their account was. With these new features, the requirements for the backend systems became a lot more complex. Specifically data lookups, transactionality, availability, reliability, and scalability – the latter being more and more important as customers demanded access to their information and money from any point on the globe.

The data access pattern for OLTP is to retrieve a small set of data, usually by doing a lookup on an ID. For example, the account information for a given customer ID. The system also must be able to write back a small amount of information based on the given ID. So the system needs the ability to do fast lookups, fast point inserts, and fast updates or deletes.

Transaction support is arguably the most important characteristic that OLTP offers, as reflected in the name itself. A database transaction means a set of actions that are either all completed, or none of them are completed; there is no middle ground. For example, an ATM has to guarantee that it either gave the customer the money and debited their account, or did not give the customer money and did not debit their account. Only giving the customer money, but not debiting their account, harms the bank; only debiting the account, but not giving the customer money, harms the customer.

Note that doing neither of the actions – not giving the money, and not debiting the account – is an unhappy customer experience, but still leaves the system in a consistent state. This is why the notion of a database transaction is so powerful. It guarantees the atomicity of a set of actions, where atomicity means that related actions happen, or don’t happen, as a unit.

Reliability is another key requirement. ATMs need to be always available, so customers can use one at any time. Uptime for the ATM is critical, even overcoming hardware or software failures, without human intervention. The system needs to be reliable because the interface is with the end customer and banks win on how well they deliver a positive customer experience. If the ATM fails every few times a customer tries it, the customer will get annoyed and switch to another bank.

Scalability is also a key requirement. Banks have millions of customers, and they will have tens of thousands of people hitting the back-end system at any given time. But the usage is not uniform. There are peak times when a lot more people hit the system.

For example, Friday is a common payday for companies. That means many customers will all be using the system around the same time to check on the balance and withdraw money. They will be seriously inconvenienced – and very unimpressed – if one, or some, or all of the ATMs go down at that point.

So banks need to scale to hundreds of thousands of users hitting the system concurrently on Friday afternoons. Hard to predict, one-off events, such as a hurricane or an earthquake, are among other examples that can also cause peaks. The worst case is often the one you didn’t see coming, so you need a very high level of resiliency even without having planned for the specific event that ends up occurring.

These requirements for the OLTP workload show up in many other use cases, such as retail transactions, billing, enterprise resource planning (widely known as ERP), customer relationship management (CRM), and just about any application where an end user is reviewing and manipulating data they have access to and where they expect to see the results of those changes immediately.

The existing legacy database systems were founded to solve these use cases over the last few decades, and they do a very good job of it, for the most part. The market for OLTP-oriented database software is in the tens of billions of dollars a year. However, with the rise of the Internet, and more and more transactional systems being built for orders of magnitude more people, legacy database systems have fallen behind in scaling to the level needed by modern applications.

The lack of scale out also makes it difficult for OLTP databases to handle analytical queries while successfully, reliably, and quickly running transactions. In addition, they lack the key technologies to perform the analytical queries efficiently. This has contributed to the need for separate, analytics-oriented databases, as described in the next section.

A key limitation is that OLTP databases have typically run on a single computing node. This means that the transactions that are the core of an OLTP database can only happen at the speed and volume dictated by the single system at the center of operations. In an IT world that is increasingly about scaling out – spreading operations across arbitrary numbers of servers – this has proven to be a very serious flaw indeed.

OLAP Emerges to Complement OLTP

After the evolution of OLTP, the other major pattern that has emerged is OLAP. OLAP emerged a bit after OLTP, as enterprises realized they needed fast and flexible access to the data stored in their OLTP systems.

OLTP system owners could, of course, directly query the OLTP system itself. However, OLTP systems were busy with transactions – any analytics use beyond the occasional query threatened to bog the OLTP systems down, limited to a single node as they were. And the OLAP queries quickly became important enough to have their own performance demands.

Analytics use would tax the resources of the OLTP system. Since the availability and reliability of the OLTP system were so important, it wasn’t safe to have just anyone running queries that might use up resources to any extent which would jeopardize the availability and reliability of the OLTP system.

In addition, people found that the kinds of analytics they wanted to do worked better with a different schema for the data than was optimal for the OLTP system. So they started copying the data over into another system, often called a data warehouse or a data mart. As part of the copying process, they would change the database schema to be optimal for the analytics queries they needed to do.

At first, OLTP databases worked reasonably well for analytics needs (as long as they ran analytics on a different server than the main OLTP workload). The legacy OLTP vendors included features such as grouping and aggregation in the SQL language to enable more complex analytics. However, the requirements of the analytics systems were different enough that a new breed of technology emerged that could satisfy analytics needs better, with features such as column-storage and read-only scale-out. Thus, the modern data warehouse was born.

The requirements for a data warehouse were the ability to run complex queries very fast; the ability to scale to handle large data sets (orders of magnitude larger than the original data from the OLTP system); and the ability to ingest large amounts of data in batches, from OLTP systems and other sources.

Query Patterns

Unlike the OLTP data access patterns that were relatively simple, the query patterns for analytics are a lot more complicated. Trying to answer a question such as, “Show me the sales of product X, grouped by region and sales team, over the last two quarters,” requires a query that uses more complex functions and joins between multiple data sets.

These kinds of operations tend to work on aggregates of data records, grouping them across a large amount of data. Even though the result might be a small amount of data, the query has to scan a large amount of data to get to it.

Picking the right query plan to optimally fetch the data from disk requires a query optimizer. Query optimization has evolved into a specialty niche within the realm of computer science; there are only a small number of people in the world with deep expertise in it. This specialization is key to the performance of database queries, especially in the face of large data sets.

Building a really good query optimizer and query execution system in a distributed database system is hard. It requires a number of sophisticated components including statistics, cardinality estimation, plan space search, the right storage structures, fast query execution operators, intelligent shuffle, both broadcast and point-to-point data transmission, and more. Each of these components can take months or years of skilled developer effort to create, and more months and years to fine-tune.

Scaling

Datasets for data warehouses can get quite big. This is because you are not just storing a copy of the current transactional data, but taking a snapshot of the state periodically and storing each snapshot going back in time.

Businesses often have a requirement to go back months, or even years, to understand how the business was doing previously and to look for trends. So while operational data sets range from a few gigabytes (GBs) to a few terabytes (TBs), a data warehouse ranges from hundreds of GBs to hundreds of TBs. For the raw data in the biggest systems, data sets can reach petabytes (PBs).

For example, imagine a bank that is storing the transactions for every customer account. The operational system just has to store the current balance for the account. But the analytics system needs to record every transaction in that account, going back for years.

As the systems grew into the multiple TBs, and into the PB range, it was a struggle to get enough computing and storage power into a single box to handle the load required. As a result, a modern data warehouse needs to be able to scale out to store and manage the data.

Scaling out a data warehouse is easier than scaling an OLTP system. This is because scaling queries is easier than scaling changes – inserts, updates, and deletes. You don’t need as much sophistication in your distributed transaction manager to maintain consistency. But the query processing needs to be aware of the fact that data is distributed over many machines, and it needs to have access to specific information about how the data is stored. Because building a distributed query processor is not easy, there have been only a few companies who have succeeded at doing this well.

Getting the Data In

Another big difference is how data is put into a data warehouse. In an OLTP system, data is entered by a user through interaction with the application. With a data warehouse, by contrast, data comes from other systems programmatically. Often, it arrives in batches and at off-peak times. The timing is chosen so that the work of sending data does not interfere with the availability of the OLTP system where the data is coming from.

Because the data is moved programmatically by data engineers, you don’t need the database platform to enforce constraints on the data to keep it consistent. Because it comes in batches, you want an API that can load large amounts of data quickly. (Many data warehouses have specialized APIs for this purpose.)

Lastly, the data warehouse is not typically available for queries during data loading. Historically, this process worked well for most businesses. For example, in a bank, customers would carry out transactions against the OLTP system, and the results could be batched and periodically pushed into the analytics system. Since statements were only sent out once a month, it didn’t matter if it took a couple of days before the data made it over to the analytics system.

So the result is a data warehouse that is queryable by a small number of data analysts. The analysts run a small number of complex queries during the day, and the system is offline for queries while loading data during the night. The availability and reliability requirements are lower than an OLTP system because it is not as big a deal if your analysts are offline. You don’t need transactions of the type supported by the OLTP system, because the data loading is controlled by your internal process.

The NoSQL Work Around

For more information on this topic, read our previous blog post: Thank You for Your Help, NoSQL, But We Got It from Here.

As the world “goes digital,” the amount of information available increases exponentially. In addition, the number of OLTP systems has increased dramatically, as has the number of users consuming them. The growth in data size and in the number of people who want to take advantage of the data has outstripped the capabilities of legacy databases to manage. As scale-out patterns have permeated more and more areas within the application tier, developers have started looking for scale-out alternatives for their data infrastructure.

In addition, the separation of OLTP and OLAP has meant that a lot of time, energy, and money go into extracting, transforming, and loading data – widely known as the ETL process – between the OLTP and OLAP sides of the house.

ETL is a huge problem. Companies spend billions of dollars on people and technology to keep the data moving. In addition to the cost, the consequence of ETL is that users are guaranteed to be working on stale data, with the newest data up to a day old.

With the crazy growth in the amount of data – and in demand for different ways of looking at the data – the OLAP systems fall further and further behind. One of my favorite quotes, from a data engineer at a large tech company facing this problem, is: “We deliver yesterday’s insights, tomorrow!”.

NoSQL came along promising an end to all this. NoSQL offered:

  • Scalability. NoSQL systems offered a scale-out model that broke through the limits of the legacy database systems.
  • No schema. NoSQL abandoned schema for unstructured and semi-structured formats, abandoning the rigid data typing and input checking that make database management challenging.
  • Big data support. Massive processing power for large data sets.

All of this, though, came at several costs:

  • No schema, no SQL. The lack of schema meant that SQL support was not only lacking from the get-go, but hard to achieve. Moreover, NoSQL application code is so intertwined with the organization of the data that application evolution becomes difficult. In other words, NoSQL systems lack the data independence found in SQL systems.
  • No transactions. It’s very hard to run traditional transactions on unstructured or semi-structured data. So data was left unreconciled, but discoverable by applications, that would then have to sort things out.
  • Slow analytics. Many of the NoSQL systems made it very easy to scale and to get data into the system (i.e., the data lake). While these systems did allow the ability to process larger amounts of data than ever before, they are pretty slow. Queries could take hours or even tens of hours. It was still better than not being able to ask the question at all, but it meant you had to wait a long while for the answer.

NoSQL was needed as a complement to OLTP and OLAP systems, to work around the lack of scaling. While it had great promise and solved some key problems, it did not live up to all its expectations.

The Emergence of Modern Databases

With the emergence of NewSQL systems such as MemSQL, much of the rationale for using NoSQL in production has dissipated. We have seen many of the NoSQL systems try to add back important, missing features – such as transaction support and SQL language support – but the underlying NoSQL databases are simply not architected to handle them well. NoSQL is most useful for niche use cases, such as a data lake for storing large amounts of data, or as a kind of data storage scratchpad for application data in a large web application.

The core problems still remain. How do you keep up with all the data flowing in and still make it available instantly to the people who need it? How can you reduce the cost of moving and transforming the data? How can you scale to meet the demands of all the users who want access to the data, while maintaining an interactive query response time?

These are the challenges giving rise to a new workload, operational analytics. Read our upcoming blog post to learn about the operational analytics workload, and how NewSQL systems like MemSQL allow you to handle the challenges of these modern workloads.


Emmy-Winning SSIMWAVE Chooses MemSQL for Scalability, Performance, and More

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

SSIMWAVE customers work to some of the highest standards in the world – from film producers to network engineers to media business executives. They demand to work with the best. SSIMWAVE also works at that level, as the company’s 2015 Emmy award for engineering achievement demonstrates. They also ask the same high standards of their technology vendors/partners. For SSIMWAVE’s rather comprehensive analytics needs, only one database makes the grade: MemSQL.

SSIMWAVE has unique technology and unique analytics needs. SSIMWAVE mimics the human visual system, enabling the software to quantify the quality of video streams, as perceived by viewers, into a single viewer score. Video delivery systems can then be architected, engineered, and configured to manage against this score. This score correlates strongly to what actual human beings would perceive the video quality to be. This allows SSIMWAVE users to make informed trade-offs among resources and perceived quality, automatically or manually, and all in real time.

SSIMWAVE Cracks the Code

According to Cisco, video data accounted for 73 percent of Internet traffic in 2017, a share that is projected to grow to 82 percent by 2022. Maximizing the quality of this video content, with the least bandwidth usage and at the lowest cost possible, is one of the most important engineering, business, and user experience issues in the online world.

The barrier to balancing video quality against compression has been that only human beings could accurately assess the quality of a given video segment when it was compressed, then displayed on different devices. Further complicating the picture (no pun intended) is the fact that people, when asked to rate video quality, give different answers with varying levels of consistency over time. This has meant that a panel of several people was needed to render a useful assessment. As a result, a software engineer or operations person wanting to process and deliver data within acceptable levels didn’t have a reliable, affordable method for knowing how much was just enough, without serious compromise to the viewer’s experience.

SSIMWAVE uses MemSQL for video streaming analytics to drive improved customer experiences.
The SSIMWAVE website demonstrates what the company’s breakthrough algorithm and technology can enable for the media & entertainment industry.

SSIMWAVE appears to have cracked the code on this problem with its proprietary SSIMPLUS® algorithm, described on their website, which provides capabilities not found elsewhere. The company’s technology assesses video quality with a single, composite number that achieves a correlation greater than 90 percent between machine assessment and subjective human opinion scores. With this technology, video professionals can make much more efficient use of network resources, while consistently maintaining the desired level of quality.

SSIMWAVE users are able achieve significant bandwidth savings by configuring to deliver on a viewer score. The company’s customers include the largest IPTV providers in the US and Canada. Their platform is affecting the streams of tens of millions of subscribers in North America. MemSQL already has a strong position in media and communications solutions, including having Comcast as a customer, and it was natural for SSIMWAVE to consider MemSQL for its own analytics needs.

SSIMWAVE’s Need for State-of-the-Art Analytics

SSIMWAVE’s business is, in the end, all about numbers. For the company to deliver a complete and reliable service, it needs a high-performance database that can store very large quantities of data and respond very quickly to ad hoc analytics queries.

SSIMWAVE has ambitious analytics goals. In addition to comprehensive internal requirements, it needs to offer state-of-the-art analytics capabilities to customers.

SSIMWAVE needs both up-to-the-moment reporting, on data volumes that will increase exponentially as new data streams in, and the ability to retain all that data to meet customer service level agreements (SLAs).

SSIMWAVE Chooses MemSQL

SSIMWAVE was ready for an innovative solution. It compared three technologies that seemed most likely to meet its requirements:

The database assessment was led by Peter Olijnyk, Director of Technology at SSIMWAVE. Peter has 20 years experience as a software developer, architect, and engineering leader, along with a passion for playing guitar in his rock band.

Olijnyk and his team at SSIMWAVE found the choice relatively easy, and decided on MemSQL. Among the key considerations were:

  • Scalability. SSIMWAVE needs a seamlessly scalable database, as its business needs may drive it to arbitrarily large scale requirements. MemSQL’s distributed architecture fits the bill.
  • Performance. SSIMWAVE needs high performance for its own internal needs, but also for its customers, who will be using the SSIMWAVE data architecture.
  • Ease of setup. SSIMWAVE was able to use MemSQL’s documentation to get its first cluster running easily, in a matter of hours. This ease of setup and comprehensibility will extend to SSIMWAVE customers.
  • Direct SQL queries. SSIMWAVE needs a tool with integrations to third party tools, allowing for direct SQL queries which are fast and responsive.
  • Rowstore and columnstore support. Although its current use case is “99 percent columnstore,” SSIMWAVE likes having the door open to rowstore use cases with MemSQL.
  • Data streaming architecture support. MemSQL works smoothly with leading stream-processing software platforms, including support for exactly-once updates. The benefit of MemSQL is its ability to scale out, enabling very high levels of performance.
  • Wide range of integrations. MemSQL supports a wide range of integrations, including the MySQL wire protocol and other standard interfaces. “We use the ODBC interface in a standard way,” said Olijnyk. “We have found MemSQL’s ODBC interface to be customizable and flexible.”
MemSQL interoperates with a wide range of other technologies for ingress and analytics.
MemSQL’s wide range of integrations is important
to SSIMWAVE and other MemSQL customers.

“The main thing that tipped the scales was the ease of use and out-of-box experience,” according to Olijnyk. “We went from reading about MemSQL to having clusters running in a matter of hours.”

“We implement real-time data streaming and MemSQL for ingest and query response,” he reports. “Also, we recently needed a way to share state across our architecture. We considered ZooKeeper and Redis, but we ended up using MemSQL rowstore, because it gives us such high performance.”

The move to this architecture for SSIMWAVE was never far from Olijnyk’s mind. “We prioritize ease of use and ease of installation. We have to concern ourselves with this approach; otherwise, costs and support effort would rise quickly. The fewer technicians we have to manage to support our customers, the better.”

SSIMWAVE Implements MemSQL

SSIMWAVE currently has MemSQL up and running. It’s “rolling out a single instance and will be ready for production within weeks.”

To see the benefits of MemSQL for yourself, you can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

150 Days Later: How Customers are Using MemSQL for Free

$
0
0

Feed: MemSQL Blog.
Author: Kristi Lewandowski.

On November 6, 2018, we made our then-new production product, MemSQL 6.7, free to use, up to certain limits, described below. To date, we’ve had more than 4,500 sign-ups for this version, with numerous individuals and companies using MemSQL to do some amazing things.

To quickly recap, with MemSQL any customer can sign up and start using our full featured product, with all enterprise capabilities, including high availability and security, for free.

In talking to customers about their experience and what they’ve built, feedback has been astounding, with folks telling us they can’t believe what we’re giving away for free.

We originally stated that what could be done for free with MemSQL, legacy database companies would charge you up to $100,000 a year. Now, we want to tell you what companies are actually doing with MemSQL.

Culture Amp

Culture Amp is an employee feedback platform and was looking for a way to improve data-driven decision making with internal reporting tools. The company’s initial solution had low adoption due to poor performance, mostly due to a slow database engine.

Scott Arbeitman, analytics and data engineering lead at Culture Amp, started investigating a better database to power its reports. “Trying MemSQL for free was a no-brainer,” according to Scott.

The results were outstanding. Tests running the company’s new data model on MemSQL versus the previous database saw an improvement of more than 28x improvement in speed. This speed-up increased reporting usage, made everyone more productive, and helped Culture Amp incorporate more data into the decision-making process.

Nikkei

Nikkei is a large media corporation based in Japan that shares news about Asia in English. Nikkei needed a better way to get real-time analytics on the readers coming to its website and its widely used mobile app.

Having better reader data allows Nikkei to understand what articles are resonating with readers, and what type of ads to show readers. With its previous database, Nikkei was only able to get reader analytics 45 minutes after someone was on the site. Now, with MemSQL, Nikkei is able to get analytics on readers in about 200 milliseconds. That’s an improvement of 13,500 times – four orders of magnitude – in performance. This allows Nikkei to actually respond to their site visitors’ activities in real time.

Because MemSQL is compatible with MySQL, Nikkei is easily able to integrate the data collected into MemSQL with its other databases. The company is getting all of this performance improvement and flexibility for free.

How to Use MemSQL for Free

If you missed the original announcement, here is a quick recap on what you get when using MemSQL for free:

  • You can create four leaf nodes, with a dedicated CPU and perhaps 32GB of RAM.
  • You also create a single aggregator node, which serves as the master aggregator.
  • You can create as many rowstore tables, stored entirely in-memory, as you like. If your database is entirely made up of rowstore tables, and you have 128GB of RAM in your leaf nodes, that’s the total database size limit.
  • You can also create as many columnstore tables, stored on disk, as you like. A mostly-columnstore database might comfortably reach 1TB in size.
  • Free use of the product includes community support. For dedicated professional support, or more than four leaf nodes, you’ll need a paid license.

Response to the free tier has been positive. Most experimental, proof of concept, and in-house-only projects run easily within a four-(leaf)-node configuration, and don’t require professional support. For projects that move to production, both dedicated professional support and the ability to add more nodes – in particular, for paired development and production environments, and for robust high availability implementations – make sense, and are enabled by an easy switch to our enterprise offering.

The case study snapshots for Culture Amp and Nikkei are good examples of what our customers have accomplished while using MemSQL for free. It’s always fun sharing the benefits our customers achieve with MemSQL versus other databases, but we get even more excited showing these performance increases when they’re achieved using the free option.

We consider all users of MemSQL to be our customers, whether you’re starting with us for the first time for free or you’ve been with MemSQL from the early days. These are just some examples of the cool work happening with free use of MemSQL. To get performance improvements similar to Culture Amp and Nikkei, all you have to do is download MemSQL.

Webinar: How Kafka and MemSQL Deliver Intelligent Real-Time Applications

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

Using Apache Kafka and MemSQL together makes it much easier to create and deliver intelligent, real-time applications. In a live webinar, which you can view here, MemSQL’s Alec Powell discussed the value that Kafka and MemSQL each bring to the table, shows reference architectures for solving common data management problems, and demonstrates how to implement real-time data pipelines with Kafka and MemSQL.

Kafka is an open source messaging queue that works on a publish-subscribe model. It’s distributed (like MemSQL) and durable. Kafka can serve as a source of truth for data across your organization.

What Kafka Does for Enterprise IT

Today, enterprise IT is held back by a few easy to identify, but seemingly hard to remedy, factors:

  • Slow data loading
  • Lengthy query execution
  • Limited user access

Kafka and MemSQL share complementary characteristics such as

These factors interact in a negative way. Limited data messaging and computing capabilities limit user access. The users who do get on suffer from slower data loading and lengthy query execution. Increasing organizational needs for data access – for reporting, business intelligence (BI) queries, apps, machine learning, and artificial intelligence – are either blocked, preventing progress, or allowed, further overloading the system and slowing performance further.

Organizations try a number of fixes for these problems – offered by both existing and new vendors, usually with high price tags for solutions that add complexity and provide limited relief. Solutions include additional CPUs and memory, specialized hardware racks, pricey database add-ons, and caching tiers with limited data durability, weak SQL coverage, and high management costs and complexity.

NoSQL solutions offer fast ingest and scalability. However, they run queries slowly, demand limited developer time for even basic query optimization, and break compatibility with BI tools.

How MemSQL and Kafka Work Together

MemSQL offers a new data architecture that solves these problems. Unlike NoSQL solutions, MemSQL offers both scalability – which affords extreme performance – and an easy-to-use SQL architecture. MemSQL is fully cloud-native; it is neither tied to just one or two cloud platforms, nor cloud-unfriendly, as with most alternatives.

MemSQL is a good citizen in all kinds of modern and legacy data management deployments.

In the webinar, Alec shows how MemSQL works. Running as a Linux daemon, MemSQL offers a fully distributed system, and is cloud-native – running in the cloud and on-premises, in containers or virtual machines, and integrating with a wide range of existing systems. Within a MemSQL cluster, an aggregator node communicates with the database client, manages schema, and shares work across leaf nodes. (A master aggregator serves as a front-end to multiple aggregator nodes, if the scale of the database requires it.)

MemSQL runs multiple aggregator and leaf nodes to distribute work fully across a cluster.

MemSQL Pipelines integrate tightly with Kafka, supporting the exactly-once semantics for which Kafka has long been well-known. (See the announcement blog post in The New Stack: Apache Kafka 1.0 Released Exactly Once.) MemSQL polls for changes, pulls in new data, and executes transactions atomically (and exactly once). Pipelines are mapped directly to MemSQL leaf nodes for maximum performance.

MemSQL Pipelines take data from Kafka, et al, optionally transform it, and store it - fast.

Together, Kafka and MemSQL allow live loading of data, which is a widely needed, but rarely found capability. Used with Kafka, or in other infrastructure, MemSQL handles mixed workloads and meets tight SLAs for responsiveness – including with streaming data and strong demands for concurrency.

Kafka-MemSQL Q&A

There was a lively Q&A session. The questions and answers here include some that were handled in the webinar and some that could not be answered in the live webinar because of time constraints.

Q. Can Kafka and MemSQL run in the cloud?
A. Both Kafka and MemSQL are cloud-native software. Roughly half of MemSQL’s deployments today are in the cloud; for instance, MemSQL often ingests data from AWS S3, and has been used to replace Redshift. The cloud’s share of MemSQL deployments is expected to increase rapidly in the future.

Q. Can MemSQL replace Oracle?
A. Yes, very much so – and other legacy systems too. Because of the complexities of many data architectures, however, MemSQL is often used first to augment Oracle. For instance, customers will use a change data capture (CDC) to copy data processed by Oracle to MemSQL. Then, analytics run against MemSQL, offloading Oracle (so transactions run faster) and leveraging MemSQL’s faster performance, superior price-performance, scalability, and much greater concurrency support for analytics.

Q. How large can deployments be?
A. We have customers running from the hundreds of megabytes up into the petabytes.

Q. With MemSQL Pipelines, can we parse JSON records?
A. Yes, MemSQL has robust JSON support.

Summing Up Kafka+MemSQL and the Webinar

In summary, MemSQL offers live loading of batch and streaming data, fast queries, and fully scalable user access. Together, Kafka and MemSQL remove access to data streaming and data access right across your organization. You can view the webinar now. You can also try MemSQL for free today or contact us to learn how we can help support your implementation plans.

MemSQL Offers Streaming Systems Download from O’Reilly

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

More and more, MemSQL is used to help add streaming characteristics to existing systems, and to build new systems that feature streaming data from end to end. Our new ebook excerpt from O’Reilly introduces the basics of streaming systems. You can then read on – in the full ebook and here on the MemSQL blog – to learn about how you can make streaming part of all your projects, existing and new.

Streaming has been largely defined by three technologies – one that’s old, one that’s newer, and one that’s out-and-out new. Streaming Systems covers the waterfront thoroughly.

Originally, Tyler Akidau, one of the book’s authors, wrote two very popular blog posts: Streaming 101: The World Beyond Batch, and Streaming 102, both on the O’Reilly site. The popularity of the blog posts led to the popular O’Reilly book, Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing.

In the excerpt that we offer here, you will see a solid definition of streaming and how it works with different kinds of data. The authors address the role of streaming in the entire data processing lifecycle with admirable thoroughness.

They also describe the major concerns you’ll face when working with streaming data. One of these is the difference between the order in which data is received and the order in which processing on it is completed. Reducing such disparities as much as possible is a major topic in streaming systems.

Scatter plot of event arrival time vs. processing completion time.
In this figure from the Streaming Systems excerpt, the authors plot event
arrival time vs. processing completion time.

In both the excerpt, and the full ebook, the authors also tackle three key streaming technologies that continue to play key roles in the evolution of MemSQL: Apache Kafka, Apache Spark, and – perhaps surprisingly, in this context – SQL.

Apache Kafka and Streaming

Apache Kafka saw its 1.0 version introduced by Confluent in late 2017. (See Apache Kafka 1.0 Introduced Exactly Once at The New Stack.) MemSQL works extremely well with Kafka. Both Kafka and MemSQL are unusual in supporting exactly-once updates, a key feature that not only adds valuable capabilities, but affects how you think about data movement within your organization.

It’s very easy to connect Kafka streams to MemSQL Pipelines for rapid ingest. And MemSQL’s Pipelines to stored procedures feature lets you handle complex transformations without interfering with the streaming process.

Bar chart showing Spark and Kafka as recent streaming arrivals.
In this figure from the full Streaming Systems ebook, Kafka
and Spark both appear as relatively recent arrivals.

Apache Spark and Streaming

Apache Spark is an older streaming solution, initially released in 2014. (One of the key components included in the 1.0 release was Spark SQL, for ingesting structured data into Spark.) Spark was first developed to address concerns with Google’s MapReduce data processing approach. While widely used, Spark is perhaps as well known today for its machine learning and AI capabilities as for its core streaming functionality.

MemSQL first introduced the MemSQL Spark Connector in 2015, then included full Spark support in MemSQL Pipelines and Pipelines to stored procedures. Today, Spark and MemSQL work very well together. MemSQL customer Teespring used Kafka and Spark together for machine learning implementations.

A reference architecture shows data streaming in from S3, Resdhift, Kafka, and Spark to real-time analytics.
The MemSQL case study for Teespring shows Kafka
and Spark used together for machine learning.

SQL and Streaming

Ironically, one of the foundational data technologies, SQL, plays a big role in Streaming Systems, and in the future of streaming. SQL is all over the full book’s Table of Contents:

  • Streaming SQL is Chapter 8 of the full ebook. In this chapter, the authors discuss how to use SQL robustly in a streaming environment.
  • Streaming Joins is Chapter 9. Joins are foundational to analytics, and optimizing them has been the topic of decades of work in the SQL community. Yet joins are often neglected in the NoSQL movement that is most closely associated with streaming. Streaming Systems shows how to use joins in a streaming environment.

MemSQL is, of course, a leading database in the NewSQL movement. NewSQL databases combine the best of traditional relational databases – transactions, structured data, and SQL support – with the best of NoSQL: scalability, speed, and flexibility.

Saeed Barghi of MemSQL partner Zoomdata shows Kafka and MemSQL used together for business intelligence.

A reference architecture with data streaming from Confluent Kafka into MemSQL through a Pipeline.
MemSQL partner ZoomData uses Kafka and a
MemSQL Pipeline for real-time data visualization.

Next Steps to Streaming

We recommend that you download and read our book excerpt from Streaming Systems today. If you find it especially valuable, consider getting the full ebook from O’Reilly.

If you wish to move to implementation, you can start with MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.

The Need for Operational Analytics

$
0
0

Feed: MemSQL Blog.
Author: Rick Negrin.

The proliferation of streaming analytics and instant decisions to power dynamic applications, as well as the rise of predictive analytics, machine learning, and operationalized artificial intelligence, have introduced a requirement for a new type of database workload: operational analytics.

The two worlds of transactions and analytics, set apart from each other, are a relic of a time before data became an organization’s most valuable asset. Operational analytics is a new set of database requirements and system demands that are integral to achieving competitive advantage for the modern enterprise.

This new approach was called for by Gartner as a Top 10 technology for 2019, under the name “continuous analytics.” Delivering operational analytics at scale is the key to real-time dashboards, predictive analytics, machine learning, and enhanced customer experiences which differentiate digital transformation leaders from the followers.

However, companies are struggling to build these new solutions because existing legacy database architectures cannot meet the demands placed on them. The existing data infrastructure cannot scale to the load put on it, and it doesn’t natively handle all the new sources of data.

The separation of technologies between the transactional and analytic technologies results in hard tradeoffs that leave solutions lacking in operational capability, analytics performance, or both. There have been many attempts in the NoSQL space to bridge the gap, but all have fallen short of meeting the needs of this new workload.

Operational analytics enables businesses to leverage data to enhance productivity, expand customer and partner engagement, and support orders of magnitude more simultaneous users. But these requirements demand a new breed of database software that goes beyond the legacy architecture.

The industry calls these systems by several names: hybrid transaction and analytics processing (HTAP) from Gartner; hybrid operational/analytics processing (HOAP) from 451 Research; and translytical from Forrester.

Consulting firms typically use the term we have chosen here, operational analytics, and CapGemini has even established a full operational analytics consultancy practice around it.

The Emergence of Operational Analytics

Operational Analytics has emerged alongside the existing workloads of Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP). I outlined the requirements of those workloads in this previous blog entry.

To summarize, OLTP requires data lookups, transactionality, availability, reliability, and scalability. Whereas OLAP requires support for running complex queries very fast, large data sets, and batch ingest of large amounts of data. The OLTP and OLAP based systems served us well for a long time. But over the last few years things have changed.

Decisions should not have to wait for the data

It is no longer acceptable to wait for the next quarter, or week, or even day to get the data needed to make a business decision. Companies are increasingly online all the time; “down for maintenance” and “business hours” are quickly becoming a thing of the past. Companies that have a streaming real-time data flow have a significant edge over their competitors. Existing legacy analytics systems were simply not designed work like this.

Companies must become insight driven

This means that, instead of a handful of analysts querying the data, you have hundreds or thousands of employees hammering your analytics systems every day in order to make informed decisions about the business. In addition, there will be automated systems – ML/AI and others – also running queries to get the current state of the world to feed their algorithms. The existing legacy analytics systems were simply not designed for this kind of usage.

Companies must act on insights to improve customer experience

Companies want to expose their data to their customers and partners. This improves the customer experience and potentially adds net new capabilities. For example, a cable company tracks users as they try to set up their cable modems so they can proactively reach out if they see there is a problem. This requires a system that can analyze and react in real-time.

Another example is an electronics company that sells smart TVs and wants to expose which shows customers are watching to its advertisers. This dramatically increases the number of users trying to access your analytics systems.

In addition, the expectations of availability and reliability are much higher for customers and partners. So you need a system that can deliver an operational service level agreement (SLA). Since your partners don’t work in your company, it means you are exposing the content outside the corporate firewall, so strong security is a must. The existing legacy analytics systems were simply not designed for this kind of usage.

Data is coming from many new sources and in many types and formats

The amount of data being collected is growing tremendously. Not only is it being collected from operational systems within the company; data is also coming from edge devices. The explosion of IOT devices, such as oil drills, smart meters, household appliances, factory machinery, etc., are the key contributors to the growth.

All this data needs to be fed into the analytics system. This leads to an increased complexity in the types of data sources (such as Kafka, Spark, etc…) as well as data types and formats (geospatial, JSON, AVRO, Parquet, raw text, etc.) and throughput requirements for ingest of the data. Again, the existing legacy analytics systems were simply not designed for this kind of usage.

The Rise of Operational Analytics

These changes have given rise to a new database workload, operational analytics. The short description of operational analytics is an analytical workload that needs an operational SLA. Now let’s unpack what that looks like.

Operational Analytics as a Database Workload

Operational analytics primarily describes analytical workloads. So the query shapes and complexity are similar to OLAP queries. In addition, the size of the data sets are just as large as OLAP, although often it is the most recent data that is more important. (This is usually a fraction of the total data set.) Data loading is similar to OLAP workloads, in that data comes from an external source and is loaded independent of the applications or dashboards that are running the queries.

But this is where the differences end. Operational analytics has several characteristics that set it apart from pure OLAP workloads. Specifically, the speed of data ingestion, scaling for concurrency, availability and reliability, and speed of query response.

Operational analytics workloads require an SLA on how fast the data needs to be available. Sometimes this is measured in seconds or minutes, which means the data infrastructure must allow streaming the data in constantly, while still allowing queries to be run.

Sometimes this means there’s a window of time (usually a single-digit number of hours) during all the data must be ingested. As data sets grow, the existing data warehouse (DW) technologies have had trouble loading the data within the time window (and certainly don’t allow streaming). Data engineers often have to do complex tricks to continue meeting data loading SLAs with existing DW technologies.

Data also has to be loaded from a larger set of data sources than in the past. It used to be that data was batch-loaded from an operational system during non-business hours. Now data comes in from many different systems.

In addition, data can flow from various IoT devices far afield from the company data center. The data gets routed through various types of technologies (in-memory queues like Kafka, processing engines like Spark, etc.). Operational analytics workloads need to easily handle ingesting from these disparate data sources.

Operational analytics workloads also need to scale to large numbers of concurrent queries. With the drive towards being data driven and exposing data to customers and partners, the number of concurrent users (also queries) in the system has increased dramatically. In an OLAP workload, five to ten queries at a time was the norm. Operational analytics workloads often must be able to handle high tens, hundreds, or even thousands of concurrent queries.

As in an OLTP workload, availability and reliability are also key requirements. Because these systems are now exposed to customers or partners, the SLA required is a lot stricter than for internal employees.

Customers expect a 99.9% or better uptime and they expect the system to behave reliably. They are also less tolerant of planned maintenance windows. So the data infrastructure backing these systems needs to have support for high availability, with the ability to handle hardware and other types of failure.

Maintenance operations (such as upgrading the system software or rebalancing data) need to become transparent, online operations that are not noticeable to the users of the system. In addition, the system should self-heal when a problem occurs, rather than waiting for an operator to get alerted to an issue and respond.

Strong durability is important as well. This is because even though data that is lost could be reloaded, the reloading may cause the system to break the availability SLA.

The ability to retrieve the data you are looking for very quickly is the hallmark feature of database systems. Getting access to the right data quickly is a huge competitive advantage. Whether it is internal users trying to get insights into the business, or you are presenting analytics results to a customer, the expectation is that the data they need is available instantly.

The speed of the query needs to be maintained regardless of the load on the system. It doesn’t matter if there is a peak number of users online, the data size has expanded, or there are failures in the system. Customers expect you to meet their expectations on every query with no excuses.

This requires a solid distributed query processor that can pick the right plan to answer any query and get it right every time. It means the algorithms used must scale smoothly with the system as it grows in every dimension.

Supporting Operational Analytics Use Cases with MemSQL

MemSQL was built to address these requirements in a single converged system. MemSQL is a distributed relational database that supports ANSI SQL. It has a shared-nothing, scale-out architecture that runs well on industry standard hardware.

This allows MemSQL to scale in a linear fashion simply by adding machines to a cluster. MemSQL supports all the analytical SQL language features you would find in a standard OLAP system, such as joins, group by, aggregates, etc.

It has its own extensibility mechanism so you can add stored procedures and functions to meet your application requirements. MemSQL also supports the key features of an OLTP system: transactions, high availability, self-healing, online operations, and robust security.

It has two storage subsystems: an on-disk column store that gives you the advantage of compression and extremely fast aggregate queries, as well as an in-memory row store that supports fast point queries, aggregates, indices, and more. The two table types can be mixed in one database to get the optimal design for your workload.

Finally, MemSQL has a native data ingestion feature, called Pipelines, that allows you to easily and very quickly ingest data from a variety of data sources (such as Kafka, AWS S3, Azure Blob, and HDFS). All these capabilities offered in a single integrated system add up to making it the best data infrastructure for an operational analytics workload, bar none.

Describing the workload in general terms is a bit abstract, so let’s dig into some of the specific use cases where operational analytics is the most useful.

Portfolio Analytics

One of the most common use cases we see in financial services is portfolio analytics. Multiple MemSQL customers have written financial portfolio management and analysis systems that are designed to provide premium services to elite users.

These elite users can be private banking customers with high net worth or fund managers who control a large number of assets. They will have large portfolios with hundreds or thousands of positions. They want to be able to analyze their portfolio in real-time, with graphical displays that are refreshed instantaneously as they filter, sort, or change views in the application. The superb performance of MemSQL allows sub-second refresh of the entire screen with real-time data, including multiple tables and charts, even for large portfolios.

These systems also need to scale to hundreds or thousands of users concurrently hitting the system, especially when the market is volatile. Lastly, they need to bring in the freshest market data, without compromising the ability to deliver the strict latency SLAs for their query response times.

They need to do all of this securely without violating relevant compliance requirements nor the trust of their users. High availability and reliability are key requirements, because the market won’t wait. MemSQL is ideal data infrastructure for this operational analytics use case as it solves the key requirements of fast data ingest, high scale concurrent user access, and fast query response.

Predictive Maintenance

Another common use case we see is predictive maintenance. Customers who have services or devices that are running continuously want to know as quickly as possible if there is a problem.

This is a common scenario for media companies that do streaming video. They want to know if there is a problem with the quality of the streaming so they can fix it, ideally before the user notices the degradation.

This use case also comes up in the energy industry. Energy companies have devices (such as oil drills, wind turbines, etc.) in remote locations. Tracking the health of those devices and making adjustments can extend their lifetime and save millions of dollars in labor and equipment to replace them.

The key requirements are the ability to stream the data about the device or service, analyze the data – often using a form of ML that leverages complex queries – and then send an alert if the results show any issues that need to be addressed. The data infrastructure needs to be online 24/7 to ensure there is no delay in identifying these issues.

Personalization

A third use case is personalization. Personalization is about customizing the experience for a customer. This use case pops in a number of different verticals, such as a user visiting a retail web site, playing a game in an online arcade, or even visiting a brick and mortar store.

The ability to see a user’s activity and, more importantly, learn what is attractive to them, gives you the information to meet their needs more effectively and efficiently. One of MemSQL’s customers is a gaming company. They stream information about the user’s activity in the games, process the results against a model in MemSQL, and use the results to offer the user discounts for new games and other in-app purchases.

Another example is a popular music delivery service that uses MemSQL to analyze usage of the service to optimize ad spend. The size of data and the number of employees using the system made it challenging to deliver the data in a timely way to the organization and allow them to query the data interactively. MemSQL significantly improved their ability to ingest and process the data and allowed their users to get a dramatic speedup in their query response times.

Summary

Operational analytics is a new workload that encompasses the operational requirements of an OLTP workload – data lookups, transactionality, availability, reliability, and scalability – as well as the analytical requirements of an OLAP workload – large data sets and fast queries.

Coupled with the new requirements of high user concurrency and fast ingestion, the operational analytics workload is tough to support with a legacy database architecture or by cobbling together a series of disparate tools. As businesses continue along their digital transformation journey they are finding more and more of their workloads fit this pattern and are searching for new modern data infrastructure, like MemSQL, that has the performance and scale capabilities to handle them.

How to Use MemSQL with Intel’s Optane Persistent Memory

$
0
0

Feed: MemSQL Blog.
Author: Eric Hanson.

Intel’s new Optane DC persistent memory adds a new performance option for MemSQL users. After careful analysis, we’ve identified one area in which MemSQL customers and others can solve a potentially urgent problem using Optane today, and we describe that opportunity in this blog post.

We also point out other areas where MemSQL customers and others should keep an eye on Optane-related developments for the future. If you need broader information than what’s offered here, there are many sources for more comprehensive information about Optane and what it can do for you, beginning with Intel itself.

Optane persistent memory fits below DRAM but above SSDs as a new form of memory.
Intel Optane extends a system’s “hot tier” with slower, but persistent and capacious
memory. SSD serves as a warm tier, and HDD and tape as the cold tier.

What Optane Offers

Intel’s Optane is a new kind of memory option that’s less expensive than RAM, and also nonvolatile. Optane is only available on systems running some of the latest hardware and software from Intel.

Optane offers a new way to improve the performance of systems, especially servers. However, it will take the industry – from hardware vendors, to operating system developers, to application developers – a good while to take full advantage of it.

Optane runs in two modes, each with different advantages and trade-offs: memory mode, which we will concentrate on here, and application mode. In memory mode, the computer’s RAM is used as a fast, volatile cache above larger, lower-cost Optane memory modules, managed entirely by Intel’s built-in memory management. In this mode, taking advantage of Optane does not require any changes to application code.

In application mode, Optane can be explicitly written to and read from, separate from traditional RAM or hard disk storage. We will only discuss application mode briefly here.

Intel Optane persistent memory has both memory mode, used with MemSQL, and application mode.
MemSQL databases with large rowstore tables can benefit from systems
that offer Intel Optane persistent memory in memory mode.

Optane has two features that make it interesting: price and capacity. First, Optane is much cheaper, on a like-for-like basis, than traditional RAM. For traditional DRAM, a 128GB ECC DDR4 module is the largest available today, and costs roughly $4,500. For Optane, a 128GB DIMM is the smallest available, and costs about $850 at this writing – more than 80% less than traditional RAM.

Of just as much interest, however, is capacity. Optane offers DIMMs up to 512GB in capacity – four times the capacity of traditional DRAM. A 512GB Optane DIMM module costs about $7,000, or nearly ten times the price of the 128GB DIMM. However, the Optane 512GB DIMM is still less than half the price of four 128GB ECC DDR4 modules of traditional RAM, at about $18,000.

A Brief Mention of Application Mode

Use cases for Optane in application mode are interesting, but we won’t go into much depth on that here. When running Optane in application mode, since you can address Optane memory directly, you can treat it almost as a replacement for disk. That’s because writes to the Optane memory become permanent, or “persistent.”

The main benefit of application mode is persistence. This allows databases that keep information in memory to restart faster, and experiment with new architectures for making update transactions permanent. MemSQL’s scale-out architecture already allows us to restart quite fast, since each processor core reads data locally to re-populate RAM, in a fully parallel fashion. Restart time has not been a significant problem for our customers to date. So while we are intrigued by the possibilities of using application mode for MemSQL, we have not so far made changes to our code to use it.

Under the Covers of Memory Mode

What’s more widely interesting for now is the use of Optane in memory mode. When you add Optane persistent memory to your system, the total Optane capacity acts as your addressable RAM capacity. Traditional RAM then serves as a cache over the Optane persistent memory.

In memory mode, you need to consider three factors in gauging performance:

  • Traditional RAM performance. Adding Optane does not change performance when the data you need resides in traditional RAM. The Intel memory controller and drivers work to maximize the number of times needed data is fetched from traditional RAM rather than Optane memory.
  • Optane memory performance. When you add Optane, and application code and data in your address space exceed the capacity of traditional RAM, some RAM accesses will miss in traditional RAM. These accesses will then be served by Optane memory instead. The total time to access will be the traditional RAM access time plus the Optane memory access time, which can be a multiple of the traditional RAM access time.
  • Cache hit percentage. When you access data in RAM, some of it will be served by a cache hit – the data is in traditional RAM – and some by a cache miss – the access to traditional RAM fails, and an additional access to Optane memory is required. The percentage of cache hits determines whether you get something close to traditional RAM performance or something closer to Optane memory performance.

For most applications, with Optane memory at perhaps four times the size of traditional RAM, the overall performance reduction may average around 25%, compared to the same total RAM size made up of traditional RAM only. This applies across the board, including for queries as well as size-of-data operations like backup, restore, and snapshots. But the traditional RAM + Optane memory combination will be much less expensive – and, as we describe below, you can get far more Optane memory into a single server than traditional RAM.

Because no one has to rewrite their software to take advantage of memory mode, and because operationally savvy users can take full advantage of Optane against suitable workloads, it’s expected that Optane will be very popular with cloud providers and in large data centers. These deployments can offer large cost savings for the server owners, with an impact on end users that may not even be noticeable.

The “Hidden” Feature of Optane – and the Impact on MemSQL Customers

What’s most exciting about Optane, when used in memory mode, is not necessarily the direct cost savings on like-for-like servers. The exciting part is that the total RAM capacity of a single machine can be multiplied by a large amount. For instance, where an organization currently supports servers with 256GB of RAM, it can cost-effectively add support for servers with 256GB of traditional RAM plus, perhaps, 1TB of Optane memory. Where logical RAM capacity is the gating factor for the number of servers used, one such Optane-enabled server may be able to replace up to four traditional servers.

This applies directly to MemSQL customers who have very large rowstore tables. If the total rowstore table size is 2TB, it would take eight or nine servers with 256GB of RAM to support the leaf nodes for the tables, plus additional servers as aggregators (this assumes no redundancy — using high availability with “redundancy 2” would double the number of needed servers).

It would only require two leaf node servers, though, if each is provisioned with a mix of 256GB of traditional RAM and 1TB of Optane memory. The total hardware cost would be far less than for eight servers sporting traditional RAM, and management and maintenance of the server pool would be far easier.

MemSQL is so efficient that it’s common for a server that’s maxed out on RAM to be only using 20-30% of its CPU capacity. However, past RAM limitations prevented some customers from getting more useful performance from each server. With Optane, effective RAM capacity can be upgraded much further, and quite cost-effectively.

Some MemSQL customers have been held back from deploying applications with truly large rowstore tables, totaling 10TB and more, due to the cost and operational complexity of managing the number of servers that would be required. With Optane, these projects become quite feasible.

These same dynamics apply to other workloads that use rowstore tables in MemSQL, and these are the workloads that will be moved to Optane-powered servers within cloud providers and in large data centers. It’s no accident that launch partners for Optane persistent memory include Google Cloud, as well as server makers like Cisco, Dell EMC, and Lenovo, plus consultancy Accenture.

It’s also possible for Optane to be useful with columnstore tables, or in applications that use a mix of rowstore and columnstore. However, columnstore needs for large databases are already served effectively by SSDs, and it will take careful testing to establish where Optane will or won’t offer a clear price/performance benefit.

What You Should Do Today

If you are a current MemSQL customer, running very large tables (1TB-plus total data size) in rowstore today – or if you have a near-term need to do so – reach out to your MemSQL contacts soon. You may be able to save money on your current deployment and cost-effectively grow your application’s memory footprint and effectiveness.

If you are not running workloads of this type, Optane may not have immediate benefit for you in the short term, at least for the database part of your applications. As applications – everything from operating systems, to business applications, to end user applications – are rewritten to take advantage of Optane, you will have the opportunity to leverage those improvements by including Optane in your server mix.

If you are not a MemSQL customer, but are running (or contemplating) applications that will require high rates of ingest, high transaction performance, fast query response, and a high degree of concurrent usage, Optane helps make MemSQL an even more attractive option. We encourage you to try MemSQL for free today, or contact us to learn how MemSQL can help with even very ambitious data management goals.

Customer Saves $60K per Month on Move from AWS RDS and Druid.io to MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

A MemSQL customer previously ran their business on two databases: the Amazon Web Services Relational Database Service (AWS RDS) for transactions and Druid.io for analytics, with a total bill which reached over $93,000 a month. They moved both functions to MemSQL, and their cost savings have been dramatic – about two-thirds of the total cost.

The customer’s new monthly cost is about $31,000, for a savings of $62,000 a month, or 66%. In addition, the customer gains increased performance, greater concurrency, and easier database management. Future projects can also claim these benefits, giving the customer lower costs for adding new features to their services and greater strategic flexibility.

AWS cost report shows dramatic drop with a move to MemSQL.
A new MemSQL customer reduced their monthly AWS costs by 93%, and their
total monthly costs by two-thirds, with a simple move to MemSQL

The chart above shows the reduction in AWS RDS costs only – from a total of roughly $68,000 per month with AWS RDS and Druid.io, to roughly $5,700 a month with MemSQL replacing both. Licensing costs for Druid.io were roughly $25,000/month, the same as for MemSQL. So total costs dropped from roughly $93,000, with AWS RDS and Druid.io, to roughly $31,000, with MemSQL running in AWS.

In addition to the very large cost savings, strategic flexibility is enhanced by the fact that MemSQL can run in a variety of public and private clouds, on premises, in mixed on-premises and cloud environments, and in virtual machines as well as containers. “Cloud lock-in” becomes a problem of the past.

Moving from AWS RDS to MemSQL

AWS RDS is a flexible service that allows the use of multiple databases. In this case, the customer was using AWS Aurora. Aurora is a MySQL and PostgreSQL-compatible relational database offered by AWS as one of a wide range of database offerings. Because MemSQL is MySQL wire protocol-compatible, the move from Aurora to MemSQL was very easy.

And, because MemSQL is much more flexible than any single AWS offering, it can accomplish many more tasks in a single database. For instance, MemSQL has strong support for both in-memory rowstore and disk-based columnstore data formats, with data moving flexibly between them. In AWS, by contrast, you might need to use either AWS ElastiCache for Memcached or AWS ElastiCache for Redis for in-memory performance, then use AWS Redshift for disk-based storage and analytics.

MemSQL can also be used to augment, rather than replace, Aurora and other AWS offerings. Data stored in Aurora can be copied to MemSQL for analytics, for example. The Aurora database then runs faster because it no longer has to handle analytics inquiries; analytics queries and analytics apps run faster because they have a dedicated MemSQL database, and because of MemSQL’s faster query performance and greater concurrency support.

Conclusion

The customer’s ability to save so much money from such a simple change is somewhat ironic, as one of the claimed selling points of AWS Aurora is: “Performance and availability of commercial databases at 1/10th the cost.” Whereas this customer was able to save more than 66%, and gain better performance, greater concurrency, and strategic flexibility, through a simple and easy move from AWS Aurora and Druid.io to MemSQL.

Contact MemSQL for more information or start using MemSQL for free today!


Introducing the MemSQL Kubernetes Operator

$
0
0

Feed: MemSQL Blog.
Author: Carl Sverre.

Kubernetes has taken the world by storm, transforming how applications are developed, deployed, and maintained. For a time, managing stateful services with Kubernetes was difficult, but that has changed dramatically with recent innovations in the community. Building on that work, MemSQL is pleased to announce the availability of our MemSQL Kubernetes Operator, and our certification by Red Hat to run on the popular OpenShift container management platform.

Kubernetes has quickly become one of the top three most-loved platforms by developers. Now, with the MemSQL Kubernetes Operator, technology professionals have an easy way to deploy and manage an enterprise-grade operational database with just a few commands.

The new Operator is certified by Red Hat to run MemSQL software on Red Hat OpenShift, or you can run it with any Kubernetes distribution you choose. Running MemSQL on Kubernetes gives data professionals the highest level of deployment flexibility across hybrid, multi-cloud, or on-premises environments. As Julio Tapia, director of the Cloud Platforms Partners Ecosystem for Red Hat, put it in our press release, services in a Kubernetes-native infrastructure “‘just work’ across any cloud where Kubernetes runs.”

As a cloud-native database, MemSQL is a natural fit for Kubernetes. MemSQL is a fully distributed database, deploys and scales instantly, and is configured quickly and easily using the native MemSQL API. MemSQL customers have requested the Kubernetes Operator, and several participated in testing prior to this release.

The majority of MemSQL customers today deploy MemSQL on one or more public cloud providers. Now, with the Kubernetes Operator, they can deploy on any public or private infrastructure more easily.

Managing an Operational Database with Kubernetes

How to Use The MemSQL Kubernetes Operator

You use the MemSQL Kubernetes Operator like other standard Kubernetes tools. Use the Kubernetes command-line interface (CLI) and the Kubernetes API to interact with the Operator and manage the application. The task of managing the cluster is greatly simplified. DevOps and administration teams can also use the Operator to implement partial or full automation.

The Operator enables you to create, read, update, and delete MemSQL clusters. Among the options you specify (see here for details):

  • The cluster size. Cluster size is defined in units, where one unit of cluster size is equal to one leaf node.
  • The memory and CPU assigned. This is defined as height, where one unit of height equals 8 vCPUs and 32GB of RAM.
  • The redundancy level. Level 1 is no redundancy, level 2 is one redundant copy (recommended for production use).
  • The storage size. How much disk space you want to reserve.

Because Kubernetes is a declarative, rather than imperative, environment, you describe the state of the cluster that you want Kubernetes to create and maintain. Kubernetes then maintains that state for you. The commands and operations are the same across all the major public clouds, private clouds, and on-premises installations as well.

The minimum cluster size you should Managing an Operational Database with Kubernetesspecify is a single leaf unit with height 1, three aggregator units (automatically created, with height 0.5), and redundancy level 2. When you create the cluster, a DDL endpoint is returned to you. You connect to the cluster using the DDL endpoint.

The MemSQL Kubernetes Operator does not currently include the ability to split and merge partitions. You will need to perform this function manually, outside of Kubernetes. We expect to include partition management in a future release.

Next Steps

If you’re already a MemSQL customer, you can begin using the Kubernetes Operator today. Access it here.

If you are not already a customer, you’ll find MemSQL a great fit for a wide range of operational analytics use cases. Try MemSQL for free today or contact us to learn how we can help you.

Mapping and Reducing the State of Hadoop

$
0
0

Feed: MemSQL Blog.
Author: Jacky Liang.

In this blog post, part one of a two-part series, we look at the state of Hadoop from a macro perspective. In the second part of this series, we will look at how Hadoop and MemSQL can work together to solve many of the problems described here.

2008 was a big year for Apache Hadoop. It appeared organizations had finally found the panacea for working with exploding quantities of data with the rise of mobile and desktop web.

Yahoo launched the world’s largest Apache Hadoop production application. They also won the “terabyte sort” benchmark, sorting a terabyte of data in 209 seconds. Apache Pig – a language that makes it easier to query Hadoop clusters – and Apache Hive – a SQL-ish language for Hadoop – were actively being developed, by Yahoo and Facebook respectively. Cloudera, now the biggest software and services company for Apache Hadoop, was also founded.

Data sizes were exploding with the continued rise of web and mobile traffic, pushing existing data infrastructure to its absolute limits. As a result, the term “big data” was coined around this time too.

Then came Hadoop, promising to all organizations to answer any questions you have with your data.

The promise: You simply need to collect all your data in one location and run it on free Apache Hadoop software, using cheap scalable commodity hardware. Hadoop also introduced the concept of the Hadoop Distributed File System (HDFS), allowing data to be spanned over many disks and servers. Not only is the data stored, but it’s also replicated 2 – 3 times across servers, ensuring no data loss even when a server goes down. Another benefit to using Hadoop is that there is no limit to the sizes of files stored in HDFS, so you can continuously append data to the files, as in the case of server logs.

Facebook claimed to have the largest Hadoop cluster in the world, at 21 petabytes of storage, in 2010. By 2017, more than half of the Fortune 50 companies were using Hadoop. Cloudera and Hortonworks became multi-billion dollar public companies. For an open source project that had only begun in 2006, Hadoop became a household name in the tech industry in the span of under a decade.

The only direction is up for Hadoop, right?

However, many industry veterans and experts are saying Hadoop perhaps isn’t the panacea for big data problems that it’s been hyped up to be.

Just last year in 2018, Cloudera and Hortonworks announced their merger. The CEO of Cloudera announced an optimistic message about the future of Hadoop, but many in the industry disagree.

“I can’t find any innovation benefits to customers in this merger,” said John Schroder, CEO and Chairman of the Board at MapR. “It is entirely about cost cutting and rationalization. This means their customers will suffer.”

Ashish Thusoo, the CEO Of Qubole, also has a grim outlook on Hadoop in general — “the market is evolving away from the Hadoop vendors – who haven’t been able to fulfill their promise to customers”.

While Hadoop promised the world a single data store for all of your data, in cheap and scalable commodity hardware, the reality of operationalizing that data was not so easy. Speaking with a number of data experts at MemSQL, reading articles by industry experts, and reviewing surveys from Gartner, we noticed a number of things that are slowing Hadoop growth and deployment within existing enterprises. The data shows that the rocketship growth of Hadoop had been partly driven by fear of being left behind, especially by technology executives – the ones who overwhelmingly initiate Hadoop adoption, with 68% of adoption initiated within the C-suite, according to Gartner. We will also explore limitations to Hadoop in various use cases especially in this ever-changing enterprise data industry.

Let’s dive in.

Hype

In a Gartner survey released in 2015, the research firm says that an important point to look at with Hadoop adoption is the low number of Hadoop users in an organization, which gives indication that “Hadoop is failing to live up to its promise.”

Gartner say that hype and market pressure were among the main reasons for interest in Hadoop. This is not a surprise to many, as it’s hard to avoid hearing Hadoop and big data in the same sentence. Gartner offers the following piece of advice for the C-suite interested in deploying Hadoop:

“CEOs, CIOs and CTOs (either singularly or due to pressure from their boards) may feel they are being left behind by their competitors, based on press and hype about Hadoop or big data in general. Being fresh to their roles, the new chief of innovation and data may feel pressured into taking some form of action. Adopting Hadoop, arguably ‘the tallest tree in the big data forest’, provides the opportunity.”

The survey warns to not adopt Hadoop because of the fear of being left behind — Hadoop adoption remains still at an early adopter phase, with skills and successes still rare. A concrete piece of advice from Gartner is to start with small projects backed by business stakeholders to see if Hadoop is helpful in addressing core business problems. Using small deployments initially will allow an organization to develop skills and develop a record of success before tackling larger projects.

Skills Shortage

When using Hadoop for analytics, you lose the familiar benefits of SQL.

According to the same survey cited above, it appears that around 70% of organizations have relatively few Hadoop developers and users. The low number of Hadoop users per organization is attributed to Hadoop innately being unsuitable for large simultaneous numbers of users. This also indicates difficulty in hiring Hadoop developers attributed to skill shortage. Which leads to our next point — cost.

Cost

Two facts about Apache Hadoop:

  1. It’s free to use. Forever.
  2. You can use cheap commodity hardware.

But Hadoop is still very expensive. Why?

While Hadoop may have a cheap upfront cost in software use and hosting, everything after that is anything but cheap.

As explained before, to make Hadoop work for more than just engineers, there need to be multiple abstraction layers on top. Having additional copies of the data for Hive, Presto, Spark, Impala, etc, means additional cost in hardware, maintenance, and operations. Adding layers on top also means requiring additional operations and engineering work to maintain the infrastructure.

While Hadoop may seem cheap in terms of upfront cost, the costs for maintenance, hosting, storage, operations, and analysis make it anything but.

Easy to Get Data In, but Very Tough to Get Data Out

Getting data into Hadoop is very easy, but it turns out, getting data out and deriving meaningful insight to your data is very tough.

A person working on data stored in Hadoop – usually an engineer, not an analyst – is expected to have at least some knowledge of HDFS, MapReduce, and Java. One also needs to learn how to set up the Hadoop infrastructure, which is another major project in itself. Speaking with relevant industry people that have formerly deployed Hadoop or work closely with organizations that use Hadoop, this is the biggest pain point of the technology — how hard it is to run analytics on Hadoop data.

Many technologies have been built to tackle the complexities of Hadoop, such as Spark (data processing engine), Pig (data flow language), and Hive (a SQL-like extension on top of Hadoop). These extra layers add more complexity to an already-complex data infrastructure. This usually means more potential points of failure.

Hiring Software Engineers is Expensive

An assortment of software skills are needed to make Hadoop work. If it’s used with no abstraction layer, such as Hive or Impala, on top, querying Hadoop needs to be done in MapReduce, which is written in Java. Working in Java means hiring software engineers rather than being able to hire analysts which are proficient in SQL.

Software engineers with Hadoop skills are expensive, with an average salary in the U.S. at $84,000 a year (not including bonuses, benefits, etc). In a survey by Gartner, it’s stated that “obtaining the necessary skills and capabilities [is] the largest challenge for Hadoop (57%).”

Your engineering team is likely the most expensive, constrained, and tough-to-hire-for resource in your organization. When you adopt Hadoop, you then require engineers for a job that an analyst proficient in SQL could otherwise do. On top of the Hadoop infrastructure and abstraction layers you need to more easily get data out, you now need to account for the engineering resources needed. This is not cheap at all.

Businesses Want Answers NOW

As businesses are going international, and customers are demanding instant responsiveness around the clock, companies are pushed to become real-time enterprises. Whether this is to derive real-time insights into product usage, live identification of financial fraud, providing customer dashboards that show investment returns in milliseconds, not hours, or understanding ad spend results on an up-to-the-second basis, waiting for queries to Map and Reduce simply no longer serves the immediate business need.

It remains true that Hadoop is incredible for crunching through large sets of data, as that is its core strength — in batch processing. There are ways to augment Hadoop’s real-time decision abilities, such as using Kafka streams. But in this case, what’s meant to be real-time processing slows down to micro batching.

Spark streaming is another way to speed up Hadoop, but it has its own limitations. Finally, Apache projects like Storm are also micro-batching, so they are nowhere near real time.

Another point to consider is that, the above technologies mentioned are another piece of complexity added to an already-complex data infrastructure. Adding multiple layers between Hadoop and SQL-based analytic tools also means slow response, multiplied cost, and additional maintenance required.

In short, Hadoop is not optimized for real-time decision making. This means it may not be well-suited to the evolving information demands of businesses in the 21st century.

In this, part one of this two-part series on Hadoop, we talked about the rise of Hadoop, why it looked like the solution to organizations’ big data problems, and where it fell short. In the next part of this series, we will explore why combining Hadoop with MemSQL may help businesses that are already invested in Hadoop.

Query Processing Improvements in MemSQL 6.8

$
0
0

Feed: MemSQL Blog.
Author: Eric Hanson.

In this blog post, I’ll focus on new query processing capabilities in MemSQL 6.8. The marquee query feature is just-in-time (JIT) compilation, which speeds up query runtimes on the first run of a query – now turned on by default. We have also improved performance of certain right and left outer joins and related operations, and Rollup and Cube. In addition, we add convenience features, including sub-select without an alias, and extended Oracle compatibility for date and time handling functions. Finally, new array functions for splitting strings and converting JSON data are added.

Other improvements in 6.8 are covered elsewhere. These include:

  • secured HDFS pipelines
  • improved pipelines performance
  • LOAD DATA null column handling extensions
  • information schema and management views enhancements

Now, let’s examine how just in time queries can work in a database.

Speeding up First Query Runtimes

MemSQL compiles queries to machine code, which allows us to get amazing performance, particularly when querying our in-memory rowstore tables. By spending a bit more time compiling than most databases – which interpret all queries, not compiling them – we get high performance during execution.

This works great for repetitive query workloads, such as real-time dashboards with a fixed set of queries and transactional applications. But our customers have been asking for better performance the first time a query is run, which is especially applicable for ad hoc queries – when slower performance can be especially noticeable.

In MemSQL 6.7, we first documented a JIT feature for SQL queries, enabled by running ‘set interpreter_mode = interpret_first’. Under this setting, MemSQL starts out interpreting a query, compiles its operators in the background, then dynamically switches from interpretation to execution of compiled code for the query as the query runs the first time. The interpret_first setting was classified as experimental in 6.7, and was off by default.

In 6.8, we’re pleased to say that interpret_first is now fully supported and is on by default. This setting can greatly improve the user’s experience running ad hoc queries, or when using any application that causes a lot of new SQL statements to be run, as when a user explores data through a graphical interface. The interpret_first setting can speed up the first run of a large and complex query – say, a query with more than seven joins – several times by reducing compile overhead, with no loss of performance on longer-running queries for their first run.

Rollup and Cube Performance Improvements

Cube and Rollup operator performance has been improved in MemSQL 6.8 by pushing more work to the leaf nodes. In prior releases, Cube and Rollup were done on the aggregator, requiring more data to be gathered from the leaves to the aggregator, which can take more time. For example, consider the following query from the Cube and Rollup documentation:

SELECT state, product_id, SUM(quantity)
FROM sales
GROUP BY CUBE(state, product_id)
ORDER BY state, product_id;

The graphical query plan for this in 6.8, obtained using MemSQL Studio, is the following:

MemSQL Studio demonstrates some of the performance improvements in MemSQL 6.8.

Notice the Grouping Set operator, third from the bottom, which is used for the Cube calculation. The Grouping Set operator is below the Gather operator, which means it is done on the leaves in this case.

This enhancement speeds up up several queries in the TPC-DS benchmark. In particular, query 67, which contains a large Rollup, improved by 5.5X compared with MemSQL 6.7.

Right Semi/Anti/Outer Join Support

MemSQL 6.8 introduces a new approach to executing certain outer-, semi-, and anti-joins. This does not add any new functional surface area to our SQL implementation; rather, it speeds up execution of some queries. A true right join operator is now supported, and certain kinds of left joins can be rewritten to right joins to enable them to run faster.

For example, consider two tables, S and L, where S is a small table and L is a large table. Suppose this query or subquery is encountered:

S left join L

This can be rewritten and executed as

L right join S

Here, the hash build side is for S. Then L is scanned and the L rows are used to probe the hash table for S. Since L is large and S is small, this is a good strategy for this query, since it results in a smaller hash table that can more easily fit in cache.

This enhancement can substantially speed up certain queries. For example, query 21 of the TPC-H benchmark speeded up about 4.3X using this approach from MemSQL 6.7 to MemSQL 6.8.

Subselect Without Alias

MemSQL 6.8 now allows you to use a subquery without an alias (name) for it, when leaving off the alias will not be ambiguous. For example, you can now say this:

-- find average of the top 3 quantities
SELECT AVG(quantity)
FROM (SELECT * FROM sales ORDER BY quantity DESC LIMIT 3);

Rather than this:

SELECT AVG(quantity)
FROM (SELECT * FROM sales ORDER BY quantity DESC LIMIT 3) as x;

Resource Governor Extensions

MemSQL 6.8 includes several extensions to the resource governor, which ensure that resources for more operations are governed under the desired pools. These extensions are:

  • LOAD DATA operations now run in the pool for the current connection where the load operation is running.
  • Stored procedures run in the resource pool of the current connection from where they are called.
  • Query optimization work always runs in the pool of the current connection.
  • Pipelines can be run under a pool you specify when you create the pipeline, using a new clause: [RESOURCE POOL pool_name].

New Built-In Functions

Two new built-in functions related to arrays are provided: SPLIT() and JSON_TO_ARRAY().

SPLIT()

The SPLIT() function has the following prototype:

split(s text [, separator text NOT NULL])returns array(text)

It splits a string into an array, using any amount of whitespace as a delimiter if no delimiter is specified. Or, if a delimiter is specified, it splits at that delimiter. For example, the query

SELECT array_length(split('a b  c') :> array(text));

returns 3. Normally you would use SPLIT in a stored procedure (SP) or a user-defined function (UDF).

JSON_TO_ARRAY()

The JSON_TO_ARRAY() function takes a JSON array and returns a MemSQL array. If your applications use JSON data that contains arrays, you can use JSON_TO_ARRAY when processing it in SPs or UDFs. For example:

JSON_TO_ARRAY('[ "foo", 1, {"k1" : "v1", "k2" : "v2"} ]')

would produce a result array of JSON elements (we’ll call it r) of length 3 like this:

r[0] = "foo"
r[1] = 1
r[2] = '{"k1":"v1","k2":"v2"}'

Expression and Built-In Function Changes

A number of small changes have been made to new expression capabilities. First, in MemSQL 6.7, we introduced a number of functions to make it easier to port Oracle applications and queries to MemSQL, and easier for experienced Oracle developers to use MemSQL. These include:

  • NVL(), TO_DATE(), TO_TIMESTAMP(). TO_CHAR(), DECODE()
  • REGEXP_REPLACE(), REGEXP_INSTR()

We’ve improved the compatibility of these functions as follows:

  • Enable TO_DATE() to support format strings with time-related format options (“HH”, “SS”, etc.)
  • Enable TO_DATE() to support the “DD” format option
  • Enable TO_TIMESTAMP() to support “YY” and “FF” format options
  • Enable TO_TIMESTAMP(), TO_DATE(), and TO_CHAR() to support the “D” format option
  • Enable TO_TIMESTAMP() and TO_DATE() to support using different punctuation as a separator
  • Enable TO_TIMESTAMP() and TO_DATE() to raise an error, instead of returning NULL, for certain error cases
  • Enable TO_CHAR() to support “AM” and “PM”
  • Modify how TO_TIMESTAMP() parses “12” when using the “HH” format option

In addition, we increased the precedence of || as concat (under sql_mode = ‘PIPES_AS_CONCAT’) to be compatible with MySQL and Postgres.

Summary

This post has covered the query processing extensions in MemSQL 6.8. We hope you’ll especially enjoy the new JIT compilation feature that improves first-run query times by default. Try it for your ad hoc workloads. Please download MemSQL 6.8 today!

HDFS Pipelines Supporting Kerberos and Wire Encryption in MemSQL 6.8

$
0
0

Feed: MemSQL Blog.
Author: Jacky Liang.

Many companies, including our customers, have invested heavily in Hadoop infrastructure. In a recent blog post, we explored the topic of hype when it came to enterprises deploying Hadoop across their organizations, and ultimately where Hadoop falls short for certain use cases.

Using MemSQL, many of our customers have been able to augment Hadoop using our HDFS pipelines feature, allowing them to quickly ingest from the Hadoop Distributed File System (HDFS) and perform analysis of their data in real time.

With MemSQL 6.8, we are happy to announce our support for Kerberos and wire encryption for HDFS pipelines. Kerberos is a widely used method for authenticating users, including users of Hadoop clusters.

Similarly, wire encryption protects data as it moves through Hadoop. Combining Kerberos and wire encryption in Hadoop is the standard in enterprises demanding the highest level of security.

In MemSQL 6.8, with the release of Kerberos and wire encryption for HDFS, customers now get comprehensive security through full standards-based authentication and over-the-wire data delivery.

MemSQL Delivers Top Performance on Industry Benchmarks with Latest Releases

$
0
0

Feed: MemSQL Blog.
Author: Mike Boyarski.

MemSQL has spent the last six years on a single mission: developing the fastest, most capable database on the market for a new generation of workloads. Today’s businesses are beginning to win or lose on their ability to use data to create competitive advantage, and technology teams at these companies need data infrastructure that can meet an increasingly broad set of requirements, perform well at scale, and fit easily with existing processes and tools. MemSQL 6.8 is the latest release of the database that meets this challenge.

Whether accelerating and simplifying the use of data to improve customer experiences, or accelerating analytics to drive better decision-making and optimize operations, both legacy databases and the current sprawl of specialty tools are failing for many data professionals.

Today, we are proud to announce that we have reached an amazing milestone. MemSQL is the first cloud native database to pass all of the three leading industry benchmarks for analytics and transactions: TPC-C, TPC-H, and TPC-DS. MemSQL meets or beats all published performance and cost data across a broad array of competitors.

These results are now possible with our latest software releases. We are announcing the general availability of MemSQL 6.8 and the first beta release of MemSQL 7.0. MemSQL 6.8 offers even faster analytics performance and advanced security for Hadoop environments. The MemSQL 7.0 beta previews new, mission-critical transactional capabilities for system of record applications and even more dramatic query performance improvements.

Proving Database Scale and Performance for Both Transactions and Analytics

We believe that the newest analytical systems and data-intensive applications, which include streaming, real-time decisions, predictive decisions, and dynamic, personalized experiences, represent a new set of workloads, which we call operational analytics. Operational analytics involves a specific combination of key database features, notably fast ingest through transaction processing, fast processing, and low-latency queries for reports and dashboards.

Current industry benchmarks were designed to individually highlight the capabilities of different kinds of databases, since no one database, until recently, could run both transactional and analytics workloads at scale. Yet that is exactly what today’s real-time and near-real-time operations and analytics environments require.

In order to demonstrate what MemSQL can do for these use cases, we ran three different benchmarks from the TPC family. This allows us to showcase that MemSQL has the broad capabilities, performance across use cases, and scalability required today.

The results were astounding. Using a single database product on standard cloud hardware, MemSQL was able to meet or beat (up to 8x) the results of databases designed for only doing either transaction processing or analytics – not both. MemSQL is the only modern database that can successfully perform, scale, and deliver the full breadth of capabilities required to support today’s demanding analytics and operational applications.

If you would like to learn more about these benchmarks, we’ve documented our infrastructure configuration, the detailed results, and comparisons to other database products in our benchmark blog post.

But Wait, There’s More … Product Improvements in MemSQL 6.8

We are also excited to highlight the new capabilities in MemSQL 6.8. Our drive to build a fast, easy to use, enterprise-capable database means ongoing work to both optimize query performance and to ensure comprehensive security capabilities.

In MemSQL 6.8, we introduced two key improvements: improved query performance and advanced security for Hadoop environments.

Improved Query Performance

We optimized our query compilation feature set to deliver what we call interpret-first query compilation. This innovative feature automatically speeds up first-run queries commonly used in ad hoc analytics environments, or when one-off queries are required.

We see a 5x speedup of CUBE and ROLLUP queries with the new optimizations. ROLLUP queries let you calculate subtotals and grand totals for a set of columns, while CUBE queries let you calculate subtotals and grand totals for all permutations of the columns specified in the query. We also saw a nearly 5x performance improvement for a range of JOIN queries. You can learn more about our query performance improvements on our blog post here.

Advanced Security for Hadoop Environments

MemSQL has had pipeline support for HDFS since MemSQL 6.5. HDFS pipeline support allows MemSQL to quickly and easily ingest data from Hadoop environments, with MemSQL providing faster ANSI SQL query response, leveraging its power as a fully distributed database.

Now, with MemSQL 6.8, we have added Kerberos support for HDFS pipelines, along with wire encryption for over-the-wire data delivery, to provide support for security standards which are commonly used in Hadoop deployments. You can read more about our HDFS security improvements at our blog post here.

To learn more about MemSQL and our improvements with MemSQL 6.8, please join our upcoming live webinar. You can also sign up for our benchmarking webinar. Or, you can get started with MemSQL for free today.

Managing MemSQL with Kubernetes

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

With the arrival of the cloud, organizations face new opportunities – and new challenges. Chief among them is how to take the greatest advantage of public and private cloud resources without being locked into a specific cloud or being barred from access to existing infrastructure. Container solutions such as Docker offer part of the solution, making it much easier to develop, deploy, and manage software. In our webinar, product manager Micah Bahti describes how to take advantage of the next step: Using Kubernetes, and the MemSQL Kubernetes Operator, to manage containers across public clouds and existing infrastructure.

Until recently, however, Kubernetes didn’t manage stateful services. Recently, that support has been added, and MemSQL has stepped into a leading position. Ahead of other widely used database platforms, MemSQL has developed and made available a Kubernetes Operator. The Operator was announced at Red Hat Summit early in May.

You can use the Operator, currently in beta, to easily deploy and manage MemSQL clusters. Like Kubernetes itself, the Operator works smoothly across operating systems, platforms, clouds, and privately held infrastructure.

You can also use the Operator with MemSQL on small deployments, including free instances of MemSQL. It scales smoothly to large databases as well; MemSQL scales to databases in the petabytes.

Deploying and Installing Kubernetes for MemSQL

Deploying and installing Kubernetes for MemSQL is very similar to using Kubernetes with other, stateless software. First, find the needed components. They’re available in the OpenShift container catalog and on Docker Hub.

Listing MemSQL components for Kubernetes

To start deployment, load the image of the MemSQL Kubernetes Operator and the configuration files into your Kubernetes cluster.

First step in installing MemSQL with the Kubernetes Operator

Then, edit the YAML file, memsql-cluster.yaml, to define the attributes of your cluster. The most important is the size of the cluster, in gigabytes. One of the advantages of Kubernetes is that it’s very easy to change this later, quickly and at no cost. For other attributes, the minimum configuration for production should be:

  • 1 leaf unit @ height 1
  • 3 aggregator units @ height 0.5
  • redundancyLevel = 2

Second step in installing MemSQL with the Kubernetes Operator

Note: You can’t downsize a cluster below the amount of data in it. For instance, if you create a 2GB cluster, then put 1.1GB of data in it, telling Kubernetes to downsize the cluster to 1GB will result in an error message.

Finally, create the cluster, and manage it using kubectl. You can connect to the cluster with mysql.

Third step in installing MemSQL with the Kubernetes Operator

You will find support for all this in the MemSQL documentation and the MemSQL Forums. Reach out to us in the Kubernetes forum or by email at team@memsql.com.

Benefits of Using MemSQL with Kubernetes

Because MemSQL offers fast, scalable SQL, the combination of the MemSQL database and the Kubernetes Operator gives you the ability to use a single relational database for transactions and analytics, without the need to move data. MemSQL easily ingests data from a range of sources and supports analytics platforms such as Looker, PowerBI, and Tableau.

When you use MemSQL with Kubernetes, you get complete freedom to deploy or redeploy across physical or cloud infrastructure, as needed. Installation and deployment take minutes, not days or weeks; scaling is elastic; upgrades happen smoothly online. You can upgrade supporting hardware or software, with no effect on your MemSQL cluster.

For more details about the Operator, see our initial announcement. And, if you are not yet a MemSQL user, you can try MemSQL for free today, or contact us to learn how we can help you.

Case Study: Kurtosys – Why Would I Store My Data In More Than One Database?

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

One of MemSQL’s strengths is speeding up analytics, often replacing NoSQL databases to provide faster performance. Kurtosys, a market-leading digital experience platform for the financial services industry, uses MemSQL exclusively, gaining far faster performance and easier management across transactions and analytics.

Kurtosys is a leader in the Digital Experience category, with the first truly SaaS platform for the financial services industry. In pursuing its goals, Kurtosys became an early adopter of MemSQL. Today, MemSQL is helping to power Kurtosys’ growth.

Stephen Perry, head of data at Kurtosys, summed up the first round of efforts in a blog post several years ago, titled Why Would I Store My Data In More Than One Database? (Among his accomplishments, Steve is one of the first MemSQL-certified developers.)

In the following blog post, we describe how usage of MemSQL has progressed at Kurtosys. In the first round, Kurtosys had difficulties with their original platform, using Couchbase. They moved to MemSQL, achieving numerous benefits.

Further customer requests, and the emergence of new features in MemSQL, opened the door for Kurtosys to create a new platform, which is used by Kurtosys customers to revolutionize the way they deliver outstanding digital and document experiences to their sales teams and to their external communities of clients and prospects. In this new platform, MemSQL is the database of record.

At Kurtosys, Infrastructure Powers Growth

Kurtosys has taken on a challenging task: hosting complex financial data, documents, websites, and content for the financial services industry. Kurtosys customers use the Kurtosys platform for their own customer data, as well as for their sales and marketing efforts.

The customer list for Kurtosys features many top tier firms, including Bank of America, the Bank of Montreal, Generali Investments, and T. Rowe Price. Kurtosys’ customers require high performance and high levels of security.

Customer focus on security is greater in financial services than in most other business segments. A single breach – even a potential breach that is reported, but never actually exploited – can cause severe financial and reputational damage to a company. So customers hold technology suppliers such as Kurtosys to very high standards.

Alongside security, performance is another critical element. Financial services companies claim performance advantages to gain new customers, so suppliers have to deliver reliably and at top speed. And, since financial services companies also differentiate themselves on customer service, they require suppliers to provide excellent customer service in turn.

(Like Kurtosys, MemSQL is well-versed in these challenges. Financial services is perhaps our leading market segment, with half of the top 10 US financial services firms being MemSQL customers.)

With all of these strict requirements, for financial services companies to trust an external provider to host their content – including such crucial content as customer financial data – is a major step. Yet, Kurtosys has met the challenge and is growing quickly.

“Our unique selling proposition is based around the creative use of new and unique technology,” says Steve. “We’ve progressed so far that our original internal platform with MemSQL, which we launched four years ago, is now a legacy product. Our current platform employs a very modern approach to storing data. We are using MemSQL as the primary database for the Kurtosys platform.”

Kurtosys Chooses Infrastructure for Growth

Kurtosys is adept at innovating its infrastructure to power services for demanding customers. For instance, several years ago, Kurtosys used SQL Server to execute transactions and Couchbase as a high-performance, scalable, read-only cache for analytics.

Initially, the combination made sense. Customers of Kurtosys wanted to see the company executing transactions on a database that’s among a handful of well-established transactional databases. SQL Server fit the bill.

However, like other traditional relational databases, SQL Server is, at its core, limited by its dependence on a single core update process. This dependency prevents SQL Server, and other traditional relational databases, from being able to scale out across multiple, affordable servers.

This means that the single machine running SQL Server is usually fully occupied with transaction processing and would struggle to meet Kurtosys’ requirements, such as the need for ad-hoc queries against both structured and semi-structured data. That left Kurtosys needing to copy data to another system, initially Couchbase, and run analytics off that – the usual logic for purchasing a data warehouse or an operational analytics database.

Couchbase seemed to be a logical choice. It’s considered a leading NoSQL database, and is often compared to other well-known NoSQL offerings such as Apache Cassandra, Apache HBase, CouchDB, MongoDB, and Redis. Couchbase tells its target audience that it offers developers the opportunity to “build brilliant customer experiences.”

NoSQL databases have the ability to scale out that traditional relational databases lack. However, NoSQL databases face fundamental limitations in delivering on promises such as those made by CouchBase. NoSQL databases favor unstructured or less-structured data. As the name implies, they don’t support SQL.

Users of these databases don’t benefit from decades of research and experience in performing complex operations on structured and, increasingly, semi-structured data using SQL. With no SQL support, Couchbase can be difficult to work with, and requires people to learn new skills.

Running against unstructured data and semi-structured JSON data, and without the benefit of SQL, Kurtosys found it challenging to come up with an efficient query pattern that worked across different data sets.

Kurtosys Moves to MemSQL to Power Fast Analytics

As a big data database, Couchbase is an excellent tool for data scientists running analytics projects. However, for day in and day out analytics use, it was difficult to write queries, and query performance was subpar. Couchbase was not as well suited for the workloads and high degree of concurrency – that is, large numbers of simultaneous users – required for internal user and customer analytics support, including ad hoc SQL queries, business intelligence (BI) tools, and app support.

At the same time, Kurtosys needed to stay on SQL Server for transactions. Kurtosys had invested a lot in SQL Server-specific stored procedures. Its customers also liked the fact that Kurtosys uses one of the top few best-known relational databases for transactions.

So, after much research, Kurtosys selected a fully distributed database which, at the time, ran in-memory: MemSQL. Because MemSQL is also a true relational database, and supports the MySQL wire protocol, Kurtosys was able to use the change data capture (CDC) process built into SQL Server to keep MemSQL’s copy of the data up to date. MemSQL received updates a few seconds after each transaction completed in SQL Server. Queries then ran against MemSQL, allowing both updates and queries to run fast against the respective databases.

MemSQL fast database replaces SQL Server and a CDC process.
In the original platform, updates ran against SQL Server.
CDC moved updates to MemSQL, which supported queries.

SQL Server was now fully dedicated to transaction support, with the CDC process imposing little overhead on processing. And, because of MemSQL’s speed, the database was able to easily keep up with the large and growing transaction volume going through the Kurtosys platform.

Kurtosys summed up its approach at the time in a slide deck that’s available within a Kurtosys blog post. The key summary slide is below.

MemSQL-Based Platform Powers New Applications

Kurtosys has now created a new internal platform. One of the key capabilities in the new platform is support for JSON data. In a recent MemSQL release, MemSQL 6.7, JSON data support is a core feature. In fact, comparing JSON data to fully structured data, “performance is about the same, which is a testament to MemSQL,” says Steve. With this capability, Kurtosys can keep many of the same data structures that it had previously used in Couchbase, but with outstanding performance.

Also, when Kurtosys first adopted MemSQL, several years ago, MemSQL was largely used as an in-memory database. This gave truly breakthrough performance, but with accompanying higher costs. Today, MemSQL flexibly supports both rowstore tables in memory and disk-based columnstore. “Performance,” says Steve, “is almost too good to believe.”

The new platform runs MemSQL for both transactions and queries. In the new platform, there’s no longer a need for CDC. Kurtosys runs MemSQL as a transactional database, handling both transactions and analytics.

MemSQL is a translytical, converged, HOAP, HTAP, NewSQL database.
In the new platform, updates and queries all run MemSQL.

The new internal platform powers Kurtosys applications with thousands of concurrent users, accessing hundreds of gigabytes of data, and with a database growing by several gigabytes of data a day.

Kurtosys is looking forward to using the new features of MemSQL to power the growth of their platform. As Steve Perry says, in a separate blog post, “What they do, they do right… we use MemSQL to improve the performance of query response.”

Stepping Ahead with MemSQL

MemSQL is a fundamental component of the key value proposition that Kurtosys offers its customers – and cutting-edge platforms, like the one being developed at Kurtosys today, will continue to push MemSQL forward.

To see the benefits of MemSQL for yourself, you can try MemSQL today for free. Or, contact us to speak with a technical professional who can describe how MemSQL can help you achieve your goals.


Webinar: The Benchmark Breakthrough Using MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL has reached a benchmarking breakthrough: the ability to run three very different database benchmarks, fast, on a single, scalable database. The leading transactions benchmark, TPC-C, and analytics benchmarks, TPC-H and TPC-DS, don’t usually run on the same scale-out database at all. But MemSQL runs transactional and analytical workloads simultaneously, on the same data, and with excellent performance.

As we describe in this webinar write-up, our benchmarking breakthrough demonstrates this unusual, and valuable, set of capabilities. You can also read a detailed description of the benchmarks and view the recorded webinar.

MemSQL stands out because it is a relational database, with native SQL support – like legacy relational databases – but also fully distributed, horizontally scalable simply by adding additional servers, like NoSQL databases. This kind of capability – called NewSQL, translytical, HTAP, or HOAP – is becoming more and more highly valued for its power and flexibility. It’s especially useful for a new category of workloads called operational analytics, where live, up-to-date data is streamed into a data store to drive real-time decision-making.

MemSQL benchmarked successfully against TPC-C, TPC-H, and TPC-DS

The webinar was presented by two experienced MemSQL pros: Eric Hanson, principal product manager, and Nick Kline, director of engineering. Both were directly involved in the benchmarking effort.

MemSQL and Transaction Performance – TPC-C

The first section of the webinar was delivered by Eric Hanson.

The first benchmark we tested was TPC-C, which tests transaction throughput against various data sets. This benchmark uses two newer MemSQL capabilities:

  • SELECT FOR UPDATE, added in our MemSQL 6.7 release.
  • Fast synchronous replication and durability for fast synchronous operations, part of our upcoming MemSQL 7.0 release. (The relevant MemSQL 7.0 beta is available.)

To demonstrate what MemSQL can do in production, we disabled rate limiting and used asynchronous durability. This gives a realistic aspect to the results, but it means that they can’t be compared directly to certified TPC-C results.

MemSQL runs transactions fast and is a scalable database, at a close to linear rate.

These results showed high sync replication performance, with excellent transaction rates, and near-linear scaling of performance as additional servers are added. For transaction processing, MemSQL delivers speed, scalability, simplicity, and both serializability and high availability (HA) to whatever extent needed.

MemSQL and Analytics Performance – TPC-H and TPC-DS

The second section of the webinar was delivered by Nick Kline.

Data warehousing benchmarks use a scale factor of 10TB of data at a time. MemSQL is very unusual in being able to handle both fast transactions, as shown by the TPC-C results, and fast analytics, as shown by these TCP-H and TPC-DS results – on the same data, at the same time.

MemSQL is now being optimized, release to release, in both areas at once. Query optimization is an ongoing effort, with increasingly positive results. Nick described, in some detail, how two queries from the TPC-H benchmark get processed through the query optimizer and executed. The breakdown for one query, TPC-H Query 3, is shown here.

Breaking down a TPC-H query to show how MemSQL avoids slow queries.

The TPC-DS benchmark is somewhat of an updated and more complex version of the TPC-H benchmark alluded to above. In fact, it’s so challenging that many databases – even those optimized for analytics, can’t run it effectively, or can’t run some of the queries. MemSQL can run all the queries for both TPC-H and TPC-DS, as well as for TPC-C, and all with good results.

For TPC-H, smaller numbers are better. MemSQL was able to achieve excellent results on TPC-H with a relatively moderate hardware budget.

MemSQL gets excellent database benchmarking results against moderate hardware.

Results for TPC-DS were also very good. Because queries on TPC-DS vary greatly in their complexity, query results vary between very short and very long result times. As a result, the geometric mean is commonly used to express the results. We compared MemSQL to several existing published results. Smaller is better.

MemSQL shows itself as a fast database against the somewhat intimidating TPC-DS benchmark.

Q&As for the MemSQL Benchmarks Webinar

The Q&A was shared between Eric and Nick. Also, these Q&As are paraphrased; for the more detailed, verbatim version, view the recorded webinar. Both speakers also referred to our detailed benchmarking blog post.

Q. Does MemSQL get used for these purposes in production?

A. (Hanson) Yes. One example is a wealth management application at a top 10 US bank, running in real-time. Other examples include gaming consoles and IoT implementations in the energy industry.

Q. Should we use MemSQL for data warehousing applications, operational database needs, or both?

A. (Hanson) Our benchmarking results show that MemSQL is excellent across a range of applications. However, MemSQL is truly exceptional for operational analytics, which combines aspects of both. So we find that many of our customers begin their usage of MemSQL in this area, then extend it to aspects of data warehousing on the one hand, transactions on the other, and merged operations.

Q. How do we decide whether to use rowstore or columnstore?

A. (Kline) Rowstore tables fit entirely in memory and are best suited to transactions, though they get used for analytics as well. For rowstore, you have to spec the implementation so it has enough memory for the entire application. Columnstore also does transactions, somewhat more slowly, and is disk-based, though MemSQL still does much of its work in memory. And columnstore is the default choice for analytics at scale. (Also, see our rowstore vs. columnstore blog post. – Ed.)

Q. How do you get the performance you do?

A. (Hanson) There’s a lot to say here, but I can mention a few highlights. Our in-memory data tables are very fast. We compile queries to machine code, and we also work against compressed data, without the need to decompress it first – this can cut out 90% of the time that would otherwise be needed to, for instance, scan a record.

We have super high performance for both transactions and analytics against rowstore. For columnstore, we use vectorized query execution. Since the early 2000s, there’s a new approach, in which you process not single rows, but thousands of rows at a time. So for filtering a column, as an example, we do it 4000 rows at a time, in tight loops. Finally, we use single instruction, multiple data (SIMD) instructions as part of parallelizing operations.

Conclusion

To learn more about MemSQL and the improvements in MemSQL 6.8, view the recorded webinar. You can also read the benchmarking blog post and view the benchmarking webinar. Also, you can get started with MemSQL for free today.

“MemSQL’s Columnstore Blows All of the Free and Open Source Solutions Out of the Water” — Actual User

$
0
0

Feed: MemSQL Blog.
Author: Jacky Liang.

A columnstore database takes all the values in a given column – the zip code column in a customer database, for instance – and stores all the zip code values in a single row, with the column number as the first entry. So the start of a columnstore database’s ZIP code record might look like this: 5, 94063, 20474, 38654… The “5” at the beginning means that the ZIP code data is stored in the fifth column in the rowstore database of customer names and addresses that the original data comes from.

Columnstore databases make it fast and easy to execute reporting and querying functions. For instance, you can easily count how many customers you have living in each US zip code – or combine your customer data with a zip code marketing database.

MemSQL combines rowstore and columnstore data tables in a single, scalable, powerful database that features native SQL support. (See our blog post comparing rowstore and columnstore.) And, in addition to its fast-growing, paid enterprise offering, MemSQL also has a highly capable free option.

You can use MemSQL for free, with community support from our busy message board. MemSQL is free up to the point where you reach four nodes, or four separate server instances, with up to 32GB of RAM each – 128GB of RAM total. This large, free capacity is particularly useful for columnstore tables, where 128GB of RAM is likely to be enough to support a terabyte or so of data on disk, with excellent performance.

We have several existing customers doing important work using MemSQL for free. And when you need more nodes, or paid support, simply contact MemSQL to move to an enterprise license.

Why Use Columnstore?

The columnstore is used primarily for analytical applications where the queries mainly involve aggregations over datasets that are too large to fit in memory. In these cases, the columnstore performs much better than the rowstore.

A column-oriented store, or “columnstore,” treats each column as a unit and stores segments of data for each column together in the same physical location. This enables two important capabilities. The first is to scan each column individually – in essence, being able to scan only the columns needed for the query, with good cache locality during the scan. These features of columnstore get you excellent performance and low resource utilization – an important factor in the cloud, particularly, where every additional operational step adds to your cloud services bill.

The other capability is that columnstores lend themselves well to compression. For example, repeating and similar values can easily be compressed together. MemSQL compresses data up to about 90% in many cases, with very fast compression and decompression as needed. As with the data design of columnstore tables, compression delivers cache locality, excellent performance, low resource utilization, and cost savings.

In summary, you should use a columnstore database if you need great analytics performance. It also helps that MemSQL, as a scalable SQL database with built-in support for the MySQL wire protocol, natively supports popular analytic tools like Tableau, Looker, and Zoomdata.

MemSQL ingests data from multiple sources and supports BI tools, SQL queries, and more.
MemSQL combines fast ingest, in-memory rowstore tables in memory, disk-based columnstore tables, and inherent SQL support for maximum power and usability.

A big advantage with MemSQL is that you get both rowstore and columnstore tables in a single database, with built-in SQL support. This gives you a number of advantages:

  • If you need to have rowstore table data duplicated and, in many cases, augmented in one or more columnstore tables, you can do this in a single database.
  • You can run queries that join, or otherwise operate on, data spread across multiple rowstore and columnstore tables.
  • You can make “game-time” price-performance decisions between storing your data in super-fast, in-memory rowstore tables vs. large, disk-based columnstore tables, then modify your decision as your business needs change.
  • The training and experience you gain in using MemSQL for one use case extends automatically to many other use cases, whether rowstore or columnstore. For some of our customers, MemSQL is the last new database they’ll ever need – and ends up replacing one or more competing database options.

You can also read more about the difference between rowstore and columnstore in our documentation.

Some Current Columnstore Options

There are several columnstore options out in the market. Here are a few of the more popular ones that we see.

Note. Most of these options are not fully comparable to MemSQL because they don’t support both rowstore and columnstore, in-memory and disk-based tables, as MemSQL does. (See the list of benefits to this converged capability above.) However, you should consider a range of options before choosing any database provider, including MemSQL.

ClickHouse

ClickHouse is an open source columnstore database developed by Yandex specifically for online analytical processing (OLAP). ClickHouse allows for parallel processing of queries using multiple cores and very fast scanning of rows, while offering good data compression.

However, there are disadvantages to using ClickHouse. There is no real DELETE/UPDATE support and no support for transactions. ClickHouse also uses its own query protocol, which means limited SQL support. This also means your favorite SQL tools may not be supported if you choose to use ClickHouse. Also, if you are migrating from a SQL database, you will likely have to re-write all your queries which have joins – a common operation in SQL.

MariaDB Columnstore

MariaDB is an open source fork of MySQL. This fork was done by Michael “Monty” Widenius, the author of MySQL, after Oracle purchased Sun Microsystems.

MariaDB supports an open and vibrant community that has frequent updates and excellent community support. Additionally, MariaDB maintains high compatibility with MySQL, so it can be used as a drop-in replacement which supports library binary parity and exact matching with MySQL APIs. MariaDB also offers a columnstore engine for analytical use cases.

However, since MariaDB only supports storing data on disk, if query speed and latency is a priority, then you may not be too happy with the performance. Additionally, MariaDB’s columnstore product is also still quite new, so there is likely to still have work to be done.

Pivotal Greenplum

Greenplum is a columnar data warehouse based on PostgreSQL. Greenplum uses massively parallel processing (MPP) techniques, with each database cluster containing different nodes, such as the master node and the segment node. This allows for parallel processing of queries and storage of data. Greenplum is also fully SQL-compliant and ACID-compliant. Finally, unlike most columnstore databases – but like MemSQL – Greenplum also supports both row and columnstore data storage.

However, customers sometimes complain about the performance and usability of Greenplum. Many customers found the product to be difficult to tune, as Greenplum tends to use all the available system resources for every single query, which can lead to performance degradation when multiple queries are executed at the same time. Also, under high write load conditions, Greenplum would cause something called a Linux journaling error. Errors of this type may require rebuilding the entire database, which might take many hours to complete.

SAP HANA

HANA is an entirely in-memory columnstore database developed by SAP. A major strength with SAP HANA is that it’s built as a data platform — there are multiple “engines” that sit inside HANA columnstore. There are specialty engines built for calculations, spatial use cases, predictive algorithms, and more, allowing users to pick and choose the right engine for their specific use case without having to use materialized views.

However, common complaints among SAP HANA users are the specialized skills one may need to work with the product. Furthermore, since SAP HANA is entirely in-memory-only with no disk component, it can get fairly expensive just for the RAM to contain all your data. Finally, the licensing costs of SAP HANA can get fairly high as well.

Where MemSQL Shines

In November 2018, we launched MemSQL 6.7. With MemSQL 6.7, as well as later MemSQL releases, you can use MemSQL for free, within fairly robust limits. When using MemSQL for free, you can create clusters with up to four nodes, with no limit on the amount of data stored on disk. You also receive community support via online forums, rather than direct, paid support.

Since launching MemSQL 6.7, we have been listening to how people using our software for free use MemSQL. Consistently in these conversations, our users – including those that run production workloads on the free tier – have consistently praised our columnstore.

Purcado's analytics benefit from MemSQL's storage efficiency across a wide range of hardware.
Purcado helps people get the best deals, on the best shoes,
from the best retailers, quickly and easily.

What makes it so good?

  • You can use it for free up to four nodes with unlimited disk. But that’s not all — it also consistently outperforms other free, open source, and even enterprise-grade columnstore databases. This conclusion came from a number of people that have tested many rival databases – sometimes, even ten or more – before finally arriving at MemSQL.
  • MemSQL has built-in support for ANSI SQL, so the query language is very familiar. We also support the MySQL wire protocol, meaning we support a wide range of tools in the data ecosystem.
  • MemSQL offers incredible compression in disk-space columnstore, allowing you to store more data and save precious storage space at the same time. Real customers like Pandora are able to reliably achieve 85–90% on-disk compression for columnar data.
  • MemSQL’s fully distributed nature means you can simply add affordable commodity hardware to increase query performance, concurrency, and ingest speed, and reduce data size.
  • Finally, unique to MemSQL, the ability to combine rowstore and columnstore data in one query means you get the benefits of real-time and historical data unified in one query! This means simplicity for your data engineering stack, lower maintenance costs, and improved performance, as MemSQL can, in many cases, be the one database to rule them all.

Did we mention you can use all this for free?

What People Are Saying

We can tout the benefits of MemSQL all we want, but we think it’s even better to let people who are using MemSQL for free do the talking for us. These are testimonials we have received directly from software developers and data engineers, company founders, and others using MemSQL to run their applications, answer their queries, and drive their businesses forward.

Paul Moss, E-commerce Startup in the United Kingdom

“I use MemSQL primarily for its columnstore. Your columnstore blows all of the free / open source columnstores in the market currently out of the water — it’s just so fast. PostgreSQL and CitusDB are inferior to your product. It’s not even close, especially since I’m running MemSQL on a single CentOS workstation machine. Additionally, as a business owner, you want the simplest engineering stack possible.

MemSQL is one database to rule them all, replacing three to four different databases. It does it all well.”

Hajime Sano, Nikkei in Tokyo, Japan

“The performance of MemSQL free tier is just as good as the enterprise version, which means performance for each query is really fast, the fastest in the columnstore databases out there. That is the greatest thing. MemSQL also supports both rowstore and columnstore in one query. We’re now able to balance real-time query performance (in rowstore) with lower hardware cost (in columnstore). 24/7 operational data goes in in-memory, while archival data goes to disk.”

Nikkei Asian Review uses MemSQL to help with analytics for their multinational media presence.
Nikkei Asian Review, with scores of bureaus and more than 1000 journalists throughout
Asia, delivers both business-focused and general coverage across the region.

Software Developer in Publishing Company in Germany

“The incredible compression and speed of the columnstore engine really is something, querying gigabytes of data in seconds was amazing to see. Also the possibility of combining rowstore and columnstore in one query is a very nice feature.”

Elad Levy, Entrepreneur in the Mobile Games Industry

“MemSQL in particular has columnstore, which is free, and it’s amazing. If you want to analyze data and get business insight, just go with MemSQL’s columnstore. You also get the ability to mix and match transactions (OLTP) and analytics (OLAP) in a single query, which saves us from deploying and querying another database. It’s a 2-in-1 solution.”

Peter Baylies, Purcado in Durham, NC

“I appreciate MemSQL’s speed even on modest, single-box hardware, as well as its storage efficiency on disk.”

Next Steps

Don’t take our word for it — you can find out yourselves why our customers are saying such positive things about our columnstore and choose to run their businesses on MemSQL, both for free and paid, with support. We have a tutorial on loading data into MemSQL and a webinar for building an analytics app using MemSQL’s columnstore. These resources show just how fast and easy it is to set up and use MemSQL.

We have a tutorial on loading data into MemSQL and a webinar for building an analytics app using MemSQL’s columnstore. These resources show just how fast and easy it is to set up and use MemSQL.

To sum up, when using MemSQL for free, you can:

  • Use up to 4 nodes, with no specific limit on disk storage
  • Get rich community support at forums.memsql.com
  • Deploy to production
  • Not face any time limits

Want 24/7 support and even more nodes? You can contact us to begin the conversation.

Video: Modernizing Data Infrastructure for AI and Machine Learning

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

The AI Data Science Summit 2019 featured a keynote by MemSQL’s CEO, Nikita Shamgunov, where he was hosted by MemSQL partner Twingo. Nikita, a co-founder of MemSQL and the technical lead from the beginning, has shepherded MemSQL’s development toward a world where cloud, AI, and machine learning are leading trends in information technology. Now that these trends are becoming predominant, MemSQL is playing an increasing role, as Nikita discussed in his keynote. What follows is an abbreviated version of his presentation, which you can view in full here. – Ed.

Today I want to talk about the demands of AI and machine learning data infrastructure. Certainly the promise is very big, right? I couldn’t be more excited about all the innovation that’s coming in retail, in health care, in transport and logistics.

ML-AI-promise-MemSQL

Investment into AI is very, very strong, with predictive analytics and customer analytics – areas where MemSQL is experiencing rapid and widespread adoption – as two of the top three planned uses for AI technology.

However, the data challenges remain. Only 15% of the organizations have the right architecture and the right data infrastructure for AI, and only 8% of all systems are accessible to AI workflows. And we see this all the time. You walk into a major organization, data is siloed, it’s locked into databases, SaaS services, data warehouses, and more. As a data scientist, data management becomes kind of one of the first challenges that you need to solve, because your AI programs and your AI technology are only as good as the data that is flowing in.

Databricks says the majority of AI projects have challenges moving from concept into production. What causes those delays? Typically, AI workflow is a multi-step workflow. MemSQL can simplify and accelerate a lot of the steps in the AI life cycle.

MemSQL fixes delays in the ML and AI life cycle.

MemSQL plugs into modern applications and plugs into modern workflows, such as AI workflows, a lot better than old school technology. It allows you to close the loop and automate the loop and remove a person looking at dashboards from the workflow and make the system completely automatic. And that’s what an operational system allows you to do, so you can go from analytics to pixels, from analytics into an app, almost instantaneously, with an automatic workflow.

Some of the key challenges that MemSQL addresses in the ML and AI life cycle.

We are currently testing technology internally that reliably allows us to get responses to a specific set of query types in 2ms. We showed this to one customer, and they had a truly interesting response. Our customer said: “First, we actually don’t believe that you can do this. But if you can do it, we want it first.”

We’re working with top U.S. banks on fraud prevention, which is another very, very typical example of using MemSQL. And fraud needs to be detected these days in real time. You swipe a credit card and you want to reject that credit card transaction in the transaction. And so, for that, you needed to have a very performant, very efficient data backbone.

MemSQL's architecture gets real-time and streaming data lined up for ML algorithms and AI programs.

MemSQL is particularly well-suited for use in speeding up the workflows that are needed for all kinds of AI and machine learning applications. We jokingly call this a “markitecture diagram” – it shows the many ways that MemSQL brings together all the different strands of input and output, providing a fast, scalable, SQL database, which can ingest and store nearly any kind of data, for AI and machine learning programs to work against.

MemSQL's position at the center of an ML and AI ecosystem makes it a strong choice for machine learning and AI.

We hope to work with many of you on AI and machine learning applications going forward. You can see my conference presentation here. For more information, please reach out. You can download and use MemSQL for free, up to certain fairly generous limits, with community support from the MemSQL Forums. For more ambitious uses of MemSQL, or to subscribe to our excellent paid support plans, contact MemSQL today.

Forrester Finds Millions in Savings and New Opportunities in Digital Transformation with MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

A new wave of digital transformation is in progress, and this new wave is powered by the exponential growth in the volume and complexity of data. To make data valuable it must be collected, stored, analyzed, and operationalized so as to drive value. Forrester has conducted a Total Economic Impact (TEI) analysis showing the savings and opportunities made possible for organizations moving to MemSQL.

In order to put savings and benefits into context, Forrester conducted case studies with four MemSQL customers. These customers face many of the data infrastructure problems that prevent organizations from using their data effectively, including:

  • Data in multiple silos
  • Stale data
  • Complex data architectures
  • Scalability limited, with expensive and fragile efforts to manage scalability
  • Poor performance for specific functions and across the board

The result? Brittle, overly complex data processing systems which, in many cases, are starting to come crashing down.

The Forrester TEI Methodology

The four customers that Forrester studies all upgraded to MemSQL to solve specific problems, as described below. The four customers were in online services; professional services’ utilities; and online security services. Each customer has from one to several use cases for MemSQL running during the study period.

Forrester then used their trademark TEI methodology, scaling the results to a representative composite organization with 15,000 employees and $3B in revenues. The results were impressive – $15M in cost savings and new benefits across several initiatives and new opportunities generated by improved flexibility, all within a three-year period.

MemSQL-legacy-database-cost-savings
Forrester found $15M in cost savings and benefits in three years with MemSQL.

MemSQL is The No-Limits Database™, offering a database solution that emphasizes three things: speed, with accelerated time to insight; scale, the ability to grow data management, and company operations, at low and stable costs; and SQL, support for the lingua franca that has powered business solutions for decades.

Taking advantage of these capabilities, the specific companies that were studied, and the composite company that Forrester created for the analysis, experienced the benefits that MemSQL promises:

  • No more missed SLAs.
  • No more fragmented data architecture.
  • No more “we can’t do that.”

Customer Benefits with MemSQL

The benefits achieved in the composite company include:

  • Reducing legacy database costs. Lower software license fees and reduced hardware costs for running the software, saved $4.1M over three years.
  • Avoiding fraud-related costs. The cost of detecting fraud dropped, and success in detecting fraud improved, with a net benefit of $2.2M over three years.
  • Reducing hardware issues. The composite organization avoided 25 business-critical failures over three years for a savings over that time period of $5.4M.
  • Improved employee productivity in analytics. Reporting and data analysis time dropped from many hours to minutes, for productivity gains of $2.4M over three years.
  • Better decision-making. All across the business, managers take advantage of opportunities more quickly, with direct revenue benefits of nearly $1m in three years.
  • Improved product and services quality. Employees have more and better data at hand for helping colleagues, partners, and customers, with benefits compounding over time.

A composite organization, by taking steps based on Forrester’s case study analysis of four actual MemSQL customers, would experience benefits of $15M against costs of roughly $3.7M, for a net present value (NPV) of $11.3M and an ROI of nearly 300%.

How Customers Grow MemSQL’s impact

In working with customers, we find that the benefits of MemSQL compound in another, powerful way. Customers tend to start by adopting MemSQL, as a new database in their arsenal, for a limited, specific use case where the cost/benefit ratio is hugely favorable and crystal clear. Once they get hands-on experience with MemSQL, however, they come up with new ideas for how to use it more broadly.

Fanatics, for instance, dramatically scaled their ambitions, scaling MemSQL up to use it as the core engine for their entire companywide, worldwide transaction capability. Similarly, a financial services company made a two-step move away from Oracle. They now run operational analytics on a combination of Kafka streaming and the MemSQL database.

Fanatics uses MemSQL as a compute engine and data repository.
Fanatics Uses MemSQL as their Core Analytics Engine for Global Operations.

See the Benefits for Yourself

To see the benefits for yourself, download the Forrester TEI Report. And for hands-on experience, download and run MemSQL for free.

Webinar: Delivering Operational Analytics with MemSQL

$
0
0

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL product marketing leader Mike Boyarski led the webinar, describing how MemSQL is well-positioned to power operational analytics. Operational analytics is the ongoing use of live and historical data to drive decision-making by both people and programs, including predictive analytics, machine learning, and AI. To do operational analytics you need to quickly ingest, analyze, and act on both incoming and existing data – all of which is in the wheelhouse of MemSQL. To view the webinar, click here.

Traditionally, companies have their most important operational data stuck in silos. They can’t access it quickly or easily, and they can’t meet service level agreements (SLAs) for data delivery and data access.

To solve these problems, businesses that need to move fast use MemSQL. This includes half of the top 19 banks, two of the top three telcos, tech leaders – from Akamai to Uber – and many others.

MemSQL powers Uber, Goldman Sachs, Netflix, Capital One, and Fitbit.

Why Operational Analytics is Vital

The demands that organizations make on their data are growing. Data volume and complexity are rising; business expectations are growing, and analytics are evolving to keep pace. Whereas reports and occasional queries from experts were once considered enough, today, businesses want to use real-time data to power predictive analytics, machine learning, and AI.

Today’s systems either struggle to keep up, or don’t even try. Their responsiveness from event to insight may not meet SLAs – and often, the SLAs themselves are not enough to keep pace with new competitors. Costs and complexity continue to increase, and demands for access – from people with SQL queries, from SQL-compatible business intelligence (BI) programs, from management dashboards, and from programs that power predictive analytics, machine learning, and AI, are all rising. These requirements are now table stakes for organizations to be competitive, as digital native companies win more and more slices of the economic pie.

In order to power analytics, organizations need to ingest, analyze, and act on blended real-time and historical data. MemSQL’s capabilities make it capable of meeting this challenge, where legacy databases fall short.

MemSQL excels at operational analytics - ingest, analyze, and act.

A New Architecture – with MemSQL at the Core

MemSQL has the relational database capabilities needed to handle structured data for both transactions and analytics, the scalability to grow to meet demand for ingest, analysis, concurrency, and action, and the flexibility to handle semi-structured JSON data and full-text search for unstructured data. MemSQL also runs on-premises and in multiple clouds, in containers and virtual machines, and with a new Kubernetes Operator for open source or Red Hat Open Shift Kubernetes distributions, making for a truly cloud-native option that lives where you need it to.

MemSQL can ingest from a wide range of sources, including change data capture (CDC), at very high rates of speed; works with relational data, key-value data, JSON semi-structured data, geospatial data, and time series data.

MemSQL-reference-architecture-ingest-et-al

MemSQL handles data from systems of engagement (SOEs), such as social media, internet of things (IoT) data, and mobile phone data, including the full range of supported formats,with excellent performance. On the other side of the data store, analytics demands include lookups, aggregates, ad hoc queries, machine learning (ML), and artificial intelligence (AI). This wide range of demands, including many more users wanting direct query access, drives a strong need for increased concurrency.

MemSQL can live right at the core of a modern data infrastructure, handling both transactions and analytics. You can augment existing systems or replace them at each stage of your infrastructure.

MemSQL sits at the center of a reference architecture for operational analytics.

MemSQL Meets Operational Requirements

Companies from Comcast and Uber find MemSQL vital to meeting their operational analytics requirements. MemSQL excels vs. competitors in meeting the needs of operational data workloads. The leading operational analytics competitors include:

  • MongoDB. Mongo is fully modern in its structure, and flexible in its deployment options. But performance for ingest, transactions, and analytics fall far behind MemSQL, and even lag other competitors.
  • Oracle Exadata. Oracle’s Exadata database machine is legacy, rather than modern; inflexible; hard to manage and keep available; and very expensive.
  • Amazon Aurora. Aurora is not modern, making it hard to match MemSQL or competitors in key areas of performance and flexibility.

MemSQL beats MongoDB, Oracle Exadata, and Amazon Aurora for operational analytics.

Q&A

Mike took questions from the audience, including:

Q. How do you write custom code to apply to ingested data?
A. You can create stored procedures to use all kinds of code as part of the MemSQL Pipelines feature, Pipelines to stored procedures.

Q. Are you an in-memory database?
A. MemSQL started out as an in-memory database, using rowstore only. We’ve now added a robust columnstore capability that is used by many of our customers. Though columnstore is disk-based, many customers have been pleasantly surprised by its performance and functionality.

Q. Do you support time series data?
A. MemSQL has strong support for many time series capabilities. However, MemSQL does not have all of the functionality of a specialized time series database out of the box. We are doing work internally on this, and a number of MemSQL customers are using us for time series data today. Please contact us to find out how we handle time series workloads.

Q. Can MemSQL be deployed on AWS, GCP, and Azure?
A. Yes! We have customers on each of these platforms. Also, MemSQL’s Kubernetes Operator makes it easy to manage MemSQL on these platforms, as well as on-premises.

Q. Has MemSQL been used to replace Exadata? What about Oracle’s version of SQL?
A. Yes, a number of well-known customers have made this move. We give you all the performance you need on a modern infrastructure, at a much lower TCO – often 3x or less than Oracle. For Oracle’s PLSQL, we have a migration path and partners to help move up to thousands of stored procedures to MemSQL’s own language. (Which includes many PLSQL-friendly features.)

Conclusion

To learn more about MemSQL and how it can help you deliver operational analytics, view the recorded webinar. You can also get started with MemSQL for free today.

Viewing all 427 articles
Browse latest View live