MemSQL Forums Announces First Community Star

July 12, 2019, 9:58 pm

≫ Next: Katoni Migrates from Elasticsearch to MemSQL for Scandinavian SaaS

≪ Previous: Webinar: Delivering Operational Analytics with MemSQL

MemSQL Forums are available anytime you have a question about MemSQL, want to know more about MemSQL, or are willing and able to help others. And now the MemSQL Forums, coming up on their one-year anniversary, have their first Community Star: Ziv Meidav, a big data architect.

The community of MemSQL users, developers, and employees gather at the MemSQL Forums to ask questions, offer support, and discuss strategies for leveraging MemSQL. Got a nasty error message? Post it on a Forum. Want to suggest a feature? Same thing.

What the MemSQL Forums Do

The MemSQL Forums are organized into topics. Topics include:

Announcements. All the latest from MemSQL and partners.
Feature Requests. What you want to see in MemSQL going forward.
Cluster Operations. Managing your MemSQL clusters.
Documentation Feedback. Suggest improvements to MemSQL’s well-regarded documentation.
MemSQL Studio. Get your questions answered about MemSQL’s cluster monitoring and debugging tool.
MemSQL Management Tools. Ask about tools in the MemSQL Toolbox package and MemSQL Ops.
Third-Party Integrations. Using MemSQL with third-party tools and frameworks that work with MemSQL, such as Kafka and Spark.
MemSQL Development. If you’re developing against MemSQL, put your questions and comments here.
Uncategorized. A catchall category.
Site Feedback. Tell us how to improve the Forums.

The great thing about the Forums is the mix of people aboard. MemSQL engineers, experienced MemSQL users from organizations large and small, new users, and students all come on, asking and answering questions, from the simplest to the most arcane. Everyone pitches in to help.

The Forums are particularly valuable to users of MemSQL who don’t have an Enterprise license, but instead run MemSQL for free. This works on-premises as well as in the cloud, up to certain limitations as to nodes and resources used. For such users, the Forums are the only source of support. And users of MemSQL’s free level are regular contributors to the Forums.

One of the things that makes it fun to create and use MemSQL is that it is, fairly literally, a computer science project. Building and using “the next great database,” which is our goal, touches on all sorts of problems. They range from the momentary and entirely practical – “my cluster just crashed” – to the architectural, mathematical, and sometimes almost philosophical. It all gets hashed out on the Forums.

Ziv to the Rescue

Like many MemSQL users and fans, Ziv asks penetrating questions on the MemSQL Forums. Ziv’s Forums activity shows comments on columnstore performance for JSON data, a question on MemSQL Studio, and a group of comments on a performance issue – which turns out to be a fairly deep topic.

It all started with a SELECT statement that had varying execution speeds, ranging from 40ms to 500ms, when run on the MemSQL 7.0 beta. Ziv posted extensive details, and MemSQL’s Haoran Xu chimed in with an initial response.

MemSQL Forums Community Star Ziv poses a tough question to MemSQL staff. — *Ziv Meidav’s focused, specific question helped uncover a bug in the MemSQL 7.0 beta.*

This started a long exchange, totaling 17 pages over 5 days. It ended with Haoran recognizing a bug in the beta, described in detail in his final message in the exchange. MemSQL found and fixed the cause of the bug immediately, and the new and improved code will appear in MemSQL 7.0 soon.

As you see here, the Forums, with questions like Ziv’s, help MemSQL people as well as customers. Through the Forums, we learn about bugs, documentation issues, feature requests, and more. Software work can be very individual; through the Forums, MemSQL people worldwide and customers can work together to make MemSQL better.

Conclusion

The MemSQL Forums are a great resource, community, and unofficial school for MemSQL users and employees to teach and learn from each other, while solving problems and minimizing downtime. They’re available to anyone who has a free account from the MemSQL Customer Portal or signs up with a valid email. And if you participate in the forums, perhaps you’ll get chosen next month as our next Community Star.

There is, of course, an easy way to get started with MemSQL. Download MemSQL for free and start using the MemSQL Forums today!

↧

Katoni Migrates from Elasticsearch to MemSQL for Scandinavian SaaS

July 13, 2019, 5:19 pm

≫ Next: MemSQL Customers Speak Up on Reviews Site G2 Crowd

≪ Previous: MemSQL Forums Announces First Community Star

Feed: MemSQL Blog.
Author: Floyd Smith.

Since MemSQL became free to use last November – for up to four nodes, and with community support – new, creative uses of MemSQL have abounded. One of the most impressive new implementations is from Katoni, an ecommerce hub that offers SEO tools via software as a service (SaaS) to mostly Scandinavian clients. Katoni has replaced Elasticsearch with MemSQL for MemSQL’s ability to handle complex queries with speed, scalability, and native SQL support.

MemSQL is currently the primary database powering Katoni’s SaaS suite of SEO tools. MemSQL runs alongside PostgreSQL, which serves as a secondary database for specific tasks such as projects, users, and billing.

Elasticsearch to MemSQL move. — *Katoni runs a popular e-commerce portal and an SEO offering delivered as SaaS.*

Moving to MemSQL

Like many companies, Katomi was originally running on other technologies, then moved to MemSQL as problems such as slow query performance dogged their efforts.

According to Martin Skovvang, software engineer at Katoni, “We started out with Elasticsearch, but soon needed a replacement as our queries became more complex. Especially, the JOINs, subselects, and HAVING features provided the major benefit of moving to MemSQL, along with the transaction siupport, scalability, and ease of use.”

Katoni uses a combination of rowstore and columnstore tables. They are excited about some upcoming MemSQL features that promise to combine the best of both.

Getting MemSQL for free, while they grow their SaaS business, is crucial to Katomi’s success. As they hit their business milestones in SaaS, they expect to move to a paid subscription.

“Impressive Stability”

Katomi runs MemSQL in Google Cloud Platform, using three VMs. They collect millions of rows of data a day into tables with tens of millions of rows in rowstore, using several tens of gigabytes of memory, and more than a billion rows in columnstore, consuming disk storage of an additional tens of gigabytes.

Katomi describes MemSQL’s stability as “impressive.” According to Skovvang, “What usually worries me most about managing databases? Two things: backups and crash recovery.” With other databases, crashes required involving the vendor’s support team, costing Katoni hours of downtime. With MemSQL, the recovery process is almost fully automatic; they never need to involve support.

Katomi does have a short wishlist for MemSQL. The interpret_first feature, which was offered as an option in MemSQL 6.7, then turned on by default in MemSQL 6.8, met one wish. Katomi are also looking for a native UUID data type, unique indexes in columnstore, and enhanced support for foreign keys.

Schema design options will open up if MemSQL can shard on non-PRIMARY keys. Enhanced multi-language support for certain full text indexes will help as well. Most of the features Katoni is looking for are either already planned, or under active discussion for MemSQL’s development road map.

Conclusion

Many MemSQL customers get traction running MemSQL for free, with the option of moving to a paid subscription as their needs grow. You can ask about MemSQL on the MemSQL Forums, download and run MemSQL for free, or contact MemSQL Sales today.

↧

MemSQL Customers Speak Up on Reviews Site G2 Crowd

July 17, 2019, 8:53 pm

≫ Next: A Scalable SQL Database Powers Real-Time Analytics at Uber

≪ Previous: Katoni Migrates from Elasticsearch to MemSQL for Scandinavian SaaS

Feed: MemSQL Blog.
Author: Floyd Smith.

MemSQL is featured on business solutions review site G2 Crowd. Customers cite advantages like MySQL-friendliness, data compression of 60-80%, strong training, and excellent support – and that’s just in a single review.

What G2 Crowd Does for Users

G2 Crowd features frank reviews of business solutions – software and services – across a wide range of categories. For software, categories include:

Sales. CRM software, sales acceleration, and a whole range of other sales solutions.
Marketing. Tools for A/B testing, content marketing, demand generation, lead management, social media marketing, and more.
Analytics. Business intelligence, predictive analytics, and statistical software are among the categories featured.
AI. Platform software, chatbots, deep learning, and image recognition are key artificial intelligence (AI) categories.
AR/VR. In augmented reality (AR) and virtual reality (VR), CAD software, content management, and game engines are among the offerings.
B2B marketplace platforms. Merchants, catering, and on-demand delivery for grocery stores and restaurants are included.
CAD & PLM. For computer-aided design (CAD) and product lifecycle management (PLM), categories include 3D design, computer-aided manufacturing, and geographic information systems (GIS).
IT infrastructure. Cryptocurrency software, relational databases (such as MemSQL), NoSQL databases, key-value stores, and many other types of infrastructure software are covered.

Additional categories are collaboration & productivity; content management; customer service; (software) development; digital advertising; e-commerce; ERP (enterprise resource planning); governance, risk & compliance; (content and software) hosting; HR; IoT; IT management; (general) office; security; supply chain & logistics; and vertical industry software.

G2 Crowd has features designed to help. You can easily link from a product to its overarching category (such as Relational Databases Software, for MemSQL), making comparisons easy. You can interact with a G2 Advisor who will help you find the products that are worth your time and attention. It’s easy to share comments on social media. And you can, in many cases, ask a question of, or request a demo from the vendor.

Using G2 Crowd for research can deliver a lot of benefits:

Validation. To begin with, you can quickly see that a provider is “real,” with validated and verified users who speak to the plusses and minuses of a product.
Strengths and weaknesses. Reviews tell you what a product has done for people, and where it has fallen short.
Sales call preparation. If you’re going to interact with a vendor’s salespeople, G2 Crowd can give you a running start on what to ask about.
Research completeness. You can find competitors to a product you’re considering and get a quick read on relative strengths and weaknesses. This can be very helpful in creating a shortlist for serious consideration, for example.

Posting a review on G2 Crowd also helps you, as a current user of a product. By posting, you encourage the good features of the products you review to get developed further – and the bad features to get fixed. Your review is likely to elicit responses and comments, amplifying your feedback and possibly helping you work around concerns. And letting vendors know that you post on G2 Crowd may give your feedback more weight with them as they develop their products further.

What the G2 Crowd Says About MemSQL

So far, so good – at this writing, MemSQL has a star rating of 4.5 out of a possible 5 points on G2 Crowd. Among the key positive comments:

Speed. Queries are reported to be very fast, for example.
SQL and MySQL compatibility. MemSQL uses standard ANSI SQL and is compatible with MySQL wire protocol, making it very easy to drop into existing workflows and development efforts.
Distributed database. MemSQL is fully distributed, which is unusual for a relational database and means you can scale MemSQL to meet a very wide range of needs.
Specific use cases. Geolocation queries, high rates of compression for columnstore tables, and online transaction processing (OLTP) at high rates of throughput are mentioned.
Affordable. Several users describe the licensing costs as reasonable, even “cheap.”

Not all the reviewers on G2 Crowd write in perfect English, as the site attracts users from all over the world. But the enthusiasm behind the many positive comments comes through: “We mostly go after greater performance and scalability”; “Very scalable, fabulous, and very easy to use”; “You can seamlessly join across both row and columnar based tables.” One comment seems to sum up many of the positives holistically: “We’re now able to keep pace with increasing data volume and provide faster insights for our customers.”

First review of MemSQL on reviews site G2 Crowd - five stars. — *This MemSQL review includes positives like “fast queries” that appear in other reviews as well.*

There are some recommendations for the company as well – an easier deployment process is recommended; better tools; built-in performance monitoring; making the product easier for less-savvy users. Here at MemSQL – as is likely to be true at other vendors who see concerns listed – these issues have been noted, and efforts to address the concerns are well underway.

And there are tips. One financial services administrator recommends taking the time to get to know the product in some depth, and using Prometheus and Grafana for monitoring.

Next Steps

If you’re already a MemSQL user – whether you have an Enterprise license, or are using MemSQL for free – consider posting a review today. (And remember that you can get questions answered on the MemSQL Forums as well.) Your efforts will benefit the community as a whole.

If you haven’t yet tried MemSQL, take a look at the reviews on G2 Crowd. Post questions as comments there, or on the MemSQL Forums. And consider trying MemSQL for free today.

↧

A Scalable SQL Database Powers Real-Time Analytics at Uber

July 25, 2019, 5:32 pm

≫ Next: Webinar: Operationalizing Predictive and ML Applications with MemSQL

≪ Previous: MemSQL Customers Speak Up on Reviews Site G2 Crowd

Feed: MemSQL Blog.
Author: Floyd Smith.

What is real-time analytics? Uber provides a useful answer. Uber is driven by analytics. The fastest-growing company ever, by some accounts, Uber uses up-to-date information to entice customers, direct drivers, and run their business on a database that provides answers with sub-second response times. In this presentation, James Burkhart of Uber explains how MemSQL has helped solve critical latency issues for the company at a critical point in its growth. We encourage you to review the key points listed here, then view the presentation from Strata + Hadoop World 2017 and review the slides on Slideshare.

[embedded content]

Four Challenges for Uber Analytics

The MemSQL database helps Uber make real-world decisions based on analytical results that the company uses to take action on the go. As a fast, scalable SQL database, MemSQL gives Uber a broad range of capabilities in a Uber wanted to overcome four challenges:

Business Intelligence
Real-time analytics
Time series aggregates
Geospatial data

Uber was looking at metrics for just about every important aspect of their business: riders, drivers, trips, financials, etc. So business intelligence is the first challenge. An example of an actionable metric could be something like a surge of business in a contained geographic area. The action to take in response is to physically reposition supply in the real world.

They wanted this to be in real time, which is the next challenge is. Before using MemSQL, Uber was facing a p99 ingest latency somewhere between 1.5 seconds on the lower end of sources, and 3 minutes on the higher end, depending on the ingestion source. They needed consistent responsiveness in fractions of a second, to a few seconds at the maximum, for all sources.

Another area of interest is aggregation of time series data. In their real-time analytics system, Uber does not store the entire history of all time series data; there are other systems to do this. In this system, they want business metrics, which have strong seasonality components and are looked at in that context.

The real-time analytics system is not a source of truth, but a source for aggregated data. For example, a morning commute has very different marketplace dynamics than a Friday night out on the town. An example query for this purpose would be something like, “Hourly count of trips in San Francisco.” The whole system is designed and optimized for aggregated data, not individual records.

Geospatial data is the final key point. Uber’s business is operating across the physical world, and they need to be able to provide granular geospatial slicing and dicing of data to help analysts understand various marketplace properties of some geo-temporal segment of data. Knowing that, somewhere in San Francisco, that the unfulfilled rate of requests went up at a particular point in time is not useful, as compared to understanding that, for example, near AT&T Stadium, when a San Francisco Giants game let out, there was a spike representing an increase in demand.

The slide (available in this Uber presentation on Slideshare) shows an example of all of these issues coming together. It shows aggregated demand for Uber rides on New Year’s Eve 2016 (orange dots) vs. the early hours of New Year’s Day 2017 (green dots). You can see where demand is and the hour of the day, all relative to the terminator – the line between one day to another, which sweeps across the globe every 24 hours. In the slide, the terminator is moving across the Arabian peninsula and eastern Europe, where it’s midnight, and heading toward very dense concentrations of Uber demand, as 2016 is ending and 2017 approaching, across Europe and key cities in Africa, such as Cairo, Egypt.

Uber uses real-time analytics to generate an image of ride demand on New Year's Eve. — *Uber demand peaks in key cities on New Year’s Eve.*

Why Uber Needs A SQL Database for Uber Analytics

Apollo is Uber’s internal platform for real-time analytics. Uber stores only recent data, that’s about seven weeks’ worth. They have low latency on ingestion, on the order of seconds to minutes, between data being logged and making that data available for query. Apollo supports ad hoc exploration of data, arbitrary drill down including geospatial filtering, and geospatial dimensioning of our data.

Another key property is deduplication. Use of Kafka is pretty heavy at Uber. With Kafka deployment, it ends up getting an at-least-once delivery guarantee. But one of the flaws in the existing system, the one that Uber has replaced with Apollo, was that it would double count in a lot of scenarios involving hardware or software failures. They needed to be able to assert that a uniquely identifiable record exists exactly once, and would not be double-counted with this system. And they have to do all of this with low latency to the end user.

The Apollo real-time analytics system has MemSQL at the core. — *MemSQL is at the core of the Apollo analytics system at Uber.*

So MemSQL is where Uber stores the data. They investigated some alternatives during the research and planning phase, but found that MemSQL serves their needs well. They find it to be super fast, that it supports ingestion at a rate significantly beyond their requirements, and that it meets their reliability needs.

Another feature that they started using, somewhat after the initial implementation, is MemSQL’s columnstore, alongside the in-memory rowstore. They started moving some of the older data into columnstore periodically, which can reduce storage costs by 90% plus in some cases. (Editor’s note: This is partly due to strong data compression in columnstore, in addition to the lower cost of storage on disk – as used for columnstore – vs. storage in memory, as used in rowstore.)

Conclusion

In summary, James described how MemSQL provided a fast, responsive, scalable solution that allows Uber to run important parts of their business using MemSQL as a standard SQL database. In the conclusion to the presentation, James goes on to talk about many specifics of the Apollo system, including Apollo Query Language (AQL), their query layer on top of SQL. He shows how they optimized many aspects of the system to support AQL while delivering high performance.

To learn more about how Uber maximized their use of MemSQL, view the recorded presentation and review the slides. Also, you can get started with MemSQL for free today.

↧

Webinar: Operationalizing Predictive and ML Applications with MemSQL

July 26, 2019, 5:51 pm

≫ Next: The Ideal Stack for Real-Time Analytics

≪ Previous: A Scalable SQL Database Powers Real-Time Analytics at Uber

Feed: MemSQL Blog.
Author: Floyd Smith.

Predictive analytics, machine learning, and AI are being used to power interactive queries and outstanding customer experiences in real time, changing how companies do business. MemSQL is widely used to help power these advanced applications, which require fast access to recent data, fast processing to combine new and existing information, and fast query response. In this webinar, Eric Hanson, Principal Product Manager at MemSQL, shows how MemSQL customers are using this fast, scalable SQL database in cutting-edge applications. You can read this summary – then, to get the whole story, read the transcript, view the webinar, and access the slides.

Mapping MemSQL to Development and Deployment

Eric describes how MemSQL helps at each step as you build and run machine learning models:

Training models. You can use MemSQL as a fast, scalable source for training data, accessing the data via standard SQL. With MemSQL, you can complete many more training runs in less time.
Integration with machine learning and AI tools. A wide range of machine learning and AI tools use the MySQL wire protocol, which is supported directly in MemSQL. So they connect directly to MemSQL as well. This includes the Python data analysis library pandas, the scikit-learn Python library, NumPy, the R language, SAS analytics, and the TensorFlow machine learning platform.
Fast ingest. MemSQL accepts streaming data, for real-time ingest, or bulk uploads from a very wide range of sources.
Scoring on load. As you load data through a MemSQL Pipeline, you can transform data with a Python script or any executable code that you create. This allows you to, for example, compute a “score” column from existing, input columns very fast, during the load process.

AI tools such as TensorFlow integrate well with MemSQL.

Using these features together supports high productivity during development and very fast execution in production. Developers use their accustomed, purpose-built machine learning and AI tools, and are able to build and test models separately from the production pipeline.

In production, MemSQL’s scalability, very fast ingest, and fast execution of pre-written, pre-tested Python scripts and pre-compiled executable code support very high performance at scale. Systems that have feedback loops benefit greatly as the benefits of fast ingestion and fast execution compound exponentially with time.

Mapping MemSQL Features to Machine Learning and AI

Whereas the overall philosophy of MemSQL is to integrate well with existing tools, there are a few MemSQL capabilities that further boost machine learning and AI use cases. Eric describes them:

Scalability. Operationalizing machine learning and AI programs is difficult enough, given the ongoing focus on research and development, over implementation, in these areas. It’s even more challenging when your shiny new ML pipeline can’t scale to match demand. Because MemSQL is fully distributed – for ingest, transactions, and analytics – it serves as the solution for a wide range of scalability problems, including this one.
Vector functions. MemSQL has a few functions that are especially useful for vector similarity matching, producing amazingly fast execution of complex operations at scale. DOT_PRODUCT for two vectors and EUCLIDEAN_DISTANCE for two vectors are highly useful, specific functions. JSON_ARRAY_PACK for a floating point number and an array, and VECTOR_SUB for two vectors, are helpful supportive functions.
ANSI SQL support. AI is often seen as a scientific function, isolated from business concerns. The full ANSI SQL support in MemSQL is highly useful operationally and can also serve as a bridge between the AI group and the business side.

Vector similarity matching functions run fast in MemSQL.

Approving Credit Card Swipes in 50ms of Processing Time

Despite years of hype, it’s still early days for real-world implementations of machine learning models and AI programs in production applications. As deployments increase, MemSQL is being used for an ever-widening range of applications. (Streaming technologies such as Kafka and Spark are often deployed for the same purpose, despite the degree of change to data flows, skills, and vendor relationships required to take advantage of them.)

One example is a credit card fraud detection application developed by a major US bank. Fraud detection used to be a batch process that ran at night, allowing many illicit purchases to be made on a stolen card or card data. But this bank is implementing fraud detection “on the swipe,” with approval decided within one second of the swipe – most of which is taken up by data transmission time. Only with MemSQL have then been able to assemble, process, and make a decision on their 70-feature model in real time.

Reference architecture for ML scoring for credit card swipes

Conclusion

For research and development, MemSQL is free to use for workloads up to 4 nodes (typically, 128GB of RAM), with much larger on-disk data sizes possible. When running MemSQL without paying, you get community support through the MemSQL Forums. To run larger workloads, and to receive dedicated support (as many organizations require for production), contact MemSQL for an enterprise license.

MemSQL’s connectivity, capabilities, and speed make it a solid choice for machine learning and AI development and deployment. For more information you can read the transcript, view the webinar, and access the slides. Or, download and run MemSQL for free today!

↧

The Ideal Stack for Real-Time Analytics

August 5, 2019, 10:48 pm

≫ Next: Webinar: Providing Better Wealth Management

≪ Previous: Webinar: Operationalizing Predictive and ML Applications with MemSQL

Feed: MemSQL Blog.
Author: Floyd Smith.

Real-time analytics is necessary to enable real-time decision making and to deliver enhanced customer experiences (download real-time whitepaper). Building a real-time application starts with connecting the pieces of your data pipeline. To make fast and informed decisions, organizations need to rapidly ingest application data, transform it into a digestible format, store it, and make it easily accessible. All at sub-second speed.

In this video, we show how an earlier version of MemSQL was able to serve as the core of a real-time analytics stack.

[embedded content]

How MemSQL Supports Analytics

Let’s briefly review MemSQL’s analytical approach. MemSQL is a database that is a real-time database, the fastest database for operational analytics. MemSQL is scalable, so you can scale to as many nodes as you want to have in your system. And MemSQL uses the proven SQL syntax; it is a relational, SQL database at its core. MemSQL has been ranked by analysts as the number one database for operational analytics.

A real-time stack powers BI dashboards

The mission of MemSQL is to make every company a real-time enterprise. We mean to enable every company to build a real-time data pipeline and to have real-time analytics on that pipeline. So we very much care, as a company, about accelerating the way you get analytics from your data as it comes in. (Editor’s note: This includes support for business analytics (BI) dashboards, as supported by major BI tools.)

The purpose of real-time analytics is, you want to gain insights that provide meaning to your business from data as it comes in. So you don’t want to do a batch job first, then do the analytics later, or the analysis later. You want to do the analytics as your data is piping into the system. Analytics is used for building a business intelligence dashboard that’s real-time, for improving the customer experience, for increasing efficiency and for creating new revenue opportunities.

A typical real-time data pipeline is architected as follows:

Application data is ingested through a distributed messaging system to capture and publish feeds.
A transformation tier is called to distill information, enrich data, and deliver the right formats.
Data is stored in an operational (real-time) data warehouse for persistence, easy application development, and analytics.
From there, data can be queried with SQL to power real-time BI dashboards.

Real-time visualization is the purpose of BI dashboards.

Let’s dive into understanding the ideal real-time stack. To create real-time analytics, to sell the idea of real-time analytics in the business, you need a BI dashboard. For all the customers we’ve worked with, there’s some sort of visualization. It’s either home-grown or it’s built using one of the third-party platforms: Tableau, Zoomdata, Looker, MicroStrategy, Qlik, you name it.

Now, the common element of all these business intelligence tools that support business dashboards like this is that they only provide the interface. They usually don’t provide the backing data store. You need a place that stores your data, that’s really fast, to be able to come up with the real-time insight.

And so, to be able to make your visualization real-time, you need a real-time backing store. Moreover, these sorts of visualizations only attach to backing stores that speak a certain language, meaning they have to be compliant with certain syntax, the most prevalent of which out there is SQL.

A persistent data store powers fast queries.

In the image above, the one in the very right is real-time visualization. Next to it is this data persistence piece.

Following are four characteristics of a data persistence element for real-time analytics:

In-Memory and Solid State. It needs to leverage in-memory and solid state storage. If you have a database that doesn’t have the ability to use in-memory and solid state storage these days, it’s probably not fast enough for your purposes.
Distributed Architecture. The fastest databases are distributed, because you want to be storing your data in many nodes and processing all of that in parallel. You want massively parallel processing to be able to retrieve your data.
Data Source Connectivity. You need a data persistent store that can connect to various sources. It’s not important only to support pure SQL coming in, but the data store needs to be able to go and grab data from other sources.
Flexible Deployment. You need to deploy it on the cloud and on-premises. More and more workloads that we see are moving to the cloud, and the cloud is an increasingly strong player in this space.

So in other words, MemSQL is actually all the above four points. We are a scalable, SQL database. So, when you think about MemSQL, think about it as you would think about MySQL or PostgreSQL or Oracle, SQL databases that are compliant with the ANSI SQL standard The difference is that MemSQL has the ability to scale.

The Pinterest Use Case

When you use Pinterest, all the pins and re-pins that you do from your mobile devices or from the desktop are pumped into Kafka. And from Kafka, they are then enriched in Spark. They actually add location information and other information to it, and then store it into MemSQL.

Pinterest's real-time architecture is driven by MemSQL

From MemSQL, they can then perform queries, ad hoc queries, to test. Usually they do A/B tests for ad-targeting purposes. And the big sort of innovation, or the big technical benefit here, is that they are doing one gigabyte per second of ingest, so that is around 72 terabytes of ingest, and they’re getting real-time analytics from all that data streaming into the system.

An Energy Use Case

This energy company has created a real-time analytics pipeline to determine the current state of their drills. This is an oil and gas company drilling for oil. If they drill in a certain direction, and the drill hits bedrock, that is very bad. Each drill bit costs them millions of dollars. So, what they need to do is determine, in real-time, whether or not the drill is going the right way.

A major energy company uses Spark and MemSQL to power operational analytics.

So, they have real-time inputs coming from their sensors. They pump all that information into Kafka – all the sensor information from all their drill bits. Once there, the data is sent through Spark, and they run a predictive analytics model on it.

The model is developed in SAS, and they use Spark to run it – to execute the machine model, or to do the model scoring. Once there, they can then put the results in MemSQL to be able to decide whether or not they should take certain proactive actions with their drill bits. So, this is another example of a real-time data pipeline that provides real-time analytics.

And finally, just to bring it all home. Refer to the demo below, and you can view the presentation.

Wind Turbines Demo

The image here shows you data from 2 million sensors that are on wind turbines located all over the world. This demo is a simulation. The locations of the wind turbines are real. This is a real data set, but the activity of the sensors is what’s simulated here.

Wind turbine data across North America is quickly analyzed from MemSQL.

You’ll notice here that I can move around the globe and see where all my wind farms and wind turbines are. I’m specifically going into Europe, because I know that there are lots of wind farms and wind turbines in eastern Europe.

Dense networks of wind turbines in Europe store data in MemSQL too.

The pipeline here is as follows. All the information from all of these is very similar to the energy company use case that I was just mentioning. All the data from all these wind turbines, each of them has sensors, and that data all goes into Kafka. From Kafka, it then goes into Spark, and there, we perform a simple regression analysis model to determine where or how we expect the wind turbine to behave.

So, the ones that we see as yellow or green or red, essentially is our prediction of how those turbines will behave over time. So the ones that are red, we expect that you should probably replace it soon. The ones that are yellow are getting there. And you’ll notice that the red ones eventually turn to green, because of the expectation is that it’s a model where things are slowly decaying over time or degrading over time, and then they get repaired. And they’re degrading and then they get repaired. So you can see here a visualization of all that in real-time.

MemSQL Ops gives a graphical dashboard on the system that powers graphical dashboards.

This is a view of MemSQL Ops, which is our graphical dashboard for looking at the MemSQL clusters. (Editor’s note: MemSQL now also offers MemSQL Studio, a more advanced graphical interface.) As I described earlier, MemSQL is a distributed database, which means it exists or you install the same software across many systems, or across many machines.

So here you see it has 16 machines, and all those machines are humming. If you see those green, it means that the CPU is going with MemSQL. So, why is it going? It’s because in real-time, the data is coming from Kafka to Spark to MemSQL, and rendered here in this dashboard. So, you can notice that sort of every second, it’s refreshing, showing users new insights based on real-time data that’s coming into the pipeline.

Conclusion

To learn more about how MemSQL supports generating analytics in real time, view the recorded presentation and review the slides. Also, you can read more in our resource center.

↧

Webinar: Providing Better Wealth Management

August 8, 2019, 2:29 pm

≫ Next: Leveraging Web Workers For Client-Side Applications with React & Redux

≪ Previous: The Ideal Stack for Real-Time Analytics

Feed: MemSQL Blog.
Author: Floyd Smith.

Wealth management is an intensely competitive offering for banks and other financial services institutions. It requires high concurrency – the ability to serve many users, fast – low latency, and the ability to access vast amounts of current and historical data in real time. Institutions have used in-memory databases and streaming data to try to meet the demand. In this webinar, MemSQL’s Sourabh Mehta shows how MemSQL improves the wealth management experience for users and gives institutions the ability to stand out in this competitive area. You can view the wealth management webinar here.

In the webinar, Sourabh describes the “before” and “after” architecture for a bank that provides wealth management dashboards to clients, whether individuals or family offices. By replacing a Hadoop/HDFS data store with MemSQL, the bank was able to deliver much more responsive updates to users, with query responsiveness in the tens of milliseconds; support tens of thousands of simultaneous users, and add more users without additional engineering work or expense; analyze five times as much historical data to provide better answers to user queries; and to avoid the processing and responsiveness delays that had previously occurred when important market news hit and usage surged.

This webinar is the second in a three-part series, How Data Innovation is Transforming Banking. You can register for the webinar series, or view the blog posts on each webinar as they appear:

Real-Time Fraud Detection for an Improved Customer Experience
Providing Better Wealth Management with Real-Time Data (this blog post)
Modernizing Portfolio Analytics for Reduced Risk and Better Performance (upcoming)

The Importance of Digital Transformation and Wealth Management

According to a Gartner survey, digital transformation is the top priority for banks, with almost double the interest of any other priority. It’s also a factor in all the other priorities they describe – revenue and business growth, operational excellence, customer experience, cost optimization and reduction, and data and analytics.

Digital transformation is banks' #1 priority, and they're using MemSQL to help.

Wealth management is a critical business area for banks, and a top target for digital transformation:

Deloitte says that 80% of retail bank profits are generated by high net worth individuals.
Aite says that US brokerages and registered investment advisers manage a total of $24.2 trillion in assets – a few trillion dollars more than the size of the US economy.
Assets controlled by these individuals are expected to rise by 25% in the current five-year period to 2021.

Wealth management generates the most profit for banks.

So wealth management is large, fast-growing, and strategic – perhaps the #1 strategic focus, in many cases – for banks and other financial institutions. Digital transformation of wealth management offerings is a top priority.

New Data Initiatives for Banks

Speeding up the delivery of data to the wealth management dashboard allows the user to get the information they need to make decisions without any visible delay; “we don’t want them to see a spinning wheel,” is how banks often describe the desired experience.

MemSQL powers the data innovations that banks need.

What banks want to eliminate, for their users, is called “event to insight latency” – that is, the waiting time between a request for information, or new information arriving from data sources, and the appearance of the information onscreen. This allows the user to interact smoothly with financial information, take action, see the response, and move on.

In order to deliver a positive experience for customers, companies set up service level agreements (SLAs) across their digital delivery infrastructure. Meeting these SLAs is crucial, but difficult to manage as the number of users increases, the complexity of queries increases at the same time, and with news events that affect financial markets driving sudden surges in usage.

Case Study – Wealth Management Dashboards

A large US bank was struggling with their wealth management solution. Based on a Hadoop/HDFS solution, the solution had many problems, but the worst was batch data loading. Data didn’t stream into the solution; instead, it was pooled for an hour, then a batch upload was run. (Batch uploads are the default for Hadoop/HDFS.) During the batch uploads, queries were locked out, which users found unacceptable.

Before MemSQL, the bank had slow queries.

In addition, queries were slow, ranging from tenths of a second into seconds. And concurrency support was poor; when critical market events occurred, and users moved onto the system en masse, response times slowed to a crawl.

The MemSQL Solution

The bank augmented the Hadoop/HDFS database with MemSQL, using MemSQL to power the wealth management solution and leaving Hadoop/HDFS as a data lake for long-term data storage. The results have been excellent:

Data is streamed into MemSQL in real-time; no waiting for data to be batched.
Ingest and query processing run lock-free, simultaneously; no query downtime during batch updates.
Five years’ history instead of one; applications and user-driven queries can draw on five times as much data at hand for deeper analysis.
Fast responsiveness; queries are answered in 10s of milliseconds, with no spinning wheel.
High concurrency; 40,000 users are supported with no contention, even when market events cause spikes in usage.

With MemSQL, queries run faster, reliably, and with high concurrency.

Conclusion

MemSQL is great for augmenting Hadoop/HDFS and other existing systems that suffer from slow responsiveness, batch update timeouts, and concurrency issues. You can view the wealth management webinar, download and run MemSQL for free, or contact MemSQL today.

↧

Leveraging Web Workers For Client-Side Applications with React & Redux

August 13, 2019, 5:44 pm

≫ Next: Webinar: Real-Time Fraud Detection for an Improved User Experience

≪ Previous: Webinar: Providing Better Wealth Management

Feed: MemSQL Blog.
Author: Floyd Smith.

If you’ve ever had a web application freeze while it was calculating something, chances are that performing that computation in a JavaScript Web Worker would help. In this blog post, MemSQL’s David Gomes shows how to create a fully client-side application, using the JavaScript libraries React for the view layer and Redux for the application state layer.

Introduction

In this article, we’re going to explore how we leverage Web Workers, together with React & Redux, to build a fully client-side web application here at MemSQL. My goal with this article is to highlight a specific use case of Web Workers, as well as detail how we were able to build on top of the relatively low-level Web Workers API to make our code more organized and easier to iterate on.

Web Workers are one of the most underrated features of JavaScript. Despite having been around for 10 years, they’re relatively unknown, and are not used very often in web applications. Most desktop GUI applications take advantage of multithreading to make sure their UIs are responsive while the application does other background work. Historically, web applications haven’t been able to apply the same strategy, but that’s where Web Workers come in.

As an example, if you’re building a CodePen-like application and want to parse the code in the editor, and add syntax highlighting to it, a Web Worker is a great idea, since you can perform the work in parallel, without incurring the large network cost of sending the entire code to a web server.

So, what is a Web Worker? A Web Worker is a feature of JavaScript that enables parallel execution of code in the browser. In other words, it allows for the execution of JavaScript in the background. The main use case of Web Workers is performing expensive computations in the browser without blocking the main thread, where the DOM is rendered. If you’ve ever had a web application freeze while it was calculating something, chances are that performing that computation in a Web Worker would help.

In this blog post, we’re going to dive into our usage of Web Workers in a specific application. That application is MemSQL Studio, a visual user interface that allows our customers to easily monitor, debug, and manage their MemSQL clusters.

MemSQL Studio is implemented as a fully client-side web application that runs in the browser. It connects to MemSQL and runs queries on behalf of the user in order to show all kinds of information about the state of the cluster. Additionally, this tool also allows users to run arbitrary queries against their cluster via an embedded SQL development environment.

Integrating React & Redux with Web Workers

The frontend of MemSQL Studio is implemented using React for the view layer and Redux for the application state layer. The “backend” of the application runs in the browser inside a Web Worker. This allows us to perform all the expensive work of connecting to MemSQL, running queries, and parsing the results in the background. This is convenient, since the queries Studio runs against MemSQL may return millions of rows. As such, we want to parse and clean up the outputs from these queries without blocking the main thread.

*We used React and Redux to move database interaction to a background process and create a smoother user experience.*

So far, all of this sounds wonderful. However, there’s one issue that I haven’t mentioned yet. How should the main thread and the Web Worker communicate?

Historically, the Web Worker API offered a shared memory protocol for coordination between the main thread and worker threads. Unfortunately, due to the Spectre Vulnerability, most browsers disabled the API. Because of this, MemSQL Studio leverages the transfer protocol to communicate. This is what the transfer protocol API looks like:

main-thread.js
expensiveComputationWorker.postMessage({ n: 8 });

expensiveComputationWorker.onmessage = (msg) => {
console.log(“received message from my web worker”, msg);
};

worker-thread.js
onmessage = (msg) => {
console.log(“received message from main thread”, msg);

postMessage(getNthPrime(msg.n));
};

Since the backend of our application runs inside the Web Worker, this API is too low level for our application. We need something more high level, which allows our components to easily request the data that they need from the backend.

The first thing that comes to mind is GraphQL, a query language that allows clients to declaratively state which pieces of data they need from an agreed-upon schema. So, we gave it a spin and built a GraphQL server that lives inside the Web Worker. Then, we built resolvers for each piece of data that our client could possibly need, so our components could simply tell a GraphQL client (we used Apollo) what they needed.

After a while, this approach became cumbersome, since we now had two type systems that we had to keep in sync:
TypeScript [1]
The GraphQL Type System

Having to write all type definitions twice slowed us down significantly [2]. Moreover, we were not taking advantage of the GraphQL query language at all. Most of the pages in our application request all the information about all the records of a given record type (e.g., all the databases in the cluster, all the nodes in the cluster, etc.). For this type of query, GraphQL is not very helpful, since we’re not taking advantage of the powerful query language at all. It makes our architecture more complex without giving us any real benefits. I actually gave a talk just about this entire experiment at React Fest last year.

So, once we decided to drop GraphQL, we explored other options. This is the flow that we wanted – and ended up achieving:

View Layer asks for data using some custom API
Worker computes data
Redux is populated with data
View is subscribed to Redux updates and eventually displays data

This general pattern is very standard for React+Redux applications. The interesting bit here is how to populate the Redux state (which lives on the main thread) from the worker thread. Here’s what we came up with. First, the view layer:

page-databases.tsx
import { queryDatabases } from “worker/api/schema”;

class DatabasesPage {
componentDidMount() {
this.props.dispatch(queryDatabases());
}
}

We can see that it’s very easy for this React component to ask for the data it needs by dispatching a Redux action. But how does it work? How is the worker thread notified of this action, and how is the Redux state populated? We implemented this using Redux middleware. (If you are not familiar with Redux middleware, I recommend the official documentation.)

We created middleware in Redux that listens to all the dispatched actions. Whenever it finds that a “worker action” was dispatched, it passes it to the web worker. A “worker action” is a specific type of Redux action object that the Redux reducers don’t listen to; instead, a “worker action” represents a specific API call on the worker thread.

The middleware looks for a specific object structure to distinguish “worker actions” from regular Redux actions. So, our Redux middleware calls the worker thread (using a custom postMessage wrapper) and the worker thread then parses the “worker action” object to figure out which API call it should run.

worker/api/schema.tsx
export const queryStructure = makeActionCreator({
name: “queryStructure”,

handle: (ctx: HandlerContext): Observable => {
…
}
};

The exported function from “worker/api/schema.tsx” generates a “worker action,” which the “page-databases.tsx” file dispatches. However, the function that the middleware will cause to run on the worker is the handle() function, which performs the actual work of connecting to MemSQL and returning the list of databases.

Since the actual API call returns an Observable of plain Redux actions, each such action will be sent back to the main thread, where our middleware will dispatch them, allowing the reducers to listen to them. This completes the cycle that I mentioned earlier: (main thread → worker thread → Redux (main thread)).

Observables as the output of API endpoints are extremely powerful, since they allow an API endpoint to emit multiple times. This makes the following patterns (and others) trivial:

Emit the output of a query in batches, for a smoother experience
Emit loading and finished (success, error) states individually, so that the Redux store contains the current state for a request (which will be shown in the view)

One final thing to note is that all of our communication between the worker thread and the main thread is JSONified. We do this so that we can easily serialize and deserialize class instances using JSON revivers. If you’re curious about the performance consequences of this, you can check out this article. (It’s equivalent to the performance of native postMessage, which uses the structured clone algorithm.)

Conclusion

Web Workers are very powerful and their simple API allows one to easily build an abstraction layer on top of them. In our case, we figured out how to integrate React, Redux, and Web Workers in a way that works very well for us. We’ve found that this framework allows us to iterate quickly while achieving our main goal of running the heavy computation work without blocking the UI. If you are interested in an open source version of this solution, please reach out to us at david@memsql.com.

The MemSQL database is renowned for its performance, so it only makes sense that our UIs follow suit. For this reason, our frontend engineering team leverages the best web framework technologies to ensure our customers are guaranteed a stellar experience. If you are an Application Engineer with a similar passion for quality, we are hiring in Portugal and San Francisco.

[1]: MemSQL Studio is written using TypeScript.

[2]: There are some type definition generators that can help with this process. However, we found that this didn’t work well for us (we were using Flow at the time).

↧

Webinar: Real-Time Fraud Detection for an Improved User Experience

August 21, 2019, 7:14 pm

≫ Next: Replicating PostgreSQL into MemSQL’s Columnstore

≪ Previous: Leveraging Web Workers For Client-Side Applications with React & Redux

Feed: MemSQL Blog.
Author: Floyd Smith.

In this webinar, which you can view here, MemSQL’s Mike Boyarski describes how real-time fraud detection represents a major digital initiative for banks. He shows how MemSQL gives fraud detection services, along with other real-time analytics, an edge with faster ingest, real-time scoring, and rapid response to a broader set of events. He also describes a major US bank’s implementation of this approach, which is described separately in our case study about fraud detection on the swipe.

New Data Initiatives for Banks

There’s a great deal of digital transformation occurring around data initiative for banks. Banks are finding a great deal of value in moving to a digital experience for their customers. Digitizing operations allows banks to deliver new services and products to the market, creating new sources of revenue.

MemSQL's architecture is well suited to detecting fraud against banks.

These initiatives create pressure on existing data infrastructure, which tends to have a great deal of latency between events in the real world and insights that banks can use to drive new applications and make better decisions. Banks, like other organizations, are seeking to enable more continuous data-driven actions and decision-making. They need a data infrastructure that can adapt to those changing conditions.

More and more queries need to fit into a service level agreement window. MemSQL has customers that need all of their queries to run within a 200 millisecond requirement. And that’s ultimately to deliver the best experience for their own clients. We also see customers that want to innovate, but want to do it using the existing operations and skills.

This means augmenting or supplementing what’s already in place to keep compatibility with existing tools, existing skills – SQL and other standard technology, relational technology – and also provide a path for cloud adoption. So cloud-native technologies that work in multi cloud or hybrid cloud configurations are the technologies that we find banks are investing in.

Fraud Detection Challenges and Opportunities

The market for fraud detection is significant. It’s moving from, in this study, $14 billion in 2018 to $34 billion in 2024. This represents a recognition that fraud is becoming very challenging and it can be addressed and augmented with technology.RSA reports a 600% increase in mobile fraud over the last three years. So obviously as more transactions are occurring on mobile channels, the reality of fraud occurring on that sort of engagement path is going up.

Digital transformation is Job 1 for banks.

Fraud is also very broad. Fraud applies to online payments, detecting insider trading, building or creating new accounts, and the synthetic identity issue, which is fake accounts using a blend of different information from hacked information. This creates a very complex hard problem for banks to mitigate, but also they want to provide the best possible experience to their customers to compete. Banks have to figure out that balance between enabling a frictionless digital experience but also protecting the assets of the business.

Fraud detection is very much powered by analytics, the ability to apply data to detection through anomaly detection, or predictive classification, or clustering. Where the battle really gets fought is identifying the appropriate model and analytic functions to identify something to block or approve. Having the sort of advanced analytics powered by big data to find the best fit, to find the most accurate models, that’s where a lot of the art and science of fraud detection exists. You can see more about this in our case study of a bank’s fraud detection application, where they have moved from overnight fraud processing to checking for card fraud in real time, on the user’s card swipe.

MemSQL Overview

What makes MemSQL a great solution for something like fraud detection? MemSQL describes itself as the “no-limits database” because of the software’s architecture. Ultimately we have a distributed scale-out system, so our ability to support growing workloads and the growth in data is baked in, because it’s just a node-based scale-out system. MemSQL also has an innovative lock-free architecture, supporting the continuous ingestion and continuous data movement that are part of what we call operational analytics applications. MemSQL is an operational database that can do very fast analytics.

Fraud prevention is a critical, large, and fast-growing problem.

Machine learning (ML) and AI can bring a great deal of value to the business, so we see a lot of customers taking advantage of real-time ML with MemSQL. And we see a lot of customers making a transition from their legacy, on-premises architecture to the cloud, and doing it in a flexible, adaptable way. You can run MemSQL and deploy that on any cloud and/or on your own on-premises infrastructure; it’s all about flexibility.

MemSQL’s claim to fame is around delivering speed, scale, and SQL all in one package. Think of our system as being able to efficiently take data into the platform and then run queries on that data as fast as any data warehouse product in the market, including both legacy platforms like Teradata and Vertica, and some of the newer cloud-based data warehouses.

MemSQL came into existence about five or six years ago. So we were built in the cloud. We took advantage of distributed processing, all the sort of cloud-native functions that you would expect, like Kubernetes and containers, making MemSQL a really good choice for those modern cloud-based platforms.

Q&A

How does MemSQL compare to Cassandra?

Cassandra is used pretty heavily in the fraud detection market, and it’s largely because of the ingest … The ability to ingest data into Cassandra is very solid. The challenge with Cassandra, though, is if you want to do any sort of advanced additional query logic, we have found that customers really have a hard time getting insights out of Cassandra. So it’s okay for running a well-defined analytic function, but if you want to change that analytic function or iterate it quickly, that’s where the limitations come into play.

Cassandra is a NoSQL system. That means their support for joins, and their ad hoc query support, and their ability to run additional sort of query functions and analytic functions, it’s not standard, right?

At MemSQL, we are on par, if not better, with the ingest requirement. And then what’s the best about MemSQL is you’re getting a relational SQL environment, whereas with Cassandra, you’re getting their own custom implementation of SQL, which means you have to learn their language. So you’re getting a little bit more complexity around doing more continuous improvement of the analytics.

So the takeaway is, you get the same ingest performance as Cassandra, but then you get the power of SQL using the MemSQL platform.

Can you explain how MemSQL works with Kafka?

Yes. In terms of ingest, MemSQL does have support for a number of sort of ingest techniques, whether it’s file systems, S3 – and MemSQL has a built-in connector to Kafka. So what that means is, you can, in a very simple way, using one command line function called “start pipeline,” it essentially automatically configures and identifies the connection point from your Kafka topic into MemSQL. It’s very easy to set up; there’s not a lot of hand-wringing and custom coding required to connect a Kafka environment to MemSQL.

And then out of the box, you get exactly-once semantics, you get extremely high throughput, and then you can do really sort of fun things like put stored procedure logic onto the Kafka stream process so that you can do scoring on the stream, for example, if you’d like. You can also do pretty advanced things around doing logic on the data movement of Kafka into MemSQL, for example to identify where to land the data, particular on a node basis or into a particular type of table.

So there’s a lot of power that you can get with Kafka and MemSQL. It’s probably the reason why, I think, it’s probably the number one ingest technology that’s used in conjunction with MemSQL. It’s a top-notch feature, and it’s really well regarded within our community.

Can MemSQL be used with Hadoop? And how do customers typically deploy in that kind of environment?

So Hadoop we see more and more being used as an archive storage layer. So we can do a couple things with Hadoop. One is you can stream data into both MemSQL and Hadoop concurrently. (As shown in this case study – Ed.) It’s your classic Lambda-style architecture. And so that means you can use MemSQL to do your analytics on the most current active data, and then your Hadoop cluster can be used for other types of analytics, more data science oriented or just archival style analytics.

We do see some customers also landing data first into Hadoop, and then they use the HDFS connector to pull data from HDFS into MemSQL. And you can do that in a continuous fashion. So there’s an ability to stream data from Hadoop directly into MemSQL, and that allows for sort of, land once and then pull into MemSQL for spot queries, or maybe a segment of queries, or a segment of data. And then when that period of time ends or that query sort of project goes away, you can flush the data of MemSQL and keep all your data in HDFS.

So there are a lot of different ways that people use MemSQL with Hadoop. And it has a lot to do with the application and the query requirements that you have. But I guess the net of it all is our HDFS pipeline is super robust and very commonly used to accelerate query performance of Hadoop. You get the nice relational SQL structure, all that interactive query response that you want.

Conclusion

You can take a free download of MemSQL. It’s a no-time-bomb version of our product. You can deploy it to production. Of course, it is limited by its scale and the number of nodes deployed, but you can do a lot with what’s available there. And then you can get support from the MemSQL Forums, which are community-driven, but also supported by some folks here at MemSQL.

↧

Replicating PostgreSQL into MemSQL’s Columnstore

August 22, 2019, 7:47 pm

≫ Next: Case Study: Replacing Exadata with MemSQL to Power Portfolio Analytics and Machine Learning

≪ Previous: Webinar: Real-Time Fraud Detection for an Improved User Experience

Feed: MemSQL Blog.
Author: Oryan Moshe.

Thanks to Oryan Moshe for this awesome blog post, which originally appeared on DEV Community. In the blog post, Oryan describes how to achieve the high performance of MemSQL’s columnstore for queries while keeping transaction data in PostgreSQL for updates – Ed.

Making the impossible hard

MemSQL cluster with capacity of 550GB that has 914GB of data in it. — **I… don’t think this is how it should look like.**

So it’s this time of the year again, we need to upgrade our MemSQL cluster and expand our contract to fit the new cluster topology.

We really outdid ourselves this time. Expanding to a 1TB cluster is impressive, especially when it’s completely not justified.

The background

Wait. A 1TB Cluster?

Yeah yeah, call us spoiled, but querying on PostgreSQL (PG from now on) is just not the same.

Sure, you can get ok speeds if you’re using the correct indexes and optimize your queries, but it’s not even comparable to the performance you get from the memory based rowstore in MemSQL (Mem from now on), or the insanely fast aggregations of the columnstore.

A Short (Short) Summary of Mem’s Different Storage Types

So we basically have 2 types of storage in Mem, rowstore and columnstore.

The rowstore is stored pretty much like any other database, but in the memory instead of the disk (crazy fast). This means each row is stored together with all of its columns.

The columnstore is sort of a transposed rowstore. Instead of storing rows, we store columns (thank you Captain Obvious), which allows us to make aggregations stupid fast. (Think about it; instead of going to each row and summing the “cost” column, we can just go to the “cost” column and sum it up.) The columnstore is stored on the disk.

(The MemSQL blog has an article on making the most of both rowstore and columnstore tables. – Ed.)

The issue is MemSQL’s license costs more as we have more memory in our cluster, not to mention the cost of the machines themselves (1TB of memory isn’t exactly cheap).

“So why not store everything in the columnstore? It’s cheaper both license and infrastructure wise, and it’s stupid fast!,” you might ask (if you talk to yourself while reading tech articles).

So here’s the catch – the way the data is stored in a columnstore makes it incredibly fast in aggregated queries, and allows amazing compression, but updating a row is slow.

(Some people here at MemSQL are thinking creatively about a possible solution to the problem of slow updates to columnstore tables that Oryan mentions here; stay tuned. – Ed.)

How slow? If we need to update some columns for rows in a specific day, it’s faster for us to delete the data from this day and re-insert the updated one instead of updating the existing rows.

So, How Do We Store Our Data?

Well, in my team we use 7 flavors of databases (might be more, can’t really keep track these days) but the main ones are PostgreSQL, hosted and managed by AWS RDS (for transactional processing) and MemSQL, hosted on EC2 and managed by yours truly (for analytical processing – including, but not limited to, analytics and transactions.).

Instinctively, most of our data is stored in PG (excluding some large columnstore tables containing North of 8B records).

The problem is, once you go Mem you never go back, so we created a replication service that can replicate a row from PG to Mem’s rowstore in real-time. This allows us to enrich our columnstore-only tables, create ETLs, and most importantly, speed up queries.

If you’re here, you either use Mem and thus know its performance, or just like to go around dev.to, reading random articles about niche DBs. If you’re the latter, let me hit you with some numbers.

A completely reasonable query, consisting of 6 joins, took 30 minutes to run on PG. After optimizing it for 2–3 hours, adding indexes, banging my head against the wall and praying for a swift ending, I was able to cut it down to 3 minutes.

Taking exactly the original query (the 30 minutes one) and running it on Mem, it took 1.87 seconds.

The Real Deal

Problems Definition, AKA What’s Making Me Lose Sleep

So Mem is expensive, we’re almost at our new license limit (after more than doubling it) and there’s no way we can go back to querying exclusively on PG.

The solution seems simple: move big tables to the columnstore, free up some memory so you don’t have to increase your license, and upgrade your machines.

For this article I’ll use our table touch_points as an example, it’s our largest (both in memory and row count) table stored in a rowstore - it has over 180M rows, and weighs more than 190GB.
Why is it in our rowstore? First, cause we replicate it from PG, and so far our service only supports replicating to rowstore tables. But, more importantly, it needs to be updated. Out of 30 columns, 2 might get updated - visitor_id and cost.

Solutions

The First Solution

So this was the “correct” solution, design-wise.

In short, using ActiveRecord callbacks, I kept 2 tables up to date. One is the touch_points table in the columnstore, containing all columns that exist presently on touch_points except the 2 that get updated. Other than touch_points, I created a table called touch_points_extra_data in the rowstore, containing the 2 missing columns and 1 ID column that allows me to connect the 2 tables.

As I said, this was the correct solution design-wise. The problem is that so much could go wrong. With so many moving parts, all dependent on Tails hooks, we were sure to get out of sync sometime. Not to mention the fact that we’ll have to edit all of our queries from touch_points to add that extra JOIN.

The Second Solution, AKA “The Bruteforce”

So we realized our top priority is to keep the data correct, and we were willing to make some compromises (foreshadowing).

I decided to replicate the whole table, as is, from PG once in a while. This way we can make sure that (up to the moment of replicating) our data will be identical in both DBs.

The compromise is that we are used to having this data updated in real time, and now it’ll be outdated until the next replication. This is a compromise I’m willing to take.

The Technical Part

Easier Said Than Done

So apparently replicating a whole table from one DB to another isn’t as straightforward as you would think. Especially when the two DBs run on different engines entirely.

The first thing I tried is using pg_dump, with the plain file format (which essentially creates a file with loads of INSERT statements) and then convert it to MySQL syntax and load to Mem.

Sounds great, right? I started the pg_dump, and 5 hours later it wasn’t even close to finishing, while the dump file was already at 60GB. pg_dump with the plain option is the most inefficient way to store data. 5 hours delay in replication is unacceptable.

If at First You Don’t Succeed… Fail Again

The next thing I tried was using the COPY command of PG, this command can copy (duh) a table, or a query into a FILE, a PROGRAM, or STDOUT.

First I tried using the STDOUT option (the simplest one, and it doesn’t create a footprint of a huge dump file).

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c
"COPY (SELECT * FROM touch_points) TO STDOUT
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '"');" > touch_points.csv

And it worked! I got a “dump” file from PG containing our whole touch_points table, in just under 20 minutes.

Now we just need to import it to Mem, but why do I need the file? I can just pipe the result right from PG straight into Mem!

So I needed to create the part where Mem receives this csv-like table and loads it into the db. Luckily Mem is MySQL-compatible and provides us with the LOAD DATA clause!

LOAD DATA LOCAL INFILE '/dev/stdin'
  SKIP DUPLICATE KEY ERRORS
  INTO TABLE touch_points_columnstore
  FIELDS
    TERMINATED BY ','
    ENCLOSED BY '"'
    ESCAPED BY ''
  LINES
    TERMINATED BY 'n'
  MAX_ERRORS 1000000;

Now, as I said we want to pipe that data right into Mem, so we need to create a connection to our DB:

mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e
"LOAD DATA LOCAL INFILE '/dev/stdin' SKIP DUPLICATE KEY ERRORS
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','
ENCLOSED BY '\"' ESCAPED BY '' LINES TERMINATED BY '\n' MAX_ERRORS 1000000;"

And then just pipe the data from PG to that connection!

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c
"COPY (SELECT * FROM touch_points) TO STDOUT
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '"');" |
mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e
"LOAD DATA LOCAL INFILE '/dev/stdin' SKIP DUPLICATE KEY ERRORS
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','
ENCLOSED BY '\"' ESCAPED BY '' LINES TERMINATED BY '\n' MAX_ERRORS 1000000;"

And… It worked! But it took 2 hours to complete. I’m sure we can do better than that.

Compression is Your Friend

So two cool things important to understand about loading data into Mem are:

When inserting a data file into Mem, it copies the file locally to the aggregator and splits the file between the nodes of the cluster, speeding up the data load significantly.
Mem supports receiving gzip-compressed data files.

Combining these two pieces of information made me understand that creating the file in the middle maybe isn’t as bad as I thought.

I can compress that file, making storage a non-issue. It’ll also speed up the transfer of the file to the aggregator (before splitting) by cutting out most of the network related latency, and it’ll allow Mem to split the data between the nodes.

Let’s do it!

First of all I need to modify the PG part so instead of piping the content to STDIN, it pipes it to a PROGRAM, and in our case, gzip.

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c
"COPY (SELECT * FROM touch_points) TO PROGRAM 'gzip > /data/tmp/replication/touch_points_columnstore.gz'
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '"');"

After we created this tmp file we need to load it. Luckily the only thing we have to do is to change the source of the input file!
Our finished script looks like this:

psql -U read_user -h very-cool-hostname.rds.amazonaws.com -p 5432 -d very_cool_db -c
"COPY (SELECT * FROM touch_points) TO PROGRAM 'gzip > /data/tmp/replication/touch_points_columnstore.gz'
WITH(DELIMITER ',', FORMAT CSV, NULL 'NULL', QUOTE '"');" &&
mysql -h memsql.very-cool-hostname.com -u write_user -P 3306 -D very_cool_db
-p'4m4z1nglyS3cur3P455w0rd' -A --local-infile --default-auth=mysql_native_password -e
"LOAD DATA LOCAL INFILE '/data/tmp/replication/touch_points_columnstore.gz' SKIP DUPLICATE KEY ERRORS
INTO TABLE touch_points_columnstore FIELDS TERMINATED BY ','
ENCLOSED BY '\"' ESCAPED BY '' LINES TERMINATED BY '\n' MAX_ERRORS 1000000;"

And that’s it!

The created file weighs 7GB, and the whole process takes less than 20 minutes, so we can run it once an hour and have semi-realtime data!

Obviously this wasn’t the end, I wrapped it up in a nice Rails module that allows me to replicate any query from PG to Mem easily, including truncating the old data and using 2 tables to minimize the downtime during replication.

Feel free to contact me with any questions! (Twitter: @oryanmoshe. Github: oryanmoshe.)

↧

Case Study: Replacing Exadata with MemSQL to Power Portfolio Analytics and Machine Learning

August 24, 2019, 12:08 am

≫ Next: Webinar: Modernizing Portfolio Analytics for Reduced Risk and Better Performance

≪ Previous: Replicating PostgreSQL into MemSQL’s Columnstore

Feed: MemSQL Blog.
Author: Floyd Smith.

This case study was presented as part of a webinar session by Rick Negrin, VP of Product Management at MemSQL. In the webinar, which you can view here, and access the slides here, Rick demonstrates how a major financial services company replaced Oracle Exadata with MemSQL to power portfolio analytics, with greatly increased responsiveness for users and the ability to easily incorporate machine learning models into their applications. In this case study, we’ll emphasize the bank’s digital infrastructure using Exadata, then present their implementation of MemSQL as a reference architecture that you can consider for your own organization’s needs.

Case Study, Before: The Previous Architecture, with ETL, Exadata, and RAC

This case study describes an asset management company. It’s a fairly large asset management company, with about a thousand employees and probably just under a half a trillion in assets under management. They have been in business for several decades. They’ve invested heavily in a lot of different technologies, primarily in some legacy database technologies. Things were working pretty well for awhile, but as new requirements started to come in and more users are using the system, they’ve started running into trouble.

So this is what their architecture looked like. And it should be fairly familiar to most of you. They have a variety of data sources, obviously their own internal operational systems, mostly legacy databases. Combined with some external third-party data and partner data that they would bring in. As well as behavioral data from how their users are using the system, both on the web and with mobile. And all of that data was moved, via standard extract, transform, and load (ETL) processes, into a traditional data warehouse.

MemSQL for portfolio analytics - replacing Oracle Exadata and RAC

And then that data was accessed by a variety of different users. So you had business users who are using custom business applications, that are doing data exploration and doing data prep on that data. As well as business users using a combination of Tableau and Excel to do analysis. And some data scientists using SAS to do some data science and data exploration of the data. It’s trying to move the models forward.

Now this resulted in a number of problems. So one is they were stuck with batch ETL. Which initially was okay, but as they’ve been trying to move to a more streaming system and more real time, this was becoming a bottleneck. And the existing database technology and the ETL technology they had, was just not sufficient. They couldn’t make it work go more often than nightly refreshes and hourly updates. This basically resulted in the system being offline whenever they would ingest large amounts of data.

On top of that, the data models that were in use by their data scientists were aging, they were somewhat limited. It didn’t allow continuous development, so it was tough to evolve them as they learned new things and then got new data. And probably the most painful thing is that as more and more users who are trying to use the system, the queries were getting slower and slower. As concurrency ratcheted up, the queries would slow down.

MemSQL for portfolio analytics - slow queries

On top of that, people wanted to be able to use the data all the time, not just nine to five. And so they want to be able to use this system even when the data’s loading constantly.

They tried to meet these new challenges by leveraging newer hardware, or appliances like Oracle RAC and Exadata. And those are extremely expensive, given the kind of hardware neeed to try to solve the problem.

MemSQL for portfolio analytics - ETL

Case Study, After: The New Architecture, with Kafka, Spark, and MemSQL

To solve these problems, they replaced the old architecture with something that looks like this. Basically with the combination of MemSQL and Kafka and Spark.

MemSQL for portfolio analytics - Kafka

So the first step was to replace all the ETL technologies with the Kafka queue. For those who aren’t familiar, Kafka is an in-memory, distributed queue. It’s fairly easy to set up and scale and manage. And it’s a great landing place for data that’s waiting to be processed. So they changed the older data sources to funnel into a single Kafka queue. And then from there they fork the data into a couple of MemSQL instances, as well as into a data lake for long-term storage.

On top of that, they would then leverage a combination of the data that was security data and their data science sandbox MemSQL instance, as well as some data from the data lake, and pull that into a Spark cluster. So, leveraging the native integration that MemSQL has with Spark so they could train their machine learning models with the newest market data all the time, driving a continuous evolution of their machine learning algorithms. At the same time, the could continue to run queries into Tableau and Excel, continue to use SAS, and continue to run their business applications without having to disturb those approaches too much.

And lastly, they were able to get much better performance that they were getting before. They got significantly faster queries. They also, because of the more efficient use of storage and cost effectiveness, they’re able to store the required five years of history versus the three they were able to store in Oracle. And they did all this while still being three times cheaper than the cost of the Oracle solution.

MemSQL for portfolio analytics - benefits

To kind of summarize the benefits: the combination of Kafka, Spark, and MemSQL enabled them to do continuous trade and risk analysis using live market data, moving from batch into real time. They reduced their overall spend by 3x while still improving performance. And they have a new data platform for driving their ML and operational analytics delivery, making them much more agile and moving faster.

Conclusion

↧

Webinar: Modernizing Portfolio Analytics for Reduced Risk and Better Performance

August 24, 2019, 4:49 pm

≫ Next: Case Study: Fraud Detection “On the Swipe” For a Major US Bank

≪ Previous: Case Study: Replacing Exadata with MemSQL to Power Portfolio Analytics and Machine Learning

Feed: MemSQL Blog.
Author: Floyd Smith.

In this webinar Rick Negrin, Product Management VP at MemSQL, describes the importance of portfolio analytics, enhanced by machine learning models, to financial services institutions – helping them to meet customer needs and edge out competitors. He shows how MemSQL speeds up portfolio analytics at scale, with unmatched support for large numbers of simultaneous users – whether connecting via ad hoc SQL queries, business intelligence tools, apps, or machine learning models. You can view the recorded webinar and download the slides. He also describes how a major US financial services institutions implemented Kafka, Spark, and MemSQL, replacing Oracle and widespread use of cumbersome extract, transform, and load (ETL) routines, in this separate case study.

The business problem Rick discusses is the need to modernize portfolio analytics for reduced risk and better performance – both for the customer managing their portfolio, and for the institution offering portfolio management tools to customers. Institutional investors want smarter portfolio management services that deliver optimal returns while reducing their exposure to any one industry, currency, or other specific source of risk.

Portfolio managers want guided insights to help them avoid sudden or dramatic rebalancing of funds that can drive up costs and reduce confidence and customer loyalty. MemSQL powers a number of portfolio dashboards and what-if analysis, leveraging live market data for the most up-to-date view of the market. The separate case study shows how a major financial services company used MemSQL to solve these problems, supporting their leadership position in the market.

This webinar was originally presented as part of our webinar series, How Data Innovation is Transforming Banking (click the link to access the entire series of webinars and slides). This series includes several webinars, described in these three blog posts:

Also included are these two case studies:

You can also read about MemSQL’s work in financial services – including use cases and reference architectures that are applicable across industries – in MemSQL’s Financial Services Solutions Guide. If you’d like to request a printed and bound copy, contact MemSQL.

The Role of the Database in Digital Transformation

Digital transformation remains the top priority by far for banks. This is confirmed by a Gartner study from 2019, but we at MemSQL hear it anecdotally in all the conversations that we have with financial institutions. This is because of the opportunity that digital transformation provides. When you to take advantage of new technologies, you can create new sources of revenue, and you can drive down your costs with new operating models, allowing you to deliver digital products and services that just weren’t possible before.

To make this happen, you need to have an architecture and operating platform that supports a new set of requirements. One need is to drive down latency: the time from when a new piece of information is born to the time you’re able to gain insight and take action on it. The effort is to get that as close to zero as possible.

When you do that, you can make faster data-driven actions in your business. So when something’s going on in the financial markets, the customer wants to understand it, to know what’s going on, as quickly as possible. And to be able to take action on it, in order to either reduce risk in a portfolio or perhaps take advantage of some new opportunity that’s come up.

Data innovations needed for financial applications - MemSQL

You also need adaptable analytics. The days of getting a static report once a week to your desk and then using that as information are far in the past. You need to be able to have an interactive experience with the data that’s flowing in. To be able to slice and dice it, looking at it across many different dimensions, to find the key insight that’s going to allow you to take advantage of what’s going on.

And you also want to be able to apply historical data in order to take the best advantage of what’s happening in real time in the market. This is especially important in the context of the machine learning algorithms that are being developed. Using the historical data to understand the patterns just so you can identify, given what you’re seeing in the market right now, what’s likely occurring, and how best to take advantage of it.

The second pillar is around service level agreements (SLAs), and particularly around analytics. So you are moving the systems from the back end, where you have maybe a couple of backend analysts who are working with the data, to the front end, where the people working with the data are the end users or the portfolio managers or even the end customers. The bar for the experience goes up dramatically. As does the need for concurrency – the need to support many simultaneous users, including those backend analysts but also BI tools and apps, at the same time.

You want interactive experiences that are snappy and responsive and allow you to get answers as quickly as possible. But to make that happen, you have to have SLAs on all the different dimensions of usage within the system, how fast the data is ingested into the system, how quickly you can query it out, how fast the storage is growing. You need SLAs across all those dimensions in order to guarantee a positive customer experience. And you need you to do that not just during the average load time, but you need to maintain those SLAs even at peak times.

Think of when some momentous event, or series of events, happens in the financial markets. You know, think 2008, or even 2000, and everybody’s coming in to use the system and you’ve got ten times more users concurrently trying to run their queries and trying to hit the storage system, the database system. You want to maintain those facilities even in the face of that – perhaps especially in the face of that. And then to do that, you need a system that can scale depending on the load.

And last, if you want a system that supports your operational standards, it should plug in with your existing tools so you can leverage all of the tool sets. This means robust support for ANSI SQL. And then of course, also the experience that your users have with those tools. So you don’t have to retrain all your users on how to operate and manage and optimize the system.

The more you can leverage the tools that you have, the easier it is to plug it into the overall ecosystem. And it’s got to be a system that’s not just built for today’s problems, but also for where everyone’s headed. And the place people are headed these days is all into cloud, into the public cloud systems. So it can’t be a legacy system, especially those that are tied to legacy hardware, because they won’t go where you’re headed. And you want something that is highly available and scalable and available to meet these requirements.

The Rise and Rise of Portfolio Analytics

Portfolio analytics is a huge market. It’s $23 billion today, expected to grow to nearly double that in the next five years. What’s driving that is the combination of compliance, digitalization, and the drive to automate the backend processes. And as that happens it basically allows new opportunities in the market. It’s all coming from technological innovations that are happening in the fintech industry and providing a number of opportunities to go and take advantage of those technologies.

MemSQL - Portfolio analytics is a growing business

Now more concretely, what are the problems that financial services companies are facing? One is the need to combat passive investment vehicles. Those could become the default that people gravitate to because they’re easier and lower cost. You’re also seeing more competition among the large asset managers; it’s said that the number of asset managers has gone up by something like 10x over the last 20 years. There are more people trying to do asset management, and they are all using similar kinds of tools to do it.

Business challenges and outcomes for database technology

And then, because the passive investment vehicles have come to be so dominant, it’s driving down the fee structures, which means financial services companies need to be more cost effective and more cost efficient in how they operate. (For more on how one large financial services company met these requirements by moving from Oracle Exadata to MemSQL, see our case study – Ed.)

How MemSQL Powers Portfolio Analytics (and Digital Transformation Overall)

Now let’s go into why is MemSQL, why is it so good for portfolio analytics? Why did we build MemSQL, and how does it serve the market?

MemSQL is a cloud-native operational database built for speed and scale. But what we do is what we call operational analytics. So there are two common patterns in the database interchange. Two workloads that were the dominant patterns for a very long time. OLTP online transaction processing. Which is all about serving applications and being reliable and distributed and transactional and durable. And there’s OLAP – online analytical processing, also known as data warehousing. OLAP is all about serving analytical queries, complex joins, and aggregations. Separating the two is ETL – the extract, transform, and load process, to move data out of OLTP, remix it, and store it in OLAP. This means the OLAP world lived on stale data, with reporting and analytics mostly done, in an offline manner (despite the name), for people to make business decisions and for future plans.

MemSQL excels in operational analytics

What operational analytics is, is it’s a third workflow that combines the requirements of the other two.

So it has all the analytical requirements in terms of the complexity of the queries, and the need for joins and aggregations, and time series and window functions and all the stuff you’d expect from a data technology. Combined with the requirements around maintaining an SLA, so it needs the reliability and availability and scalability of the operational systems. And when you have requirements that are part of both those things, MemSQL is the best database hands down.

So we boil this down into kind of a pithy statement of, when you need analytics with an SLA, MemSQL is the best fit. And this is the bread and butter of what we do with MemSQL and the kind of problems that we solve.

And we’re also seeing more and more people who are moving into doing more predictive ML and AI use cases. And it turns out what you need to solve those problems is a system that can operationalize your model at scale with a combination of historical and real time data. And it turns out MemSQL is a great fit for that. And so we see people doing more and more work in this space and it’s really just sort of the evolution of operational analytics.

A good example of this is we have a large bank in North America that’s built their credit card fraud system on top of MemSQL. And they chose it because the system they had before was too slow. And so they could catch the fraud, but only after it had happened – or even after it had happened several times. And then they would refund the money to the customer, but they would lose out.

What they wanted was a system that could identify and stop the fraud in its tracks before it happened. And to do that, you need to be able to load data continuously and in real time. Leverage historical data, for example, the user’s past purchasing history on the credit card. And then combine that with what happened with the credit card right now. And then make an instant decision around whether or not they’ll let the charge go through. And by leveraging MemSQL, they are able to do that. And using a custom model they built, they were able to implement that and achieve their objectives.

And the third pillar of what we do is around helping customers move from cloud and replace their legacy systems. Bust as customers, as everybody is moving to the public cloud, there’s a need to replace the legacy systems that won’t make it there. Either because they are using hardware that’s just not possible to take to the cloud. Or because they have legacy algorithms and technology and intellectual property that was built for slow-spinning disk 20 years ago and just are not applicable, or don’t make as much sense and don’t work as well in a cloud environment.

And MemSQL has the advantage of being a modern system built with modern data structures that works very well running in cloud environments. And giving you the availability and reliability and scalability of a cloud system. Combined with a front end that is familiar and easy to use, because it looks just like a regular relational database. So really giving you the best of both worlds, making it easy to move from legacy systems to something more modern that will run in the environment so you need to run. So that’s what we do.

And now who we do it for? As you may have surmised, finance is a top industry for us. Over half the top 10 banks in North America make use of MemSQL, and they do it not just for portfolio analytics, but for a number of other use cases like fraud that I mentioned, risk management and trade analytics and pretty much anything that fits that operational analytics workload. There are more and more use cases popping up all the time.

We have a number of MemSQL customers in the media space, Comcast and Pandora are just a couple of them. As well as in the telco space, like Verizon. Or people are doing things like tracking the quality of video or audio streaming. As well as doing user behavior analysis in order to implement things like personalization. As well as setting ad tracking and delivering analytics to partners like your advertisers, for how the ads are performing.

And the third key vertical for us is really in the high tech space. Everything from companies as big as Akamai and Uber making major investments to leverage MemSQL, to fast growing startups that need systems that can help maintain their growth as they’re building their customer-facing SaaS products.

The MemSQL database excels in speed, scale, and SQL

So the key problems that we solve are around speed and scale on SQL. So when you need speed, meaning bringing data in fast, pouring data out fast. Or performance matters, coupled with scale, or you need to be able to scale the system up as your usage and number of customers are growing. But you want to do it with a familiar interface that works with existing tools and technology and skillset of your people. When you have all three of those requirements, then MemSQL is the absolute best fit for you. Speed, scale, SQL, is how we describe our differentiation, the key pain points that we solve. And that’s what’s driving our business.

MemSQL capabilities include fast ingest and cloud-native deployment

Now in terms of how we fit into the overall ecosystem, you can think of us like a regular relational database, and that data comes in from a data source. That data source can be standard operational legacy databases, could be streaming data sources like Spark and Kafka as I mentioned. Or bulk loading from big data systems.

And we have technologies that you bring data in from all those different sources regardless of the type or how you’re bringing it in. We have things like a CDC technology for bringing in data from operational systems. We have a feature called pipelines, so let’s you bring data in easily from Kafka and from data stores. And we have the ability to do transformations on the data so they can easily get it into the right shape before you put into the database. Now when you put it in the database, we have two different storage engines. We have in-memory rowstore and a disk-based columnstore.

We find customers tend to use a mix of the two. The rowstore tables are particularly good for hot data that needs to have a strict SLA, specifically when you do transactions or seeks. And the columnstore is much better for analytics and aggregation and the more analytical queries. And usually customers have the combination of the two on their system.

We support a number of different data types. So, whatever type of data you want to store. We can support it in MemSQL, obviously we handle the relational data, but we also have a native geospatial type. So you can easily index that and put it alongside of your relational data. We store a native JSON column type so you can store data in JSON or project out properties and easily index them and reference them within your SQL query.

We support time series and key value patterns as well. And then we also support a full text index, so you can pull text index elements of the database and reference that in your queries. And whether you’re using third-party analytical tools, like Tableau or Looker, Zoomdata or MicroStrategy, MemSQL supports all those. But many of our customers tend to use third party tools and many tend to do custom applications, or a mix, based on their needs and how they’re operating.

And of course in terms of how we run, you can run us on-premises, on bare metal. You can leverage VMs, or more recently, you can now leverage Kubernetes – we now have a Kubernetes operator. If you want to run us inside of a Kubernetes cluster on-prem or in the cloud, you can leverage our operator to easily deploy and manage MemSQL. And we run it across all the different cloud vendors – AWS, Google and Azure. We have customers who run us self-managed in all three. We also offer a managed service. If you don’t want to manage MemSQL yourself, we’ll do it for you.

How do we do all this? We do this through a set of features around having scalable SQL, so we have full ACID support. Meaning we support transactions and rollbacks combined with full ANSI SQL. So pretty much any SQL functions you would want supported in that SQL. And, as I mentioned, we have the full set of data types. Whether you’re storing documents or JSON, geospatial or full tech search, it’s all natively supported within the system.

We have fast ingest, and we have standard bulk load API, supporting the MySQL wire protocol. You can load data using files in CSV or other data types, other formats, bulk data is easily bulk loaded into the system. We have a native parallel stream ingest feature called Pipelines. It’s a first class object in the system. So you can easily load data in with the simple statement, CREATE PIPELINE. You point it at the Kafka queue or an S3 bucket or any of your favorite storage mechanisms and it immediately starts loading the data in a parallel. And we’re able to do this loading and queries simultaneously, because of the lock-free semantics within our system. Like I mentioned, we do use a different data structure under the covers that allows us to do this in a way that the legacy databases cannot.

Core Technologies that Power MemSQL Include MVCC, Skiplists

Pretty much all legacy databases were built on top of a technology called a b-tree. This was great in the days of slow spinning discs, when we needed to bring data back in chunks. But it came with certain locking semantics that made it very difficult to run a data query at the same time. It wasn’t a requirement in the original days or the early database.

MemSQL is built with a data structure called a skiplist. And with a skiplist, you can build them in a lock-free way. And that, combined with multi-version concurrency control (MVCC), our concurrency control system, allows us to be able to load data and to be able to query it simultaneously while it’s being loaded. This allows you to be streaming data in constantly, without blocking the system and preventing you from running queries. And this was one of the key underlying mechanisms that allows us to do what we do in a way that no one else can.

Under the covers, we’re a distributed, shared-nothing, massively parallel system. So when you put your data in the system it’s transparently charted across a number of modes. If you need more power, you can just add more nodes to the system. The system will automatically rebalanced the data, transparent to the developer or to the application. We do all this on commodity hardware. We don’t require any special hardware, you just need any machine that meets the minimum number of cores in memory and supports a modern version of Linux, you can be up and running with MemSQL.

We also provide high availability (HA), so it’s highly available, with transparent failover. So we’ll keep two copies of your data on two different machines at any given time. So if one of the machines dies, we transparently fail over and the application doesn’t even really notice. And we’re-cloud native. We sell software that you can deploy on-premises, and you can deploy us on any of the major cloud vendors. As I mentioned, we now have a Kubernetes operator, so that you can a run us in a Kubernetes cluster. So wherever you need to deploy we can make that work for you. And that’s it.

So at this point that’s the over view of MemSQL. Hope you learned something interesting. I’m happy to take questions.

Q&A

Q. What are some of the other use cases MemSQL is solving in financial services?

Great question. I think I may have touched on this earlier, but the portfolio analytics is definitely one of the primary ones. We also do things like trade analytics. So we have a large investment bank that wants to track each step in a trade process and identify any bottlenecks. So if there’s any bottlenecks they can immediately respond and go fix them, to make sure the trades are flowing as quickly as possible. So it’s something that they modeled within MemSQL, and they actually tried a number of different technologies. MemSQL is the only one that was able to meet their requirements for the number of updates per second they had to do. We’ve had a couple of other customers who’ve also implemented trade analytics that way.

Risk management is another use case where you want to keep track of how much risk you’ve taken on during the day. In the past, often larger, especially investment banks can only tell how much risk they had taken on by doing a report at the end of the day. They try to leave quite a bit of buffer in order to make sure they didn’t take on too much risk and then violate the FTC rules. And so by having that available to them in real time, they can be much more specific and precise about how much risk they’re taking on. And not have to leave any opportunities on the table because they’re in fear of possibly going over a line that they might not be actually going over.

And then one of the other use cases I mentioned was fraud. We actually have a fair amount of fraud use cases where people are tracking down either credit card fraud or other forms of fraud. But yeah, we have more use cases popping up all the time.

Q. Can we get some help if we want to do a proof of concept with MemSQL?

Absolutely. We have a number of highly trained sales engineers. We’ll be happy to help you get up and running and give you sort of the best practices. And help make sure that your system is optimally tuned to achieve the performance and scale you require.

Q. What’s coming in the future for MemSQL?

Great question. So we have a lot of good stuff, but there’s a lot more to go build. So we’re making additional investments and more on the transactional side. Particularly on recovery from disaster. Today we support backups and the backups are now online operations. So you can run that and it won’t block your system. There are full backups. So we’ve had a number of requests, like can we do incremental backups. Which allow you to run backup more often, reduce your RPO. So that’s a feature going to be coming soon.

As well as things like point in time restore. So you can restore back to a particular point in time, again to try to reduce that RPO to as little as possible.

We’re making efforts around simplifying the choices people have to make. So I mentioned we have a rowstore and a columnstore. And so having both of those in one system, with SQL queries able to span the two types of tables, is a huge innovation, and hugely valuable to customers. But it then presents a decision point when you’re designing your app around, “Hey, how do I decide whether I should put this data in rows or in columns? Make this table rowstore or columnstore?” And so we’re working on merging the rowstore and columnstore technologies together in something we call SingleStore, to allow you to just have one table type, say CREATE TABLE, and you don’t have to think about how the data is stored underneath. MemSQL takes care of that and organizes it for you.

And then the third pillar of investment is around the managed service, which I mentioned is currently a private preview. But we’ll be taking that into public preview soon. And we’re making investments to automate the management to the system to make it easy for you to be able to just spin up a cluster and you use it in MemSQL. Without having to worry about the physical management and troubleshooting characteristic.

We invite you to learn more about MemSQL at www.memsql.com, or give us a try for free it memsql.com/download.

↧

Case Study: Fraud Detection “On the Swipe” For a Major US Bank

August 24, 2019, 8:54 pm

≫ Next: Why Do Banks Need Real-Time Transaction Processing?

≪ Previous: Webinar: Modernizing Portfolio Analytics for Reduced Risk and Better Performance

Feed: MemSQL Blog.
Author: Floyd Smith.

This case study was originally presented as part of a webinar session by Mike Boyarski, Sr. Director of Product Marketing at MemSQL. It’s been updated to include additional information and references. In the webinar, which you can view here, and access the slides here, Mike describes the challenges facing financial services institutions which have decades’ worth of accumulated technology solutions – and which need to evolve their infrastructure immediately to meet today’s needs. In this case study, which was also described in the webinar, Mike shows how a major US bank created a new streaming data architecture with MemSQL at its core. Using MemSQL enabled them to move from overnight, batch fraud detection to fraud detection “on the swipe,” applying machine learning models in real time. He presents a reference architecture that can be used for similar use cases, in financial services and beyond.

This case study presents a reference architecture that can be used by leading retail banks and credit card issuers to fight fraud in real time, or adapted for many other real-time analytics use cases as well. In addition, it describes how MemSQL gives fraud detection services an edge by delivering a high-performing data platform that enables faster ingest, real-time scoring, and rapid response to a broader set of events. A similar architecture is being used by other MemSQL customers in financial services, as described in our Areeba case study.

This case study was originally presented as part of our webinar series, How Data Innovation is Transforming Banking (click the link to access the entire series of webinars and slides). This series includes several webinars, described in these three blog posts:

Also included are these two case studies:

Real-Time Fraud Case Study

This application is a credit card solution. The MemSQL customer was looking to deliver a high-performance, agile fraud detection platform using standard SQL, with challenging performance requirements. And so I’ll talk about what that means around agility and some of the sort of performance demands they have.

The customer has a time budget of one second from the time the card is swiped to the approval or refusal. There’s a very sort of sophisticated set of queries that need to be run in a very short window of time. They have about a 50 millisecond budget to work with to run a number of queries. In this application they are looking at about a 70-value feature record. And so we’ll spend a little bit of time on how that looks.

MemSQL is crucial to real-time fraud detection at this major US bank

Processing starts with a request, which is a transaction at a terminal or a point of sale system. The request hits, and that event is collected into the bank’s online transaction processing (OLTP) application. That’s a transactional operational database that is collecting that information. And that request is then converted by that OLTP app into a number of disparate queries.

There are various models that they have to identify this event and match it against a number of other activities that may have occurred over time. And in this application, again, it’s roughly 70 queries that are being run. And so it’s running events like trying to identify engagement between this customer and that vendor over the past days, months, and years. They’re trying to identify a trend around that customer and that merchant. They’re also looking at other activity, like geolocation event information about prior sort of transactions and the location that the event is taking place in.

Without listing every single query, there is a distinct set of queries that are being run all in parallel against a reference store, which is on MemSQL. MemSQL is the real-time operational data store. And so this is all about delivering a check against a fairly sophisticated model to do a score. And so the data is analyzed in MemSQL. We’ll score against the 70 odd queries and provide essentially, at the end of the day, a yes or no against that most recent transaction. And so that scoring service can occur within the timeframe that they required, which is around a 50 millisecond window.

And what, ultimately, this particular customer was looking to do was get to more agility so that they can add even more feature extraction queries over time, so that can continue to optimize their model using continuous insights from their data. Think about this as a continuously moving model that needs to adjust to the insights that they’re gaining on the different and new fraud prevention techniques that some of their customers are taking advantage of.

And so when you compare and contrast the previous solution that they were using before MemSQL, they were taking advantage of nightly batch jobs and accumulating these feature records for each customer, doing analysis overnight and trying to identify if the score in fact was correct or not. As any of you know who’s ever lost their credit card or debit card details, a fraudster can accumulate multiple charges in a day, if the check doesn’t occur until nighttime. So fraud events were getting through the system, resulting in lost revenue.

And also, the other challenge that they had was they couldn’t easily change their fraud model. So their iterating of features and adding new queries was very slow in their sort of update process. When they could identify a profile or fingerprint of fraudulent activity, it took them, in some cases, many weeks to make an update to the system in order to catch that source or type of fraud on an ongoing basis or a go-forward basis.

MemSQL’s Advantages for Real-Time Fraud Detection

As a result, they had a number of reasons for moving to MemSQL. One was that they wanted to get to a fairly sophisticated number of queries running concurrently, all within this 50 millisecond window. That was something that was not possible for them with the previous system. That system was inconsistent, so they weren’t getting reliable performance.

MemSQL advantages for real-time fraud detection include speed, scalability, concurrency, and SQL support

What they got from MemSQL that they are really excited about is the ability to add new features to their model scores using standard SQL, and do that an a more iterative basis. So this was giving them the flexibility to do further updates to their platform without having to re-engineer the system or wait for a lengthy sort of change management process that was part of their prior system. And a lot of the issues with the previous system had to do with the fact that they were using some technologies that were non-standard, meaning non-SQL or non-relational based.

And so, ultimately what they are able to model out was that this continuous improvement and this real-time sort of refinement was going to save them literally tens to potentially hundreds of millions of dollars and sort of lost fraud events. So for them, it’s all about getting more agile with their fraud detection platform using standard SQL, getting the great performance so that their customers don’t notice any disruption in their experience and service, and then of course saving money and making money from that more advanced service.

MemSQL Overview

MemSQL is used by a lot of banks, because a lot of banks like the performance of MemSQL. They like the familiar relational SQL. And we typically see us beating out the competition on a price/performance basis on a regular basis.

This diagram sums up MemSQL’s features. We jokingly call it a “markitecture” diagram, because it sums up our selling points in a form that relates to a lot of the reference architectures we derive from customer implementations of MemSQL.

MemSQL sits at the core of modern data processing implementations.

Our claim to fame is around delivering speed, scale, and SQL, all in one package. So of course most databases will say they’re fast, but I would argue that MemSQL is probably the world’s fastest database, because of our ability to really optimize the entire pipeline of data to the platform. So that includes ingestion. We have a lock-free architecture, so that means we can handle streaming events from Kafka or custom application logic and/or change data capture (CDC) logic.

So just think of our system as being able to efficiently take data into the platform and then run queries on that data as fast as any data warehouse product in the market. That includes the great legacy platforms like Teradata and Vertica and others, and also some of the newer cloud-based data warehouses. We are fast, we have a strong columnstore engine that’s disk-based, along with a strong rowstore engine that runs in memory. I’ll talk a little bit about our underpinnings and our architecture in a moment, but it’s all about speed, getting data in, and then once the data has landed, running those queries as quickly as possible.

I mentioned earlier about MemSQL’s scale, which is powered by our distributed scale-out architecture. It’s a node-based, shared-nothing architecture. That’s what makes MemSQL really, really fast. And we believe strongly in relational SQL, because we think that’s the easiest way to get data out of your system. It works really well with existing tools. But also, more importantly, it works with the skill set that already exists inside most organizations.

In terms of our architecture and ecosystem support, as you can see on the left, we can ingest data from a variety of sources: Kafka, Hadoop/HDFS, AWS S3, Spark, and custom file systems. So our ingestion technology is top notch, and that’s another reason why a lot of customers, mostly banks, really like our platform. It works very well for those types of high-ingest environments.

And then once data lands into MemSQL, we have two different storage engines. You have an in-memory based rowstore that’s really fast for point look-ups, transactions, and is fully ACID compliant. And of course we also have a columnstore engine that looks and feels like a traditional data warehouse. So great compression, it can query and do fast aggregate queries. We did a performance test on a trillion row scan in a second. So it’s very, very fast for data warehousing type jobs as well.

And MemSQL has all the flexible data types that you would expect of a modern database, whether it’s JSON, whether it’s relational structure, key-value, time series, etc. Our deployment flexibility is mostly based on our Kubernetes and container support. You can run us on anything, whether that’s in the cloud or on your own, on-premises infrastructure. Of course, if you want to run it on bare metal or any other Linux environment, you can do that as well.

Lastly, getting into a little bit more depth on our architecture, we are fully ACID compliant. That means that transactions can be committed, guaranteed, and logged to a disc. We treat every event as its own entity. MemSQL is fully ANSI SQL-compliant, and we have all of the flexible data type support that you would expect of a modern database.

MemSQL’s ingestion is, again, world-class. We can do parallel stream ingest, we can ingest directly to rowstore, or columnstore, or both. It depends on your application needs. And that’s ultimately why customers really like the platform, is it gives you the flexibility to determine what’s the best outcome, the best process flow to get the SLA result that you need.

We are a shared-nothing, massively parallel, highly concurrent database. Which means that, if you’ve got lots of users accessing your system or lots of data ingestion points coming into the system, concurrency for our platform, unlike most other platforms, is really not a challenge.

That wraps up the story of how MemSQL is being applied to real-time fraud detection. While there’s a lot of depth that can be gone into around the types of functions and queries that are required of a fraud detection application, our goal here today was to give you a starting point to understand how we fit for this real-time fraud application.

We invite you to learn more about MemSQL at MemSQL.com, or give us a try for free at MemSQL.com/download.

↧

Why Do Banks Need Real-Time Transaction Processing?

August 30, 2019, 9:47 pm

≫ Next: A Technical Introduction to MemSQL

≪ Previous: Case Study: Fraud Detection “On the Swipe” For a Major US Bank

Feed: MemSQL Blog.
Author: Floyd Smith.

A new report from RT Insights describes the benefits of real-time transaction processing in banking and financial services and shows how traditional database architectures interfere with real-time data movement. In order to get the benefits of real-time transaction processing, such as improved portfolio management, fast credit card fraud and acceptance checks, and others, banks and other financial services institutions need to use a translytical database, combining the best of transaction and analytical data processing capabilities in a single, fast, scalable system.

A Real-Time Database for Banking

What is a real-time database? And why would you need one for banking and financial services companies?

A real-time database is a database that can support real-time processing. According to Wikipedia (as of the publication date), “Real-time processing means that a transaction is processed fast enough for the result to come back and be acted on right away.” That certainly sounds like something banks could use – when you go to the ATM machine, or use a credit card, or apply for a home loan, you certainly want the systems you’re using to return the right answers, right away. (These functions are also good examples of the use of machine learning in financial services, another MemSQL specialty.)

Indeed, accounting and banking are two of the areas where real-time databases are said to be most useful. The RTi report cites many important applications for “faster and more intelligent decision-making”: fraud monitoring; dynamic portfolio analysis; regulatory compliance; and protection from cyberthreats.

What Kind of Database Can Be Real-Time?

Traditional data processing depends on databases that seem designed, not to enable, but to keep data from being real-time. These databases are not scalable, so they’re limited to the capabilities of a single machine. In order to make the most of what one machine can do, transactions are handled on specific database type, called online transaction processing (OLTP).

Then, the OLTP system is tied up for a while so a specialized process, extract, transform, and load (ETL), can copy data off it. The data is then remixed with other in-house and outside data, reformatted for faster analytics performance, and moved to a different kind of database for analytics, broadly called online analytics processing (OLAP).

https://www.predictiveanalyticstoday.com/top-free-extract-transform-load-etl-software/

Different kinds of analytics databases exist; data warehouses have specialized tools for slicing and dicing data, while operational analytics databases are better suited for supporting applications, such as a mapping or ride hailing app on your phone. Even NoSQL gets in the act, with data lakes used for data science queries and even for business intelligence (BI) tools, though the fit there is not very strong.

The movement of data from ingest, to OLTP, through ETL, to OLAP can take many hours and even days – far from real-time. So the RTI report puts forward translytical databases, which combine transactional and analytical capabilities, as the right place to look for an answer.

A translytical database is both fast and scalable – when a workload is too much for a single server to support, a second server can be added, extending the processing power, RAM, and disk space available for the stored data. By combining both functions into a single database, eliminating the intermediate steps inherent in the OLTP/ETL/OLAP split, the translytical database can serve as a real-time database, supporting crucial applications in banking and financial services.

h2>Additional Benefits of a Real-Time Database for Financial Services

h2>

Organizations are so used to traditional, siloed data structures that they don’t see some of the hidden costs involved – costs that are removed when slow-moving data becomes real-time data. Here are some of the benefits that banks and other financial services organizations receive when they move to real-time transaction processing:

Improved customer experience. When real-time data becomes the expectation, improvements appear in all the different ways that a customer interacts with a financial services institution, from using a credit card to managing an investment portfolio.
Better risk management. For regulatory reasons, financial services organizations must measure, and manage, the riskiness of customer portfolios, requiring a move to real-time transaction processing for a large part of what a bank does. It’s then relatively easy for the organization to offer real-time interactivity to customers.
Creation of a “data culture.” By making real-time data the default, organizations can always be managing risks, always be in compliance with regulations, and always be offering the best possible services and support to their customers. To fully achieve this, an organization must move to having a “data culture” right across all functions and all levels, spurring even closer involvement with live data.

Get Your Copy of the Report

You can download and read the RTi report at no charge, giving you the chance to learn about real-time transaction processing and its benefits for banking and financial services. Get your copy today!

↧

A Technical Introduction to MemSQL

August 30, 2019, 11:16 pm

≫ Next: MemSQL as a Data Backbone for Machine Learning and AI

≪ Previous: Why Do Banks Need Real-Time Transaction Processing?

Feed: MemSQL Blog.
Author: John Sherwood.

John Sherwood, a senior engineer at MemSQL on the query optimizer team, presented at MemSQL’s Engineering Open House in our Seattle offices last month. He gave a technical introduction to the MemSQL database, including its support for in-memory rowstore tables and disk-backed columnstore tables, its SQL support and MySQL wire protocol compatibility, and how aggregator and leaf nodes interact to store data and answer queries simultaneously, scalably, and with low latencies. He also went into detail about code generation for queries and query execution. Following is a lightly edited transcript of John’s talk. – Ed.

This is a brief technical backgrounder on MemSQL, our features, our architecture, and so on. MemSQL: we exist. Very important first point. We have about 50 engineers scattered across our San Francisco and Seattle offices for the most part, but also a various set of offices across the rest of the country and the world.

With any company, and especially a database company, there is the question of why do we specifically exist? There’s absolutely no shortage of database products out there, as probably many of you could attest from your own companies.

Technical Introduction to MemSQL database 1

Scale-out is of course a bare minimum these days, but the primary feature of MemSQL has traditionally been the in-memory rowstore which allows us to circumvent many of the issues that arise with disk-based databases. Along the way, we’ve added columnstore, with several of its own unique features, and of course you’re presented all this functionality through a MySQL wire protocol-compatible interface.

Technical Introduction to MemSQL database 2

The rowstore requires that all the data can fit in main memory. By completely avoiding disk IO, we were able to make use of a variety of techniques to speed up the execution, with minimal principal latencies. The columnstore is able to leverage coding techniques that – with code generation and modern hardware – allow for incredibly fast scans.

The general market we find ourselves in is: companies who have large, shifting datasets, who are looking for very fast answers, ideally with minimal changes in latency, as well as those who have large historical data sets, who want very quick, efficient queries.

So, from 20,000 feet as mentioned, we scale out as well as up. At the very highest level, our cluster is made up of two kinds of nodes, leaves and aggregators. Leaves actually store data, while aggregators coordinate the data manipulation language (DML). There’s a single aggregator which we call the master aggregator – actually, in our codebase, we call it the Supreme Leader – which is actually responsible for coordinating the data definition language (DDL) and is the closest thing we have to Hadoop-style named namenode, et cetera that actually runs our cluster.

Technical Introduction to MemSQL database 3

As mentioned, the interface at MemSQL is MySQL compatible with extensions and our basic idiom remains the same: database, tables, rows. The most immediate nuance is that our underlying system will automatically break a logical database into multiple physical partitions, each of which is visible on the actual leaf. While we are provisionally willing to shard data without regard to what the user gives us, we much prefer it if you actually use a shard key which allows us to set up convenient joins, et cetera, for actual exploration of data.

The aggregator then is responsible for formulating query plans, bridging out across leaves as necessary to service the DML. Of particular note is that the engine that we use is able to have leaves perform computations with the same full amount of functionality that the aggregator itself can perform, which allows us to perform many worthwhile optimizations across the cluster.

Technical Introduction to MemSQL database 4

A quick, more visual example will better show what I’m talking about. Here we have an example cluster. We have a single master aggregator and three leaf nodes. A user has given us the very imaginatively named database “db” which we’re supposed to create. Immediately the aggregator’s job is to stripe this into multiple sub-databases, here shown as db_0 through db_2. In practice, we find that a database per physical core on the host works best, it allows parallelization and so on, but drawing out 48 of these boxes per would probably be a little bit much.

Technical Introduction to MemSQL database 5

So beyond just creating the database, as mentioned, we have a job as a database to persist data. And as running on a single host does not get you very far in the modern world. And so, we have replication. We do this by database partition, replicating data from each leaf to a chosen slave.
So as you can see here, we’ve created a cluster such that there is no single point of failure. If a node goes down, such as this leaf mastering db_2, the other leaf that currently masters db_0 will be promoted, step up, and start new serving data.

Technical Introduction to MemSQL database 6

I’d also note that while I’m kind of hand waving a lot of things, all this does take place under a very heavy, two phase commit sort of thing. Such that we do handle failures properly, but for hopefully obvious reasons, I’m not going to go there.

So in a very basic example, let’s say a user is actually querying this cluster. As mentioned, they talked to the master aggregator that’s shown as the logical database, db as mentioned, which they treat as just any other data. The master aggregator in this case is going to have to fan out across all the leaves, query them individually and merge the results.

One thing that I will note here, is that I mentioned that we can actually perform computations on the leaves, in a way that allows us not to do so on the master. Here we have an order-by clause, which we actually push down to each leaf. Perhaps there was actually an index on A that we take advantage of.

Technical Introduction to MemSQL database 7

Here the master aggregator will simply combine, merge, stream the results back. We can easily imagine that even for this trivial example, if each leaf is using its full storage for this table, the master aggregator (on homogenous hardware at least) will not be able to do a full quick sort, whatever you want to use, and actually sort all the data without spooling. And so even this trivial example shows how our distributed architecture allows faster speeds.

Before I move on, here’s an example of inserts. Here, as with point lookups and so on in the DML, we’re able to say the exact leaf that owns this row across its object.

Technical Introduction to MemSQL database 8

So here we talk to a single leaf end up transparently without the master aggregator necessarily knowing about it. Replicates that down to db_1’s slave on the other host. Allowing us to have durability, replication, all that good stuff.

Again, as a database, we are actually persisting everything in the data that has been entrusted to us. We kind of nuance between durability to the actual persistence of a single host versus replication across multiple hosts.

Like many databases, the strategy that we use for this is a streaming write-ahead-log which allows us to rephrase the problem from, “How do I stream transactions across the cluster?” to simply, “How do I actually replicate pages in an ordered log across multiple hosts?” As mentioned, this works at the database level, which means that there’s no actual concept of a schema, of the actual transactions themselves, or the row data. All that happens is that this storage layer is responsible for replicating these pages, the contents of which it is entirely agnostic to.

Technical Introduction to MemSQL database 9

The other large feature of MemSQL is its code generation. Essentially the classic way for a database to work is injecting in what we would call in the C++ world, virtual functions. The idea that in the common case, you might have an operator comparing a field of a row to a constant value.

Technical Introduction to MemSQL database 10

In a normal database you might inject an operator class that has a constant value, do a virtual function lookup to actually check that, and we go on with our lives. The nuance here is in a couple of ways this is suboptimal. First being that if we’re using a function pointer, a function call, we’re not in-line. And the second is simply that in making a function call, we’re having to dynamically look it up. Code generation on the other hand allows us to make those decisions beforehand, well before anything actually executes. This allows us both to make these basic optimizations where we could say, “this common case any engine would have – just optimize around it,” but also allows us to do very complex things outside of queries in a kind of precognitive way.

An impressive thing for most when they look through our code base is is just the amount of metadata we collect. We have huge amounts of data on various columns, on the tables and databases, and everything else. And at runtime if we were to attempt to read this, look at it, make decisions on it, we would be hopelessly slow. But instead, by using code generation, we’re able to make all the decisions up front, efficiently generate code and go on with our lives without having runtime costs. A huge lever for us is the fact that we use an LLVM toolchain underneath the hood, such that by generating IR – intermediate representation – LLVM, we can actually take advantage of the entire tool chain they’ve built up. In fact the same toolchain that we all love – or we would love if we actually used it here for our main code base – to use in our day to day lives. We get all those advantages: function inlining, loop unrolling vectorization, and so on.

And so between those two features we have the ability to build a next generation, amazing, streaming database.

↧

MemSQL as a Data Backbone for Machine Learning and AI

August 31, 2019, 3:59 pm

≫ Next: Community Stars Light Up MemSQL Forums

≪ Previous: A Technical Introduction to MemSQL

Feed: MemSQL Blog.
Author: Nikita Shamgunov.

MemSQL co-founder and co-CEO Nikita Shamgunov gave the keynote address and a session talk at the AI Data Science Summit in Jerusalem earlier this summer. The session talk, presented here, describes the use of MemSQL as a data backbone for machine learning and AI. In this talk, he gives a demonstration of the power of MemSQL for AI, describes how MemSQL relates to a standard ML and AI development workflow, and answers audience questions. This blog post is adapted from the video for Nikita’s talk, titled Powering Real-Time AI Applications with MemSQL. – Ed.

[embedded content]

Demonstrating the Speed of MemSQL for AI Applications

MemSQL allows you to deal with large datasets. And here, on my laptop, running Linux, I am going to show you a table here using this wonderful tool which we call MemSQL Studio. I’m going to show you a table that has 1.7 billion records. So I’m storing 1.7 billion records on my laptop. (Note 1761516864 highlighted in MemSQL Studio – Ed.)

MemSQL scans 1.7B records in 25 milliseconds.

How many of you know what SQL is, and how to use SQL? (Most of the audience raises their hands. – Ed.) That’s very good. That’s very convenient, and I always ask this question because usually I get a very positive response. So people in general know SQL, and in a data science world, people often start with SQL to pull data into their own tools or frameworks because SQL is so convenient to slice and dice data before you move it somewhere else.

So again, to level set of what MemSQL is before we go into the actual application, the system is a relational database. It stores data in an incredibly compressed format. So here I’m storing 1.7 billion records on my laptop. The whole thing takes 11 gigabytes.

The kind of data here is somewhat typical, so data I collected from the telemetry from my own laptop. So I used a tool called Sysdig, and it starts telling all that information … telemetry that is happening in my own laptop.

Sysdig generates tons of data from your laptop to use for ML experimentation with MemSQL.

All the system calls, temperature, all the various telemetry and data points and events that are happening on my laptop. So obviously, you can run machine learning models to start predicting what’s going to happen next and there are startups in San Francisco that are doing exactly that for monitoring very large data centers, for monitoring Kubernetes containers and whatnot.

So I captured that data and I captured that data in MemSQL, loaded it. And the trick is that you can load that data in real time, so you can connect to anything that emits data. You can connect to stream processing systems like Kafka, and the data will just start flowing into MemSQL. And you can assume that MemSQL is completely limitless, right? So you can land as much data in MemSQL as you want.

But the data is not just stored in some sort of cold storage. If you want to get an individual data point out of the data, you can retrieve that data in an incredibly fast way. So those who work with S3, Hadoop… If you want to take one individual record stored on HDFS in a Parquet file and you want to get it back, some sort of big gears need to shift before you get this record back.

Well, in this case, let’s just try to fetch a record here, out of MemSQL, out of my laptop. I can put limit one to get one record and stuff comes back in two milliseconds.

MemSQL finds a specific record in 6ms.

And what that means is you can actually … once you deploy those models, right? And then when you deploy models into production, usually it’s the combination of intelligence that’s delivered through a model and some source that serves data to the application, you put them together and you can achieve incredibly low latencies eventually delivered to what people recently called smart apps.

So that’s what MemSQL is. Typically, in data science there are multiple parts of the workflow, as you put systems into production, when you need to extract individual data points out of the system, you can do it with incredibly low latency. People do it for fraud detection all the time. You swipe a credit card, you need to understand if it’s a fraudulent transaction, or you’re walking into a building, you’re doing face recognition. That’s the thing … it has to be a real time system.

Or you want to extract data from the system and slice and dice it. And I’ll give you an example of that as well. In this query I’m going to scan the whole 1.7 billion records and understand some basic information. So group, group, group those 1.7 billion records by this field that’s called GID. And that is actually happening also very, very quickly, in 1.2 seconds. In 1.2 seconds we processed a billion records here.

MemSQL processes more than 1B records in 1.2 seconds. So now, why this matters is because the system is incredibly convenient in becoming the backbone of where you want to land data, where you want to do feature engineering, where you want to expose that data set to your team. It’s very lightweight. It can run on your laptop and it’s free to run on your laptop. And then you can open up your laptop and put it on the server, put it in the cloud, scale it, and then go from analytics and data science to pixels that are displayed on people’s apps on the website in an incredibly efficient way.

Q. What is MemSQL focused on here?

A. MemSQL is focused on both fetching data and updating data. So here let’s go and update something. There’s an update statement. I can run it and it’s instant. And the idea is you can build apps. It’s a transactional database. And usually when you have apps you want to insert, update, delete, aggregate, produce reports, deliver analytics back to the application. And we give you basically infinite compute.

We have this mechanism of locking your record. And in fact, we only lock on writes, so you will only have conflict on writes, and reads never take locks. So that makes it incredibly convenient to build apps because what people hate is when apps start to spin. There’s a spinning, like a wheel, on the app. Very often, when we win business from large companies, the definition of success is no spin.

Q. What’s the drawback?

A. What’s the drawback? Well, humbly, there are no drawbacks. So I think the only drawback is that it is in fact a distributed system so you need to use a distributed system versus sometimes the workloads are such that they fit on a small system and that they fit on a small system, it’s no better. So you’re kind of wasting that ability to scale. (The advantages of MemSQL are not evident at small scale – Ed.)

Using MemSQL for Machine Learning and AI Workflow

Q. What about the workflow for AI and ML? (This refers to the ML and AI workflow Nikita showed during his keynote address on data infrastructure for ML and AI at the same conference. – Ed.)

A. So the question is … So what’s the workflow? And when do you build a model? So typically, this is kind of a very high level definition of the workflow. You define machine learning use cases. You define ML use cases, you do some data explorations. So you don’t need MemSQL to define ML use cases, but when you want to do data exploration, typically, people want to play with the data.

MemSQL serves as crucial data support for ML/AI development and deployment.

And once you push data into MemSQL, it’s very, very efficient. You can explore data yourself by running SQL statements directly against the database. You can attach a business intelligence (BI) tool. You can visualize it, you can attach a notebook. Now you’ll be pulling data into Python or Spark, and MemSQL integrates and works with Spark incredibly well. So you can play with the data.

Q. Where can I get the connector?

A. The connector, it’s on GitHub, so you download the MemSQL connector, and what we do is we give you very fast data exchange between MemSQL and Spark data frames. And that’s what people use all the time, where MemSQL gives you basically storage and compute with SQL, and Spark gives you training and ad hoc analysis, ATL, feature engineering.

And what people do is they create a data frame and they call “save to MemSQL” on a data frame and the data drops in there and it’s reliable, transactional, all those good things.

Data pipeline. (Refers to step 4., Data pipeline and feature engineering, in the workflow diagram. – Ed.) So everywhere I go into modern organizations, there are data pipelines. You take data, extract it from multiple systems, and oftentimes people use Spark. Sophisticated customers and some of them are here in Israel, use change data capture (CDC) tools when they pull data from relational databases, MySQL, Oracle, massage the data and push it into MemSQL.

So we spent years and put a lot of work into making that process very, very simple. And one of the most common patterns with data pipelines … usually there’s Kafka somewhere and once data is in Kafka, you can put in data into MemSQL at the click of a button. You say, “Create pipeline,” point at Kafka, data’s flowing in.

Build ML model. (Refers to step 5., Build ML model, in the workflow diagram. – Ed.) So this is what you’re going to do using Python packages, and then people iterate, right? Obviously, there’s a lot going on when you develop a model and you think about a million things, but once you’re kind of out and proud with a model, you want to present results and you want to deploy that model. And typically, you deploy a model and then you run in some sort of parallel environment to make sure that you don’t screw up.

And really depending on the use case, right? In some cases, the bar for quality is low and we have some customers that perform fraud detection on electricity IoT data such that, you know, “This household spent more on electricity than last month.”

Okay, we want to look at that. Well, that’s not very sophisticated. And then anything you do there will improve their quality of fraud detection dramatically. And then we have customers that do financial analysis, risk analysis, understanding risk of an individual loan. That’s where the bar is very, very high because you improve the model by a little bit and then you’re saving millions of dollars.

Then you plan for deployment and then you operationalize the model. (Refers to step 8., Plan for deployment, and step 9., Operationalize model, in the workflow diagram. – Ed.) And so that’s where some people deconstruct that model, and let’s say they do image recognition … And I showed you that video in the keynote.

Maybe I’ll show that again here. So in this video, you can point a smartphone at anything and it will go and find that that’s an item in the catalog. So it does it in a very, very fluid way and it allows people to compare prices, efficiently shop. Or you’re talking to a friend and you want to buy the same thing that the friend has and whatnot.

Using MemSQL to recognize anything with ML and AI.

So in this case, they took a model and they deconstructed that model and they expressed that model in feature vectors and used basic operations that we offer in MemSQL, such as vector dot products, Euclidean distance, to run a very simple query. These are the primitive operations that we offer in MemSQL, so if you store feature vectors in a table, now, using a scale-out kind of limitless compute, you can run a query that will scan all the records in that table, compute dot product against the feature vector which you got from an image with all the feature vectors already there.

MemSQL has some very fast functions that are highly useful for ML and AI.

Well, I’m not going to explain you what a dot product is, but basically running that query where all the feature vectors are greater than 0.9, that’s your similarity search.

MemSQL compares feature vectors very, very quickly

MemSQL’s Future for Machine Learning and AI

Now, the advantage of doing this in a database is that the actual model … in this case, incredibly primitive … but co-locating the model and the data opens up all the possibilities that you can do. And now, what we’re working on now is the ability to push TensorFlow and Caffe and PyTorch models directly into MemSQL.

And with that, you’re able to run those models in production, inside the database, right next to the data and deliver great user experiences by building smart apps.

Final Q&A and Conclusion

Q. How do transactions work for Spark?

Very good question. So the way transactions work in MemSQL, as a transactional system, everything that’s between begin transaction and commit is atomic. So with Spark, it’s no different. If you take a data frame and you save to MemSQL, this data frame drops into MemSQL in a transactional way. So until every record makes it from the data frame, makes it into MemSQL, nobody else sees that data. There are transaction boundaries around “Save to MemSQL.” In the case of Kafka, we go micro-batch to micro-batch and the transaction boundaries are around the micro-batch.

And in fact, if there is a failure, then if a micro-batch … So we will never persist half of a micro-batch. So each micro batch is going to be delivered to MemSQL as a whole, which gives you exactly-once semantics from Kafka.

Q. Do I use MemSQL for training my model?

Yeah. Training is done outside of MemSQL, so we do not support training, but we can be the data backbone for training, and if training needs data and queries, MemSQL periodically – that’s a perfect low-latency solution for that. All right. Thank you.

You can download and run MemSQL for free or contact MemSQL – Ed.

↧

Community Stars Light Up MemSQL Forums

August 31, 2019, 6:35 pm

≫ Next: CREATE PIPELINE: Real-Time Streaming and Exactly-Once Semantics with Kafka

≪ Previous: MemSQL as a Data Backbone for Machine Learning and AI

Feed: MemSQL Blog.
Author: Floyd Smith.

The MemSQL Forums are seeing more and more community contributions – not just timely and incisive questions, but answers from the community, and valuable content contributions as well. We have two new community stars for the summer, and some valuable Q&As flying back and forth.

What Makes an Online Community Work?

Participation in any online community is mostly optional. People always have a lot of ways they can spend their time, so for an online community to take off, it has to offer a lot to people.

So it’s notable that the MemSQL Forums are seeing more and more answered questions, and solid contributions, from customers and users – while MemSQL employees continue to help out as well.

The Forums are also important to a key MemSQL initiative, offering free use of MemSQL for small and, in our humble opinion, even medium-sized deployments. Only Enterprise users have direct access to MemSQL’s highly regarded, responsive support team, so the Forums are a crucial source of help.

Summer Community Stars

The first Community Star, named in June, and featured in a previous blog post, was Ziv Meidav. Now we have a second, and a third Community Star.

The July Community Star, Brandon Vincent, has all the tools needed to play the game, as they say in baseball. Not only does he dig in to help other users with complex technical questions; he recently posted an important piece of documentation, the excellent Columnstore Key Guidelines. Covering both shard keys and columnstore table keys, the Guidelines have a lot of important answers – and a couple of in-depth questions as well.

The August Community Star, Mani Gandhi, is part of the broader MemSQL community, out beyond the Community Forums. Mani has amassed more than 14,000 karma on Hacker News, a good chunk of that while making insightful comments about MemSQL. (Do go read his posts, but don’t upvote him because of this reference, as Hacker News frowns on that.)

Fun on the Forums

There’s a lot of serious stuff on the Forums, of course – discussions of potential errors in a MemSQL Pipeline, a question about GROUP_CONCAT sorting (coming soon, in MemSQL 7.0), and a question and answer about unwanted increases in memory usage.

The MemSQL Forums include a question about MemSQL as a plagiarism detector.

However, one has to wonder if the question and answers about using MemSQL as a plagiarism detector are entirely serious. Probably, but someone should double-check the question and answers to see if they’re all entirely original.

School’s Back In

As school goes back into session, and summer winds down, we expect things to get more business-like again – and to see an uptick of activity on the MemSQL Forums. See you there.

↧

CREATE PIPELINE: Real-Time Streaming and Exactly-Once Semantics with Kafka

August 31, 2019, 11:35 pm

≫ Next: Looking Back on the First Year of MemSQL Studio

≪ Previous: Community Stars Light Up MemSQL Forums

Feed: MemSQL Blog.
Author: Floyd Smith.

In this presentation, recorded shortly after MemSQL introduced MemSQL Pipelines, two MemSQL engineers describe MemSQL’s underlying architecture and how it matches up perfectly to Kafka, including in the areas of scalability and exactly-one updates. The discussion includes specific SQL commands used to interface MemSQL to Kafka, unleashing a great deal of processing power from both technologies. In the video, the MemSQL people go on to describe how to try this on your own laptop, with free-to-use MemSQL software.

Introduction to MemSQL: Carl Sverre

I want to start with a question. What is MemSQL? It’s really important to understand the underpinnings of what makes Pipeline so great, which is our MemSQL distributed SQL engine.

There are three main areas of MemSQL that I want to talk about really briefly. The first area of MemSQL is that we’re a scalable SQL database. So if you’re familiar with MySQL, Postgres, Oracle, SQL Server, a lot of our really awesome competitors, we are really similar. If you’re used to their syntax, you can get up and running with MemSQL really easily, especially if you use MySQL. We actually have followed their syntax very similarly, and so if you already used MySQL, you can pretty much drop in MemSQL in place, and it just works.

[embedded content]
So, familiar syntax, scalable SQL database. What makes us scalable, really briefly? Well, we’re a distributed system. We scale out on commodity hardware, which means you can run us in your favorite cloud provider, you can run us on-premises and it generally just works. It’s super, super fast, as you can see – really, really fast and it’s just fun.

So without further ado, I want to get into a little bit of the technical details that are behind what makes us such a great database. In MemSQL, we have two primary roles in the cluster. So if you think about a collection of Linux machines, we have some aggregators and some leaves. Aggregators are essentially responsible for the metadata-level concepts in the cluster. So we’re talking about the data definition layer – DDL, for people who are familiar SQL. We’re responsible for CREATE TABLE statements, for CREATE PIPELINE statements, which we’re going to get right into.

MemSQL stores data in leaf nodes and distributes operations across them.

In addition, master aggregators and child aggregators collectively handle things like failover, high availability, cluster management, and really importantly, query distribution. So when a SELECT query comes in, or an INSERT command comes in, you want to get some data, you want to insert some data. What we do is we take those queries and we shard those queries down onto leaf nodes. So leaf nodes are never connected to directly by your app. Instead you connect to the aggregators and we shard down those queries to the leaf nodes.

And what are the leaf nodes? Well, leaf nodes satisfy a couple of really, really powerful features inside the engine. One feature is storage. If you have leaf nodes, you can store data. If you have more leaf nodes, you can store more data. That’s the general concept here. In addition, it handles pre-execution. So the more leaf nodes you have, generally the faster your database goes. That’s a great property to have in a distributed system. You want to go faster, add more leaf nodes. It generally scales up and we see some really amazing workloads that are satisfiable by simply increasing the size of the cluster. And that’s exciting.

Finally you get a sort of natural parallelism. Because we shard these queries to all the leaf nodes, you can sort of take advantage of the fact that by scaling out, you are taking advantage of many, many cores like real, real solid Linux; everything you want performance-wise just works. And that’s a simple system to get behind, and I’m really excited to always talk about it because I’m a performance geek.

So that’s the general idea of MemSQL. This Meetup is going to be really focused on Pipelines, so I just wanted to give you some basic ideas at the MemSQL level.

Introduction to MemSQL Pipelines: John Bowler

Pipelines are MemSQL’s solution to real-time streaming. For those of you who are familiar with what an analytics pipeline looks like, you might have your production database going into a Kafka stream or writing to S3 or writing to a Hadoop data lake. You might have a computation framework like Spark or Storm, and then you might have an analytics database farther downstream, such as RedShift, for example. And you might have some business intelligence (BI) tools that are hitting those analytics databases.

So MemSQL Pipelines are our attempt at taking a step back and solving the core problem, which is: how do you easily and robustly and scalably create this sort of streaming analytics workload, end to end? And we are able to leverage some unique properties of our existing system.

For example, since MemSQL is already an ACID-compliant SQL database, we get things like exactly-once semantics out of the box. Every micro-batch that you’re streaming through your system happens within a transaction. So you’re not going to duplicate micro-batches and you’re not going to drop them.
You also have these pipelines, these streaming workloads are automatically distributed using exactly the same underlying machinery that we use to distribute our tables. In our database, they’re just automatically sharded across your entire cluster.

And finally, for those of you who have made these sort of analytics workloads, there’s always going to be some sort of computation step, whether it’s Spark or any similar frameworks. We offer the ability to perform this computation, this transformation, written in whatever language you want, using whatever framework or whatever libraries you want. (This happens within the Pipeline, very fast. Accomplishing the same thing as an ETL step, but within a real-time streaming context. – Ed.) And we’ll explain in more detail how that works.

So this is how you create a pipeline – or, this is one of the ways that you create a pipeline. You’ll notice that it is very similar to a CREATE TABLE statement and you can also alter a pipeline and drop a pipeline. The fact is that pipelines exist as first-class entities within our database engine.

The MemSQL CREATE PIPELINE statement names a Kafka pipeline as its input source.

And underneath this CREATE PIPELINE line is a LOAD DATA statement that is familiar to anyone who’s used a SQL database, except instead of loading data from a file, you’re loading data from Kafka, specifically from this, a host name and the tweets topic. And then the destination of this stream is the tweets table. So in this three lines of SQL, you can declaratively describe the source of your stream and the sync of your stream and everything related to managing it is automatically handled by the MemSQL engine.

This is sort of a diagram of how it works. Kafka, for those of you who are unfamiliar, is a distributed message queue. When you’re building analytics pipelines, you very commonly have lots of bits of production code that are emitting events or emitting clicks or emitting sensor data and they all have to get aggregated into some sort of buffer somewhere for whatever part of your analytics system is consuming them. Kafka is one of those very commonly used types, and it’s one that we support.

So you have all different parts of your production system emitting events. They arrive in Kafka, maybe a few days worth of buffer. And when you create your pipeline in MemSQL, it automatically streams data in. Now, Kafka is a distributed system. You have data sharded across your entire cluster. MemSQL is also a distributed system. You have data sharded across your entire set of leaf nodes. So when you create this Kafka consumer, it happens in parallel, automatically.

Now, this is sufficient if you have data in Kafka and you just want to load it straight into a table. If you additionally want to run some sort of transform, or MapReduce, or RDD-like operation on it, then you can have a transform. And the transform is just a binary or just a program.

You can write it in whatever language you want. All it does is it reads records from stdin and it writes records to stdout, which means that you could write it as a Python script. You could write it as a Bash script. You could write it as a C program if you want. Amazing performance. You can do any sort of machine learning or data science work. You can even hit an external server if you want to.

Kafka distributed data maps to MemSQL leaf nodes.

So every record that gets received from Kafka passes through this transform and is loaded into the leaf, and all of this happens automatically in parallel. You create the code of this transform and MemSQL takes care of automatically deploying it for you across this entire cluster.

Conclusion

The video shows all of the above, plus a demonstration you can try yourself, using MemSQL to process tweet streams. You can download and run MemSQL for free or contact MemSQL.

↧

Looking Back on the First Year of MemSQL Studio

September 5, 2019, 7:27 pm

≫ Next: Fiserv on Machine Learning & Real-Time Analytics at Financial Institutions

≪ Previous: CREATE PIPELINE: Real-Time Streaming and Exactly-Once Semantics with Kafka

Feed: MemSQL Blog.
Author: David Gomes.

Today is the one-year anniversary of MemSQL Studio and sees the release of MemSQL Studio 1.8.1. To celebrate, we have put together a brief summary of where we’ve taken MemSQL Studio in its first year, a few thoughts on how we will extend the product going forward, and an invitation to you to contribute your feedback.

Just over a year ago, we launched the first release, MemSQL Studio 1.0.1. While the release notes detail everything that’s happened since, we want to reflect on the last year of development and give some additional context to these notes.

How Studio Started

The MemSQL Studio project started in early 2018 with the goal of creating a visual tool that our customers could use to manage and monitor their MemSQL clusters. During the first few months of development, we worked toward implementing a front-end architecture that would allow us to iterate on new features as quickly as possible.

Upon completing our first release, we began tweaking and improving the product as we got feedback from our customers and from the community. One comment, from a Fortune 100 company in financial services, sums up the enthusiasm: “I will keep digging into MemSQL Studio, and bringing up more feature requests, as I see this to be a gold mine.”

Our first public release was on August 6th, 2018. In this first release, MemSQL had only a small handful of features:

SQL Editor
Resource Usage Profiler
Schema Explorer
Pipelines Table
Nodes Table

MemSQL Studio 1.0.1 launched 1 year ago. — *Screenshot of MemSQL Studio 1.0.1.*

What We’ve Added to Studio

A month later, we added a new feature called Visual Explain. This had been one of the most-requested features from our customers. At its core, it allowed anybody to understand where a query is spending its time, in order to illustrate how to optimize its running time.

Visual Explain was a breakthrough feature addition to MemSQL Studio. — *The Visual Explain feature in Studio.*

Over time, we’ve been adding other new features such as:

Real-time resource usage monitoring (disk, RAM, CPU, and network)
Logical monitoring of the MemSQL topology
Real-time monitoring of running queries

With MemSQL Studio, you can monitor resource usage in real time. — *Real-time resource usage monitoring.*

Over the past year we have significantly expanded the initial set of features. The Schema Explorer now contains much more information about the clusters’ schemas. Moreover, the SQL Editor has been completely reimagined to give users a more IDE-like experience:

Multiple result tabs for easy query result comparison
Schema tree search
Loading and exporting of SQL files, and exporting results to CSV
Easy database selection
Persistent SQL Editor buffer
Find/Replace and other text editor features
Performance improvements
Resizable panes
Visual improvements

Today, we’re shipping MemSQL Studio 1.8.1. The release notes for Studio 1.8.1 are:

Adds result tabs to the SQL Editor’s output pane
Disables the bottom panel from appearing when queries fail in the SQL Editor
Visual refactor to the table page in the Schema Explorer
Fixes a bug where tables with a date column would crash the Sample Data tab on the table page

Studio now lets you see active processes, live. — *Active Processes in MemSQL Studio 1.8.1.*

Where MemSQL Studio’s Going

We are continually receiving feedback from our customers, and Studio updates planned for later in 2019 will include a smoother onboarding experience, performance improvements for large clusters, and several other as-yet-unannounced features.

As usual, all feedback is welcome. The best place to provide us with feedback is on the MemSQL Forums. Make sure to select the “MemSQL Studio” category when creating a new topic.

If you’re interested in doing full stack engineering work on products like MemSQL Studio, make sure to check our currently open positions. We’re currently hiring in San Francisco, California and Lisbon, Portugal.

↧

Fiserv on Machine Learning & Real-Time Analytics at Financial Institutions

September 9, 2019, 7:30 pm

≫ Next: MemSQL SingleStore – And Then There Was One

≪ Previous: Looking Back on the First Year of MemSQL Studio

Feed: MemSQL Blog.
Author: Manish Pandey.

In this webinar, which you can view here, Manish Pandey tells us about the transformative effect real-time data, machine learning, and artificial intelligence will make in helping financial services institutions meet increasing demands from consumers. (You can also see the slides.) Pandey, who is Sr. Director of Business Development and Digital Strategy at MemSQL partner Fiserv, highlights the power of MemSQL, and the smarts that Fiserv can bring to helping companies make the most out if it. – Editor

As you know, financial services has always been an exciting world. And to add to it, it’s an interesting time as well for several different reasons. Data analytics and the speed at which we access, analyze and make sense of data, has become such a critical part of our business today.

It’s not that data and speed were not important in the past, but the game is different today because our consumers understand what it means to have a profound experience. In fact, just to share a couple of data points to set the context, in a recent article by The Financial Brand, they talked about four D’s of consumer attrition, four D’s of consumer attrition are dissatisfaction, debt, displacement and divorce.

Actually, no prizes for guessing, but more than 50% of consumers leave their bank because of dissatisfaction. Dissatisfaction has several different aspects of it, but the key part is attrition. The consumers are leaving the bank because they are not happy.

According to a recent survey from TD Bank, the risk of payment fraud is the number one concern for 44% of financial industry professionals this year. That’s a 14% increase in just 12 months.

So, on one hand we have consumers who are demanding the seamless, real-time experience. On the other hand, we have to deal with all these threats, which means friction. How do we balance it? I will touch upon three aspects of our world and discuss why it’s important to have real-time data computing capability. Also, why leverage analytics, machine learning, AI in combination to drive business forward and create delighted consumers.

Digital-Native Companies are Setting the Pace

With that, there are many reasons why these two are so successful, but one common aspect, they both know their customers very well. Also, they know how to engage with them without having a customer care person talk to them. In fact, I don’t know about you guys, but in fact, for me in last several, several years of my experience with these guys, I’ve never had to call them for any issues, which is just profound customer experience that I talked earlier on when we were setting the topic.

Netflix, Amazon, and others are setting consumer expectations, says Pandey of Fiserv.

How they do it, they know their customers like no one does. It’s not rocket science but, as I like to call it, it’s data science. They know how to use the data, not just data, but how to use the data in real-time and make sense of the data. Use the data to monetize as well.

Look at what Google or Uber or Facebook are doing with that. They are pivoting to several related services and offerings and expanding their business, because of the way they use the data. Can we say the same thing for financial institutions? I have my different opinion on that.

Financial services, where MemSQL is popular, is undergoing transformation.

These innovations have changed the game in the industry. What’s happening close to our home in the financial services? If you look at financial services, I end up talking to four or five financial services companies, with the banks or insurance companies or credit unions, in a week. And it’s tons of engagement on so many different conversations and topics, but one of the key aspect that’s coming up pretty much in every conversation that either these financial institutions are trying to become a digital business or are going through a digital transformation, or they’re looking at digital business transformation.

When we talk about digital transformation, it simply means that they’re trying to empower their customers. They’re empowering their customers on all different channels. It doesn’t matter how they are interacting with the banks, whether they are going online using tablet or mobile or simple desktop or they are calling a customer care rep and talking, or they’re walking to the branch.

It’s about helping your customers do a lot of things themselves and that gives them freedom to express themselves. At the same time, with this transformation, FIs are trying to drive revenue and optimize the revenue either by selling new products and services in that space or driving the wallet share. When you think about the business transformation, there’s a lot happening in financial services.

At the same time, when no one thought that any innovation is possible in insurance, companies like Lemonade came with a very little investment and funding, and they are threatening bigger insurance giants because of the simple user experience they have created for their consumers, and the way they have reduced the processing time using technologies like machine learning and artificial intelligence.

They are bringing the innovation coming from non-banking industry and threatening our banking industry and financial services companies because, simply they have been able to help the customers, empower the customers and change the business models. It’s the time that, when we are seeing these conversations that are happening in banking and financial services too, when banks are looking to become platform companies. When there are talks around banking as a service being offered to non-banking companies.

Consumer Demands are Changing

There are discussions around Apple working with Goldman Sachs and talking about new cards in the market. With that, I want to shift our focus on the consumer side. So far we’ve talked about what’s happening at the more organization level in the financial services industry, but now if you look closely on the customer side, customers are telling us every day, they are giving us feedback.

Pandey of Fiserv says customers want banks to do more.

The reason we need to understand what they are giving us as a feedback is because we got understand what’s shifting in that paradigm. I mean, there’s a lot of noise, at the same time, there’s a lot of good feedback coming from our interaction with consumers. That’s the reason we need to understand the data. We need to access the data in real time and deliver the technologies that we talked about, analytics, machine learning, and AI, and it has gained much significance now.

Consumers today have all different kinds of experiences in retail, online and entertainment. They want to do what they want to, and they want to do it now. I will say this one more time, they want to do what they want to do, and they want to do it now. We can’t disagree with that because, we are a consumer, and we also see that we are getting different experiences when we interact with the different touch-points.

What you see on the screen is a representation of the same. They no longer see banks they used to see earlier. They’re demanding experiences consistent with what they see with other touch points. Consumers are saying, “Know me. You have all the data.” No one has more consumer data, in my opinion, than banks. We have all kinds of consumer data. They spend most of their time with us, and they spend most of their emotional relations with us. Money is not just money anymore. Money is very much emotional.

So what do we do with that data? What do we do with that information? I’ll give you two real-life experiences. Banks who I deal with, if they are careful, they will understand that I’m not a phone-banking guy. I’m not a snail mail kind of guy. I’m more like an online guy. I love to chat with my bank. I love to interact with the chatbots and clarify my questions, but I see those banks interacting with me, sending those offers in my mailbox every week. Tons and tons of paper. I have no clue why they do it.

Clearly, it means they don’t understand me. They don’t understand my preferences. Another example, one of the leading banks in this country, I went to them four, five years back. I just wanted to explore options of refinancing, because I thought I will take benefit of some lower interest rates. They were nice enough to run an analysis on me and look at my interest rate. They came back and said that, no, I have a very good deal, and I should not look at refinancing. It won’t be cost effective for me, and I listened to that.

Now, from the same bank, I started getting mailers on refinancing. One, they are sending those mailers in my mailbox, again, every day. And now they advised me not to go for refinance, and they’re asking me to refinance, and they are sending me offers on refinancing. That’s so ridiculous. I mean, I’ve no idea why they will do this to me, or any consumer. That clearly tells us that we are doing not much with the data, or we are not doing much with the information that’s coming from the consumers.

Consumers expect simplicity and ease, and MemSQL and Fiserv can help.

That’s been the biggest difference that we are seeing in our industry as compared to some other industries. If we go further and if we talk a little bit more about consumer expectations, so consumers, obviously, are expecting hassle-free interactions no matter how they do business with any institution. To give you some data points, these consumers are becoming increasingly comfortable with different ways to do any transaction, for that matter, paying as well.

Whether it’s the older generation embracing digital channels or more people saying they see a digital wallet as secure. It is clear consumer perceptions are evolving. Several types of online transactions are showing modest increase over last years, including banking and financial transactions. We are also seeing boomers and seniors often showing largest increase and, though they are still trailing behind the other generations, but they are spending a lot of time online.

And they are doing a lot of transactions online, whether they are doing financial transactions, or they’re doing non-financial transactions, or they are on social media. Those experiences are different for them and, obviously, that’s why there is a conversation as to why I’m not getting a consistent experience. Although many consumers also remain at least somewhat concerned about the security of receiving or paying or doing financial transactions, but they are embracing the new technology and experiences more and more every passing year.

It doesn't have to be a bank for transactions, according to Fiserv.

With that, we also look at, consumers are also increasingly getting very open to using non-banking companies, which is extremely important as you as a consumer. I mean, and your consumers you see that, when they are becoming more comfortable with non-banking players, especially when it comes to paying bails or taking a loan or managing money, tracking your budget, transferring money or anything, something financial, it’s not a good sign. It’s not a good sign because, we always thought that consumers will be attached with the banks because of several different reasons but, now because of experiences that they are getting from non-banking companies in these areas, they are all ready to switch.

Just to let you know a data point, in 2017, 40% of consumers stated that they would be comfortable using a technology company to pay bills compared to 55% of them in 2018 survey, which is a big jump. So, today, consumers demand convenience, ease of use, faster services an enhanced user experience and interface.

...and people of all ages are paying online.

Not to spend too much time on this one, but we are also seeing, comfort is increasing with automation and online activities are increasing day by day. Not just among the younger generation, but baby boomers and seniors are getting very comfortable as well, and that’s the point that we are trying to convey.

So, it’s cutting across different generations now. It’s cutting across different customer segments. It’s cutting across within the segments as well. As we go along, we’ll talk about how a bank, in fact, not just their data customer segmentation, but they developed like 15,000 micro customer segments depending on several different parameters. Now we are seeing the trend happening across different segments and within those segments and it’s just mind-blowing.

How Financial Services Companies are Responding

Now we are moving to the third dimension. I mentioned in the beginning of the webinar that I will touch upon three different dimensions of our world in this context in terms of access to real-time data and analytics and machine learning. We talked about what’s happening at the broader financial institution level. We talked about consumer preferences. Now, we’re going into the risk and fraud area. The risk and fraud has gained a lot of attention. It’s always been a very sticky topic, but now with the exposure that we have, with the technology, I mean, we are also seeing those cases of fraud increasing year on year.

Real-time data, often enabled by MemSQL, fights fraud.

And there are millions and billions of dollars lost. It’s not just the amount of money lost, at the same time we’re talking about its impact on your brand and consumers. That’s the key reason that we should be looking at real-time data computing and combining this with analytics and other technologies that we’ve been talking about and not just detecting fraud but preventing fraud. That’s going to help us not just in terms of cutting our losses but, at the same time, it’s going to help us from the customer satisfaction point of view also. (See the YouTube video about PayPal and how they accomplish these goals.)

About a third of banking execs are using AI already.

Because, while the customers are talking about a seamless experience, they are not going to compromise on their data and security of data. So, it’s a balancing act that we have to play and in that context we have to make sure that we are preventing the fraud using real-time analytics and the power of real-time data computing. That’s not a surprise, that several senior executives in the financial services are using AI technologies in various areas such as personalization, in terms of how they deal with consumers, or how they think about their relationship with consumers.

Productivity, fraud, wealth management, advisory services, and the list is ever growing. Where are you using this? That’s a question that we should be asking ourselves when we are looking at our strategy in terms of how we are looking to do our business. Why this is so all complex. Why this needs various areas of strategic focus rather than a siloed approach. Well, to start with, there are several challenges of legacy data, but it’s too big, too disparate, and too slow.

Just to give you some data points. 3.5 billion Google searches are conducted, 300 million photos are uploaded to Facebook, and 2.5 quintillion bytes of data are created. IDC predicts global data will grow tenfold between 2016 and 2025 to a whooping 163 zettabytes. I don’t even know how to spell this, and I don’t even know how many zeros there are in zettabytes. It’s just a mind-blowing number.

I mean, it’s just mind-blowing when we talk to banks, and when I hear the stories, they don’t even know where all the data is sitting and how old the data is. They are living with the data, which, probably, they never want to use it.

In terms of velocity, data needs to move faster than legacy systems can handle. Even a 10-seconds lag in data delivery can pose a threat if you are dealing with, say, hyper-critical data. Again, IDC estimates 10% of all data will be hyper-critical in nature by 2025, so we are dealing with, not just the data, huge volume of data, at the same time, we are dealing with very sensitive data. So, why haven’t we moved faster and why now? Great question. Well, we have not moved faster because several different reasons, we talked earlier on and in the last slide as well.

Framing the challenges met by real-time data.

A strong data infrastructure and capability is really, really needed, which can help you solve real-time use cases. That will allow you to predict, optimize and forecast. It could also help you stream data from multiple legacy systems or sources and that’s where we are talking about how a more strategic approach to data capability, and not a siloed approach, can help you. At the same time, we’re talking about a much more robust data infrastructure and that’s where we were looking about this whole in-memory data computing, and MemSQL is playing a much bigger role with dealing with clients like you.

The challenges are part of our business and there are many, and the way we have evolved the challenges have compounded. Then, of course, there is always, always pressure on total cost of ownership. There is always pressure on our bottom line. So, we have to move because of these reasons, and many other reasons that we have talked about. I think it’s fair to conclude here on touching upon those three areas and concluding with this quote that only things advancing faster than technology is consumer expectations. The fact is, the pace of change is accelerating. Your customers and consumers want what they want and when they want it. The name of the game is speed, ease and convenience.

How Fiserv and MemSQL Help

Real-time data for sure moves your business forward. Now, just to talk about some of those success stories that we have seen, not just working with our clients but, I also looked at some broader audience. I want to talk about just a few of them and then we will talk about what Fiserv and MemSQL are doing in this space and then we will take up some questions. The first one that you are seeing on the personalization, this was a story of a U.S. bank, which used machine learning to a study that discounts its private bankers were offering to customers.

Three case studies show how MemSQL and FiServ can make a difference.

The private bankers claim that they offered these discounts to very important, valuable customers, but the analysis showed a different outcome. So the bank used analytics to determine that who should be given those offers. And they made a campaign around a personalized offering based on analytics and real-time processing of the data on the consumers. They saw an 8% increase in the revenue in just a few months, so that was a very profound ROI that we noticed in terms of driving your revenue using the power of real-time data and making a personalized offer using analytics and related technologies.

When we talk about the next use case here, this was a story of a European Bank. They tried a number of things to forecast consumer attrition. Many of those, they did not give them the desired outcomes, so they turned to machine learning, and they build a predictive algorithm on customers who were still active, but they were likely to reduce their business with the bank. The new understanding gave them rise to a targeted campaign that reduced churn by 15%, which was, again, a very significant result that they noticed using machine learning and analytics technology.

Then, the last one that I have, it’s an example of a bank in Asia, a leading bank. It used advanced analytics with the data streaming from so many different systems. Some of them, data points, were coming … like customer demographics or key characteristics of the customers, the product held, the credit card statements, the transactions and the point-of-sale data, online and mobile transfers and payments and credit bureau data. And they looked at completely 360-degree profile of the customers, and the bank discovered that they were not just looking at four or five or 10 different customer segments, but they were looking at like 15,000 micro-customer segments.

That helped them build a mixed product to-buy model that increased the likelihood to buy three times over. That was a really profound learning for them and that helped them grow their business significantly. What Fiserv is doing, in collaboration with MemSQL, is also working with its clients on several different use cases, very, very similar to what we talked in the last slide. Some of them are listed here, but that’s not all, but we are also looking at how we can help our clients understand their consumers looking at their structured data, unstructured data and the social media data.

Demanding processing steps with strict SLAs are met by MemSQL.

Basically, ingesting data using MemSQL technology and then using the power of Fiserv data analytics and developing a complete customer crisis degree profiling and making a determination on how to engage with those customers. And that leads to the next best offer or next best action that we can take with those consumers. I mean, they could be attrition candidates, they may be dissatisfied, they may be a good candidate for causes opportunity.

They may be moving, or they may be buying a new house, and we should be always there in front of them, going back to our team that we should know them, because our customers are expecting us to know them. At the same time, we got to also look at fraud alerts. That’s where we have combined together, and we are working on the power of real-time data and analytics to make sure that we are putting use cases which can help us predict or alert, fraud possibilities. Not to discount the point that we have also developed various dashboards and related portals, which can help you aggregate data in real-time and make some significant determinations as you go on.

Reach out to Fiserv to learn more, including how they use MemSQL.

With that, I’m going to sum it up, and I’m going to say that, now, if you are interested in learning how the combination of real-time data computing offered by MemSQL and Fiserv’s power of analytics can help you grow your business or drive customer delight by offering real-time personalized services offers. Or, simply, identifying real-time data-related use cases suitable to your needs. We are very, very happy to talk one-on-one, and we are always willing to learn and share. So, reach out to my colleague, Lois, or myself, and we will set up some time and learn from your experiences and share ours.

Questions and Answers

Q. You talk with customers every day. What have you seen as the biggest changes in terms of market dynamics or priorities?

I will answer this from two perspectives, one from the consumer point of view and one from the financial institutions’ perspective. When we look at the consumer, we talked about that. I mean, consumers are, of course, expecting a seamless experience when they are dealing with Netflix, Amazon, and the other touch points. They expect us to provide a similar experience when they deal with their banks.

At the same time, they are being very loud and clear and they are saying that, “Hey, you’ve got to know me.” When I was engaged in one of the conversation with customers who were saying that, “Hey, when I’m filling a form on Facebook or other non-banking financial portals, I generally see that they fetch data from other sources about me.” And, hopefully, some data or type-ahead kind of thing. It’s happening in our world too, which is very profound. It goes back to the point that knowing them is really important.

At the same time, they’re saying that, “Know me now.” So the real-time has a huge significance. There is no real opportunity for us to have any lag when we deal with our customers. So that’s from the customer point of view, but from the financial services’ perspective, I mean, the average span of human attention is five to seven seconds, which is less than a goldfish attention span. So if you’re talking about that kind of limited opportunity for us to engage with our consumers, and since they’re interacting mostly with digital channels, we have to be pretty solid in terms of how we are communicating with them and how we are not losing them in those five to seven seconds.

That’s the kind of significant change that we are seeing and it goes back to one of the things that I mentioned that, how do you make your offers or interactions personalized and not make it generic? Not make it broader group-wise interaction? And there is a pattern evolving around these needs that we need to understand if you have to interact with those customers better and get their attention. So that’s the kind of changes I’m seeing right now from both perspective.

Q. How should we start the journey if we want to move towards real-time data computing and analytics?

Sometimes, it’s overwhelming because just the kind of challenges our ecosystem presents to us. We talked about legacy data and data sitting in silos. Data always been, again, it’s my personal opinion, data always has been one of the low-priority items in the financial services industry. Not that we didn’t value data, just that the challenges were so big that we never put so much attention on that.

However, because of the changing dynamics, it has changed a lot. Now, my suggestion in that context always been to my clients that, let’s not try to solve so many use cases at one shot. You need to pick up a use case which is aligned with your strategic priority, whether it’s the use case around consumer or fraud or just accessing the data for various internal consumptions, which can help executives make better decisions.

So it could be any of those areas, which are aligned with your key goals but, at the same time, the data strategy has to be much broader. And the infrastructure that you are looking at should be thought of from the perspective that you’re going to scale up and address many use cases. So start with one use case, but think about a much bigger data infrastructure game plan and a much better strategic analytics strategy and that’s where you can handle one use case, show the value proposition, show the ROI to your leadership team and then move on to the next one, but at the same time you have the infrastructure ready for that.

Q. Can Fiserv and MemSQL help us understand our current landscape and help solve some use cases?

Absolutely. That’s one of the reasons we’re partnered together because we want to provide end-to-end services and solutions to our clients across the entire journey. Fiserv and MemSQL can really look at, from the very beginning of your journey and walk the entire path. When I say that, what I mean that we can come and look at your current state of your data, data infrastructure, what kind of business priorities you have.

And then, from there on, we can really build a strategy for you and then we can execute them back together. That’s what I mean by walking the path together. Fiserv and MemSQL, very, very happy to engage with you guys wherever you are in your journey. Not just solve use cases, but we can help you identify the right use cases, align with your business priorities and solve them for you.

The biggest roadblock is to getting projects to move forward or to reinvigorate stalled opportunities. How have your customers and partners overcome these roadblocks?

The biggest one always being around, like I was sharing, it’s about data access. It’s been, because the way this industry has evolved, there are so many legacy systems. Those systems, probably, were not required to talk to each other in earlier days. We are required to look at data the way we need to look at data because of the experience consumers are getting outside banking and financial services world. It has forced all of us.

So, that’s been the biggest challenge – how do we access data? How do we make sense of the data, and then, how do we monetize the data? Again, we are working with our clients to really navigate this path and not get really bogged down in this whole journey by putting the right strategy in place. At the same time, guiding them towards the right tools and technologies that should be leveraged to continue on this journey.

Again, the key point here is that, identifying the right use case. Whether it’s related to consumers or related to fraud and risk management or related to compliance or is related to helping your executive leadership make real-time decisions. It could be any of those areas, then we go backward and understand, how do we navigate through the data infrastructure, database challenges, how do we facilitate the realtime computing and then, at the same time, how do we enable you to address future use cases? So, I think the silos are the biggest opportunity here.

Conclusion

We invite you to learn more about MemSQL at memsql.com or give us a try for free at memsql.com/free.

↧