Quantcast
Viewing all articles
Browse latest Browse all 427

Introducing MemSQL Pipelines to Stored Procedures

Feed: MemSQL Blog.
Author: Bryan Offutt.

With the release of MemSQL 6.5, the MemSQL team is announcing our latest innovation in data ingestion: Pipelines to stored procedures. By marrying the speed and exactly once semantics of MemSQL Pipelines with the flexibility and transactional guarantees of stored procedures, we’ve eliminated the need for complex transforms and unlocked the doors to a variety of new, groundbreaking use cases. This blog post will walk you through how to use Pipelines to stored procedures to break data streams up into multiple tables, join them with existing data, and update existing rows, but this is just the tip of the iceberg of what is possible with this exciting new functionality.

How Does it Work?

Image may be NSFW.
Clik here to view.
Extract → Transform (Optional) → Stored Procedure OR Traditional Single Table Load

Pipelines to stored procedures augments the existing MemSQL Pipelines data flow by providing the option to replace the default Pipelines load phase with a stored procedure. The default Pipelines load phase only supports simple insertions into a single table, with the data being loaded either directly after extraction or following an optional transform. Replacing this default loading phase with a stored procedure opens up the possibility for much more complex processing, providing the ability to insert into multiple tables, enrich incoming streams using existing data, and leverage the full power of MemSQL Extensibility.

Pipelines Transforms vs. Stored Procedures

Those familiar with MemSQL Pipelines might be wondering, “I already have a transform in my pipeline….should I use a stored procedure instead?” Leveraging both a transform and a stored procedure in the same Pipeline allows you to combine the third-party library support of a traditional transform alongside the multi-insert and data-enrichment capabilities of a stored procedure.

Traditional Transform Stored Procedure
Pros
  • Can be written in any language (BASH, python, Go, etc.)
  • Can leverage third-party libraries
  • Easily portable to systems outside of MemSQL

Cons

  • More difficult to debug and manage
  • Less performant for most use cases
  • Only allows inserting into a single table
  • No transactional guarantees
Pros
  • Insert into multiple tables
  • MemSQL native
  • High performance querying of existing MemSQL tables
  • Transactional guarantees

Cons

  • MemSQL-specific language
  • No access to third-party libraries for more complex transformations

Example Use Cases

Insert Into Multiple Tables

Pipelines to stored procedures now enables you to insert data from a single stream into multiple tables in MemSQL. Consider the following stored procedure:

CREATE PROCEDURE proc(batch query(tweet json))
AS
BEGIN
    INSERT INTO tweets(tweet_id, user_id, text) 
      SELECT tweet::tweet_id, tweet::user_id, tweet::text
      FROM batch;
    INSERT INTO users(user_id, user_name)
      SELECT tweet::user_id, tweet::user_name
      FROM batch;
END

This procedure takes in tweet data in JSON format, separates out the tweet text from the user information, and inserts them into their respective tables in MemSQL. The entire stored procedure is wrapped in a single transaction, ensuring that data is never inserted into one table but not the other.

Load Into Table and Update Aggregation

There are many use cases where maintaining an aggregation table is a more performant and sensible alternative to running aggregation queries across raw data. Pipelines to stored procedures allows you to both insert raw data and update aggregation counters using data streamed from any Pipelines source. Consider the following stored procedure:

CREATE PROCEDURE proc(batch query(tweet json))
AS
BEGIN
    INSERT INTO tweets(tweet_id, user_id, text) 
      SELECT tweet::tweet_id, tweet::user_id, tweet::text
      FROM batch;
    INSERT INTO retweets_counter(user_id, num_retweets)
      SELECT tweet::retweeted_user_id, 1
      FROM batch
      ON DUPLICATE KEY UPDATE num_retweets = num_retweets + 1
      WHERE tweet::retweeted_user_id is not null;
END

This procedure takes in tweet data as JSON, inserts the raw tweets into a “tweets” table, and updates a second table which tracks the number of retweets per user. Again, the transactional boundaries of the stored procedure ensures that the aggregations in retweets_counter are always in sync with the raw data in the tweets table.

Use Existing Data to Enrich a Stream

It’s also possible to use a stored procedure to enrich an incoming stream using data that already exists in MemSQL. Consider the following stored procedure, which uses an existing MemSQL table to join an incoming IP address batch with existing geo data about its location:

CREATE PROCEDURE proc(batch query(ip varchar, ...))
AS
BEGIN
    INSERT INTO t
      SELECT batch.*, ip_to_point_table.geopoint
      FROM batch
      JOIN ip_to_point_table
      ON ip_prefix(ip) = ip_to_point_table.ip;
END

These use cases only scratch the surface of what is possible using Pipelines to stored procedures. By joining the speed of Pipelines with the flexibility of stored procedures, MemSQL 6.5 gives you total control over all of your streaming data. For more information on the full capabilities of MemSQL Extensibility, please see our documentation.


Viewing all articles
Browse latest Browse all 427

Trending Articles