Welcome!

Cloud Event Processing - Analyze, Sense, Respond

Colin Clark

Subscribe to Colin Clark: eMailAlertsEmail Alerts
Get Colin Clark via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: Data Mining, IBM Journal, MySQL Journal, CIO/CTO Update, CEP on Ulitzer

Blog Feed Post

Data Mining in Streaming Data – CEP & SAX

In the last couple of posts, I’ve outlined a method for both reducing the dimensionality of continuous data and also reducing it to symbols to make further analysis easier. The method we’ve been using is referred to as Symbolic Aggregate Approximation, or SAX.

STREAMING SAX

The examples that I’ve shown so far have been illustrated using Excel. But if we were serious about using SAX in a real world scenario, we’d most probably be processing some type of streaming data. SAX has application anywhere there’s a bunch of highly dimensional, continuous data being generated. But we’ll stick to stock market trade data for now.

I went out and purchased a month’s worth of IBM trade prices & volumes from the NYSE – it’s very easy to do, and you can do it here. Once I did that, I loaded the data into a MySQL database and prepared to process it within DarkStar, our distributed event processing system that uses components of streaming map/reduce and complex event processing.

BEFORE WE GET STARTED

In the examples I outlined earlier, I took an entire day’s worth of data, normalized it, and then applied piecewise aggregate approximation to it, dividing a trading day up into 7 roughly equal samples. Now that we’re going to process the data as it streams out of the exchange, how should we break things up? The answer depends upon the question you’re asking. If there’s a pattern you think shows itself every 10 minutes and consists of 10 discrete values, then we should sample 10 minutes worth of data and break it down into intervals of 1 minute using the techniques shown earlier. So, the first thing we’re going to do is create a named window. A named window is going to provide the data we need in a 10 minute, sliding window.  We describe the window like this:

CREATE WINDOW winTradeData.win:time(10 minutes) as select * from tradeEvent;
INSERT INTO winTradeData select * from tradeEvent;

What these two statements do is to, 1) create a sliding window that contains the last 10 minutes of tradeEvent events, and 2) inserts tradeEvent events into that window as they arrive.  The first statement creates a named window that has all of the fields from the tradeEvent event.  The second statement populates the window.  So far so good.

WE’VE GOT A WINDOW FULL OF EVENTS, NOW WHAT?

Well, we’d like to break down the window into 10 equal segments of 1 minute each.  And then we’ll average and classify the 1 minutes segments.  But before we can classify the data, we need to normalize it.  We want to do this every minute; we don’t want to wait and do this every 10 minutes do we?  If we did, we might miss a whole bunch of patterns that started in the previous window and ended in the current window.  So we’re going to pull data from the window and normalize it every minute with this statement (I call this a ‘tumbling window’):

SELECT symbol, (price-avg(price))/std(price) as normalized_price FROM winTradeData output every 1 minute;

PICK A LETTER, ANY LETTER

We’re going to apply PAA to this resulting data set, (see earlier post), PAA will give us an average value for each time slice within the interval that we’re analyzing.  In this case, it’s 1 minute long.  So we want to average all the trades for a 1 minute period and then look up the corresponding SAX letter.  We could write another query to accomplish this or perhaps modify the one above.  Once we have the averages, we can assign a letter and then we’ll have a SAX word.

SO WE’VE GOT A SAX WORD, NOW WHAT?

Now that we’re able to describe streaming data in a discrete way, with a lower bounding function, we’re ready to do some more things.  From an earlier post, I said that SAX could be used for clustering, classification, anomaly detection, and search.  We’re going to focus on search in the next post.

UNTIL THEN

Think about how this algorithm lends itself to a map/reduce (via Hadoop or via in database map/reduce) implementation and how we’d use SAX then to correlate streaming data to historical data – there’s a lot in this blog that talk about this, perhaps not in terms of SAX, but there’s work in map/reduce, inverted indexes, etc.  We’ll need all of that, and a little more, to string it all together.

THE NEXT POST

Will happen sooner than the last, I promise.

THANKS FOR READING!


Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to colin@cloudeventprocessing.com