Welcome!

Cloud Event Processing - Analyze, Sense, Respond

Colin Clark

Subscribe to Colin Clark: eMailAlertsEmail Alerts
Get Colin Clark via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Feed Post

Normalizing Streaming Data & Piecewise Aggregate Approximation

Ok, so you’ve read the last post, downloaded and read the papers on SAX, and you’re ready to get going!  Wonderful.  First, you’ll need some data which I’ve thoughtfully included for download here- SAX Prep (an excel file with some trades in it).  Download the data, and then follow along below.

WHAT ARE WE DOING?

What we want to do is take a whole bunch of numeric data and reduce the dimension of it and then convert it into some type of symbolic representation.  This is so we can do some other interesting things with it later that are much easier when the data is represented this way.  Currently, the data in the Excel spreadsheet that I’ve toiled for hours on just for you, looks like this:

What we see in this chart, is a day’s worth of trade prices for a make believe symbol.  Actually, I know the symbol, but I can’t tell you that because you didn’t buy the data!

In the next step, we want to normalize the data with a mean of 0 and standard deviation of 1.  So, compute the average for the day, and then for each price, subtract the average and divide by the standard deviation.  Or just use some Excel functions; which I have done in the spreadsheet for you.

Piecewise Aggregate Approximation (PAA)

With Applied PAA

Once we’ve normalized the data, we can apply PAA,.  I picked time divisions of an hour, and averaged the normalized price information.  You can see the normalized price data and resulting buckets, as as computed via PAA in the chart at the right.  There’s something important to notice here, although I didn’t pick a bunch of divisions, which might have given more specificity to the resulting PAA analysis, you can see that the shape of the PAA looks like the underlying data.  This is important when we then use symbols to describe the patterns – because we’re using PAA underneath, we can calculate the distance between observed SAX patterns.  Also, you can see some of the statistically irrelevant spikes have been ignored.  Super Good!

EVERYTHING’S A SYMBOL, MAN…

So, how do we go from normalized PAA to symbol?  Easy; if you look in the spreadsheet, you’ll see the values -1.28, -.84, -.52, -.25, 0, .25, .52, .84, and 1.28.  And I’ve associated letters with those #’s.  So, the first PAA is 1.68, which is greater than 1.28, so our word begins with I.

MAY I HAVE THE ENVELOPE PLEASE

So after all of this analysis, our SAX word that represents a whole lot of trade data is, “IFGDBAB.”  How cools is that?  A whole day’s worth of data expressed as a few symbols.  Think of how much easier it would be to look up a nearest neighbor to this pattern, or maybe classify it given some cluster analysis, or detect something that we haven’t seen before using suffix trees?  All much easier to do with symbolic vs continuous numeric data.

TAKE ME TO THE ‘B’ SECTION

If you read the papers I recommended, and have paid attention, you might notice a potential problem with the methodology outlined and applied here so far.  What is it?  Also, this has been a lot of fun to do using Excel, but I think we could actually get this done easier and faster using some good old sliding windows and aggregation (CEP).

AND AS ALWAYS

Thanks for reading – I’ll be showing how to do this using DarkStar next.  Because chances are if we’re doing this in real time, we’re doing it for *a bunch* of data, and Excel, although wonderful, just ain’t going to cut it.

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to colin@cloudeventprocessing.com