Cloud Event Processing - Analyze, Sense, Respond

Colin Clark

Subscribe to Colin Clark: eMailAlertsEmail Alerts
Get Colin Clark via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: Cloud Computing, Virtualization Magazine, Twitter on Ulitzer, Java Developer Magazine, Microservices Journal

Blog Feed Post

Writing Bad Code in Map/Reduce

One of the map’s we’re running breaks the text of a Tweet down into words

In the Twitter project we’ve been working on, one of the map’s we’re running breaks the text of a Tweet down into words.

Because we can’t assume that any data will be available for access via a database, etc, we attach a couple of values that we’re interested for later analysis to the word, attach a 1, and emit the tuple.

This is an example of what the tuple looks like:

"TimeZone", "Location", "ScreenName", "Word", 1

This map is produced from a tweet that contains many words, so, crunching the Tweet down will result in many of the above values being duplicated.

"TimeZone", "Location", "ScreenName", "anotherWord", 1
"TimeZone", "Location", "ScreenName", "yetAnotherWord", 1
"TimeZone", "Location", "ScreenName", "Word", 1
"TimeZone", "Location", "ScreenName", "Word", 1

But if you notice, there’s a way to compress this a bit.

On of the problems with map/reduce (maybe not a problem, but a caveat) is that it’s quite easy to swamp the network.

In the example above, we’ll be emitting a tuple (the tuple is slightly longer in production) and repeating information for each word.

What if instead, we used an associative array when crunching down the tweet. For example, after crunching the data above, our data structure would look something like:

"TimeZone", "Location", "ScreenName", {word, 3, anotherWord, 1, yetAnotherWord, 1}

By attaching an associative array of words and their count to the result in the map we dramatically decrease the amount of data being moved around. This reduces both the storage required (if you’re doing old fashioned Hadoop) but more importantly, the network traffic if you’re doing streaming map/reduce.

Of course, this means that the reduce function needs to be aware that it’s getting more than the simple string emitted in our first example, but we’ve saved a lot of bandwidth in the process.

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to colin@cloudeventprocessing.com