Cloud Event Processing - Analyze, Sense, Respond

Colin Clark

Subscribe to Colin Clark: eMailAlertsEmail Alerts
Get Colin Clark via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn

Related Topics: CEP on Ulitzer

Blog Feed Post

Building Inverted Indexes on Tweets Using Map/Reduce

In our last post, we looked at how to make bad map/reduce code better map/reduce code.  A natural fallout from breaking tweets down into words is the ability to build an inverted index to facilitate searching tweets by key words.

It’s All in the Tweet

Given the tweets,

  1. “@eventcloudpro I like the idea of tree maps, if you combine these with metrics trees in #PM Strategy Management tools u could align the two,” and
  2. “RT @jakewk RT @eventcloudpro: did streambase find a buyer? http://bit.ly/brTJlx SunGard likes to buy their technology partners #cep”
  3. “I really like #erlang – it’s rocking technology!”

How do we compute an inverted index? First, let’s assign a unique ID to each tweet above – for our example, that’s tweet #1, tweet #2, and tweet #3.  Now we want to find which words are used where so we’ll construct a table consisting of words and the list of tweet id’s that contain those words.

Word                    Tweet

@eventcloudpro     1,2

metrics                  1

technology            2, 3

and so on…

Using the Index

To find tweets with the words we’re interested in, we just query the inverted index.  For example, if I’m interested in finding tweets with the word ‘metrics’, using the index I see that tweet #1 has that word, so I look up tweet #1.  If I’m interested in the words ‘@eventcloudpro’ and ‘technology’, I get the sets (1,2) and (2,3) – and they’re intersection is tweet #2.

I’d Like Extra Sauce, Please

So now that we’ve figured out how to look up specific tweets based upon content, how could we look up tweets based upon other stuff; stuff like categorization, entity extract, root words (stemming), or even sentiment?  By running the tweets through a system that calculates these value added goodies, we could construct additional inverted indexes to further organize our tweets.


Using map/reduce in the process of calculating inverted indices is a natural fit.  And there’s a great introduction on how to do this using Erlang in Joe Armstrong’s book, “Programming Erlang.”

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to colin@cloudeventprocessing.com