Welcome!

Cloud Event Processing - Analyze, Sense, Respond

Colin Clark

Subscribe to Colin Clark: eMailAlertsEmail Alerts
Get Colin Clark via: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Blog Feed Post

Search, Themes, Entities, and Sentiment Analysis

I’ve been working on a project lately involving looking for and processing any and all data for a subset of the OTC equities market here in the US.  Listening to a description of what needed to be accomplished, someone piped up and said, “Well, this should be pretty easy.  All we have to do is search for the stock symbols, company names, etc, persist the data, and start calculating our baseline statistics.”

NOT SO PRONTO, TONTO

Try searching the web for a stock symbol, or for 10,000 stock symbols, and you’ll get a jumble of information.  For instance, searching for the symbol “ABLE” can provide hours of entertainment; and completely meaningless results.  So how do we narrow things down a bit.

ENTITY EXTRACTION

By using computational linguistics, we can scan source documents, be they Tweets, Blogs, News, whatever really and look for what are referred to as entities.  Using such approach let’s us not only look for “ABLE” as above, but only where “ABLE” is a noun, as opposed to a verb.  This gets us further down the line.  Combining these types of techniques along with company name, etc. narrows the results down quite a bit.

SUMMARIES GET YOU EXACTLY THAT

If you can’t extract the search entity from the source document, how do you know if the sentiment you’ve calculated for the document actually has anything to do with what you where searching for.  That’s why having the ability to calculate sentiment for an entity identified within the document is as important, if not more, than summarizing the sentiment score at the document level.

CONSTRUCTING INVERTED INDICES

Using the correct approach, we should be able to calculate a document’s summary, any relevant themes, extract referenced entities and calculate sentiment for each.  This is a great problem to solve using elastic resource.  Just keep adding VM’s running the right stack until you can keep up with the inflow.  This is also a great application for Map/Reduce, as I’ve discussed previously.  In this fashion, as elemental information is extracted and calculated for the source document, those keys are either created or updated with the document link id in addition to storing the source document somewhere.

NOSQL = NOTOOLS?

There is a potential need for using a NoSQL database for storing some of the information above.  Several come to mind – MongoDB & CouchDB for document storage, Cassandra for inverted indices, etc.  We’ve even got some of these running with DarkStar right now ,consuming raw information from the web and running streaming Map/Reduce and CEP based analytics and queries against it.  But while some of these open source databases are great for what they’re intended for, there is a severe lack of tools available – tools that we typically take for granted when building out very large datasets that we actually want to do something with after we’ve stored all that data.  If you decide to go down the NoSQL route, be aware of the trade-off’s you’re making, either consciously or not – assumptions can kill a project.

AND AS USUAL

Thanks for reading!

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to colin@cloudeventprocessing.com