This blog is dedicated to the in-depth review, analysis and discussion of technologies related to the search and discovery of information. This blog represents my views only and does not reflect those of my employer, IBM.

Wednesday, May 17, 2006

The Text Analytics Pipeline or Pipe Dream

Text analytics promises to more efficiently find information, extending your search beyond providing just keywords but rather expressing yourself in more natural ways. It allows the search engine to understand the meaning behind your query and correctly match that with the meaning conveyed by the documents being searched. But what is text analytics exactly and how is it used to improve your search?

First, we should not think of text analytics as one methodology but rather a plurality of techniques used to solve a variety of text understanding problems. Each type of analysis produces additional information about the document (or query) and clarifies its intent. For example, one type of analysis might be used to identify phone numbers and/or people’s names in the text so that queries like “what is Todd Leyba’s phone no.?” can be asked. Some analytics are used as building blocks for higher level analytics, such as language identification or natural language parsers that decompose a sentence into its grammatical parts. Research has found that the overall quality of some analytics can be improved when a hybrid of techniques, rule-based and statistical methods for example, are combined rather than used alone. It has also been found that different often times highly specialized analytics are required depending on the corpus of documents (e.g., analyzing pharmaceutical vs. defense intelligence documents).

So it is evident that a large number of analytics can be applied to a single piece of text and as the search domain changes so can the number and type of analytics change. All of this becomes problematic for the search engine. Is the search engine expected to provide all of the required analytics? This is highly unlikely. No one company can have the resources or expertise in all domains. It is more reasonable to expect a search engine to provide a framework that allows an analytic to be plugged in as needed. The search engine would provide some general purpose analyzers out of the box but should be extendable with more specialized analyzers obtained from elsewhere. The search engine would provide a kind of pipeline where documents are feed past the right sequence of chosen text analyzers. Each analyzer would perform its analysis of the text and annotate the document with its deduced data and/or make available the information to the next analyzer.

Fortunately, there is a standard proposed by IBM for just such a text analytics pipeline. It is referred to as the Unstructured Information Management Architecture (UIMA). The UIMA architecture defines a framework in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines. To try out the UIMA software framework download the free UIMA Software Development Kit (SDK) from IBM’s alphaWorks Site. IBM’s enterprise search engine, OmniFind, allows UIMA annotators to be plugged into the OmniFind processing flow, enabling semantic search to be performed on the extracted concepts.

It is important to remember that UIMA does not provide the text analytics itself but rather the standard to which the analytics are written. Once the analytics are written to this standard then UIMA makes it easy to plug-n-play these analyzers together to form your ultimate search solution (which may still be a pipe dream).

So why do I feel that achieving search nirvana via text analytics might still be a pipe dream? Actually, pipe dream may be too strong of a phrase because it implies something that cannot be materialized. Actually it is all about time to adoption. UIMA was announced as open source in August of 2005. Since that time there has been thousands of downloads of the SDK and a host of third-party vendors that have announced their use of UIMA to wrap and deploy their analysis capabilities.

All of this is a good start and shows promise for the text analytics industry. But we are not at the point yet where you can effectively plug-n-play with your favorite text analytics (ala-carte) without some customization and corresponding services work to fill in the gaps. But let’s give it a chance. UIMA is only a year old and like any child needs time to grow.


Post a Comment

Links to this post:

Create a Link

<< Home