The Text Analytics Pipeline or Pipe Dream

First, we should not think of text analytics as one methodology but rather a plurality of techniques used to solve a variety of text understanding problems. Each type of analysis produces additional information about the document (or query) and clarifies its intent. For example, one type of analysis might be used to identify phone numbers and/or people’s names in the text so that queries like “what is Todd Leyba’s phone no.?” can be asked. Some analytics are used as building blocks for higher level analytics, such as language identification or natural language parsers that decompose a sentence into its grammatical parts. Research has found that the overall quality of some analytics can be improved when a hybrid of techniques, rule-based and statistical methods for example, are combined rather than used alone. It has also been found that different often times highly specialized analytics are required depending on the corpus of documents (e.g., analyzing pharmaceutical vs. defense intelligence documents).
So it is evident that a large number of analytics can be applied to a single piece of text and as the search domain changes so can the number and type of analytics change. All of this becomes

Fortunately, there is a standard proposed by IBM for just such a text analytics pipeline. It is referred to as the Unstructured Information Management Architecture (UIMA). The UIMA architecture defines a framework in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines. To try out the UIMA software framework download the free UIMA Software Development Kit (SDK) from IBM’s alphaWorks Site. IBM’s enterprise search engine, OmniFind, allows UIMA annotators to be plugged into the OmniFind processing flow, enabling semantic search to be performed on the extracted concepts.
It is important to remember that UIMA does not provide the text analytics itself but rather the standard to which the analytics are written. Once the analytics are written to this standard then UIMA makes it easy to plug-n-play these analyzers together to form your ultimate search solution (which may still be a pipe dream).
So why do I feel that achieving search nirvana via text analytics might still be a pipe dream? Actually, pipe dream may be too strong of a phrase because it implies something that cannot be materialized. Actually it is all about time to adoption. UIMA was announced as open source in August of 2005. Since that time there has been thousands of downloads of the SDK and a host of third-party vendors that have announced their use of UIMA to wrap and deploy their analysis capabilities.
All of this is a good start and shows promise for the text analytics industry. But we are not at the point yet where you can effectively plug-n-play with your favorite text analytics (ala-carte) without some customization and corresponding services work to fill in the gaps. But let’s give it a chance. UIMA is only a year old and like any child needs time to grow.
0 Comments:
Post a Comment
<< Home