This blog is dedicated to the in-depth review, analysis and discussion of technologies related to the search and discovery of information. This blog represents my views only and does not reflect those of my employer, IBM.


Wednesday, May 31, 2006

Enterprise Search Summit 2006 Review

I just returned from attending the 2006 Enterprise Search Summit in NYC and must say that it was encouraging to see the tremendous amount of interest and growth in the enterprise search market. There were about 1000 attendees, which was a three fold increase from last year. All of the search vendors were there as well as good attendance by enterprise customers seeking search solutions. What follows are some of my observations, views, and experiences at the conference.

Keynote Speaker:

Peter Morville, President of Semantic Studios, gave an interesting speech about “Ambient Findability” which coincidentally is the title of his new book. Schooled as a librarian, Peter gave a more holistic view of search, defining it beyond just the search and retrieval of text documents, but the ability to find anyone or anything from anywhere at anytime. He cited Google Maps, PodZinger, and even Cisco’s wireless location appliance as examples of alternative forms of enterprise search. He riled at Google’s suggestion that there is just one simple to use interface (think of “Onebox”) with its persistent “flat” list of results. Peter feels that its more about navigation and way finding through a mashup of digital content in all shapes and forms. Peter also reminded us of the rapid adoption of mobile devices such as cell phones, Blackberries and iPods which are morphing into multi-informational devices (e.g., cell phones to find restaurants, iPods to watch video, etc..). Overall, I thought Peter’s talk was good, stimulating me to think outside of the conventional enterprise search box (probably worth reading his book).

Who’s Hot and Who’s Not

At the end of the conference Steve Arnold, an independent search analyst, gave his views of Who’s Hot in the enterprise search industry. First, he identified three major trends in the industry. They are:
  • Search Platforms – Steve identified IBM, Oracle, Microsoft, FAST, and Autonomy as vendors providing enterprise search platforms. His message was that if you are in the process of selecting an enterprise search engine and you choose products from anyone of these vendors then you are locked in which makes it extremely difficult and costly to switch later.
  • Appliances and APIs – While most in the audience could only name Google as providing a search appliance, Steve said that most of the vendors he talked to had plans or were on the verge of releasing their own search appliances. Steve also said that although “appliance” implies dead easy to use he wasn’t so sure it could be achieved due to the complexity of enterprise content. As for APIs Steve said that everyone has one but what to look for are APIs that are SOA (Web Services) or REST based.
  • Specialization - Steve also felt that more and more vendors are turning towards very domain specific search solutions. He cited Oracle and Convera as leaders in specialization. At the top of Steve’s list of who was hot was Vivisimo. He felt that Vivisimo, overall, had the most robust set of functionality. Besides Vivisimo’s well-known result set clustering and other capabilities, Steve noted their move towards federation. Steve felt that there is no one search engine for the enterprise, primarily because of embedded search and that Vivisimo can be used to help solve this multi-search system problem. Other vendors he highly recommended to look at were Endeca, Coveo, and MondoSoft. Coveo and MondoSoft are partnered with Microsoft and have plug-ins for the Sharepoint server.

Conference Sessions

Besides Peter Morville’s taunt to think outside of the box, I didn’t see or hear anything dramatically new in the enterprise search space. Taxonomies, multi-faceted search, metadata tagging, and search engine optimization were reoccurring themes throughout the conference.

User Interface Design

I was hoping to see some dramatic breakthroughs in user interface design but didn’t. Several speakers dismissed the use of fancy graphics (e.g., 3D interactive topic cluster maps) used for navigation. These they felt were non-intuitive, too abstract and viewed as eye candy for the programmer but useless for the end user. Almost all user interfaces I saw had some form of
multi-faceted search to aid in the navigation of results. I came to realize that this was essential to the user search experience. Rather than rely on the relevance algorithm alone to get the right answer at the top of the list it is better to add alternative ways to find the right answer with just one (or more) clicks. This is different from forcing the user to augment their query with new or additional terms (both thought and time consuming). Note that Google has not embraced MF search and still advocates the “flat” result set. This is probably because they are constrained by the type and amount of metadata returned by the web.

Google

Google’s big play in the enterprise search market is the introduction of the Google OneBox. Google offers an API that lets a developer connect search queries to an external data system. Queries matching a developer defined pattern are passed on to external system, and results are displayed at the top of the search results. Google already provides connectors to Oracle, Cognos, SAS, Cisco and SalesForce.com with more to come. The strategy is to use the Google search box not only as the single point of access to all web content but to your enterprise applications as well. This is quite a potential threat and has many search vendors nervous. For the non-search vendors, it is viewed as a way to expose their often complex functionality through an already accepted dead simple interface resulting in more awareness of their products.

Actionable results

We have all heard of the requirement to highlight search terms in the document after it is clicked on for display and to position the user to the place in the document where the search terms are highlighted (useful for very large documents). But Inxight took this concept several steps forward. They provide a search extender to the Google Appliance and Desktop search that extracts people’s names, companies and 25 other entities from the result document and provides the facets on the left as navigation into the document itself. The document is first clicked on and displayed from the Google view cache (search terms are already highlighted). Then by clicking on any of the facets on the left (say people), the document is positioned to the page that contains the highlighted facet.

Best Bets

Surprisingly a lot of time was devoted to the discussion of “Best Bets” or “QuickLinks” as we know them. The message was that the overall goal of the search engine is to help find what the user is looking for and that no matter how good the search engine is it is not going to always produce the most relevant results. They encouraged the search administrators to closely examine their top queries and manually provide the QuickLinks for these results to dramatically increase customer satisfaction. While admitting that this is a kind of crutch they felt it was necessary to keep customer sat high while you figure out why the search engine was not placing the right results at the top. They also pointed out that the right answer should always be number one, not three, and that Best Bets is one way to ensure that it’s at the top. In either case, what I got out of this is that our search products should have a robust search quality reporting system that produces the information needed to perform this type of analysis. At a minimum there needs to be the following reports:

  1. Top queries (by query term and submitted frequency)
  2. No Results
  3. Results with no click through
  4. Next page of results

Social Bookmarking and Folksonomies

A poll of the audience showed that the majority were working on content taxonomies in one form or another as a way of augmenting their search solutions. Several speakers addressed the challenges in taxonomy generation which is heavily reliant on document tagging with metadata. Automatic metadata generation (e.g., clustering) is still in its infancy and not heavily used so most companies either rely on professional or author creation of metadata. Using professionals can be expensive especially in an enterprise with vast amounts of information. Author generated metadata can be inadequate, inaccurate, or outright deceptive, so several speakers talked about leveraging the user community as a way to help solve the problem.

The idea is to let your users organize the content for their own use as they see fit - much the way we bookmark web pages with our own terms. The key is to then make these tags available to the rest of the user community. The result is an unpredictable but highly accurate folksonomy of documents. This initial tagging can then be used as the basis for building the ultimate taxonomy. Note that taxonomy generally implies a hierarchy where a folksonomy is one level deep (a flat list of words associated with the document). There were many web based folksonomies cited - Del.icio.us, Flickr and even IBM’s DogEar to name a few. But they indicated that the model is working its way into the enterprise. More vendors are providing the tools to allow for the tagging of search results and sever based components to share those tags and/or assist in the overall taxonomy generation process.

Conference Grade (B+)

Overall I found the conference to be very informative and worth while to participate in the future. Don’t hesitate to contact me if you have any questions or would like to discuss a topic in more detail.


Click here to read more...


Wednesday, May 17, 2006

The Text Analytics Pipeline or Pipe Dream

Text analytics promises to more efficiently find information, extending your search beyond providing just keywords but rather expressing yourself in more natural ways. It allows the search engine to understand the meaning behind your query and correctly match that with the meaning conveyed by the documents being searched. But what is text analytics exactly and how is it used to improve your search?

First, we should not think of text analytics as one methodology but rather a plurality of techniques used to solve a variety of text understanding problems. Each type of analysis produces additional information about the document (or query) and clarifies its intent. For example, one type of analysis might be used to identify phone numbers and/or people’s names in the text so that queries like “what is Todd Leyba’s phone no.?” can be asked. Some analytics are used as building blocks for higher level analytics, such as language identification or natural language parsers that decompose a sentence into its grammatical parts. Research has found that the overall quality of some analytics can be improved when a hybrid of techniques, rule-based and statistical methods for example, are combined rather than used alone. It has also been found that different often times highly specialized analytics are required depending on the corpus of documents (e.g., analyzing pharmaceutical vs. defense intelligence documents).

So it is evident that a large number of analytics can be applied to a single piece of text and as the search domain changes so can the number and type of analytics change. All of this becomes problematic for the search engine. Is the search engine expected to provide all of the required analytics? This is highly unlikely. No one company can have the resources or expertise in all domains. It is more reasonable to expect a search engine to provide a framework that allows an analytic to be plugged in as needed. The search engine would provide some general purpose analyzers out of the box but should be extendable with more specialized analyzers obtained from elsewhere. The search engine would provide a kind of pipeline where documents are feed past the right sequence of chosen text analyzers. Each analyzer would perform its analysis of the text and annotate the document with its deduced data and/or make available the information to the next analyzer.

Fortunately, there is a standard proposed by IBM for just such a text analytics pipeline. It is referred to as the Unstructured Information Management Architecture (UIMA). The UIMA architecture defines a framework in which basic building blocks called Analysis Engines (AEs) are composed in order to analyze a document. At the heart of AEs are the analysis algorithms that do all the work to analyze documents and record analysis results (for example, detecting person names). These algorithms are packaged within components that are called Annotators. AEs are the stackable containers for annotators and other analysis engines. To try out the UIMA software framework download the free UIMA Software Development Kit (SDK) from IBM’s alphaWorks Site. IBM’s enterprise search engine, OmniFind, allows UIMA annotators to be plugged into the OmniFind processing flow, enabling semantic search to be performed on the extracted concepts.

It is important to remember that UIMA does not provide the text analytics itself but rather the standard to which the analytics are written. Once the analytics are written to this standard then UIMA makes it easy to plug-n-play these analyzers together to form your ultimate search solution (which may still be a pipe dream).

So why do I feel that achieving search nirvana via text analytics might still be a pipe dream? Actually, pipe dream may be too strong of a phrase because it implies something that cannot be materialized. Actually it is all about time to adoption. UIMA was announced as open source in August of 2005. Since that time there has been thousands of downloads of the SDK and a host of third-party vendors that have announced their use of UIMA to wrap and deploy their analysis capabilities.

All of this is a good start and shows promise for the text analytics industry. But we are not at the point yet where you can effectively plug-n-play with your favorite text analytics (ala-carte) without some customization and corresponding services work to fill in the gaps. But let’s give it a chance. UIMA is only a year old and like any child needs time to grow.


Click here to read more...


Wednesday, May 03, 2006

How Secure Is Your Search?

If I had to choose I would definitely pick search security as one of the most challenging requirements to fulfill when building an enterprise search product. By search security I mean that you as an end user will only be able to search and view those documents that you have been granted access to. I’d like to use this first post to present some of the problems you might encounter and possible solutions to supporting search security. I’ve tried to keep this posting brief. For a more in-depth discussion you can read Enterprise Search Security.

The information in an enterprise can exist in many shapes and forms and is managed by the most appropriate software for the task at hand. Controlling access to sensitive information contained within these repositories is typically enforced by the managing software. The extent to which the information is secured can vary from system to system each enforcing its own security policies and requirements. For example, file systems generally control read, write, and execute operations on files. Contrast a file system security model to that of a relational database management system that can control access to individual columns of data or a document management system that can limit access within a specified period of time.

The diversity in security models for the different types of enterprise content is problematic for enterprise search engines. The primary goal of an enterprise search engine is to provide quick and relevant responses to inquiries for documents that users are authorized to see. In order to meet the performance and relevance requirements most search engines build an optimized index that represents the content to be searched. Rather than search the original content, the user is actually posting queries to the index – much like searching a card catalog in a library. The index is therefore comprised of documents that were extracted from the various backend data sources. These backend data sources were crawled with credentials of sufficient authority to access and extract all of the documents for that data source. Consequently, the initial document access rights of an enterprise search index represent the access rights of the crawler. But how does the search engine restrict individual user’s access rights rather than what the crawler was allowed to see?

One approach is for the search engine to provide its own security model. The administrator of the search engine would define the individual access rights to the cataloged documents. This approach has several drawbacks. First, it attempts to normalize the documents native access control with its own. This dictates a common security model, one that can be used to represent all of the security models of the sources contributing to the index. As previously demonstrated, this may not be practical or possible as the different types of sources increases. Second, this approach requires the administrator to redefine controlled access to documents that have already been defined in the originating repositories – an unnecessary and duplicative task. And lastly, the approach implies that the administrator has enterprise wide knowledge of the access controls for all enterprise content – an unlikely situation. Ideally the search engine should honor the access rights of the documents as defined by its native software. This could be accomplished two ways.

First we could automatically copy the document’s native Access Control Lists (ACLs) as defined by its hosting software into the index of the search engine. Although this approach reduces the burden on the administrator, it has several shortcomings. If the native ACLs are to retain their original security model then the search engine would then need to re-implement the corresponding security mechanisms used by the backend to interpret those ACLs. This could be a daunting task. Alternatively, the search engine could try to normalize these ACLs into a single model so that a single security filtering mechanism could be used. But again a true normalized model may not be achievable. The result would be a security model representing the least common denominator of all the contributing repositories.

The second approach is not to maintain any security information in the index at all. In response to a query and just before the result set is presented to the user the search engine would remove those documents the user is not allowed to see by consulting in real time with the document’s originating backend repository. The search engine would in a sense be impersonating the end user when interacting with the native repository. Through impersonation, the search engine would be asking the native repository if the user can have access to one or more documents that were previously crawled and extracted from its source. This approach has several advantages. First, document access is controlled by the native security mechanisms of the originating repository however complex that may be. Second, the filtering is done in real time thus reflecting the latest native ACL changes for any given document. However, impersonation does require connectivity to the all of the backend repositories that have contributed to the index. If a particular backend is not available then the disposition of a document can not be determined. This may not be so dire. If the backend is not available then the document probably cannot be viewed. Under this condition the document would automatically be removed from the result set.

Of greater concern is the performance of the impersonation approach. Search indexes are optimized for speed generally producing sub second response times. With the impersonated approach described above a considerable amount of time would be added to communicate with each backend to determine if the documents should be included in the final result set. The more differentiated the result set the greater the number of communications. The problem is compounded when a user is denied access to the majority of the results.

A more efficient approach would be to combine the storage of native high level ACLs in the index along with the real-time consultation of the originating repositories to determine what documents a user is allowed to see. The storage of native high level ACLs in the index is necessary to ensure adequate search performance but alone does not assure comprehensive document level security. The host software of the document’s originating repository becomes the final arbiter as to whether or not the user is allowed access and thus guarantees enforcement of the documents native ACL.

Click here to read more...