This blog is dedicated to the in-depth review, analysis and discussion of technologies related to the search and discovery of information. This blog represents my views only and does not reflect those of my employer, IBM.

Wednesday, May 03, 2006

How Secure Is Your Search?

If I had to choose I would definitely pick search security as one of the most challenging requirements to fulfill when building an enterprise search product. By search security I mean that you as an end user will only be able to search and view those documents that you have been granted access to. I’d like to use this first post to present some of the problems you might encounter and possible solutions to supporting search security. I’ve tried to keep this posting brief. For a more in-depth discussion you can read Enterprise Search Security.

The information in an enterprise can exist in many shapes and forms and is managed by the most appropriate software for the task at hand. Controlling access to sensitive information contained within these repositories is typically enforced by the managing software. The extent to which the information is secured can vary from system to system each enforcing its own security policies and requirements. For example, file systems generally control read, write, and execute operations on files. Contrast a file system security model to that of a relational database management system that can control access to individual columns of data or a document management system that can limit access within a specified period of time.

The diversity in security models for the different types of enterprise content is problematic for enterprise search engines. The primary goal of an enterprise search engine is to provide quick and relevant responses to inquiries for documents that users are authorized to see. In order to meet the performance and relevance requirements most search engines build an optimized index that represents the content to be searched. Rather than search the original content, the user is actually posting queries to the index – much like searching a card catalog in a library. The index is therefore comprised of documents that were extracted from the various backend data sources. These backend data sources were crawled with credentials of sufficient authority to access and extract all of the documents for that data source. Consequently, the initial document access rights of an enterprise search index represent the access rights of the crawler. But how does the search engine restrict individual user’s access rights rather than what the crawler was allowed to see?

One approach is for the search engine to provide its own security model. The administrator of the search engine would define the individual access rights to the cataloged documents. This approach has several drawbacks. First, it attempts to normalize the documents native access control with its own. This dictates a common security model, one that can be used to represent all of the security models of the sources contributing to the index. As previously demonstrated, this may not be practical or possible as the different types of sources increases. Second, this approach requires the administrator to redefine controlled access to documents that have already been defined in the originating repositories – an unnecessary and duplicative task. And lastly, the approach implies that the administrator has enterprise wide knowledge of the access controls for all enterprise content – an unlikely situation. Ideally the search engine should honor the access rights of the documents as defined by its native software. This could be accomplished two ways.

First we could automatically copy the document’s native Access Control Lists (ACLs) as defined by its hosting software into the index of the search engine. Although this approach reduces the burden on the administrator, it has several shortcomings. If the native ACLs are to retain their original security model then the search engine would then need to re-implement the corresponding security mechanisms used by the backend to interpret those ACLs. This could be a daunting task. Alternatively, the search engine could try to normalize these ACLs into a single model so that a single security filtering mechanism could be used. But again a true normalized model may not be achievable. The result would be a security model representing the least common denominator of all the contributing repositories.

The second approach is not to maintain any security information in the index at all. In response to a query and just before the result set is presented to the user the search engine would remove those documents the user is not allowed to see by consulting in real time with the document’s originating backend repository. The search engine would in a sense be impersonating the end user when interacting with the native repository. Through impersonation, the search engine would be asking the native repository if the user can have access to one or more documents that were previously crawled and extracted from its source. This approach has several advantages. First, document access is controlled by the native security mechanisms of the originating repository however complex that may be. Second, the filtering is done in real time thus reflecting the latest native ACL changes for any given document. However, impersonation does require connectivity to the all of the backend repositories that have contributed to the index. If a particular backend is not available then the disposition of a document can not be determined. This may not be so dire. If the backend is not available then the document probably cannot be viewed. Under this condition the document would automatically be removed from the result set.

Of greater concern is the performance of the impersonation approach. Search indexes are optimized for speed generally producing sub second response times. With the impersonated approach described above a considerable amount of time would be added to communicate with each backend to determine if the documents should be included in the final result set. The more differentiated the result set the greater the number of communications. The problem is compounded when a user is denied access to the majority of the results.

A more efficient approach would be to combine the storage of native high level ACLs in the index along with the real-time consultation of the originating repositories to determine what documents a user is allowed to see. The storage of native high level ACLs in the index is necessary to ensure adequate search performance but alone does not assure comprehensive document level security. The host software of the document’s originating repository becomes the final arbiter as to whether or not the user is allowed access and thus guarantees enforcement of the documents native ACL.


Post a Comment

Links to this post:

Create a Link

<< Home