Dynamic Composition of Information Retrieval Techniques

 

Shlomo Zilberstein, Principal Investigator
Department of Computer Science
140 Governor's Drive
University of Massachusetts
Amherst, MA 01003-4610
Phone: (413) 545-4189
Fax: (413) 545-1249
Email: shlomo@cs.umass.edu
URL: http://www.cs.umass.edu/~shlomo/

James Allan, Co-Principal Investigator
Department of Computer Science
140 Governor's Drive
University of Massachusetts
Amherst, MA 01003-4610
Phone: (413) 545-3240
Fax: (413) 545-1789
Email: allan@cs.umass.edu
URL: http://www.cs.umass.edu/~allan/

 

WWW Page

URL: http://anytime.cs.umass.edu/shlomo/research/DCIR.html

Project Award Information

Keywords

Progressive processing, information retrieval, search engines, monitoring and control of computation, resource-bounded reasoning, Markov decision process

Project Summary

This project is a collaboration between the Information Retrieval Center and the Resource-Bounded Reasoning Lab at the University of Massachusetts. It is aimed at developing a dynamic approach for the construction of information retrieval search engines that will improve their flexibility, quality of service and robustness. Currently, these systems are built by integrating a fixed set of modules and techniques that perform tasks such as query formation, query optimization, query evaluation, precision improvement, and recall improvement. The new approach consists of context-dependent mechanisms for optimal selection of information retrieval techniques based on a probabilistic description of their performance. The approach addresses the high level of uncertainty regarding the run-time of complex retrieval techniques and their impact on the quality of the results. The current static approach to integration of information retrieval modules continues to produce small performance gains, but the systems are extremely specialized for each task, and it is not clear how well results generalize to new types of retrieval. This project offers an alternative approach with significant advantages because it allows a system to configure itself dynamically to the specific task at hand, to the person using the system, and to limited computational resources.

Project References

Collaborators

Dr. Abdel-Illah Mouaddib from centre de recherches en informatique de Lens (CRIL), France, is an international collaborator participating in this project.

Project Impact

We have completed the first year of the project and can report the following impact:

  1. Technology: using machine learning to predict the performance of IR techniques. We have analyzed a statistical query expansion technique (LCA) and found that we can predict more than half of the situations in which it will be useful. This result is critical to the success of the project, and is the first time this effect has been seen in IR.
  2. Technology: reusable meta-level control policies for progressive processing tasks. We have developed a technique to select the best subset of IR techniques for a given query by solving a corresponding Markov decision problem. In order to factor the effect of the load on the system, the reward for processing a query includes an opportunity cost -- the cost of taking computational resources away from the remaining queries. Several fast approximation schemes for the opportunity cost have been developed, providing an effective framework for managing computational resources in highly dynamic environments.
  3. Technology: IR system design. Partly funded by this project, the CIIR is re-architecting its core information retrieval system into a thread-based system of cooperating components. The new architecture, called Metamorph, is being designed to allow configuration of an IR system for a wide range of tasks. The architecture is influenced by the goals of this project--i.e., rapid and lightweight reconfiguration of a system based on the meta-level control policies.
  4. Graduate education: The project supports two doctoral students, one in the Resource-Bounded Reasoning Lab (Andrew Arnt) and one in the Information Retrieval Center (we are currently recruiting a student to fill this position). The students working on the project receive training in both AI and IR. Andrew Arnt is successfully completing the "portfolio" requirements of the Ph.D. program.

We anticipate that the completion of the project will have the following additional impact:

  1. Performance gains: We anticipate substantial gains in the quality of service of information retrieval search engines both under normal operation and particularly under increased workload.
  2. Robustness: The resulting system will be able to respond to a high workload by reducing the amount of computation needed to process each query and dynamically adjusting the quality of service.
  3. Open system design: The resulting system will facilitate sharing and integration of information retrieval techniques developed by others by establishing a standard for exchanging intermediate results within an open system architecture.

Goals, Objectives, and Targeted Activities

The objective of this project is to construct a new approach to building information retrieval search engines that addresses the following key challenges:

  1. Composability of information retrieval techniques: How to unify the specifications of a variety of existing information retrieval techniques so that they can be used as plug-in components of a complete system?
  2. Identifying computational tradeoffs in information retrieval: How to characterize the computational cost and the benefits of a particular IR technique? How to measure the interdependency between different techniques and the characteristics of the input query?
  3. Monitoring and control: How to measure and monitor the progress that the system makes with a specific task? How to handle the uncertainty associated with both quality and duration of each module? How to dynamically select the best IR techniques in a particular situation with minimal computational overhead?
  4. Scalability of resource-bounded reasoning techniques: By definition, information retrieval tasks involve large collections and a substantial amount of test data. One of the challenges of this project is to address the scalability of resource-bounded reasoning techniques to handle very large problem domains.
  5. Dynamic configuration of search engines: From an information retrieval perspective, this project will test the ability to move from a careful, design-time system configuration to a more flexible approach based on run-time analysis of the specific characteristics of the task.

To achieve these objectives, we are planning the following activities for the second year of the project: improve the machine learning techniques we have used to predict performance; use larger data sets for evaluation; modify the task model to reflect the lessons learned from the first year (for example, partial execution of a technique may be needed to better predict its performance); test the efficacy of the Metamorph architecture for lightweight reconfiguration of a system; develop a prototype of a search engine composed of several techniques that offer a time/quality tradeoff; and evaluate the search engine and the approach we developed to control it.

Area Background

The ability to dynamically adjust computational effort based on the availability of computational resources has been studied extensively by the AI community since the mid 1980's. These efforts have led to the development of a variety of techniques such as anytime algorithms [Dean & Boddy, 1988; Zilberstein & Russell, 1996], design-to-time [Garvey & Lesser, 1993], flexible computation [Horvitz, 1987], and progressive reasoning [Mouaddib & Zilberstein, 1998]. Similar models of computation have been studied within the systems community in the area of imprecise computation [Liu et al., 1991]. In this project, we adapt the progressive processing framework to build a resource-bounded reasoning system that is particularly suited for dynamic composition of information retrieval techniques [Jones & Willett]. The information retrieval techniques we currently use are based on the extensive infrastructure available at the Center for Information retrieval at UMass, and in particular on components of the INQUERY system [Callan et al., 1992]. The dynamic selection of modules is based on their marginal value, which is determined by several factors: the characteristics of the specific request, the quality of the intermediate result produced by the system, the execution time of the module under consideration, and its anticipated contribution to the overall quality of service. Additional factors that can be taken into account are the consumer of the results and the expected reward for the service.

The current static approach to integration of information retrieval modules continues to produce performance gains of about 10% each year [Allan et al., 1997], but the systems are extremely specialized for each task, and it is not clear how well results will generalize to new types of retrieval. We believe that our new approach will provide significant advantages because it allows a system to configure itself dynamically to the specific task at hand. Our approach will allow systems to take advantage of query processing techniques when it makes sense for them to do so, and to include or exclude key components of the system based on whether they are expected to work, and whether there is time for them to be applied.

Area References

  1. T.L. Dean and M. Boddy. An analysis of time-dependent planning. Proceedings of the 7th National Conference on Artificial Intelligence, 49-54, Minneapolis, Minnesota, 1988.
  2. J.P. Callan, W.B. Croft, and S.M. Harding. The INQUERY retrieval system. Proceedings of the 3rd International Conference on Database and Expert Systems Applications, 78-83, Valencia, Spain, 1992.
  3. J. Allan, J. Callan, W.B. Croft, L. Ballesteros, D. Byrd, R. Swan, and J. Xu. INQUERY Does Battle with TREC-6. The 6th Text Retrieval Conference (TREC-6), NIST Special Publication, 1997.
  4. Croft, W.B. Combining Approaches to Information Retrieval. To appear in Advances in Information Retrieval: Recent Research from the CIIR, W. Bruce Croft (ed.) Kluwer Academic Publishers, 2000.
  5. A. Garvey and V. Lesser. Design-to-time real-time scheduling. IEEE Transactions on Systems, Man and Cybernetics, 23(6):1491-1502, 1993.
  6. E.J. Horvitz. Reasoning about beliefs and actions under computational resource constraints. Proceedings of the 1987 Workshop on Uncertainty in Artificial Intelligence, Seattle, Washington, 1987.
  7. K.S. Jones and P. Willett. (eds.) Readings in Information Retrieval. Morgan Kaufmann Publishers, 1997.
  8. J. Liu, K. Lin, W. Shih, A. Yu, J. Chung, and W. Zao. Algorithms for scheduling imprecise computations. IEEE Transactions on Computers, 24(5):58-68, 1991.
  9. A.I. Mouaddib and S. Zilberstein. Optimal scheduling of dynamic progressive processing. Proceedings of the 13th Biennial European Conference on Artificial Intelligence, 449-503, Brighton, UK, 1998.
  10. S. Zilberstein and S. J. Russell. Optimal composition of real-time systems. Artificial Intelligence, 82(1-2):181-213, 1996.
  11. S. Zilberstein and A.I. Mouaddib. Reactive control of dynamic progressive processing. Proceedings of the 16th International Joint Conference on Artificial Intelligence, 1268-1273, Stockholm, Sweden, 1999.

Potential Related Projects

The following web sites provide additional information and links to related projects.