Web Page Clustering using Heuristic Search in the Web Graph
Ron Bekkerman
Shlomo Zilberstein
James Allan
Abstract
Effective representation of Web search results remains
an open problem in the Information Retrieval
community. For ambiguous queries, a traditional
approach is to organize search results into
groups (clusters), one for each meaning of the
query. These groups are usually constructed according
to the topical similarity of the retrieved
documents, but it is possible for documents to be
totally dissimilar and still correspond to the same
meaning of the query. To overcome this problem,
we exploit the thematic locality of the
Web--relevant Web pages are often located close to each
other in the Web graph of hyperlinks. We estimate
the level of relevance between each pair of retrieved
pages by the length of a path between them. The
path is constructed using multi-agent beam search:
each agent starts with one Web page and attempts
to meet as many other agents as possible with some
bounded resources. We test the system on two types
of queries: ambiguous English words and people
names. The Web appears to be tightly connected;
about 70% of the agents meet with each other after
only three iterations of exhaustive breadth-first
search. However, when heuristics are applied, the
search becomes more focused and the obtained results
are substantially more accurate. Combined
with a content-driven Web page clustering technique,
our heuristic search system significantly improves
the clustering results.
Download
[pdf]