Passage Retrieval on Web Data

  • Ruihua Song ,
  • Shao-Ping Ma ,
  • Min Zhang

International Conference on Machine Learning and Cybernetics |

Published by IEEE

Publication

On the Web, it is quite common that one document has several independent subtopics, i.e., it is multi-topic. For such document, dividing it into passages with each of them corresponding to only one topic will improve the retrieval performance. In this paper, the features embedded in the HTML structure are utilized as evidence of passage segmentation. Experimental results on the TREC-9 10 gigabyte Web dataset show that the 11-point average precision of the passage retrieval is higher than that of the usual document retrieval by about 9% on the collection of multi-topic documents and by about 1.6% on the whole document set. Further analyses indicate that the precision is actually higher, if all the documents returned by passage retrieval are assessed.