Site abstraction for rare category classification in large-scale web directory

Tie-Yan Liu; Tao Qin; Zheng Chen; Wei-Ying Ma

Site abstraction for rare category classification in large-scale web directory

Tie-Yan Liu ,
Tao Qin ,
Zheng Chen ,
Wei-Ying Ma

WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web | May 2005

Published by ACM

Download BibTex

Automatically classifying the Web directories is an effective way to manage Web information. However, our experiments showed that the state-of-the-art text classification technologies could not lead to acceptable performance in this task. Due to our analysis, the main problem is the lack of effective training data in rare categories of Web directories. To tackle this problem, we proposed a novel technology named Site Abstraction to synthesize new training examples from the website of the existing training document. The main idea is to propagate features through parent-child relationship in the sitemap tree. Experiments showed that our method significantly improved the classification performance.