Kable

Established: January 1, 2012

Goal: Kable is to extract structured knowledge from semi-structured Web site. Around 25% Web pages are of semi-structured, which contains lots of knowledge and are template generated web pages. Kable can learn the template across web pages and extract the knowledge. E.g. IMDB is a typical semi-structured web sites, which contains knowledge like movie name, director, runtime and so on.

picture1

Challenge: large scale, lack training, cross language, frequent update

Solution: scalable unsupervised template learning across pages

Problem Definition:

  • Input: semi-structured site
  • Output: tabular data
  • Basic idea: learning the variant and invariant part of page clusters [CIKM’12]
    • Across pages generated by the same template
    • The invariant parts are deemed as schema
    • The variant parts are deemed as values

1             2

Algorithm:

  • S1. Using URL pattern and tag path tree for page clustering
  • S2. Template learning in each page cluster
  • S3. Post processing and schema matching

Achievements:

  • Ship to Satori: Http://Kable (opens in new tab)
  • Processed all tier 0 and tier 1 sites in RetroIndex, encompassing 9Mil+ sites and 16Bil+ pages. From that, 16Bil+ pages were detected to have semi-structure data, containing 409Bil+ facets