Kable

Established: January 1, 2012

Goal: Kable is to extract structured knowledge from semi-structured Web site. Around 25% Web pages are of semi-structured, which contains lots of knowledge and are template generated web pages. Kable can learn the template across web pages and extract the knowledge. E.g. IMDB is a typical semi-structured web sites， which contains knowledge like movie name, director, runtime and so on.

Challenge: large scale, lack training, cross language, frequent update

Solution: scalable unsupervised template learning across pages

Problem Definition:

Input: semi-structured site
Output: tabular data
Basic idea: learning the variant and invariant part of page clusters [CIKM’12]
- Across pages generated by the same template
- The invariant parts are deemed as schema
- The variant parts are deemed as values

Algorithm:

S1. Using URL pattern and tag path tree for page clustering
S2. Template learning in each page cluster
S3. Post processing and schema matching

Achievements:

Ship to Satori: Http://Kable (opens in new tab)
Processed all tier 0 and tier 1 sites in RetroIndex, encompassing 9Mil+ sites and 16Bil+ pages. From that, 16Bil+ pages were detected to have semi-structure data, containing 409Bil+ facets