{"id":293825,"date":"2016-09-18T22:57:22","date_gmt":"2016-09-19T05:57:22","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-project&#038;p=293825"},"modified":"2016-09-18T22:59:05","modified_gmt":"2016-09-19T05:59:05","slug":"kable","status":"publish","type":"msr-project","link":"https:\/\/www.microsoft.com\/en-us\/research\/project\/kable\/","title":{"rendered":"Kable"},"content":{"rendered":"<p><strong>Goal:<\/strong> Kable is to extract structured knowledge from semi-structured Web site. Around 25% Web pages are of semi-structured, which contains lots of knowledge and are template generated web pages. Kable can learn the template across web pages and extract the knowledge. E.g. IMDB is a typical semi-structured web sites\uff0c which contains knowledge like movie name, director, runtime and so on.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-293828\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/Picture1-300x270.png\" alt=\"picture1\" width=\"336\" height=\"302\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/Picture1-300x270.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/Picture1.png 587w\" sizes=\"auto, (max-width: 336px) 100vw, 336px\" \/><\/p>\n<p><strong>Challenge:<\/strong> large scale, lack training, cross language, frequent update<\/p>\n<p><strong>Solution<\/strong>: scalable unsupervised template learning across pages<\/p>\n<p><strong>Problem Definition:<\/strong><\/p>\n<ul>\n<li>Input: semi-structured site<\/li>\n<li>Output: tabular data<\/li>\n<li>Basic idea: learning the variant and invariant part of page clusters [CIKM\u201912]\n<ul>\n<li>Across pages generated by the same template<\/li>\n<li>The invariant parts are deemed as schema<\/li>\n<li>The variant parts are deemed as values<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-294053\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/1.png\" alt=\"1\" width=\"154\" height=\"177\" \/>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 <img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-294056\" src=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/2-300x115.png\" alt=\"2\" width=\"496\" height=\"190\" srcset=\"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/2-300x115.png 300w, https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2016\/09\/2.png 544w\" sizes=\"auto, (max-width: 496px) 100vw, 496px\" \/><b><\/b><i><\/i><u><\/u><\/p>\n<p><strong>Algorithm:<\/strong><\/p>\n<ul>\n<li><b><\/b>S1. Using URL pattern and tag path tree for page clustering<\/li>\n<li>S2. Template learning in each page cluster<\/li>\n<li>S3. Post processing and schema matching<\/li>\n<\/ul>\n<p><b>Achievements:<\/b><\/p>\n<ul>\n<li>Ship to Satori: <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/kable\/\">Http:\/\/Kable<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li>\n<li>Processed all tier 0 and tier 1 sites in RetroIndex, encompassing <b>9Mil<\/b>+ sites and <b>16Bil<\/b>+ pages. From that, <b>16Bil<\/b>+ pages were detected to have semi-structure data, containing <b>409Bil<\/b>+ facets<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Goal: Kable is to extract structured knowledge from semi-structured Web site. Around 25% Web pages are of semi-structured, which contains lots of knowledge and are template generated web pages. Kable can learn the template across web pages and extract the knowledge. E.g. IMDB is a typical semi-structured web sites\uff0c which contains knowledge like movie name, [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","footnotes":""},"research-area":[13556],"msr-locale":[268875],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-293825","msr-project","type-msr-project","status-publish","hentry","msr-research-area-artificial-intelligence","msr-locale-en_us","msr-archive-status-active"],"msr_project_start":"2012-01-01","related-publications":[],"related-downloads":[],"related-videos":[],"related-groups":[],"related-events":[],"related-opportunities":[],"related-posts":[],"related-articles":[],"tab-content":[],"slides":[],"related-researchers":[],"msr_research_lab":[],"msr_impact_theme":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/293825","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-project"}],"version-history":[{"count":0,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/293825\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=293825"}],"wp:term":[{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=293825"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=293825"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=293825"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=293825"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}