{"id":620757,"date":"2019-11-11T16:09:33","date_gmt":"2019-11-12T00:09:33","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=620757"},"modified":"2019-11-14T16:21:02","modified_gmt":"2019-11-15T00:21:02","slug":"web-data-extraction-using-hybrid-program-synthesis-a-combination-of-top-down-and-bottom-up-inference","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/web-data-extraction-using-hybrid-program-synthesis-a-combination-of-top-down-and-bottom-up-inference\/","title":{"rendered":"Web data extraction using hybrid program synthesis: a combination of top-down and bottom-up inference"},"content":{"rendered":"<p>Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various usability challenges around robustness, the amount of training effort required, the complexity of programs synthesized, as well as the ease of interaction in limited UI environments. In this work we address these challenges based on a novel program synthesis approach which combines the benefits of deductive (top-down) and enumerative (bottom-up) synthesis strategies. This yields a semi-supervised technique with which concise web data extraction programs expressible in standard XPath\/CSS can be synthesized from a small number of user-provided examples. We demonstrate the effectiveness of our method in comparison to existing techniques in terms of overall accuracy, robust inference from a small number of examples, as well as inference of concise programs comparable to hand-written selectors. Our method has been deployed as a feature in the Microsoft Power BI product and released to millions of users.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Automatic synthesis of web data extraction programs has been explored in a variety of settings, but in practice there remain various usability challenges around robustness, the amount of training effort required, the complexity of programs synthesized, as well as the ease of interaction in limited UI environments. In this work we address these challenges based [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Mohammad Raza","user_id":"32997"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":"33755"}],"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"ACM","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"SIGMOD (Special Interest Group on Management of Data)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-6-1","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13556,13563,13560],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-620757","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-artificial-intelligence","msr-research-area-data-platform-analytics","msr-research-area-programming-languages-software-engineering","msr-locale-en_us"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-6-1","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"ACM","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/11\/websynth-5dcdee542b58d.pdf","id":"621360","title":"websynth-5dcdee542b58d","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/11\/websynth-techreport.pdf","id":"621342","title":"websynth-techreport","label_id":"243118","label":0}],"msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":621360,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/11\/websynth-5dcdee542b58d.pdf"},{"id":621357,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/11\/websynth.pdf"},{"id":621342,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2019\/11\/websynth-techreport.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Mohammad Raza","user_id":32997,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Mohammad Raza"},{"type":"user_nicename","value":"Sumit Gulwani","user_id":33755,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Sumit Gulwani"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[663303],"msr_project":[817234,654579,672951],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":817234,"post_title":"PROSE-Powered Data Ingestion (a.k.a. Table Extraction)","post_name":"prose-powered-data-ingestion","post_type":"msr-project","post_date":"2022-01-31 17:52:00","post_modified":"2022-04-19 13:39:04","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-powered-data-ingestion\/","post_excerpt":"\u200bWhen the COVID-19 pandemic was in its early stages, several agencies published infection and mortality data for different geographical regions in the public domain.&nbsp;This data appeared in web pages, CSV files, JSON files, and more.&nbsp;There was plenty of useful data out there, but before one could use this data to generate models and visualizations, one had to ingest the data into a tabular data frame and clean it.&nbsp;The task of extracting tables from the varied&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/817234"}]}},{"ID":654579,"post_title":"PROSE Framework","post_name":"prose-framework","post_type":"msr-project","post_date":"2020-07-22 11:23:08","post_modified":"2022-07-22 10:31:53","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/prose-framework\/","post_excerpt":"Microsoft PROSE SDK is a framework of technologies for programming by examples: automatic generation of programs from input-output examples.","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/654579"}]}},{"ID":672951,"post_title":"Predictive Program Synthesis","post_name":"predictive-program-synthesis","post_type":"msr-project","post_date":"2020-07-07 20:17:25","post_modified":"2022-04-19 15:47:55","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/predictive-program-synthesis\/","post_excerpt":"Program synthesis technologies help users to easily automate tasks that would otherwise require significant manual effort or programming skills. For instance, programming-by-example or natural language programming approaches allow the user to express intent by giving examples or natural language descriptions of the task, from which the system can synthesize a program in a formal programming language to complete the task. In this project, we are exploring the novel notion of predictive program synthesis, which is&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/672951"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/620757","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":1,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/620757\/revisions"}],"predecessor-version":[{"id":620760,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/620757\/revisions\/620760"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=620757"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=620757"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=620757"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=620757"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=620757"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=620757"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=620757"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=620757"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=620757"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=620757"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=620757"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=620757"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=620757"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}