{"id":654228,"date":"2020-04-27T16:49:10","date_gmt":"2020-04-27T23:49:10","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=654228"},"modified":"2023-11-16T16:57:37","modified_gmt":"2023-11-17T00:57:37","slug":"auto-suggest-learning-to-recommend-data-preparation-steps-using-data-science-notebooks","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/auto-suggest-learning-to-recommend-data-preparation-steps-using-data-science-notebooks\/","title":{"rendered":"Auto-Suggest: Learning-to-Recommend Data Preparation Steps Using Data Science Notebooks"},"content":{"rendered":"<p>Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.) holds the potential to greatly improve user productivity, and has therefore become a central focus of research.<\/p>\n<p>We propose a novel approach to \u201cauto-suggest\u201d contextualized data preparation steps, by \u201clearning\u201d from how data scientists would manipulate data, which are documented by data science notebooks widely available today. Specifically, we crawled over 4M Jupyter notebooks on GitHub, and replayed them step-by-step, to observe not only full input\/output tables (data-frames) at each step, but also the exact data-preparation choices data scientists make that they believe are best suited to the input data (e.g., how input tables are Joined\/Pivoted\/Unpivoted, etc.).<\/p>\n<p>By essentially \u201clogging\u201d how data scientists interact with diverse tables, and using the resulting logs as a proxy of \u201cground truth\u201d, we can learn-to-recommend data preparation steps best suited to given user data, just like how search engines (Google or Bing) leverage their click-through logs to learn-to-rank documents. This data-driven and log-driven approach leverages the \u201ccollective wisdom\u201d of data scientists embodied in the notebooks, and is shown to significantly outperform strong baselines including commercial systems in terms of accuracy.<\/p>\n<p>Datasets used in this project have been made available on GitHub <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" href=\"https:\/\/github.com\/congy\/AutoSuggest\" target=\"_blank\" rel=\"noopener noreferrer\">https:\/\/github.com\/congy\/AutoSuggest<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> to facilitate future research.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Data preparation is widely recognized as the most time-consuming process in modern business intelligence (BI) and machine learning (ML) projects. Automating complex data preparation steps (e.g., Pivot, Unpivot, Normalize-JSON, etc.) holds the potential to greatly improve user productivity, and has therefore become a central focus of research. We propose a novel approach to \u201cauto-suggest\u201d contextualized [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":[{"type":"user_nicename","value":"Cong Yan","user_id":"39441"},{"type":"user_nicename","value":"Yeye He","user_id":"34992"}],"msr_publishername":"ACM","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"1539","msr_page_range_end":"1554","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"International Conference on Management of Data (SIGMOD)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-6-26","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13563],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[251665,267834,253351,255226],"msr-conference":[260221],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-654228","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-locale-en_us","msr-field-of-study-business-intelligence","msr-field-of-study-data-preparation","msr-field-of-study-ground-truth","msr-field-of-study-leverage-statistics"],"msr_publishername":"ACM","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-6-26","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/auto-suggest.pdf","id":"654231","title":"auto-suggest","label_id":"243109","label":0},{"type":"doi","viewUrl":"false","id":"false","title":"10.1145\/3318464.3389738","label_id":"243109","label":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":654231,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/04\/auto-suggest.pdf"}],"msr-author-ordering":[{"type":"user_nicename","value":"Cong Yan","user_id":39441,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Cong Yan"},{"type":"user_nicename","value":"Yeye He","user_id":34992,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Yeye He"}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[957177],"msr_project":[967218],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":967218,"post_title":"Self-service Data Preparation","post_name":"self-service-data-preparation","post_type":"msr-project","post_date":"2023-11-08 14:36:00","post_modified":"2023-11-18 10:15:39","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/self-service-data-preparation\/","post_excerpt":"It is often cited that data scientists spend a significant portion of their time (up to 80%), cleaning and preparing data. For less-technical users, who may be less proficient in writing code (e.g., in Excel, Power-BI and Tableau), the tasks of preparing and cleaning data are not just time-consuming, but also technically challenging. In the \"Self-service Data Preparation\" project, our goal is to develop technologies that can automate common data-preparation tasks, in the context of&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/967218"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/654228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":5,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/654228\/revisions"}],"predecessor-version":[{"id":941268,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/654228\/revisions\/941268"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=654228"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=654228"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=654228"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=654228"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=654228"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=654228"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=654228"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=654228"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=654228"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=654228"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=654228"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=654228"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=654228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}