{"id":486965,"date":"2018-05-19T13:01:26","date_gmt":"2018-05-19T20:01:26","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=486965"},"modified":"2018-10-16T22:26:07","modified_gmt":"2018-10-17T05:26:07","slug":"mlbench-benchmarking-machine-learning-services-human-experts","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mlbench-benchmarking-machine-learning-services-human-experts\/","title":{"rendered":"MLBench: Benchmarking Machine Learning Services Against Human Experts"},"content":{"rendered":"<p><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">Modern machine learning services and systems are complicated <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">data systems \u2014 the process of designing such systems is an art <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">of compromising between functionality, performance, and quality. <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">Providing different levels of system supports for different functionalities, <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">such as automatic feature engineering, model selection and <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">ensemble, and hyperparameter tuning, could improve the quality, <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">but also introduce additional cost and system complexity. In this <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">paper, we try to facilitate the process of asking the following type <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">of questions: How much will the users lose if we remove the support <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">of functionality x from a machine learning service?<\/span><br style=\"margin: 0px;padding: 0px;color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;background-color: #ffffff\" \/><br style=\"margin: 0px;padding: 0px;color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;background-color: #ffffff\" \/><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">Answering this question using existing datasets, such as the UCI <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">datasets, is challenging. The main contribution of this work is <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">a novel dataset, mlbench, harvested from Kaggle competitions. <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">Unlike existing datasets, mlbench contains not only the raw features <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">for a machine learning task, but also those used by the winning <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">teams of Kaggle competitions. The winning features serve as a <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">baseline of best human effort that enables multiple ways to measure <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">the quality of machine learning services that cannot be supported <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">by existing datasets, such as relative ranking on Kaggle and relative <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">accuracy compared with best-effort systems.<\/span><br style=\"margin: 0px;padding: 0px;color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;background-color: #ffffff\" \/><br style=\"margin: 0px;padding: 0px;color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;background-color: #ffffff\" \/><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">We then conduct an empirical study using mlbench to understand <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">two example machine learning services from Azure and Amazon, <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">and showcase how mlbench enables a comparative study revealing <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">the strength and weakness of these existing machine learning <\/span><span style=\"color: #333333;text-indent: 0px;letter-spacing: normal;font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;font-size: 14px;font-style: normal;font-weight: 400;float: none;background-color: #ffffff\">services quantitatively and systematically.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Modern machine learning services and systems are complicated data systems \u2014 the process of designing such systems is an art of compromising between functionality, performance, and quality. Providing different levels of system supports for different functionalities, such as automatic feature engineering, model selection and ensemble, and hyperparameter tuning, could improve the quality, but also introduce [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"Proceedings of the VLDB Endowment (VLDB 2018)","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Proceedings of the VLDB Endowment (VLDB 2018)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2018-05-17","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"http:\/\/www.vldb.org\/pvldb\/vol11.html","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13563],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-486965","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-data-platform-analytics","msr-locale-en_us"],"msr_publishername":"","msr_edition":"Proceedings of the VLDB Endowment (VLDB 2018)","msr_affiliation":"","msr_published_date":"2018-05-17","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"http:\/\/www.vldb.org\/pvldb\/vol11.html","msr_doi":"","msr_publication_uploader":[{"type":"url","title":"http:\/\/www.vldb.org\/pvldb\/vol11.html","viewUrl":false,"id":false,"label_id":0}],"msr_related_uploader":"","msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":0,"url":"http:\/\/www.vldb.org\/pvldb\/vol11.html"}],"msr-author-ordering":[{"type":"text","value":"Yu Liu","user_id":0,"rest_url":false},{"type":"text","value":"Hantian Zhang","user_id":0,"rest_url":false},{"type":"text","value":"Luyuan Zeng","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Wentao Wu","user_id":34824,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Wentao Wu"},{"type":"text","value":"Ce Zhang","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[957177],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/486965","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":3,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/486965\/revisions"}],"predecessor-version":[{"id":486974,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/486965\/revisions\/486974"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=486965"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=486965"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=486965"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=486965"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=486965"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=486965"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=486965"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=486965"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=486965"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=486965"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=486965"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=486965"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=486965"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}