{"id":1128357,"date":"2025-02-10T14:48:43","date_gmt":"2025-02-10T22:48:43","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=1128357"},"modified":"2025-04-09T15:21:50","modified_gmt":"2025-04-09T22:21:50","slug":"tuna-tuning-unstable-and-noisy-cloud-applications","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/tuna-tuning-unstable-and-noisy-cloud-applications\/","title":{"rendered":"TUNA: Tuning Unstable and Noisy Cloud Applications"},"content":{"rendered":"<p><span dir=\"ltr\" role=\"presentation\">Autotuning<\/span> <span dir=\"ltr\" role=\"presentation\">plays<\/span> <span dir=\"ltr\" role=\"presentation\">a<\/span> <span dir=\"ltr\" role=\"presentation\">pivotal<\/span> <span dir=\"ltr\" role=\"presentation\">role<\/span> <span dir=\"ltr\" role=\"presentation\">in<\/span> <span dir=\"ltr\" role=\"presentation\">optimizing<\/span> <span dir=\"ltr\" role=\"presentation\">the<\/span> <span dir=\"ltr\" role=\"presentation\">perfor<\/span><span dir=\"ltr\" role=\"presentation\">mance of systems, particularly in large-scale cloud deploy<\/span><span dir=\"ltr\" role=\"presentation\">ments, and has been used to improve the performance of <\/span><span dir=\"ltr\" role=\"presentation\">a number of systems including databases, key-value stores, <\/span><span dir=\"ltr\" role=\"presentation\">and operating systems. We find that one of the main chal<\/span><span dir=\"ltr\" role=\"presentation\">lenges in performing autotuning in the cloud arises from <\/span><span dir=\"ltr\" role=\"presentation\">performance variability or noise in system measurements. <\/span><span dir=\"ltr\" role=\"presentation\">We first investigate the extent to which noise slows down <\/span><span dir=\"ltr\" role=\"presentation\">autotuning and find that as little as<\/span> <span dir=\"ltr\" role=\"presentation\">5%<\/span> <span dir=\"ltr\" role=\"presentation\">noise can lead to a <\/span><span dir=\"ltr\" role=\"presentation\">2<\/span><span dir=\"ltr\" role=\"presentation\">.<\/span><span dir=\"ltr\" role=\"presentation\">5<\/span><span dir=\"ltr\" role=\"presentation\">x slowdown in converging to the best-performing con<\/span><span dir=\"ltr\" role=\"presentation\">figuration We also measure the magnitude of noise in cloud <\/span><span dir=\"ltr\" role=\"presentation\">computing settings and find that, while some components <\/span><span dir=\"ltr\" role=\"presentation\">(CPU, disk) have almost no performance variability there <\/span><span dir=\"ltr\" role=\"presentation\">are still sources of significant variability (caches, memory). <\/span><span dir=\"ltr\" role=\"presentation\">Additionally,<\/span> <span dir=\"ltr\" role=\"presentation\">we<\/span> <span dir=\"ltr\" role=\"presentation\">find<\/span> <span dir=\"ltr\" role=\"presentation\">that<\/span> <span dir=\"ltr\" role=\"presentation\">variability<\/span> <span dir=\"ltr\" role=\"presentation\">leads<\/span> <span dir=\"ltr\" role=\"presentation\">to<\/span> <span dir=\"ltr\" role=\"presentation\">autotuning <\/span><span dir=\"ltr\" role=\"presentation\">finding<\/span> <span dir=\"ltr\" role=\"presentation\">unstable<\/span> <span dir=\"ltr\" role=\"presentation\">configurations, where for some workloads <\/span><span dir=\"ltr\" role=\"presentation\">as many as<\/span> <span dir=\"ltr\" role=\"presentation\">63<\/span><span dir=\"ltr\" role=\"presentation\">.<\/span><span dir=\"ltr\" role=\"presentation\">3%<\/span> <span dir=\"ltr\" role=\"presentation\">of configurations selected as &#8220;best&#8221; during <\/span><span dir=\"ltr\" role=\"presentation\">tuning can degrade by<\/span> <span dir=\"ltr\" role=\"presentation\">30%<\/span> <span dir=\"ltr\" role=\"presentation\">or more when deployed. Using <\/span><span dir=\"ltr\" role=\"presentation\">this<\/span> <span dir=\"ltr\" role=\"presentation\">as<\/span> <span dir=\"ltr\" role=\"presentation\">motivation,<\/span> <span dir=\"ltr\" role=\"presentation\">this<\/span> <span dir=\"ltr\" role=\"presentation\">paper<\/span> <span dir=\"ltr\" role=\"presentation\">proposes<\/span> <span dir=\"ltr\" role=\"presentation\">a<\/span> <span dir=\"ltr\" role=\"presentation\">novel<\/span> <span dir=\"ltr\" role=\"presentation\">approach <\/span><span dir=\"ltr\" role=\"presentation\">to improve the efficiency of autotuning systems by (a) de<\/span><span dir=\"ltr\" role=\"presentation\">tecting and removing outlier configurations, and (b) using <\/span><span dir=\"ltr\" role=\"presentation\">ML-based approaches to provide a more stable<\/span> <span dir=\"ltr\" role=\"presentation\">true<\/span> <span dir=\"ltr\" role=\"presentation\">signal <\/span><span dir=\"ltr\" role=\"presentation\">of<\/span> <span dir=\"ltr\" role=\"presentation\">de-noised<\/span> <span dir=\"ltr\" role=\"presentation\">experiment<\/span> <span dir=\"ltr\" role=\"presentation\">results<\/span> <span dir=\"ltr\" role=\"presentation\">to<\/span> <span dir=\"ltr\" role=\"presentation\">the<\/span> <span dir=\"ltr\" role=\"presentation\">optimizer.<\/span> <span dir=\"ltr\" role=\"presentation\">The<\/span> <span dir=\"ltr\" role=\"presentation\">re<\/span><span dir=\"ltr\" role=\"presentation\">sulting system, <em>TUNA<\/em> (<span style=\"text-decoration: underline\">T<\/span><\/span><span dir=\"ltr\" role=\"presentation\">uning <span style=\"text-decoration: underline\">U<\/span><\/span><span dir=\"ltr\" role=\"presentation\">nstable and <span style=\"text-decoration: underline\">N<\/span><\/span><span dir=\"ltr\" role=\"presentation\">oisy Cloud <\/span><span style=\"text-decoration: underline\"><span dir=\"ltr\" role=\"presentation\">A<\/span><\/span><span dir=\"ltr\" role=\"presentation\">pplications) enables faster convergence and robust config<\/span><span dir=\"ltr\" role=\"presentation\">urations. We find that configurations learned using TUNA <\/span><span dir=\"ltr\" role=\"presentation\">perform better and with lower standard deviations during <\/span><span dir=\"ltr\" role=\"presentation\">deployment, as compared to traditional sampling methodolo<\/span><span dir=\"ltr\" role=\"presentation\">gies. Tuning<\/span> <span dir=\"ltr\" role=\"presentation\">PostgreSQL<\/span> <span dir=\"ltr\" role=\"presentation\">running an enterprise production <\/span><span dir=\"ltr\" role=\"presentation\">workload, we find that TUNA can lead to<\/span> <span dir=\"ltr\" role=\"presentation\">1<\/span><span dir=\"ltr\" role=\"presentation\">.<\/span><span dir=\"ltr\" role=\"presentation\">88<\/span><span dir=\"ltr\" role=\"presentation\">x lower run<\/span><span dir=\"ltr\" role=\"presentation\">ning time on average with<\/span> <span dir=\"ltr\" role=\"presentation\">2<\/span><span dir=\"ltr\" role=\"presentation\">.<\/span><span dir=\"ltr\" role=\"presentation\">58<\/span><span dir=\"ltr\" role=\"presentation\">\ud835\udc65<\/span> <span dir=\"ltr\" role=\"presentation\">lower standard deviation <\/span><span dir=\"ltr\" role=\"presentation\">compared to traditional sampling methodologies. \u00a0<\/span>TUNA will be incorporated into the <a href=\"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mlos-in-action-bridging-the-gap-between-experimentation-and-auto-tuning-in-the-cloud\/\">MLOS<\/a> project and has both <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/mlos\/tuna-eurosys-artifacts\">artifacts<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/mlos\/tuna-eurosys-artifacts\">multiple<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/aka.ms\/mlos\/tuna-eurosys-dataset\">datasets<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> available.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Autotuning plays a pivotal role in optimizing the performance of systems, particularly in large-scale cloud deployments, and has been used to improve the performance of a number of systems including databases, key-value stores, and operating systems. We find that one of the main challenges in performing autotuning in the cloud arises from performance variability or [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"","msr_page_range_end":"","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Eurosys 2025","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2025-3-30","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"https:\/\/2025.eurosys.org\/","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":null,"footnotes":""},"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[269148,269142],"msr-field-of-study":[246691],"msr-conference":[267387],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-1128357","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us","msr-post-option-approved-for-river","msr-post-option-include-in-river","msr-field-of-study-computer-science"],"msr_publishername":"","msr_edition":"","msr_affiliation":"","msr_published_date":"2025-3-30","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"doi","viewUrl":"false","id":"false","title":"https:\/\/doi.org\/10.1145\/3689031.3717480","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/mlos-in-action-bridging-the-gap-between-experimentation-and-auto-tuning-in-the-cloud\/","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/performance-roulette-how-cloud-weather-affects-ml-based-system-optimization\/","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/llamatune-sample-efficient-dbms-configuration-tuning\/","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/aka.ms\/mlos\/tuna-eurosys-artifacts","label_id":"243118","label":0},{"type":"url","viewUrl":"false","id":"false","title":"https:\/\/aka.ms\/mlos\/tuna-eurosys-dataset","label_id":"243118","label":0}],"msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":1132755,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/TUNA_EuroSys_2025.pdf"},{"id":1130118,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/TUNA.pdf"},{"id":1128366,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2025\/02\/TUNA_EuroSys_2025_Submitted-1.pdf"}],"msr-author-ordering":[{"type":"text","value":"Johannes Freischuetz","user_id":0,"rest_url":false},{"type":"text","value":"Konstantinos Kanellis","user_id":0,"rest_url":false},{"type":"text","value":"Brian Kroth","user_id":0,"rest_url":false},{"type":"text","value":"Shivaram Venkataraman","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[],"msr_event":[],"msr_group":[684024],"msr_project":[],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":[],"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1128357","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":6,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1128357\/revisions"}],"predecessor-version":[{"id":1132758,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/1128357\/revisions\/1132758"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=1128357"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=1128357"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=1128357"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=1128357"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=1128357"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=1128357"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=1128357"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=1128357"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=1128357"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=1128357"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=1128357"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=1128357"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=1128357"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}