{"id":663438,"date":"2020-05-31T21:27:18","date_gmt":"2020-06-01T04:27:18","guid":{"rendered":"https:\/\/www.microsoft.com\/en-us\/research\/?post_type=msr-research-item&#038;p=663438"},"modified":"2020-12-17T01:18:45","modified_gmt":"2020-12-17T09:18:45","slug":"balancing-efficiency-and-fairness-in-heterogeneous-gpu-clusters-for-deep-learning","status":"publish","type":"msr-research-item","link":"https:\/\/www.microsoft.com\/en-us\/research\/publication\/balancing-efficiency-and-fairness-in-heterogeneous-gpu-clusters-for-deep-learning\/","title":{"rendered":"Balancing efficiency and fairness in heterogeneous GPU clusters for deep learning"},"content":{"rendered":"<p>We present Gandiva_fair, a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT).\u00a0 Gandiva_fair\u00a0 provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandiva_fair\u00a0 is the first scheduler that allocates cluster-wide GPU time fairly among active users.<\/p>\n<p>Gandiva_fair achieves efficiency and fairness despite cluster heterogeneity. Data centers host a mix of GPU generations because of the rapid pace at which newer and faster GPUs are released. As the newer generations face higher demand from users, older GPU generations suffer poor utilization, thus reducing cluster efficiency. Gandiva_fair profiles the variable marginal utility across various jobs from newer GPUs, and transparently incentivizes users to older GPUs by a novel resource trading mechanism that maximizes cluster efficiency without affecting fairness guarantees of any user. With a prototype implementation and evaluation in a heterogeneous 200-GPU cluster, we show that Gandiva_fair achieves both fairness and efficiency under realistic multi-user workloads.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We present Gandiva_fair, a distributed, fair share scheduler that balances conflicting goals of efficiency and fairness in GPU clusters for deep learning training (DLT).\u00a0 Gandiva_fair\u00a0 provides performance isolation between users, enabling multiple users to share a single cluster, thus, maximizing cluster efficiency. Gandiva_fair\u00a0 is the first scheduler that allocates cluster-wide GPU time fairly among active [&hellip;]<\/p>\n","protected":false},"featured_media":0,"template":"","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_publishername":"ACM","msr_publisher_other":"","msr_booktitle":"","msr_chapter":"","msr_edition":"","msr_editors":"","msr_how_published":"","msr_isbn":"","msr_issue":"","msr_journal":"","msr_number":"","msr_organization":"","msr_pages_string":"","msr_page_range_start":"1","msr_page_range_end":"16","msr_series":"","msr_volume":"","msr_copyright":"","msr_conference_name":"Fifteenth European Conference on Computer Systems (EuroSys'20)","msr_doi":"","msr_arxiv_id":"","msr_s2_paper_id":"","msr_mag_id":"","msr_pubmed_id":"","msr_other_authors":"","msr_other_contributors":"","msr_speaker":"","msr_award":"","msr_affiliation":"","msr_institution":"","msr_host":"","msr_version":"","msr_duration":"","msr_original_fields_of_study":"","msr_release_tracker_id":"","msr_s2_match_type":"","msr_citation_count_updated":"","msr_published_date":"2020-4-15","msr_highlight_text":"","msr_notes":"","msr_longbiography":"","msr_publicationurl":"","msr_external_url":"","msr_secondary_video_url":"","msr_conference_url":"https:\/\/www.eurosys2020.org\/program\/","msr_journal_url":"","msr_s2_pdf_url":"","msr_year":0,"msr_citation_count":0,"msr_influential_citations":0,"msr_reference_count":0,"msr_s2_match_confidence":0,"msr_microsoftintellectualproperty":true,"msr_s2_open_access":false,"msr_s2_author_ids":[],"msr_pub_ids":[],"msr_hide_image_in_river":0,"footnotes":""},"msr-research-highlight":[],"research-area":[13547],"msr-publication-type":[193716],"msr-publisher":[],"msr-focus-area":[],"msr-locale":[268875],"msr-post-option":[],"msr-field-of-study":[],"msr-conference":[],"msr-journal":[],"msr-impact-theme":[],"msr-pillar":[],"class_list":["post-663438","msr-research-item","type-msr-research-item","status-publish","hentry","msr-research-area-systems-and-networking","msr-locale-en_us"],"msr_publishername":"ACM","msr_edition":"","msr_affiliation":"","msr_published_date":"2020-4-15","msr_host":"","msr_duration":"","msr_version":"","msr_speaker":"","msr_other_contributors":"","msr_booktitle":"","msr_pages_string":"","msr_chapter":"","msr_isbn":"","msr_journal":"","msr_volume":"","msr_number":"","msr_editors":"","msr_series":"","msr_issue":"","msr_organization":"","msr_how_published":"","msr_notes":"","msr_highlight_text":"","msr_release_tracker_id":"","msr_original_fields_of_study":"","msr_download_urls":"","msr_external_url":"","msr_secondary_video_url":"","msr_longbiography":"","msr_microsoftintellectualproperty":1,"msr_main_download":"","msr_publicationurl":"","msr_doi":"","msr_publication_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/gandiva-fair-eurosys20.pdf","id":"663444","title":"gandiva-fair-eurosys20","label_id":"243109","label":0}],"msr_related_uploader":[{"type":"file","viewUrl":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/gandiva-eurosys20.pdf","id":"713356","title":"gandiva-eurosys20","label_id":"243118","label":0}],"msr_citation_count":0,"msr_citation_count_updated":"","msr_s2_paper_id":"","msr_influential_citations":0,"msr_reference_count":0,"msr_arxiv_id":"","msr_s2_author_ids":[],"msr_s2_open_access":false,"msr_s2_pdf_url":null,"msr_attachments":[{"id":713356,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/12\/gandiva-eurosys20.pdf"},{"id":663444,"url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-content\/uploads\/2020\/05\/gandiva-fair-eurosys20.pdf"}],"msr-author-ordering":[{"type":"text","value":"Shubham Chaudhary","user_id":0,"rest_url":false},{"type":"user_nicename","value":"Ramachandran Ramjee","user_id":33337,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Ramachandran Ramjee"},{"type":"user_nicename","value":"Muthian Sivathanu","user_id":36320,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Muthian Sivathanu"},{"type":"user_nicename","value":"Nipun Kwatra","user_id":37634,"rest_url":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/microsoft-research\/v1\/researchers?person=Nipun Kwatra"},{"type":"text","value":"Srinidhi Viswanatha","user_id":0,"rest_url":false}],"msr_impact_theme":[],"msr_research_lab":[199562],"msr_event":[],"msr_group":[],"msr_project":[968667],"publication":[],"video":[],"msr-tool":[],"msr_publication_type":"inproceedings","related_content":{"projects":[{"ID":968667,"post_title":"AI Infrastructure","post_name":"ai-infrastructure","post_type":"msr-project","post_date":"2023-10-04 00:44:27","post_modified":"2025-08-20 21:38:01","post_status":"publish","permalink":"https:\/\/www.microsoft.com\/en-us\/research\/project\/ai-infrastructure\/","post_excerpt":"Towards efficient AI\/ML deployment The AI Infrastructure team at Microsoft Research India&nbsp;works on cutting-edge systems optimizations for improving the efficiency of a variety of AI\/ML workloads, including an emerging class of workloads, namely, serving large language models (LLMs). AI\/ML models are expensive to train and serve at scale and therefore, systems optimizations are crucial for unlocking the true potential of AI-powered applications. The key principle behind many of our projects is co-design of Systems and&hellip;","_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-project\/968667"}]}}]},"_links":{"self":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/663438","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item"}],"about":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/types\/msr-research-item"}],"version-history":[{"count":2,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/663438\/revisions"}],"predecessor-version":[{"id":663447,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-item\/663438\/revisions\/663447"}],"wp:attachment":[{"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/media?parent=663438"}],"wp:term":[{"taxonomy":"msr-research-highlight","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-research-highlight?post=663438"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=663438"},{"taxonomy":"msr-publication-type","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publication-type?post=663438"},{"taxonomy":"msr-publisher","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-publisher?post=663438"},{"taxonomy":"msr-focus-area","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-focus-area?post=663438"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=663438"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=663438"},{"taxonomy":"msr-field-of-study","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-field-of-study?post=663438"},{"taxonomy":"msr-conference","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-conference?post=663438"},{"taxonomy":"msr-journal","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-journal?post=663438"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=663438"},{"taxonomy":"msr-pillar","embeddable":true,"href":"https:\/\/www.microsoft.com\/en-us\/research\/wp-json\/wp\/v2\/msr-pillar?post=663438"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}