Tuning Data Center Performance with Machine Learning (Past Project – Completed)

Tuning Data Center Performance with Machine Learning (Past Project – Completed)

Established: February 4, 2014




Machine learning and artificial intelligence (AI) techniques provide a tremendous new opportunity to increase data center performance. Recent efforts elsewhere attest to this fact. One specific problem that we have considered in this project is to utilize machine-learning techniques to identify hardware configurations in data centers that optimize energy/speed and vary those configurations at run time depending on different performance metrics collected via careful instrumentation.


Project Overview

Today, hardware specs of commodity servers are chosen based on their ability to cater to a large range of applications. However, workloads exhibit variation in their characteristics that could be exploited to improve performance and energy efficiency. Unfortunately, customizing hardware specs to workloads complicates resource management and increases cost. It thus makes sense to tweak hardware components via software configuration. In addition to the many potential software knobs to control hardware configurations (e.g., through virtual machine managers and operating system parameters), we observe that modern servers offer a number of firmware settings (e.g., through the BIOS or UEFI) that can be tuned with significant impact on run time and power consumption. In fact, some parameters like those related to memory and storage can only be tuned through the firmware configuration. This process thus leads to data centers with fungible settings or soft heterogeneity  (see figure).

Two issues arise when we change firmware settings. First, there are complex dependencies between workloads and firmware settings that are hard to interpret. Second, relevant hardware performance metrics have to be measured accurately before modeling these complex dependencies. We aim to address both of these issues in this project.


Project Details

We have built tools that help us accurately measure hardware performance metrics. We have also developed machine-learning models that have enabled us to unravel the relationships between workloads and firmware settings. Together, these tools and models have helped us set optimal firmware configurations in our data center servers.

X-mem: Tool to Measure DRAM Performance

In addition to OS performance counters (PMCs), we have utilized widely-available tools like CPU-Z, Roadkil, AS SDD and MLC to measure hardware performance metrics. One thing we have observed is that existing DRAM-sensing tools lack several important features. They do not provide information on metrics that we care about. Thus, we have developed a better tool called X-mem to sense DRAM performance. It allows us to characterize memory hierarchy in extreme detail. It is modular, portable and extensible, while surpassing most capabilities of comparable software. It allows us to directly sense and measure parameters like throughput, loaded and unloaded latency, and power consumption of each level of cache and DRAM through flexible stimuli.

We have published our findings on X-mem at ISPASS 16 and released source code under the MIT license on the Microsoft GitHub store. We have ported X-mem to several platforms including Intel Atom micro-servers and ARM-based evaluation boards. The table below provides a comparison of X-mem and other relevant tools for DRAM characterization.

NUMA and cache-size settings: The figure below on the left illustrates how an understanding of the aggregate memory throughput or latency as a function of the working set size per thread, number of threads and chunk sizes helps make substantial design decisions on cache configurations. The figure on the right (also obtained using X-mem) reveals how significant main memory performance asymmetry may arise from the interaction of NUMA and page size.

Error-management settings: We have utilized X-mem to study hard and correctable faults that occur in DRAM. Using an internal, tail latency sensitive, web-search workload, we have found that SMI interrupt settings (133 ms) degrade both the average and 95th percentile performance much faster than CMCI settings (775 us) because of their higher handling latency (see figure). Our experiments with X-mem have thus helped make internal decisions on picking the correct error handling settings for specific workloads.


DRAM timing and frequency settings: Characterizing DRAM performance in detail with X-mem also helps achieve new things like scaling memory frequency through firmware settings on emerging server platforms.

Machine-learning Models for Soft Heterogeneity

Armed with X-mem and other tools that allow us to accurately sense hardware performance metrics, we have developed machine-learning models that help us understand the relationship between workloads and firmware settings. Our methodology called FXplore utilizes performance metrics together with graph algorithms to efficiently search the design space of firmware configurations for a range of workloads. In addition to DRAM settings, it helps determine other key firmware configurations like hardware and adjacent cache-line prefetching, CPU and DRAM turbo boosting as well as Hyperthreading. Look at our CCGrid 16 paper for more details on FXplore.

The figure below summarizes the methodology. Utilizing feature vectors based on hardware performance metrics and OS PMCs, we have trained machine-learning (ML) models that relate workloads to firmware settings. Once trained, we have employed these models to do cluster assignments at run time for workloads coming in to the data center. Thus, our approach has helped attain the maximum level of data center performance based on firmware configuration.  

With FXplore, we have shown exponential speed ups in exploration time. We have also presented methods to find optimal configurations for both on-line operation and co-location of workloads. Overall, we have demonstrated that soft-heterogeneity in data centers can improve average run time and energy consumption by 11% and 15%.



Going Forward

Machine learning has the potential to impact data centers in several ways. Configuring hardware is just one of them. Even in this approach there are different paths we could take (UEFI, BIOS, VMs, OS) and rely on different performance metrics. This project has broken new ground by tuning servers via firmware settings based on PMCs and basic hardware performance metrics. We are continuing to develop more tools like X-mem that enable us to sense useful parameters from other pieces of hardware like SSDs and network switches. Based on these signals, we intend to develop new AI algorithms to dynamically configure VM settings for maximum concurrency, low interference and high performance. We are building machine-learning algorithms that help us adapt servers to environmental variations (like humidity, temperature and altitude etc.), enabling us to achieve high performance in stressful situations. We are also exploring methodologies that allow us to run applications on unreliable hardware by modeling the most likely program and data states based on supervised and unsupervised learning.

Project Timeline

  1. 2016
    • Project title: Modeling performance impact of DRAM error correction in Intel MCA
    • People involved:
      • Mark Gottscho, MSR intern from UCLA
      • Puneet Gupta, remote collaborating professor from UCLA
      • Sriram Govindan, Bikash Sharma and Mike Andrewartha from Azure Cloud Server Infrastructure Product Team
      • Di Wang and Shuayb Zarar from Microsoft Research
  2. 2015
    • Project title: Achieving soft heterogeneity in data centers through firmware reconfiguration
    • People involved:
      • Xin Zhang and Prof. Sherief Reda, remote collaborators from Brown University
      • Shuayb Zarar from Microsoft Research
  3. 2014
    • Project title: Characterizing DRAM performance and variability in warehouse-scale computers
    • People involved:
      • Mark Gottscho, MSR intern from UCLA
      • Sriram Govindan, Bikash Sharma, Mark Santaniello, Badriddine Khessib and Kushagra Vaid from Azure Cloud Server Infrastructure Product Team
      • Jie Liu and Shuayb Zarar from Microsoft Research

Impact Summary

  1. Created novel characterization algorithms and firmware-configuration tools for the Azure Cloud Server Infrastructure product team
  2. Released X-mem source under the MIT license at the Microsoft GitHub OSS store along with a docker for easy adoption
  3. Developed new business partnership with a DRAM vendor to streamline the memory sourcing and testing process
  4. Disseminated research results through conference papers at ISPASS 16 , CAL 16 and CCGRID 16
  5. Filed US and international patents that protect our IP of utilizing AI to tune data center performance