a tall building lit up at night

Microsoft Research Lab – Asia

On Engineering the Next Generation of Systems: Learning-Augmented Systems

Published

“The world runs on software” —— Our modern life relies on diverse software systems, which provide services including search, shopping, chatting, recommendation, and so on. As user requirements and scenarios increase, these systems continuously grow in scale and complexity over time. Interestingly, this complexity is not only reflected in the code base size, but also in the significant amount of human efforts required to design, develop, and operate these systems.

In particular, systems have been traditionally built with explicit rules or heuristics based on human experience and comprehension. However, as modern software systems grow, their complexity is beyond what humans can correctly reason about. Fortunately, recent advances in machine learning (ML) and deep learning (DL) have driven a paradigm shift in how the industry approaches modern software systems. We coin the term learning-augmented systems, to describe modern systems whose design and logic are at the intersection of traditional heuristics and ML/DL techniques. Through the AutoSys project, Microsoft Research Asia has been researching methodologies and tools, towards systematically building and deploying learning-augmented systems.

The Emergence of Learning-Augmented Systems

Decision-making runs through each stage of system design, development and operation. These decisions govern how systems should handle application requests under a particular operation environment, to satisfy user requirement. Examples of these decisions include system configurations, query plan formulation in databases, routing decisions in networking, job scheduling in data processing clusters, document ranking in search engines, and so on.

Most decisions are traditionally solved with explicit rules or heuristics based on human experience and comprehension. For systems small in scale and complexity, it is relatively easy for engineers to benchmark and understand system behavior. However, modern systems tend to exhibit a complexity beyond what humans can correctly reason about. First, modern systems are designed to be elastic and scalable, especially with frameworks such as Docker and Kubernetes. This results in a degree of system component interactions and dependencies that is higher than previously seen. Second, modern systems tend to have a large number of configuration knobs and environmental variables, which can have a non-linear relationship to the system behavior. This results in a non-linear space over which humans cannot be confident of their decision-making optimality.

Recently, machine learning (ML) and deep learning (DL) have driven a paradigm shift towards learning-augmented system design development, and operation. As ML and DL excel in learning complex data patterns, it holds the potential in filling the gap of comprehending the complexity of modern systems operating in a highly dynamic environment. In addition, since most software systems already extensively log performance and behavioral metrics, these logs can be consumed by ML/DL techniques and models.

Over the last few years, the Systems Research Group at Microsoft Research Asia has collaborated with various product groups (e.g., Azure, Bing, Bing Ads, and Office) on designing, building and deploying learning-augmented systems. The rest of this article introduces our insights and results so far.

The Challenges for Learning-Augmented Systems

From our experience in production, we observe that learning-augmented systems have long been considered difficult to build and require a team of engineers and data scientists. In fact, realizing existing ML/DL algorithms is simply one step in the process. Challenges arise from the fundamental problem that current software systems are not inherently learning-aware nor learnable.

First, being stochastic in nature, ML/DL inference has some degrees of uncertainty. This uncertainty can lead to learning-induced system failures or suboptimality. One example is when models are not sufficiently trained to match the complexity of system component interactions and non-trivial system knobs. While failures are not unique to learning-augmented systems, handling them requires a different approach for the following reasons. First, while ensuring correctness has traditionally been a process of reviewing readable code base written by humans, most ML/DL models are not easily interpretable. Second, as models autonomously learn from the training data, it is difficult for system operators to determine the exact changes necessary to the model structure or training dataset, in response to some suboptimal predictions.

Second, modern systems can be highly elastic and dynamic over time, but continuously and efficiently adapting knowledge is still a challenge. For simplicity, system engineers typically view an entire system deployment as one black box and model it with one monolithic model. However, this approach is fragile because deployment changes can invalidate modeling assumptions. Interestingly, not only are these assumption changes imply model re-training for data distribution drifts, but they can also necessitate a complete model re-design in some cases.

Third, modern systems expose programming interfaces for functional needs, rather than for abstracting system measurements and logs in a way of facilitating learning. This hinders the development of learning-augmented systems in the following aspects. First, many types of system feedbacks are not natively learnable, e.g., stack traces and core dumps. Second, while model training requires well-structured datasets of controlled actuations and observed feedbacks, it is typically difficult to make such datasets for modern systems. One example is how caches need to be properly warmed up before system behavior stabilizes. However, prematurely starting logging can render measurements irrelevant, and over-conservatively starting logging can waste time.

AutoSys — Systematically Design and Build Learning-Augmented Systems

In light of challenges described above, the Systems Research Group at Microsoft Research Asia has been researching methodologies and tools, towards systematically building and deploying learning-augmented systems [1]. One of the research results is the AutoSys framework [2], which is illustrated in the figure below. On top of off-the-shelf ML/DL tools and infrastructures, AutoSys specifies components of three dimensions: techniques for exploring system behavior, system abstractions for continuous learning, and tools for handling learning-induced system failures.

In addition to fundamental research efforts, we have been collaborating with various product groups, to deploy learning-augmented systems. An example of such long-term collaborations is auto-tuning system configuration parameters. Specifically, in collaborations with Bing Ads [3], we reduced the key-value store cluster latencies by more than 20%. In fact, compared to manual tuning by senior engineers, AutoSys reduced the typical tuning time from weeks to days. Furthermore, in collaborations with Bing [2], as compared to manual tuning by senior engineers, AutoSys reduced the search engine’s index lookup latencies by 33.5%, and it improved the ranking relevance by 3.4%. The benefit of auto-tuning was also reflected in the tuning time reduction.

Looking forward, Microsoft Research Asia will continue to invest research efforts in learning-augmented systems. We will also work to establish a stronger tie with industry and research communities.

References

[1] The Case for Learning-and-System Co-design. Chieh-Jan Mike Liang, Hui Xue, Mao Yang, Lidong Zhou. ACM SIGOPS Operating Systems Review ’19.

https://www.microsoft.com/en-us/research/publication/the-case-for-learning-and-system-co-design/

[2] AutoSys: The Design and Operation of Learning-Augmented Systems. Chieh-Jan Mike Liang, Hui Xue, Mao Yang, Lidong Zhou, Lifei Zhu, Zhao Lucis Li, Zibo Wang, Qi Chen, Quanlu Zhang, Chuanjie Liu, Wenjun Dai. USENIX ATC ’20.

https://www.microsoft.com/en-us/research/publication/autosys-the-design-and-operation-of-learning-augmented-systems/

[3] Metis: Robustly Optimizing Tail Latencies of Cloud Systems. Zhao Lucis Li, Chieh-Jan Mike Liang, Wenjia He, Lianjie Zhu, Wenjun Dai, Jin Jiang, Guangzhong Sun. USENIX ATC ‘18.

https://www.microsoft.com/en-us/research/publication/metis-robustly-tuning-tail-latencies-cloud-systems/

[4] Accelerating Rule-matching Systems with Learned Rankers. Zhao Lucis Li, Chieh-Jan Mike Liang, Wei Bai, Qiming Zheng, Yongqiang Xiong, Guangzhong Sun. USENIX ATC ’19.

https://www.microsoft.com/en-us/research/publication/accelerating-rule-matching-systems-with-learned-rankers/