Analyzing Complex Systems via Machine Learning

Published May 9, 2006

Share this page

By Rob Knies, Managing Editor, Microsoft Research

Moises Goldszmidt, well known for his research in machine learning, joined Microsoft Research’s Silicon Valley lab as a principal researcher in January 2006 after spending four years with the Utility Infrastructure Management Department of Hewlett-Packard Labs. A graduate of Universidad Simón Bolivar in Caracas, Venezuela, he received his master’s degree in electrical engineering from the University of California, Santa Barbara and his Ph.D. in computer science from UCLA. Earlier in his career, he worked for the Rockwell Science Center, SRI International, Stanford University, and Peakstone Corp.

Roy Levin, Microsoft distinguished engineer and director of Microsoft Research Silicon Valley, believes Goldszmidt, 45, will provide singular benefit to his lab and the company as a whole.

“Moises has some great ideas,” Levin said, “for applying machine-learning techniques and technology to understand how large, complex computer systems perform, which I think will be very beneficial for Microsoft.”

A few months into his Microsoft career, Goldszmidt, a tenor-saxophone enthusiast (jazz, Latin, salsa), took time to field a few questions about his career and his research interests:

Q: Let’s discuss your career thus far.

Goldszmidt: I’ve been always interested in how to manage uncertainty and make optimal decisions. And that has led me to the study of efficient representations of uncertainty and to the exploration of effective inference procedures. My objective during my doctoral studies was to try to understand how we make decisions in everyday life. I wanted to find an empirical connection between the way we make decisions and the data we’re recording about our own activities in real life. And I became interested in automated diagnosis. So I started at Rockwell Science Center to study how to characterize the diagnostic process and how to improve the automated diagnostic process of any system.

Soon, I confirmed that one of the most important problems was to how to represent and manage uncertainty in order to model the environment on which you’re making these decisions. So I started to investigate automatic means of discovering these models, all based on trying to find a probability distribution that would characterize that uncertainty. I went on that path for some time at SRI International, where I did a lot of research on pattern recognition and machine learning.

goldszmidt During the startup craze, a colleague and I started to investigate how to apply methods based on pattern recognition and machine learning to model the performance of computer systems and networks. At that time, we were trying to guarantee quality of service on the Web. I suddenly discovered that there’s virgin territory in trying to apply the discipline of optimal decision-making in the area of distributed systems. At the same time, distributed systems were growing both in scale and complexity, which resulted in the services and service-oriented architectures that we have today. I discovered that there were a lot of things that were missing from being able to manage these services optimally. One of the most important was, if possible, a closed-form model of the behavior of our systems and the behavior of the users. We needed those models in order to make decisions. The question we posed was: Can we automatically and unobtrusively discover those models? The approach was based on pattern-recognition and machine-learning techniques. Within a year, we had a prototype for estimating the capacity of e-commerce Web sites and then using these estimates for making quality-of-service decisions.

The startup experience gave me valuable knowledge about taking ideas into products and produced a set of research questions I was eager to investigate. There are many trade-offs in how to embed these techniques at the lowest level of abstraction in the machinery, perhaps as a service themselves: What should the data-gathering process be? Can it be distributed? How much data would one need? How invasive are these methods in our systems? What’s the trade-off between the accuracy I need and how long these models would live? Can I adapt these models fast enough? I wanted to explore those trade-offs deeper, and that’s what led me to HP, to do research on these trade-offs and the possibility of tech transfer to the business units that produce management tools.

What attracted me to Microsoft was the fact that Microsoft is one of very few companies that has those distributed and networked services “alive and running:” Hotmail®, MSN®, and Windows Live™. This would enable me to increase the scale and impact of this line of research and to validate the results.

Q: It sounds like your approach is across-the-board. Do you have specific computer-science interests that you pursue?

Goldszmidt: This line of research touches on various subdisciplines of computer science. It has the potential to influence the way we characterize performance in distributed systems. It also has the potential to enhance the set of tools that we use to understand their behavior and the behavior of users. There are many questions from the algorithmic and theoretical points of view: Can we achieve a consistent belief state from a distributed collection of data? What form and shape does a distributed decision-making algorithm take, especially one that performs diagnosis and forecasting? Can we characterize its online behavior? What are the “best” models to induce, and how do we characterize “best”?

There are system-design issues: If wildly successful, should we embed some of this machine learning and pattern recognition at the operating-systems level? How do we change the design of the architectures to facilitate the data collection and decision-making?

Even artificial intelligence may benefit, because this line of inquiry has the potential to establish a lush playground of autonomous decision-makers—clusters of servers—with different levels of intelligence and power.

Finally, the application of machine learning and pattern recognition to the domain of large distributed networked systems is unlike any other domain. A lot of things happen in a very short time. Thus, there is an abundance of data for inducing, even fairly complicated models. Yet, this is a double-edged sword. Do patterns change fast enough to challenge current assumptions about stationarity? We will see.

Q: When did you first encounter Microsoft Research?

Goldszmidt: Since my grad-student days at UCLA, I’ve known three pillars of Microsoft Research in the areas of decision-making, uncertainty, and machine learning: Jack Breese, David Heckerman, and Eric Horvitz. Jack and Eric actually hired me at Rockwell for my first job out of grad school. After they then went to Microsoft Research, I kept in touch with both of them. Because of my interest in pattern recognition and machine learning and my grounding in probability, I have also had many interactions with David. I’ve been visiting Microsoft Research to give talks throughout my career.

Q: How have you found Microsoft Research since joining, and how is it different from your HP experience?

Goldszmidt: Microsoft is providing me with an opportunity to experience, firsthand, large-scale distributed systems. At HP, I was one step removed from that. This firsthand experience is fundamental to advance my research objectives. Also—and this is a testimony of the great relationship between Microsoft Research and the rest of Microsoft—I have had no problems in getting access to the systems and transactional data that these systems produce. There is an innate trust that if you are working for Microsoft, you will do the right thing in terms of both advancing the state of the art and helping Microsoft make the best products possible. Again, I am thankful for the successful transfers that are enabling this fruitful relationship. Colleagues have been fantastic. Any question, anything outside my domain, I find within arm’s reach somebody who will have the answer and is willing to give it to me. My only problem is time. I need a couple of more hours a day and two more days a week!

Q: Let’s talk about taking machine-learning approaches to management of large, complex systems. What sort of approaches do you take? What are you doing that’s new and different, and how is your work developing?

Goldszmidt: The approach I am taking is grounded in sound probabilistic models. As we are still in an exploratory phase, these are valuable in gaining understanding and insight. In particular, I rely on Bayesian networks to effectively and efficiently represent these models. Also, as these models may be the basis for automated decision-making, I try very hard not to ignore the interface to the human experts. The models must be interpretable, modifiable, and verifiable by human beings. For a long time, Bayesian networks have been known to offer these qualities.

I am also focused on managing and engineering the trade-off between the accuracy of these models and their complexity, both in terms of the amount of data needed for induction and in terms of the qualities mentioned above. Thus, I am aware that, depending on the task at hand, such as diagnostics or forecasting, one may only need to find a rank ordering between different alternatives and not actual, precise probabilities. Other trade-off factors that my research takes into consideration with respect to these models are the ability to adapt, their impact on performance, and how intrusive the mechanisms for data collection and inference are. For example, in previous work, I have modeled quality of service as a two-state proposition: either the system is in compliance with a given performance objective or not. This strategy enabled me to apply a relatively simple but robust set of pattern-recognition models called classifiers, which were then used to find which system metrics were correlated with states that violated the quality-of-service objectives. The price paid for applying these models was that they could not differentiate between “severe” and “benign” violation states. Yet the benefits of simplicity and efficiency paid off.

I and a number of colleagues have had several initial successes in both producing tools and in having publications accepted for events such as SOSP [the Association for Computing Machinery (ACM) Symposium on Operating System Principles] and OSDI [Symposium on Operating Systems Design and Implementation]. Still, it’s early to say whether these are the methods we should be using, but we’re getting some of the results, and it’s attracting attention. It’s a worthy endeavor. I think it’s the only way to manage these large systems. It is, nevertheless, a win-win situation. Suppose they don’t work; well, now, Microsoft knows, knows before anybody else, and we have to look for better alternatives, including deep changes in design and implementation. And that’s the worst-case scenario. In the best-case scenario, our services are going to be the best in the world, and everybody is going to come to us.

Q: You are the co-chair of the upcoming SysML conference, the First Workshop of Tackling Computer Systems Problems with Machine Learning Techniques. Talk about that event and what you hope to accomplish.

Goldszmidt: The agenda I am pushing is interdisciplinary in nature. It requires at least researchers from machine learning and systems. I firmly believe that unless researchers in these fields learn about each other’s “crafts” and “aesthetics,” no deep progress will be made. The objective of the workshop is to get researchers from both fields to engage in a dialog that will establish a solid basis for impactful research. We have a strong program committee with recognized experts from both fields and a promising set of papers. My co-chair, Emre Kiciman, also from Microsoft Research, and I have designed a program with lots of time for discussions and interactions, hoping to facilitate a meaningful exchange of ideas. The workshop is under the umbrella of the ACM SIGMETRICS conference, the premier conference on performance-evaluation methods.

In 2003, I co-chaired an ACM workshop on self-managing systems that evolved into an annual conference, the International Conference on Autonomic Computing. Let’s see where this one goes.

Q: What would a successful career at Microsoft Research look like at its conclusion?

Goldszmidt: I, of course, would like to advance the state of the art in computer science and, in the process, ensure that Microsoft products and services are No. 1 in the world. Having said that, I think I would be really happy if my research can at least offer some new insights and solid foundations to build on. On a more selfish plane, I get a great amount of joy from learning and exploring new paths to increase my understanding of computer science in particular and science in general. In Microsoft Research, I can find the top researchers in every single area of computer science and related disciplines, including mathematics, engineering, and even cognitive science. It seems to me that I came to the right place.

Microsoft Research Blog

Analyzing Complex Systems via Machine Learning

Research Areas

Microsoft Research Blog

On Second Thought

Research Areas