Auto ML and the future of self-managing networks with Dr. Behnaz Arzani

Published March 18, 2020

Share this page

Episode 111 | March 18, 2020

Dr. Behnaz Arzani (opens in new tab) is a senior researcher in the Mobility and Networking group (opens in new tab) at MSR, and she feels your pain. At least, that is, if you’re a network operator trying to troubleshoot an incident in a datacenter. Her research is all about getting networks to manage themselves, so your life is as pain-free as possible.

On today’s podcast, Dr. Arzani tells us why it’s so hard to identify and resolve networking problems and then explains how content-aware, or domain-customized, auto ML frameworks might help. She also tells us what she means when she says she wants to get humans out of the loop, and reveals how a competitive streak and a comment from her high school principal set her on the path to a career in high tech research.

Microsoft Research Podcast (opens in new tab): View more podcasts on Microsoft.com
iTunes (opens in new tab): Subscribe and listen to new podcasts each week on iTunes
Email (opens in new tab): Subscribe and listen by email
Android (opens in new tab): Subscribe and listen on Android
Spotify (opens in new tab): Listen on Spotify
RSS feed (opens in new tab)
Microsoft Research Newsletter (opens in new tab): Sign up to receive the latest news from Microsoft Research

Transcript

Behnaz Arzani: Humans are great at innovating and building stuff, but when it comes to figuring out what went wrong, and how it went wrong, and fixing things, it’s much better to have automation do that than humans do that because we take our sweet time with things. And we also don’t have the mental power to process so much data that’s out there all at once. Machines are much better at doing that.

Host: You’re listening to the Microsoft Research Podcast, a show that brings you closer to the cutting-edge of technology research and the scientists behind it. I’m your host, Gretchen Huizinga.

Host: Dr. Behnaz Arzani is a senior researcher in the Mobility and Networking group at MSR, and she feels your pain. At least, that is, if you’re a network operator trying to troubleshoot an incident in a datacenter. Her research is all about getting networks to manage themselves, so your life is as pain-free as possible.

On today’s podcast, Dr. Arzani tells us why it’s so hard to identify and resolve networking problems and then explains how content-aware, or domain-customized, auto ML frameworks might help. She also tells us what she means when she says she wants to get humans out of the loop, and reveals how a competitive streak and a comment from her high school principal set her on the path to a career in high tech research. That and much more on this episode of the Microsoft Research Podcast.

Host: Behnaz Arzani, welcome to the podcast.

Behnaz Arzani: Hey, thanks for having me.

Host: I like to start situating.

Behnaz Arzani: Mm-hmm.

Host: And you’re a senior researcher at MSR and you work in Mobility and Networking.

Behnaz Arzani: Right.

Host: But that falls under the bigger umbrella of Systems and Networking.

Behnaz Arzani: Mm-hmm.

Host: So to kick off our conversation, give us the elevator pitch version of what you all are up to. What’s the big goal of Mobility and Networking and how does it fit into the broader ecosystem of Systems and Networking?

Behnaz Arzani: Right. So I guess Mobility and Networking, as the name suggests, goes into two parts of mobility and networking. So half of our groups are doing things like IoT research, edge research, things that have to do with mobile phones and things like that, or, like, devices. And what people like me do are more of the networking aspects. So every distributed system that is out there has an underlying network, and so our job is to try to figure out how to operate those networks properly and how to make those networks work the best way possible.

Host: So what would you say the big, audacious goal of Systems and Networking or Mobility and Networking is, sort of, writ large?

Behnaz Arzani: So I think every person you ask is going to give you a different answer on that. My particular take on this is that we’re the infrastructure behind a lot of the systems that you see out there, from the web that you access to the storage systems that you use to everything else, and so our job is to make this as seamless as possible. So you shouldn’t even know that these networks are there. You should just use it and expect that they work properly.

Host: Well let’s talk about what gets you up in the morning…

Behnaz Arzani: Mm-hmm.

Host: …and where you situate yourself as a researcher. What’s your big goal, Behnaz, as a scientist, and what do you want to be known for at the end of your career?

Behnaz Arzani: Yeah, so, when I think about this – and this was mainly after doing many, many internships in Azure Networking and seeing what operators have to deal with every day – and to me it seems like the worthy goal, or what I really want to achieve, is something where the life of an operator, a network operator, is as painless as possible because it can get painful on days. Especially if there is, you know, something broke and they have to figure out what happened. It can be a nightmare and what I would like to see is that they don’t have to do that.

Host: And we will get into how you’re going about that shortly. Let’s start with one of the things that is kind of interesting to me. Many people I’ve talked to on this podcast emphasize the importance of keeping humans in the loop.

Behnaz Arzani: Mm-hmm.

Host: But you suggest, in some ways, and for some problems, we actually need to get humans out of the loop or at least you question why, after so many years, you still have so many humans trying to figure so much of this all out. So when you say, get humans out of the loop, what do you mean, and then how does it play out in the work you’re doing?

Behnaz Arzani: Right. So I think, depends on what do you get the humans out of the loop of. I think, to me, humans are great at innovating and building stuff, but when it comes to figuring out what went wrong, and how it went wrong, and fixing things, it’s much better to have automation do that than humans do that because we take our sweet time with things. And we also don’t have the mental power to process so much data that’s out there all at once. Machines are much better at doing that. And what I keep seeing is that we as humans are very, very inefficient and what that causes is that our customers, often, are in pain while humans are trying to figure things out like that, and so why do we have to do that after so many years of having networks out there and I think it’s because this particular problem is just a really, really hard problem to solve. And so I find that both exciting and hard and so it’s a challenge that’s worth pursuing.

Host: Hasn’t it gotten more complicated? Rather than saying, you know, we should have had this figured out by now it’s like well, the internet threw us a lot more problems. The cloud has thrown us a bigger problem. How would you answer that?

Behnaz Arzani: A lot of this has to do with scale. So we just have more and more of things.

Host: Right.

Behnaz Arzani: And the bigger things are as harder to handle, but also our processing capabilities have also increased so that’s one piece of good news. The other thing is, like, if you think of things like cloud, yes, they did like throw us a curve ball in introducing something new, but also they added a little bit of structure. So when you think about like a cloud network, it’s much more symmetric and much easier to reason about compared to something like the internet, which is just a hodge-podge of devices connected to each other with arbitrary topologies. Like, if you look at the stuff we did in 007, for example, what we used, really, was that the fixed structure and the nice structure that cloud networks actually have.

Host: So that actually helped.

Behnaz Arzani: That actually helps, yeah.

Host: Interesting. Okay, because, you know, you just think the bigger it is, the more messy it is, but you’re actually saying it’s added a layer of structure…

Behnaz Arzani: Yeah.

Host: …to help iron out some of the problems. All right, well let’s talk a little bit more in detail about those kinds of problems. Data center diagnosis is hard. There’s lots of incidents, lots of different kinds of incidents, incidents with whole life cycles… Why is this so hard and what are some specific research projects you have going to make it not so hard?

Behnaz Arzani: Yeah, so if you go back to what I was saying earlier, like, the network is really the underlying infrastructure of a lot of distributed systems. So there’s a lot of dependency on a well-functioning network, but the problem is also that when something goes wrong, how do you know if it’s the network that’s problematic or if there’s other layers of the infrastructure that may be problematic?

Host: Mm-hmm.

Behnaz Arzani: A very simple example of this is, all of the VMs we operate in Azure are dependent on our storage systems, for example, because they have a virtual hard drive that has access to those storage systems. So they have to go over the network, but there’s also that storage system itself that can fail. It’s also the VM virtual hard disk that might fail. Like there’s a lot of different failure scenarios. The host might fail. The server might fail. Everything. And so what we see often is that, well the first step is to just locate who the expert is that needs to look at this problem. And often it’s the case that because there’s so many different levels of expertise, like the storage people, the storage operator knows really well how storage works, but he may not know anything about how a network works, right?

Host: Right.

Behnaz Arzani: And so he doesn’t know how to look at this network and to determine whether this network is healthy or not.

Host: Mm-hmm.

Behnaz Arzani: Right? So you really need a network operator to get engaged at that point, but at the same time, you need to first know that you need the network operator to engage at that point. It’s kind of like a chicken and egg problem.

Host: Right. You don’t know what you don’t know.

Behnaz Arzani: Yeah, so the projects we’re working on right now, in the case of the storage example that I gave, I think in 2016, we had a Net 4.0 project that dealt with that. Right now, we are looking at a project called Scouts, which, its goal is basically to say, if each individual team provides an abstraction that basically says, is it likely that my team is probably going to be responsible for this problem? So my expertise is needed, or not?

Host: Yeah.

Behnaz Arzani: That way, at least, a storage operator, when it sees that the storage system is failing can know, oh, I need a networking expert…

Host: Mm-hmm.

Behnaz Arzani: …or I need a host expert, an SDN expert. What type of expert do I need to help me…

Host: Right.

Behnaz Arzani: …with figuring this out?

Host: So, the problem upstream is diagnosing where the problem is, and you want to do that quickly.

Behnaz Arzani: Right.

Host: So that you can address the issue and the problem has been that with a human, it takes way too long to even figure out who to blame.

Behnaz Arzani: Right.

Host: So what is it that machines can do to help us out here?

Behnaz Arzani: I think the observation, at least we had, and there’s a lot of work still remaining to be done, but the observation is well, we see enough examples as, like, if I’m the networking team in Microsoft, we’ve seen examples of failures happen in the past and we collect a lot of data from our own infrastructure. So the idea is, can we learn from past failures whether this is probably going to be caused by a networking problem or a physical networking problem, for example, and basically use machine learning to identify whether this problem is likely due to this team’s infrastructure failing.

Host: So do these failures present themselves in a certain way that would be a pattern detection thing that would be really good for machines to work on?

Behnaz Arzani: In certain cases, yes. So, for example, in the case of physical networking that turns out to be really true. It’s more complicated when you have… for example, something like a software load balancer is a lot more complicated because it has a lot more dependencies and its failures are also more complex.

Host: Mm-hmm.

Behnaz Arzani: So for certain teams this is easier, but the nice thing is also that for these teams these are often the first ones to get blamed anyway because all of the teams depend on them.

Host: Right.

Behnaz Arzani: So it’s kind of like a win-win situation. You might want to build similar things for the teams that you can build this for…

Host: Right.

Behnaz Arzani: …and then hope that this would simplify the problem to an extent that it makes the life of operators easier.

Host: Okay. Well, talk a little bit more about the lifecycle of a problem.

Behnaz Arzani: Mm-hmm.

Host: Or an incident we’ll call it, because we all recognize there are incidents that are going to happen and there will be a lot of them…

Behnaz Arzani: Right.

Host: …when you have this giant scale. What do you mean when you say lifecycle of an incident?

Behnaz Arzani: Well, so an incident starts when some monitoring infrastructure picked up that some anomaly is happening, that something is not operating as it should.

Host: Okay.

Behnaz Arzani: And so an incident is created. A lot of the times we also have automation that goes and checks and knows how to fix it, so that’s the good case.

Host: Sure.

Behnaz Arzani: That’s like the best-case scenario. But in some cases, when automation also fails to solve the problem, we have humans that are called, basically, to try to resolve it. And basically, the first step that that human takes is to figure out who to call to help, and also they get together and try to figure out, okay, which part of the system went wrong, how do we fix it? And the first step is actually mitigating the problem, meaning, for example, if I have a software load balancer that’s problematic, I’ll redirect all of my traffic to a different software load balancer while I figure out what’s going on with this load balancer, right? And then they go proceed to fix it and resolve the issue.

Host: I’m having a visual of an ER doc… You know, you triage and you say, you know, is he breathing? Is he bleeding? Start one, stop the other.

Behnaz Arzani: Right.

Host: And then we can move on to what’s really the problem.

Behnaz Arzani: Exactly. Yeah, that’s basically what happens.

(music plays)

Host: Well, let’s talk a little bit more about automation for a second and, and this trend towards auto ML, or automated machine learning. And it’s one line of research that seems really promising, and there’s some specific branches of it. You refer to them as content-aware ML, or domain-customized auto ML frameworks.

Behnaz Arzani: Right.

Host: So talk somewhat generally about the work that’s going on in ML and then tell us how you’re instantiating it in the world of networks and distributed systems.

Behnaz Arzani: Right. I mean, I think that this came up when I was the only one in our group which knew a little bit about networking and machine learning and I had thirty different teams in Azure asking me to build the machine learning model that does something, whatever that was. And it felt like the pattern that I was going through each time was very, very similar. And so it felt like I should be able to replicate my brain somehow so that like I’m not needed in that process. And I didn’t know at the time, when I researched it, I found that auto ML is actually a thing in the machine learning communities. I didn’t know that! And then when I looked at those, what I found is that a lot of them try to do anything and everything, or they’re customized to domains that are very, very popular. Things like video analytics, like natural language processing, things like that are always needed, not necessarily something for networking. So my friend and I, Bita Rouhani from Doug Burger’s group, started to look at well, what happens if you just dump networking data into these systems? Like, just let’s see how well they do. And they did it abysmally bad. The state-of-the-art was like terrible. And so we looked at it and said okay, why is that the case? And what we found was that, well, there’s simple domain customizations that we could do, even on the input. Not anything to the machine learning, but just how we present the data that would significantly boost their accuracy. And so the idea was well, actually, operators are really good at that part. Like they really know their data. They really know things about the data that the auto ML frameworks don’t know. So is there a way to bridge this gap? Is there a way to provide that domain knowledge without him knowing anything about ML? Maybe like somehow the auto ML framework knows what information it needs and queries for that information from the user and the user provides that information and then we use that to generate a more customized ML model as part of those auto ML frameworks.

Host: So this sounds a lot like Patrice Simard’s work in machine teaching, which is similar to this domain-specific ML, right?

Behnaz Arzani: Right. I mean, it’s similar and yet different. I think the nice thing about networking is actually, even though the types of problems we tackle are very, very diverse, they fall into a very limited set of categories. Things like congestion control, like diagnosis, like traffic engineering, like I can count them on my hand how many broad problem topics…

Host: Sure.

Behnaz Arzani: …we tackle. And so because of that, it’s much easier to provide a networking-specific abstraction for these systems than it is for any generic problem. And again, like for example, a network has a specific structure. You always have underlying topology. Like there are things we know, right? Where, for generic problems, we might not know those specific topics and I think that’s where, like, our take on the problem is different in the sense that we want to exploit network-specific domains that you can quantify almost, right? Like you can use a structure for them as opposed to like a generic problem.

Host: So you’re providing a domain expert in networking with machine learning tools and they don’t necessarily have to be a machine learning expert to be able to use these tools.

Behnaz Arzani: Right.

Host: To make the whole thing happen.

Behnaz Arzani: Yeah. And I might, I should preface that that we don’t know how to do this. Like I’m just like giving you the idea of, this is what we want to do. We don’t know how to do it yet!

Host: Okay. Go in there a little bit. You don’t know how to do this, so this is a like…

Behnaz Arzani: An idea.

Host: Okay… Where are you with the idea? How far have you pushed on it?

Behnaz Arzani: So what we did initially was just to verify this hypothesis that domain knowledge actually helps auto ML systems and we were successfully able to demonstrate that. What we’re doing now is take one specific area in networking that’s very, very well-structured, but yet rich in problems, specifically congestion control. So within congestion control, you might have a lot of different problems. What is the best congestion control protocol for me to use at any given point in time, given different objective functions that I have? Or, like, can I design an ML-based congestion control protocol? And a lot of different other questions that we have a whole list of. And our idea is well, how do we build a domain-customized auto ML framework for congestion control specifically? So it’s not even for networking, just for this very, very tiny domain within networking. And we’re exploring whether we can do that.

Host: Okay. Thank you for the word hypothesis. It was the one I was searching for and couldn’t find five minutes ago. You have a paper that just got accepted at the conference for Networked Systems Design and Implementation, NSDI, this year and you call it Private Eye.

Behnaz Arzani: Right.

Host: And it deals with scalable and privacy-preserving compromise detection in the cloud. What problem is this work addressing and what’s promising about your approach to it?

Behnaz Arzani: So the problem was, really, when we talked to Azure operators, one of the things they mentioned is, we have these really good compromise detection systems that are very, very effective that customers can use, but they don’t want to use.

Host: Why?

Behnaz Arzani: Or I don’t know if like “don’t want to use” might be a strong word, it might be that they are hesitant to use. And the reason for that seems to be that they’re concerned about their privacy, how much data they want to share with Microsoft, and also taking on a third-party code. So basically, Microsoft will have to maintain that compromise detection system for them and a lot of customers are uncomfortable with that.

Host: Right.

Behnaz Arzani: So we looked at this, and the idea was, well, we still need to protect all of our customers, even though they don’t necessarily want to use these systems. So how do we do this without needing our customer’s permission to do so? And the observation was – and this is not a new observation, a lot of researchers have made this observation in the past, which is – well, network behavior changes when a VM is compromised. So can we use that change to basically say whether a VM is likely to be compromised or not, and then go from there? The other observation, which is unique to this paper was, though, that we do have these compromise detection systems that are very, very effective and they’re running on at least our first-party VMs. And these are VMs that run things like Bing, like SQL, like services that we have, and some of our customers are also opting in to use them. So what they do is provide a constant stream of detections of compromised VMs that they’ve seen, and we can use those as sort of quote-unquote “labels” to learn, okay, this is what compromise looks like and this is what changes it induces in the network behavior of these VMs. Putting these two on top of each other, we were like okay, maybe we can do something that’s privacy preserving compromise detection that operates at data center scale.

Host: Wow.

Behnaz Arzani: And then scale is also a hard thing here so the paper goes into a lot of trouble of explaining, for example, how do I ensure that I can run at this massive scale without sacrificing too much on accuracy, without having to use things like IP addresses…

Host: Right.

Behnaz Arzani: …with, right now, with GDPR are difficult to use because GDPR says that if a customer wants to, they can contact you and say that you have to delete this and you have twenty four hours to do so and so on…

Host: Wow. So this sounds like it is also in the sort of early stages of, how might we do this?

Behnaz Arzani: Yes. I mean, the paper basically goes and demonstrates that we can theoretically do this and my experience with the other Scout project kind of says that there’s a whole nine yards between “we think we can do this” and we deploy it and we say, oh, this came up…

Host: Right.

Behnaz Arzani: …so we have to handle this, and this other thing… so like what I found actually very interesting is, from research paper to actually deployment, things can change a hundred and eighty degrees. Like you might just completely change the approach you use just because new constraints come up in deployment that you hadn’t thought about when you were doing the prototype version.

Host: Right.

Behnaz Arzani: Which is basically what the paper usually is.

Host: But you’re moving forward.

Behnaz Arzani: Mm-hmm.

Host: And this paper is kind of the beginning of the exploration and then you’re going to try to scale it up.

Behnaz Arzani: Right.

Host: And see where it breaks.

Behnaz Arzani: Yeah. It probably will break. I’m going to be honest about that, but yeah.

Host: Yeah, but that’s what research is about, right?

(music plays)

Host: Well, you just referred to Scout again and let’s preface this by saying collaboration is so essential to research now that Microsoft Research even has an award for it!

Behnaz Arzani: Right.

Host: And you recently won this, you and your team recently won it, for this project called Scout.

Behnaz Arzani: Yeah.

Host: So tell us about the project. What is Scout and why did it win MSR’s collaboration award?

Behnaz Arzani: Ahh, that’s interesting. So, I started Scout two years ago, with an intern of mine who was from Princeton and we basically first started to think about okay, is this even a problem? Like the first step is, like, how bad is this problem and is this really a problem?

Host: Define this problem.

Behnaz Arzani: Meaning, is it really the case that people find it hard to blame a team for a problem? We did that investigation and said, yeah, apparently it is hard, so let’s try to solve this problem.

Host: Wait, wait, wait. So, for me it’s easy to blame. Okay. Let’s just like level-set here…

Behnaz Arzani: Right.

Host: You’re saying it’s hard to blame a team, or it’s hard to prove the blame?

Behnaz Arzani: Well, everybody points the finger at the other one.

Host: Right.

Behnaz Arzani: So that’s basically what happens. Because also like, you know, people have limited time, so they do a superficial check and if everything seems healthy, it’s like nope, not me…

Host: Right.

Behnaz Arzani: …your turn. And so they keep passing the ball around until somebody figures out what’s going on and that is very, very inefficient…

Host: Okay.

Behnaz Arzani: …and we basically just demonstrated that that’s the case.

Host: All right.

Behnaz Arzani: Then we were like okay, how do we solve this problem? And so we went about at least doing a prototype version of the Scout, which is basically a paper we submitted and fine. But then we’re like, okay, can we actually deploy this? And there’s this really cool project in the Systems group going on called Resource Central, which has to do with a framework to deploy machine learning models in production. So that’s where Ricardo Bianchini came in and said, well, we have this really cool framework, why don’t you guys take advantage of this and use this to deploy the system? So they helped us to basically deploy the first version of a Scout and then the physical networking team in Azure was the first team that we targeted to build a Scout for and they helped us with insights and what they knew about the network. The data they collected helped us figure out okay, you did a great job here, you did a sucky job here, we hate you for it, and all of those different things. So like they provided us with really good feedback.

Host: Right.

Behnaz Arzani: And this is an ongoing collaboration, so we found that Scouts do really well in certain cases, but they suck at cases where operators actually get angry about like…

Host: Right.

Behnaz Arzani: …we can classify really, really hard problems. When it comes to the easy stuff, we sometimes make mistakes. And it turns out, operators don’t like it when you get things wrong that they would have gotten right.

Host: You can’t have everything though. Come on guys. Well, here’s a question though: is the “easy stuff problem” like the person who’s proofreading a paper and gets all of the small print right, but the headline has a massive error in it?

Behnaz Arzani: Yeah, pretty much. Yeah. So like for example like one example we saw is, there was an incident where the title of the incident said, “Arista switch. This is experiencing problem.” So a switch is basically the purview of the physical networking team. It’s actually saying this is what the problem is. Our very cool Scout said, this is not a physical networking issue. And I was like, okay. Why? Turns out that that particular incident was a transient problem, so that meant that there was a blip. And that blip really didn’t register in the monitoring data that we had. The machine learning model thought, things are fine! Nothing’s bad!

Host: Interesting.

Behnaz Arzani: But because we didn’t have the contextual information – and this goes back to the need for context, right?

Host: Yeah.

Behnaz Arzani: Like we didn’t have the contextual information, and so we got that one wrong. And what we learned from that is, well, we need to have some form of contextual features as part of our feature set. Now if you look at our prototype version, this really didn’t register to us as an important problem because our accuracy was so high. We had like 98% true positive, 97% true negative, but in that 3% we had these very, very simple mistakes that operators are very unforgiving about because it’s like, it’s saying it in the title!

Host: So how do you fix that?

Behnaz Arzani: Well, so it actually ends up being a relatively simple fix because again, like it’s in the title. You just use information from the title as part of the features that you’re using. For us, the original hypothesis was, use the monitoring data as God. Basically, what does the data show? But also, it’s a fact that, you know, there are some incidents that turn out to be non- problems, but there’s still an incident and somebody has to still go and look, so it’s important to basically have the context from the incident itself as well, as part of your feature set.

Host: Okay. So I want to sort of weave back in some of the things we’ve talked about. My understanding of what you want to accomplish here is a self-managing network using auto ML frameworks and having as little human slowdown in the process as possible.

Behnaz Arzani: Right.

Host: You don’t want humans completely out of the loop.

Behnaz Arzani: Not yet. I don’t think that’s possible. I mean, ideally you would want to, and I think that’s like kind of the Holy Grail.

Host: Do you foresee a future where that would be possible?

Behnaz Arzani: As somebody that I really admire once told me, when you build systems you have to ask yourself, can it take me to Mars? And I think that’s pretty much what we failed to do when building networks.

Host: Okay.

Behnaz Arzani: At least recently, because there is, maybe… I don’t know, but when I look at a lot of the work I have done, and a lot of the work my peers have done, I think we never really asked ourselves that question, which is why we’re in the mess we’re in.

Host: Right.

Behnaz Arzani: Maybe, over time, as we start to ask that question more, will it take us to Mars?

Host: All right.

Behnaz Arzani: Then… because you know, we are the same people that built actual things that took us to the moon.

Host: Right.

Behnaz Arzani: And those did not need operators to manage them. Yeah…

Host: Well, okay, but so having the contextual input for these systems to identify the stupid errors, can you do that and make that happen with a machine without having human context provided?

Behnaz Arzani: The right answer to that question is, I don’t know. These are things that we’re experimenting with, that we’re trying, but who knows?

Host: All right. Let’s bring this all together. Your big goal is to get to networks that can manage themselves and we’re not there yet, so what would you say are the big open problems in the field that, if solved, would get us closer to the network equivalent of self-driving cars?

Behnaz Arzani: So I think there’s a couple of things. One is, what data do we actually need from the network to be able to do this? I think that’s still an open problem. But the problem is not that we don’t know what exact data we need, it’s like, what data we need and how to efficiently collect it? Like how do we collect it without actually breaking the network while doing so? I think that’s like… There’s a lot of work going on, and we see paper after paper on this topic, but we really don’t know what is the necessary and sufficient data set to be able to do this.

Host: Okay.

Behnaz Arzani: That’s one. The control loops that we need to able to then use this data to do self-driving networks and self-driving – the equivalent of self-driving cars is, I don’t think, there in place yet. And we don’t even have the mechanisms to then implement that control loop yet. I also think that we’ve been bogged down by just how to get the network to work in the first place, and a lot of the papers that we see like, for example, the traffic engineering papers we see, have to do with that. And so I think it’s just, we haven’t had time yet to fully explore the other side of things.

Host: And networks themselves… like you started talking about Internet of Things, and Donald Kossmann was recently on the podcast and talked about nine billion “things,” as it were. And trying to think of how you even wrap your brain around how you would manage that kind of a network.

Behnaz Arzani: Right. Luckily, that’s out of my area of expertise. I work on data center networks. If I get that to work, I’m happy. That I would talk to somebody else about!

Host: There’s the finger pointing over there! I’m data centers. That’s your business.

Behnaz Arzani: Yup!

Host: Well, we’ve talked about what gets you up in the morning, and it’s a lot that gets you up in the morning! Now I want to know what keeps you up at night. And I often joke with some researchers that their entire career is about what keeps them up at night, technically speaking. That said, is there anything about your work, outside the fact that it’s important to get your work right, that keeps you up at night, metaphorically, and if so, how are you dealing with it?

Behnaz Arzani: So I guess – and this is more recent… I don’t think it’s been the case for the past, like, whatever years is –like after we started actually deploying the Scout and getting it to be used in production, like, my worry is again, will it get us to Mars, for lack of a better word? In the sense of how trustworthy is it? When is it going to break again? How long is it going to last as is? When is the next time that somebody’s going to yell at me because I got a simple thing wrong? So I think reliability of machine learning systems for networking, and how hands-free can they actually be, is something that keeps me up at night because it seems to me, at least with our experience, that there’s some level of hand holding that’s needed over time. And that worries me because, what does that actually mean? Does it mean that you always need somebody babysitting these types of systems? And that’s not necessarily the best thing that you would want.

Host: Yeah. You got me thinking so deeply right now about the preferred future, and “humans out/humans in” and if we could ever really get to a full representation of data center problems.

Behnaz Arzani: Right. Who knows?

Host: That’s why you’re working here.

Behnaz Arzani: Yeah.

Host: Well it’s story time, Behnaz. Tell us about yourself. How did you get started in the high-tech life and how did you end up at Microsoft Research? I heard the word internships, plural, earlier on…!

Behnaz Arzani: Yeah. Well. I had a very messy way to getting where I’m at. In high school I loved physics and I liked circuits and electrical systems so I went into electrical engineering as my bachelor’s degree. And still, I like electrical engineering a lot. Like, I was the circuits person, the analog circuits person… how do you analyze… at that time, they would teach us things about BJTs and such, and then that was where I wanted to go, and a friend of mine said, you’ll never find a job in electrical engineering.

Host: Really?

Behnaz Arzani: Yeah, especially in analog circuits, which I was particularly good at. And so I was like okay, what’s the next thing I’m good at? And that was probability and networking. And I’m like, okay, that’s what I’m going to do. And actually, the first few classes that I sat in – because there was an electrical engineering analog circuits class that was in parallel to the digital signal processing class that we had to take if we were in the communications major – and I would sit in that class and I was like, I have to be in that other class! And then I started to fall in love with it. I was like, I really, really like networking. So I applied to a networking PhD, again in electrical engineering, and then my advisor just left my school. So I had to find a new advisor and that was in the computer science department and that’s how I became a computer scientist…

Host: Wow.

Behnaz Arzani: …completely by accident!

Host: Circuitous route!

Behnaz Arzani: Yes. And then I loved it. So like most of it is like, I accidentally stumbled into where I’m at, and then I ended up falling in love with it.

Host: Where did it all start? I mean, where was this taking place and who were you working with?

Behnaz Arzani: So I was at the University of Pennsylvania. I was working with Roch Guerin when I started, and then Roch left to become the chair of computer science at Washington, St. Louis. So then I moved to computer science to work with my new advisor, Boon Thau Loo, who is still there, and I did networking. And then I think my first internship was in 2015.

Host: Here?

Behnaz Arzani: In Azure networking.

Host: Oh, Azure, OK.

Behnaz Arzani: So not Microsoft Research, no. And then again, I loved it so I came back for a second time and then I applied for a postdoc here. I did a postdoc and then I applied for full-time jobs, and then the rest is history.

Host: So postdoc in Azure, or postdoc in Microsoft Research?

Behnaz Arzani: Microsoft Research.

Host: Okay.

Behnaz Arzani: I don’t think Azure has postdocs. I actually fought for a postdoc in Azure and they said that we don’t have such a thing!

Host: Right? Tell me a little bit, though, about the back and forth between Azure and Microsoft Research.

Behnaz Arzani: So the way it happens in MSR is very different than the way it happened in Azure. So when I was an Azure intern, I talked to Azure people every day, twenty-four hours a day. So I knew about all the problems that were going on. I knew what the people’s pain points are because they were sitting next to me. Here at MSR, people come to us and say, I need this problem solved. Or we solve a problem like, hey, we solved this problem, do you actually need this that we did? And so it’s very, very different, I would say, the dynamic of going from an idea at MSR to actually deploying it in production.

Host: Mm-hmm.

Behnaz Arzani: And it’s much, much harder than if you come up with the idea when you’re sitting in Azure and deploying it in Azure, but it’s amazing how easily it gets done. It’s amazing, like, how fun the collaborations are and so on.

Host: Right. From your position now, where do you see yourself in the future? Staying in research?

Behnaz Arzani: I prefer to not think about that type of thing! I like to be the person who does things while they’re fun and once they’re not fun, stop doing them and move on to the next thing so I have no idea how to answer that question!

Host: All right. Well tell us something that we might not know about you. Maybe it impacted your life or career, a life defining moment or some personal characteristic… but maybe it’s just something interesting that would give us some context about you outside the lab.

Behnaz Arzani: Uh, okay. Not something I’m really proud of, but I’m a very, very competitive person. So I always attribute me getting to where I am to a friend of mine in high school, where our principal would come and say, learn from this person. This person is great. And I was like, I can do better. And it’s sad, but true, that the reason I’m here is because of a competition with another person in high school. Otherwise, I would not get into college, I think, I would not get to where I am.

Host: Okay, so let me clarify. There was an actual person that your principal said, be like that person?

Behnaz Arzani: And I was like no, I’m going to be better than that person.

Host: Oh, my gosh. I’d like to meet that principal. Well, before we go, I want to give you the opportunity to talk to some version of your grad school self. Assuming you’d listen to you, what advice would you give yourself if you could go back and give yourself advice?

Behnaz Arzani: Hmm. The advice I would think is, it’s okay to be nitpicky. Like, I think one thing that I found frustrating, as a PhD student, was how much one of my advisors, Roch, wanted us to be very, very meticulous about making sure about every single detail about something before we made a conclusion. And it took a long time to do. It was a lot of pain. And I’ve now learned to appreciate that. And so, what I would say is, it’s hard now, but it’s such good advice.

Host: Behnaz Arzani, thank you for joining us today.

Behnaz Arzani: Thank you.

Host: It’s been so much fun!

Behnaz Arzani: Yeah, I know. Thanks.

(music plays)

To learn more about Dr. Behnaz Arzani and the latest in networking research, visit Microsoft.com/research