Microsoft Research Forum Briefing Book cover image

Research Forum Brief | January 2024

Panel Discussion: AI Frontiers

Share this page

photo of Ece Kamar

“The sparks that we are seeing [are] really about having building blocks that give us the initial technologies … to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today.”

Ece Kamar, Managing Director, AI Frontiers

Transcript

Ashley Llorens, VP and Distinguished Scientist, Microsoft
Ece Kamar, Managing Director, Microsoft Research AI Frontiers 
Sébastien Bubeck, VP, Microsoft GenAI 
Ahmed Awadallah, Senior Principal Research Manager, Microsoft Research AI Frontiers 

Microsoft AI researchers discuss frontiers in small language models and where AI research and capabilities are headed next. 

Microsoft Research Forum, January 30, 2024

I’m Ashley Llorens, with Microsoft Research. My team works across research and product to incubate emerging technologies and runs programs that connect our research at Microsoft to the broader research community. I sat down with research leaders Ece Kamar, Ahmed Awadallah, and Sébastien Bubeck to explore some of the most exciting new frontiers in AI. We discussed their aspirations for AI, the research directions they’re betting on to get us there, and how their team is working differently to meet this moment.

ASHLEY LLORENS: So let’s dive in. We’re experiencing an inflection point in human technology where machines, broadly speaking, are starting to exhibit the sparks of general intelligence, and it’s hard to avoid the enthusiasm. Even if you wanted to. And I think it’s fair to say that there’s no shortage of that enthusiasm here among us. But as researchers, we’re also skeptics. You know, we go right in and try to understand the limitations of the technology as well as the capabilities, because it’s really those limitations that expose and define the frontiers that we want to push forward on. And so what I want to start here with is to sketch those frontiers here with you a little bit. I’d like to hear about an aspiration you have for AI and why the technology cannot do that today. Then we’ll come back around to the research directions that you’re betting on to close those gaps. And, so, I don’t know. Ahmed, what do you think? What aspiration do you have for AI, and why can’t the tech do it today?

AHMED AWADALLAH: I have a lot of aspirations. I think … you just mentioned we saw the sparks of AGI, so naturally, we’re looking forward to actually seeing AGI. But beyond that, more realistically, I think two of the things I’m really looking forward to is having AI that can actually perceive and operate in the real world. We have made significant advances with language models. We are seeing a lot of advantages with multimodality. It looks like an AI that can perceive and operate in the real world is not that far off from where we are. But there are a lot of challenges, as well. And I’m really excited to see how we can get to that. 

LLORENS: What does that look like for you, when AI operates in the real world? What is it doing? 

AWADALLAH: It looks … To me, it means that we go, first, we go beyond language, and we are getting a lot into multimodal models right now that can perceive images and languages. However, a big part of what we do is that we take actions in the world in different ways. We have a lot of behavior that we exhibit as we do tasks, and it’s not clear that we can do that right now with AI. So imagine that we have an AI system that we can ask to do things on our behalf, both in the digital and in the physical world. Imagine that we have guarantees that they will accomplish these tasks in a way that aligns with our original intent. 

LLORENS: Yeah, it’s compelling. Ece, what do you think?

ECE KAMAR: My dream for the AI systems is that they become our helpers, companions, longer-term collaborators than just, like, prompting something and it gives me an answer. And we are, actually, still quite far from having AI systems that can really help us through our life for the different purposes that we have and also really understand our goals, intentions, and also preferences. So I think we have, right now, the sparks that we are seeing are really about having building blocks that give us the initial technologies to build on, to get to those AI systems that have a memory, that have a history, that have a deep understanding of human concepts, and that can carry out tasks that are a lot broader, a lot more complex than what we can do today. And our task right now is using these blocks to really imagine what those future systems are going to look like and discover those new innovations that will push the capabilities forward so that we can really build systems that create a difference in our lives, not only the systems that we want to play with or, you know, do small tasks for us—that are already changing how I work, by the way. These things are not minor, but they can really be a part of my daily life and help me with everything I do. 

LLORENS: Seb, what do you think? 

SÉBASTIEN BUBECK: Yeah, my aspiration for AI, actually, has nothing to do with the technology itself. I hope that AI will illuminate how the human mind works. That’s really my real aspiration. You know, I think what’s going on in our minds and the way we reason is extremely mysterious. And anything that is mysterious, it looks kind of magical. We have no idea what are the basic elements for it. And with AI, we’re seeing that, at the very least, it’s mimicking the type of reasoning that’s going on in human beings. So I’m hoping that we’re going to be able to really uncover those building blocks of reasoning. That’s my dream for the next decade, I guess. 

LLORENS: How good of an analogy do you think, I’ll say, transformers or, you know, today’s machine learning models are for how we think and reason? 

BUBECK: It’s a terrible analogy. [LAUGHS] So it really … the transformer is absolutely not, in my mind, trying to mimic what the human brain is doing. It’s more like the emergent properties are similar. So, you know, it’s … the substrate is going to be obviously different. I mean, one is a machine and one is wetware, and the concrete algorithm that is running will be different. But it’s plausible that the emergent property will be similar. That’s what I’m hoping. 

LLORENS: No, yeah. Super interesting. And now I want to understand a little bit about the research directions that you are most excited about to get there. I don’t think you’re going to tell me about your neuroscience research. [LAUGHS] 

BUBECK: [LAUGHS] I wish. I wish. 

LLORENS: That’s an interesting place to start … 

KAMAR: Not yet. Maybe in the next episode. [LAUGHS] 

BUBECK: Exactly. 

LLORENS: But what are you betting on right now to get us closer to that? 

BUBECK: Yeah. No, it’s actually connected, the two things. So what we are experimenting with right now is the following. So to us, I think to all of us here, GPT-4 showed the sparks of AGI, early signs of humanlike reasoning. And to us, we see this as a, kind of, proof of concept. OK, it means you can get this type of intelligence—quote, unquote—if you scale up a ton, if you have a very, very large neural network trained on a lot of data with a lot of compute for a very long time. OK, great. But exactly which one of those elements was needed? Is it the big data that’s necessary? Is it the large neural network? Is it a lot of compute? And what is a lot, by the way? What is large? You know, is 1 billion large? Is 10 billion large? You know, questions like this. So to me, this comes from a scientific inquiry perspective. But at the end of the day, it has enormous economical impact, because when you answer these questions, you go make everything smaller. And this is what we’ve been doing with the Phi series of models, trying to build those small language models. Again, we come at it from the scientific perspective, but it has very, very concrete impact for the future of Microsoft.

LLORENS: So I think Phi is on a lot of minds right now. Let’s actually stick with Phi for a minute. What is the secret? [LAUGHS] What—let’s stick with that—what is the secret? What’s enabling you to get to the reasoning capabilities that you’re demonstrating with models of that size? 

BUBECK: Yes, yes, yeah. There is … 

LLORENS: What size is Phi, by the way? 

BUBECK: Yeah, so the latest, Phi-2 (opens in new tab), is 2.7 billion parameters. Phi-1.5 (opens in new tab) was 1.3 billion. So we have doubled the size. So the secret is actually very simple. The secret is in the title of the first paper that we wrote in the Phi series, which is “Textbooks Are All You Need.” So “Textbooks Are All You Need,” this is, of course, a play on the most famous paper of all time in machine learning, “Attention Is All You Need,” that introduced the attention mechanism for the transformer architecture. So in “Textbooks Are All You Need,” what we say is if you play with the data and you come up with data which is of “textbook quality”—so the meaning of this is a little bit fuzzy, and this is where part of the secret lies—but if you come up with this textbook-quality data, we’re able to get a thousand x gains if you look at the total compute that you need to spend to reach a certain level in terms of benchmark, intelligence, etc. So now what is this textbook quality, this mysterious textbook quality? Well, the way I want to put it is as follows. What matters in text when you give text to this transformer to try to teach them a concept is how much reasoning is going on in the text? How, what kind of concept can you extract if you are to predict the next word in that text? So what we want is text which is reasoning dense, and, you know, like, novels, they are not really reasoning dense. Sometimes you need to reason a little bit to understand, OK, how all the characters are related, you know, why are they thinking or doing what they are doing. But where do you have really reasoning-dense text? Well, it’s in textbooks. So this is the secret, basically. 

LLORENS: And, Ahmed, recently you and I have had conversations about a universe of different pretraining methods, textbook-like reasoning tokens, you know, being one, and then also the whole universe of, of post-training methods and how there’s a whole space to explore there. So maybe you can get into your research interests, you know, where are you pushing on that frontier? And, you know, what haven’t we talked about yet in terms of pretraining versus post-training? 

AWADALLAH: Yeah, that’s a very good question. And, actually, it was very interesting that many, many similar insights would apply to what Sébastien was just describing. But if you look at how we have been pretraining models recently, we start with the pretraining stage, where we basically show the model a lot of text—the textbooks—and we have them learning to predict the next word. And with a lot of data and a lot of size, the big size, a lot of emergent properties were showing up in some models that we didn’t really even try to teach them to the model. But we have also been seeing that there are other stages of pretraining—some people refer to it as post-training—where after we pretrain the model, we actually start teaching it specific skills, and that comes in the form of input-output samples or sometimes an input and two different outputs, and we are trying to teach the model that the first output is preferred to the second output. We can do that to teach the model a particular style or a skillset or even for alignment, to teach it to act in a safer way.  

But what we have found out is that now that we have these large models, as well—and they are actually very powerful engines that can enable us to create all sorts of data—many of these properties, we don’t have to wait for them to emerge with the size. We can, actually, go back and create synthetic tailored data to try to teach a smaller model that particular skill. We started with reasoning, as well, because reasoning is a pretty hard property, and we haven’t really seen reasoning emerging even to that level we have in models like GPT-4 right now, except after scaling to so large size in the model and in the data size, as well. So the question was, now that we have emerged it in these models, can we actually create data that teaches the model that particular skill? And we were not trying to teach the model any new knowledge, really. We were just trying to teach the small model how to behave, how to solve a task. So, for example, with a model like GPT-4, we are seeing that you can ask it to solve a task that requires breaking up a task into steps and going step by step into solving that task. We have never seen that with a small model, but what we have found out is that you can, actually, use a powerful model to demonstrate the solution strategy to the small model, and you can actually demonstrate so many solution strategies for so many tasks. And the small models are able, actually, to learn that, and the reasoning ability is significantly improved based on that. 

LLORENS: I find the word reasoning pretty loaded. 

AWADALLAH: It is.

LLORENS: I think a lot of people mean a lot of different things by reasoning. Actually, I found some clarity. I had a nice discussion with two of our colleagues, Emre Kiciman and Amit Sharma, and, you know, they wrote a recent paper on reasoning. Sometimes we mean symbolic-style reasoning; sometimes we mean more commonsense reasoning. You talked about, kind of, more symbolic-style-reasoning tokens perhaps, or how do I think about the difference between those kinds of training data versus world knowledge that I might want a model to reason about? 

BUBECK: Yeah, very good question. So if you take the perspective that you start with a neural network, which is a completely blank slate, you know, just purely random weights, then you need to teach it everything. So going for the reasoning, the high-level reasoning that we do as human beings, this is like, you know, step No. 10. You have many, many steps that you need to satisfy before, including, as you said, commonsense reasoning. So, in fact, in our approach for the pretraining stage, we need to spend a lot of effort into the commonsense reasoning. And there, the textbooks approach is perhaps a little bit weird because there’s no textbook to teach you commonsense reasoning. You know, you acquire commonsense reasoning by going outside, you know, seeing nature, talking to people, you know, interacting, etc. So we … you have to think a little bit outside the box to come up with textbooks that will teach commonsense reasoning. But this is, actually, what we do, a big, a huge part of what we did. In fact, everything that we did for Phi-1.5 was focused on commonsense reasoning. And then when we got to Phi-2, we got a little bit closer to the Orca model, and we tried to teach also slightly higher-level reasoning, but we’re not there yet. There is still, you know, a few more layers. We’re not yet at step No. 10. 

LLORENS: Yeah, fair enough. Ece, geek out with us a little bit now on research directions. I’m sure you have a lot of interest in everything we’ve just talked about. Anything you want to add from your perspective? 

KAMAR: There is, actually, a lot to add, because one of the biggest things that we are trying to do in our new organization is understand the connections between these different works that are going on, because our purpose is not exploring independent research directions and make progress on each. But we have a very focused mission. Our focused mission is expanding the frontiers of AI capabilities, expanding the frontiers of what intelligence can be in these machines. And to be able to get there, we have to have a coordinated understanding of how Phi connects to Orca and how these two model families connects to other future-looking ideas that can push those boundaries forward. So think about this as, like, an intelligent pyramid. That’s how I have been, kind of, thinking about this in my mind. 

At the base of it, we have the building blocks of these models, base models. Phi is a beautiful example. And in the future, we are going to have other models. Phi is going to go and do other things, and other places can do other things. Phi and GPT-4 and these models are going to coexist in a model library. The one layer above that is all of the work that Orca team is doing with fine-tuning specialization. Taking a capability, taking a domain, taking some constraints and trying to see, like, I have these base models, but how do I make them work even better for the different domains and capabilities that I really, really care about and have more control over what those models generate for me. So that’s like the second step of that intelligent pyramid that we are building. But then we have been doing some really interesting demonstrations and building in our teams to, kind of, look at, like, how does orchestration play a role in that intelligence pyramid? Because when you think about it, the simplest way we can get things done with either the base models or the specialized models today is I just tell it to do something by prompting and it does something for me. But is that the end of the way we are going to be building with these models to be able to expand those frontiers? That answer is a no. And in fact, one piece of work that our teams have been doing collectively is called AutoGen (opens in new tab). And that library, which became very popular with the developer community—and we love seeing the responses we are getting. Correct me, Ahmed, I think we got to 15,000 stars under a month in GitHub …  

AWADALLAH: Yeah, we did.

KAMAR: … with this library, with this very experimental library. And we are learning a lot from the developer community about how they are using it. But what we are seeing is that the kind of things people want to do with these models, when they want to expand those capability boundaries, when they want to have a more robust execution, when they want to really overcome the brittleness of the prompting and prompting the models strategy, they actually go to orchestration, and in fact, they go to multi-agent orchestration. So that multi-agent, what we mean by multi-agent orchestration is that imagine you have a complex task that you cannot reliably do by just prompting even the best model we have in our family. But what you can do is something very similar to how humans work actually. We take a complex problem. We divide it into smaller pieces and then assign smaller pieces to different people that have different capabilities. That’s exactly how AutoGen framework works. It takes a complex task, divides it into smaller pieces, and assigns different pieces to different “agents,” which means intelligences that can prompt different models with different strategies and personas and get them working together. And what we are seeing is that this very simple idea of multi-agent orchestration, on top of all of the great work that’s happening on the modeling side, is another layer in that intelligence pyramid that can really push the frontiers forward. So one of the things we are doing in our organization is really understand these connections—how does Phi relate to Orca relate to AutoGen?—as we are building this pyramid. But there is something else we are betting on right now, which I believe is going to become very, very important as these systems become a part of the real world, as Ahmed was suggesting. 

So when we were doing the “sparks of AGI” work, there is actually something we say in the introduction when we are talking about intelligence, the core of intelligence. Any intelligence system needs to be learning, needs to be learning from their environment, needs to be learning from the interactions they are having. And this is not something we currently have even in the best models or even in the best AI systems we have in the world. They are static. They may be interacting with millions of people every day and getting feedback from them or seeing how people respond to it, but it does not make any of those systems better or more intelligent or understand their users any better. So I feel like this is one of the areas that we have to push forward very strongly. How do we incorporate a learning feedback loop into this intelligence pyramid—every layer of it—in a transparent, understandable, and reliable way so that the systems we are building are not only getting better because experts like Sébastien and Ahmed are putting a lot of time in data collection. And, of course, that work needs to happen, as well, and, you know, coming up with new ideas to make the models better. But we are, actually, creating this virtuous loop for our systems for them to get better in time.  

The last research idea we are pushing forward is something, actually, very unifying across the stack I’m talking about. One of the biggest questions is, how is the progress in AI look like today, right? Like, we are doing all of this great work, but how the capabilities of the AI systems, all the models we are building, are evolving as the models scale up and we have more data. So this is really becoming a question about evaluation and understanding. So think about this as we are doing a lot of agile work in a very fast-changing environment. What we need is headlights to be able to see where we are going and how much progress we have made. So this is why another area we are really pushing for as a research direction in our organization is not only relying on existing benchmarks and existing evaluation strategies, but really reinventing how we think about evaluation overall. We talked about this intelligence stack. How does the innovations in the intelligence stack can enable the researchers to come up with new approaches to understand the models, evaluate the models, such that we can have a much better understanding of where we are and where we are headed as we are building this intelligence pyramid? 

LLORENS: A quick follow-up question on evaluation. This is one that I think a lot about. There’s the idea of benchmarks that try to maybe test the, you know, the generality of the intelligence of a model. And then there’s, all the way, the end-to-end evaluation in the context of use. And how much do we think about the end-to-end story there when we talk about evaluation? 

KAMAR: It’s a spectrum. I would also like to hear from Sébastien and Ahmed, but it is really a spectrum, and there are different questions that motivate the work on evaluation. So when we ask a question like what does that capability curve look like for AI models, there we have to focus on the models themself and understand how the models are progressing. But then if you are asking a question of, I want to build reliable, capable AI systems of the future— how does that curve look like? That requires a different way of thinking about the evaluation where we are not only evaluating the models, but we are evaluating the whole stack. We are actually saying, OK, let’s think about prompting. Let’s think about orchestration and understanding the complementarity of the stack and looking into how the capabilities improve as we put the pieces together and to be able to light our way forward, both in terms of understanding how well we do in models and how well we do in building systems. We have to do the work in both. There is really no shortcut for that. 

LLORENS: Microsoft Research is over 30 now, over 30 years old. And suffice it to say, I think we’re, you know, we’ve been going strong for over 30 years, but we’re in new territory. And I think we are organizing differently in some ways, you know, to meet the moment. And along those lines—and you, kind of, alluded to this before—but you’ve recently taken on a new leadership role. 

KAMAR: With Sébastien and Ahmed, as well. 

LLORENS: Of course. So maybe you can say a little bit more about how we’re organizing differently, what this looks like from your perspective.

KAMAR: As you said, this is really about the moment that we are in right now. Of course, I haven’t been at Microsoft Research for the whole 30 years [LAUGHTER], but I’ve been here for at least half of it, and personally, for me, there has never been a moment as exciting as now to be an AI researcher and to be an AI researcher inside Microsoft. Think about it. This is the company that is building the cutting-edge AI technologies in the hands of millions of people and doing it at an unbelievable speed that surprises me, although I have been an employee of this company for the last 13 years. So think about the speed of innovation that we are seeing here. Think about where the ambition level is in this company when it comes to doing great AI work. 

Of course, by doing research inside Microsoft, we are also able to see where the gaps are. We are able to get a lot of feedback about what is working and what is not working. And that’s giving us a lot of really strong signals about where we need to push. And, in fact, these research directions we are talking about, they are not coming from thin air. This is really coming from working with different product groups, learning from their experiences, trying things ourselves, as well. So these are all motivating us to rethink what AI research means in this new AI age. So if you are creating an ambition level that is as high as what the current situation requires us to be, which is we are going to be at the cutting edge of the AI world, we are going to be impacting the real-world AI systems, and we are going to be pushing forward in this intelligent pyramid. That really requires that we have to coordinate ourselves very well on a very well-defined mission and go with it with conviction and go with it with speed and agility. So that’s what we are doing in our new organization that’s called AI Frontiers. This is a mission-focused AI lab and our mission is expanding the frontiers of AI capabilities, and we are doing it by being very focused on a number of key directions, which we kind of covered, but also having the agility and the teamwork to always re-evaluate ourselves and ask the question of, are these the most important problems to work on right now? Or how the world is changing, should we rethink? Should we create new directions? Should we end directions and build? This is, I think, one of the most important things about where we are in the AI world right now. We are not working on hypothetical ideas. Of course, we are dreaming big; we are taking risks. We are not only doing incremental things. But even for the ideas that are long-term and riskier, we are only going to learn if we are building those ideas, sharing it with the community, and learning from that feedback. So those are the building blocks of our new organization.

LLORENS: One of the things that’s exciting about doing research, I find, in an industrial environment like Microsoft is the ability to essentially affect the population through translating things into products, right. On the other hand, there is a big difference between what comes out at the end of a research pipeline, a research asset, you know, a model like Phi or Orca, and a thing that powers a product. One of the things I think we’ll do with AI Frontiers is provide more of a channel, a more coherent channel, of research artifacts like this into product. But can you talk a little bit about that? What is that difference? What goes into getting something from, you know, what we might put on GitHub to something we might give to our colleagues in Azure, for example? 

BUBECK: I think the timelines are really shortened recently. Overall, research has accelerated so dramatically that the distance between a real product and something that comes at the end of a research, you know, project is, like, the gap is very small, I would say. And this is really, you know, to Ece’s point about having an organization which is mission focused and about building things, this is, to me, the essence of what’s going on right now. We cannot have horizons which are 10 years into the future. The truth is, nobody knows where AI is going to be 10 years from now, so it’s meaningless to plan at time horizons which are the usual time horizon that we are used to in research. If you are in research, you know, from 10 years ago and you’re planning with a 10-years horizon, then, of course, there is going to be an immense gap between whatever you produce and, you know, a real product. This is not the case anymore. So even something like Phi, you know, it could be in product very soon. 

AWADALLAH: Yeah. When I first joined, actually, Microsoft Research, we would also think about the research that we’re doing right now is two, three, five years away, and we’d categorize research that way for making it into product. That spectrum’s collapsing. 

BUBECK: Completely. 

AWADALLAH: Things are happening so fast. The amount of work needed from taking it from research results to a product is still a lot of work. And that’s why I have been amazed by how fast we have been moving as a company, putting these things safely and reliably into the hands of our customers. However, that spectrum is not in years anymore. Things are moving very, very fast and some of the findings that we find make their way into impact in a matter of weeks or months. 

KAMAR: And there’s one more point to make here, which is doing AI Frontiers inside MSR. We are choosing to go, to be building a mission-focused organization that’s going really fast on some of these problems and get our hands dirty and work with different parties in the company. And at the same time, we are inside a very strong organization that has researchers studying many different problems at different time horizons and sometimes being able to, you know, go through on directions that we may not be able to afford by being in this mission-focused organization. So one of the things we very much care about is also building bridges, not only with the company, not only with the academic world, but also with the different groups inside the Microsoft Research umbrella and really benefit from the riskier bets that, you know, the traditional MSR labs are taking and collaborating with them and enabling all of us to try those ideas. So we are really hoping that by being inside this MSR family, we are gaining a lot and we are able to scale on our ideas and experimentation a lot more. 

LLORENS: You alluded to the, you know, the work it takes to go from a research artifact to something in a product, and part of that work pertains to responsible AI, as we might say inside Microsoft, or just AI safety more broadly. I think that’s true for transitioning to translating something to product, but even to releasing something, you know, a paper, you know, with a GitHub, you know, artifact that we put out there. Let’s go back let’s even say to the Orca work. How are you thinking about safety in the context of open sourcing something like Orca? What are the tests you’re running? And, you know, what does that frontier look like?

AWADALLAH: Yeah, that’s a very good question. And, actually, we put a lot of emphasis on safety even on research assets and, actually, we put a lot of our research assets through a process as rigorous as we would products before we are able to release them. And this is definitely the right thing to do. And, as you mentioned Orca, we did Orca fairly early on, and we weren’t yet at this stage sure what the process should be, so we, actually, never released it, because … like, once we wrote the paper and found out that we had something interesting, we wanted to release it because we wanted to share it with the research community and we wanted the research community to be able to build on top of it, but we didn’t have a story for what does that mean in order to actually release it safely. So we took some time back and worked with the rest of the company and came up with a very rigorous process. And before we are able to put anything out, it had to go through that process. That said, I think we are still even learning how to evaluate and how to measure and what does it mean to measure safety. So it’s not like a checkbox where we figured it out, and that’s what we are doing, and we feel good about it, and we put it out there. There is a continuous effort from a very large number of teams throughout the company in both products and research to always refine these processes so that we make sure we advance our understanding of what safe release of these models is and also make sure that we have the right processes and systems to make sure everything we put out there goes through that process. 

LLORENS: And there are frontiers here that are super interesting. I think multimodality is a really interesting frontier relative to evaluation and safety. And we started earlier in the conversation even talking about AI in the real world that we interact with maybe not even just as a chatbot, but as an agent of some kind that can take action in the real world. So it’s great to see us taking this so seriously at this phase, because I think it’s going to get even more complicated, you know, as we move forward and more important. Why don’t we talk AI and society for a minute. One of the things that I find important for me as I reflect on my own research, my own journey here, is remaining grounded by perspectives outside of this laboratory, outside of the spheres that we’re in. We get some of that at our dinner tables, right. I do have the opportunity, for me personally, to engage with communities, community organizations, even politicians. But I’m really interested in how you all stay grounded in perspectives outside of this world here in Microsoft Research. Ece, why don’t we start with you? 

KAMAR: Yeah, one of the things, talking about AI and society and responsibe AI, one of the things that’s very important is that a significant portion of our organization, our researchers and engineers, have significantly contributed to the work that Microsoft has done in the responsible AI space over the last decade. And, in fact, I’m … one of the things I’m most proud of in terms of my personal time in MSR is how much MSR contributed to where Microsoft is in doing AI responsibly. And that all happened because we, actually, got to see the realities of AI development and have the passion to drive innovation in terms of building AI responsibly. Now I think this is an opportunity for us to do this at larger scales as we have more coordinated efforts in terms of pushing the frontiers of AI in this new organization and MSR more broadly. So there are a few ways we are doing this right now. And then I’ll come to your point about the community. One of the things that we very much care about is sharing our work with the academic community and with the developer community through open sourcing. So all of the works—Phi, Orca, AutoGen, and the other things we are going to be doing—we release them. And, in fact, what is so significant about the small-language-model space is that they enable a lot of hands-on research work that may not be possible without these family of models, because when you think about it, a lot of the other models that have reasoning capabilities that may compare with Phi and Orca, they were much larger and they were black boxes to the research community. Now that we are putting these models out there in an MIT License, we really welcome the academic community to take these models, to look into how they are actually getting better in reasoning, and ask the question of how. Ask the question of, how do we have better controls in Phi and Orca? How do we improve the training data such that we can mitigate some of the biases, reliability issues, toxicity in it?

One of the things I personally very much believe in is that there cannot be any camps about stopping AI versus going as fast as possible. This is really about building AI responsibly and making sure that our innovations happening are also taking responsibility as a core part of that innovation. So with that in mind, we think it is so important to enable the whole academic community with models, with architectures, with agents, libraries such that the innovation in terms of how do we make AI responsible comes from the whole world instead of just the field that has access to such models. 

BUBECK: And if I may, like, for the Phi model on Hugging Face, we are approaching a million downloads. So, you know, it’s very real. Like, this is really getting into the hands of, well, a million people, so … 

LLORENS: Yeah, for sure. 

AWADALLAH: Yeah, and to add to that, we are seeing this a lot with AutoGen, as well, because AutoGen, it’s not a model. You can use a lot of models with it. And it created a big developer community around it, and we have been learning a ton from them and not just in how they are using it, but actually in so many innovative ideas of even how to use it to make your applications safer or to make application more reliable, because the framework enables you to define different roles, and people are coming up with very interesting ideas about maybe adding a safeguard agent in order to make sure that whatever the team of agents is doing actually fits the particular safety criteria or adding some other agents that are trying to make sure that the completion of the task aligns with the initial human intent. So we are going early with enabling the community to use what we are doing and open sourcing it. It is helping us collectively come up with better ways for building these things in a much better and safer way. 

KAMAR: And then on top of the work we are hopefully enabling the academic community, there is also something about working inside a company like Microsoft and learning from real-world use cases. And responsible AI is really about real world, and we want to make sure that we, over time, think about ways—possibly even collaborating with you, Ashley, and your team—really, like, sharing our learnings about how the real world looks like, what the real-world considerations are, with a much larger community so that we can think about all of these considerations together and innovate together in terms of building AI responsibly. 

LLORENS: And the global research community—we talk a lot about that—is more expansive, I think, than it’s ever been, at least as it pertains to computing research and the amount of different disciplines right now involved in what we’ve considered computing research. On the one hand, there are the computer scientists that are playing with Phi right now, that are playing with AutoGen. On the other hand, there’s legal scholars, there’s policy researchers, there’s medical practitioners, and so the global research community is just more expansive than ever, and it’s just been great to be able to use Microsoft as a platform to be able to engage more broadly, as well. So, look, I’ve had really a lot of fun, you know, talking to you all on a daily basis but today in particular. Thanks for a fascinating discussion. 

KAMAR: Thank you, Ashley. 

BUBECK: Thanks, Ashley.  

AWADALLAH: Thank you.