Microsoft Research Forum Briefing Book cover image

Research Forum Brief | January 2024

Augmenting Human Cognition and Decision Making with AI

Share this page

Jake Hofman

“How can we use AI to help people make better decisions, reason about information, be more productive, and ultimately even improve themselves?”

Jake Hofman, Senior Principal Researcher

Transcript

Jake Hofman, Senior Principal Researcher, Microsoft Research NYC 

Jake Hofman discusses recent research in building and evaluating AI tools for helping people make better decisions and improve their own capabilities. 

Microsoft Research Forum, January 30, 2024

JAKE HOFMAN: Hi, my name is Jake, and I’m excited to share some recent work that we’ve been up to at Microsoft Research New York City called augmenting human cognition and decision making with AI. And what I mean by that is very simple: how can we use AI to help people make better decisions, reason about information, be more productive, and ultimately even improve themselves?

And we’ve come up with a little sports analogy for thinking about the spectrum of ways that people might interact with AI tools. On the left, we have, sort of, the least desirable outcomes with an analogy to steroids, something that gives you a superhuman ability in the moment but can leave you worse off afterwards than you were, leading to long-term deskilling. An example being something like forgetting how to spell if you over-rely on spell check.  

In the middle, we have tools that act like a good pair of running sneakers. They give you a temporary boost in the moment, but there are no long-term consequences when you take them away. Here, you might think of something like saving time with cumbersome syntax using autocomplete. And on the right, we have perhaps the Holy Grail, a coach that not only helps you in the moment but helps you improve yourself in a long-lasting and sustainable way. And it may seem that these are discrete options, but in fact, we can make choices in how we design and use AI tools that can substantially impact how they affect people. And so I want to go through very quickly just two examples of studies we’ve done to think about the design and use of AI tools and how we can optimize them: first an LLM-based search study and a second one on an LLM-based tutor.  

So in the first study, we looked at a, sort of, sneaker scenario: how does LLM-based search affect decision making? We did this by asking people to research and choose between pairs of cars given a certain criteria, and we randomized whether they had access to traditional search or LLM-based search. So some people saw the usual set of blue links, which was provided on the backend by Bing Search API, while other people saw natural language responses generated by GPT-3.5. 

And here’s what we learned from this experiment. For routine tasks where the LLM provided accurate information, people were about twice as fast using the LLM-based search as they were using traditional search with comparable levels of accuracy. But when the LLM made a mistake—as it did here, indicated by the X over an incorrect number in the response—people basically didn’t notice, and they often made incorrect decisions themselves as a result. Thankfully, though, we found a simple fix. We added confidence-based highlighting similar to what you would see in a spelling or grammar check, and that greatly reduced overreliance on this incorrect information and improved people’s performance in the task, leaving all other measures unaffected. So this is one of those key design choices that can make a real difference. And experimentation was key for prototyping and validating it.

In our second study, we looked at more of a coach scenario for how LLM-based tutoring affects learning. So what we did is we randomized people into seeing different types of assistance at different times when they were practicing standardized math problems like the one you see here. Then we looked at their performance on a separate test where, very importantly, no one had any assistance so we could assess how much they themselves had learned. 

So in one condition—the answer-only condition—people tried a problem, and then they were just told whether they were right or wrong, and if they were wrong, they were shown the correct answer. In contrast, in a stock-LLM condition, people were given the explanation that vanilla GPT-4 out of the box provided. In this case, GPT-4 gives a correct but rather esoteric formula for the person to try to learn and memorize to solve the problem. And in a third and final condition, we had a customized LLM that was given a pre-prompt to emulate a human tutor, and it suggested more cognitively friendly strategies, which in this case involves choosing a value for an unspecified number in the problem to make it easier to solve. 

And the findings here are pretty straightforward. From this experiment, we saw that LLM explanations really boosted learning relative to seeing only answers, as shown by these two points on the right. But there were substantial benefits to having people use the tutor after having tried on their own first, as opposed to consulting the tutor before attempting the problem. We also saw some indication that there’s directional evidence for the customized pre-prompt, providing a small boost over the stock explanations. 

And so to wrap up, I hope these two studies have provided useful examples of just how much the choices we make in designing and deploying AI tools can matter and how important rigorous measurement and experimentation are to making sure that we maximize the benefits and minimize the risks of the tools that we build. So with that, I will wrap up. I have some links here to the papers I’ve discussed and looking forward to any comments and questions that folks might have.

Thank you.