Microsoft Research Blog

Artificial intelligence

Gaps Between Research and Practice When Measuring Representational Harms Caused by LLM-Based Systems

November 22, 2024

To facilitate the measurement of representational harms caused by large language model (LLM)-based systems, the NLP research community has produced and made publicly available numerous measurement instruments, including tools, datasets, metrics, benchmarks, annotation instructions, and other techniques. However, the research community lacks clarity about whether…
Dimensions of Generative AI Evaluation Design

November 18, 2024

There are few principles or guidelines to ensure evaluations of generative AI (GenAI) models and systems are effective. To help address this gap, we propose a set of general dimensions that capture critical choices involved in GenAI evaluation design. These dimensions include the evaluation setting,…
Dukawalla: Voice Interfaces for Small Businesses in Africa

November 16, 2024

Small and medium-sized businesses (SMBs) often struggle with data-driven decision-making due to a lack of advanced analytics tools, especially in African countries where they make up majority of the workforce. Though many tools exist they are not designed to fit into the ways of working…
Evaluating Generative AI Systems is a Social Science Measurement Challenge

November 16, 2024

Across academia, industry, and government, there is an increasing awareness that the measurement tasks involved in evaluating generative AI (GenAI) systems are especially difficult. We argue that these measurement tasks are highly reminiscent of measurement tasks found throughout the social sciences. With this in mind,…
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

November 1, 2024

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover…
Large Language Models Can Provide Accurate and Interpretable Incident Triage

October 28, 2024

Large-scale cloud services frequently experience incidents that can have a significant impact on their stability. Incident triage is a critical process that assigns incidents to dedicated teams for resolution. However, traditional rule-based methods, commonly employed in various systems, have limitations due to a finite set…
Farmer.Chat: Scaling AI-Powered Agricultural Services for Smallholder Farmers

October 8, 2024

Small and medium-sized agricultural holders face challenges like limited access to localized, timely information, impacting productivity and sustainability. Traditional extension services, which rely on in-person agents, struggle with scalability and timely delivery, especially in remote areas. We introduce FarmerChat, a generative AI-powered chatbot designed to…
Differential Transformer

October 7, 2024

Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels…
i-Code Studio: A Configurable and Composable Framework for Integrative AI

October 6, 2024

Artificial General Intelligence (AGI) requires comprehensive understanding and generation capabilities for a variety of tasks spanning different modalities and functionalities. Integrative AI is one important direction to approach AGI, through combining multiple models to tackle complex multimodal tasks. However, there is a lack of a…
On Evaluating LLMs’ Capabilities as Functional Approximators: A Bayesian Perspective

October 6, 2024

Recent works have successfully applied Large Language Models (LLMs) to function modeling tasks. However, the reasons behind this success remain unclear. In this work, we propose a new evaluation framework to comprehensively assess LLMs' function modeling abilities. By adopting a Bayesian perspective of function modeling,…
Predictability of identifier naming with Copilot: A case study for mixed-initiative programming tools

October 1, 2024 | Michael Jing Long Lee, Advait Sarkar, and Alan F. Blackwell

Studies show that predictive text entry systems make writing faster, but written content more predictable. We consider if these trade-offs extend to code synthesis tools such as GitHub Copilot. While Copilot can make developers produce code faster, it may also affect how they choose identifiers…
Maia-2: A Unified Model for Human-AI Alignment in Chess

September 29, 2024

There are an increasing number of domains in which artificial intelligence (AI) systems both surpass human ability and accurately model human behavior. This introduces the possibility of algorithmically-informed teaching in these domains through more relatable AI partners and deeper insights into human decision-making. Critical to…

No results