Skip to main content
Dynamics 365
A group of people sitting at a long table, using laptops to evaluate a product and outcome
  • 13 min read

Sales Qualification Agent: How we evaluated and improved AI quality with benchmarks


The Sales Qualification Agent (SQA) in Dynamics 365 Sales introduces a new class of autonomous sales AI, one that does far more than assist with drafting or summarization. SQA performs multi-step reasoning, conducts live web research, generates personalized outreach, and engages prospects in multi-turn qualification conversations. These capabilities directly shape pipeline quality, seller productivity, and customer relationships. 

As agentic AI becomes deeply embedded in revenue-critical workflows, trust must be earned through transparent, repeatable, and rigorous evaluation—not anecdotal wins or point demos.

Today, we’re announcing the Microsoft Sales Bench—a collection of evaluation benchmarks designed to assess the performance of AI-powered sales agents across real-world scenarios. Adding to the Sales Research Bench already published as part of this collection to evaluate Sales Research Agent, today we are also publishing the Sales Qualification Bench to evaluate Sales Qualification Agent in Dynamics 365 Sales.

This post presents the detailed evaluation methodology and results for the agent, including a head-to-head comparison against chatGPT using identical data, tasks, and scoring rubrics. These efforts establish the first benchmark purpose-built to measure end-to-end sales agent workflows, from research to outreach to live qualification. 

SQA Architecture  

The Dynamics 365 Sales Qualification Agent (SQA) architecture is designed as an end-to-end, enterprise-grade AI system that autonomously researches leads, synthesizes insights, and generates seller-ready outreach. It combines an intelligence engine powered by large language models with iterative web and enterprise data research, tightly integrated with Dynamics 365 Sales and Microsoft Copilot Studio for orchestration. Built on secure enterprise foundations, the architecture enforces governance, compliance, and data protection while enabling scalable, trustworthy AI-driven sales workflows. 

Evaluation Metrics and Methodology 

To understand how well the Sales Qualification Agent (SQA) performs in real-world sales qualification workflows, we designed the Sales Qualification Bench, a comprehensive evaluation that mirrors how sellers actually research leads, personalize outreach, and engage with prospects. Our goal was straightforward: measure whether SQA can help reps qualify faster, personalize more effectively, and carry higher-quality customer conversations—using the same signals and information they rely on every day. 

To ensure that the evaluations accurately represent real-world conditions, we developed a testbed that closely mirrors the complexity and ambiguity found in contemporary sales environments. This allowed us to evaluate SQA end to end, from autonomous research and reasoning to grounded, actionable research briefs, outreach messages, and multi-turn qualification conversations. 

Evaluation Setup

To ensure real-world fidelity, we constructed a production-like lead evaluation environment that mirrors how SQA operates in Dynamics 365 Sales. 

Lead and Data Corpus 
  • Three synthetic but realistic seller companies (C1) across distinct industries, with unique: 
    • Product offerings 
    • Knowledge sources 
    • Ideal customer profiles 
  • 300+ lead dataset (C2) expanded into a scenario-rich corpus: 
    • Companies across 6 global regions (North America, Europe, Asia, South America, Australia, Africa) 
    • 33 industries 
    • Mixed clarity (well-known brands and long-tail companies) 
    • Structured attributes (name, role, email) 
  • CRM roles represented
    • Sales representatives 
    • Digital specialists 
    • Customer success managers 
    • Each linked to relevant accounts, opportunities, and cases 
  • Company segment coverage
    • Enterprise 
    • Mid-Market 
    • Small Business 
    • Government 
    • Education 
  • 500+ email exchanges simulating real sales interactions: 
    • Technical product questions 
    • Meeting requests 
    • Ambiguous or low-intent inquiries 
Simulated Agent Workflows 

All evaluations reflected real SQA behavior: 

  • Autonomous web-based research 
  • Role-aware outreach generation 
  • Multi-turn qualification conversation handling 
Tasks Evaluated and Evaluation Metrics 
1. Company Research 

For each lead, the agent generates a structured research brief including: 

  • Business overview, strategy and priorities 
  • Financial signals 
  • Recent news relevant to the seller 
Metrics Definition 
Recency Measure of how recent time-sensitive insights are relative to the current date (older insights are not as useful for sellers) 
Relevance & Solution Fit  Measure of how well the insights are tied back to sellers’ offerings (relevant insights are more actionable than a regurgitation of facts) and articulate the lead company’s need or interest in then 
Completeness   Measure of how well the insights capture all the facts that are useful to a seller 
Reliability Measure of how consistently the agent finds useful insights for the seller (e.g., strategic priorities return current strategic priorities and not generic mission statements, news returns news articles and not generic evergreen statements about a company)  
Credibility Measure of how reputable the sources referenced by the agent are  
2. Lead Outreach 

Based on its research, the agent generated a personalized email aligned to: 

  • The lead’s role 
  • The seller’s value proposition 
  • The company’s business context 
  • Value-based positioning 
     
Metric Definition 
Clarity Assesses how clear, precise, and jargon-free the message is, ensuring every sentence adds value. 
Personalization Measures how well the email is tailored to the specific target company, using concrete company-level details rather than generic industry language. 
News-anchored opening Checks whether the email references recent company events or updates, ensuring the outreach feels timely and current. 
Relevance and Solution Fit Measure of how well the insights are tied back to sellers’ offerings/solutions (relevant insights are more actionable than a regurgitation of facts), and articulate the lead company’s need or interest in them
Structure Evaluates whether the email has a clear logical flow from opening hook to problem, solution, and call to action. 
3. Qualification Conversations (Engage) 

The agent then autonomously engages back and forth with the lead, progressively asking them questions for customer-configured qualification criteria such as budget, need, and timeline and answering the lead’s questions such as: 

  • “What does your solution do?” 
  • “How are you priced?” 
  • “How do you compare to competitors?” 
  • “Who else uses this?” 
Metric Definition 
Answer Quality Assesses whether the agent provides clear, relevant, and complete answers that directly address the customer’s intent. 
Agent Comprehension Evaluates how well the agent understands customer intent, prioritizes requests, and adapts tone and strategy based on the user’s response. 
Answer Readability Checks that responses are natural, professional, easy to read, and fully compliant with formatting and content rules
Human handoff accuracy Ensures the agent correctly flags when human intervention is required, such as for unanswered technical questions, legal/billing requests, meeting requests, or explicit requests for a human. 
Discovery question coverage Measures how effectively the agent qualifies leads using indirect, strategic discovery questions across Need, Budget, Authority, and Timeline

Each metric is scored independently on a 0–10 scale, where higher scores indicate stronger performance. We used an LLM-as-a-judge approach to score outputs against the ground truth and rubric and manually reviewed a sampled subset of evaluations to calibrate the judges and validate scoring consistency. To reduce judge variance and mitigate hallucination risk, each sample was evaluated five times, and the mean across runs was recorded as the final score. 

Benchmarking Strategy with ChatGPT 

To ensure an objective and fair comparison, we replicated a standard seller workflow in ChatGPT UI using GPT-4.1 with Pro license, a more advanced model than the GPT-4.1-mini variant currently used by SQA. 

Standard Prompting 

This setup simulates how a seller naturally interacts with a general-purpose LLM: 

  • High-level contextual instructions only 
  • Mirrors SQA’s autonomous research-to-outreach flow 

This ensures: 

  • Workflows remain representative and unbiased 
  • Comparisons reflect real-world usability, not prompt-engineering skill 
Identical Knowledge Sources and Context 

ChatGPT was given the exact same knowledge sources as SQA, including: 

  • Full lead information and seller value proposition 
  • Seller Q&A documentation via the SharePoint connector 
  • Historical conversation context for reply generation 

This isolates differences in agent reasoning and orchestration, not data access. 

Evaluation Results  

Microsoft evaluated the Sales Qualification Agent (SQA) and ChatGPT with over 300 leads, covering research, outreach, and qualification tasks with identical knowledge sources. Evaluations completed on December 4, 2025, showed that SQA consistently outperformed ChatGPT-4.

  • Research: SQA was 6% more effective at relevant, thorough company research. 
  • Outreach: SQA was 20% better at personalized communication and timely event references. 
  • Engagement: SQA scored 16% higher for precise responses and targeted qualifying questions. 

SQA also operates autonomously, reducing overhead and boosting pipeline quality for sales teams. 

Results by Task Category 

1. Company Research 

SQA was 6% better than ChatGPT, winning in its ability to perform more relevant and complete research that highlighted the lead company’s interest in the sellers offerings: 

  • SQA provided more relevant results: To ensure sellers spend their time on the most important leads, they need to determine whether a lead is good fit for their offerings. While both SQA and ChatGPT were given the same context (seller company and value proposition of the offerings), SQA consistently did better at tying its research back to this context, helping sellers determine fit. Appendix A shows an example where SQA was able to tie the company’s strategic priorities to its need for a collaboration platform and infer strong purchase ability from its robust operational health and minimal leverage burden.
  • SQA synthesized results with higher level of fidelity and completeness: The agent’s value is directly correlated to its ability to eliminate tedious work for the seller. SQA produced more detailed research synthesis (as demonstrated in Appendix A), giving a single, trusted source for the seller to get equipped with any insights they may need.  

These results stem from numerous experiments aimed at optimizing web research for the best outcomes at minimal cost, rather than relying on costly advanced models. Sellers get deeper insights with SQA’s agentic RAG for real-time reasoning with iterative web search results, combined with unique capabilities that increase data coverage, for example, auto-linking CRM records and extraction of company name from lead emails. 

2. Personalized Outreach 

SQA was 20% better than ChatGPT, notably ahead in the level of personalization and mentions of relevant recent events that will resonate with the lead. 

  • More personalized and customer-centricity: A lead is more likely to respond to a cold outreach email that directly explains how the seller’s offering can address their needs. SQA did so effectively by starting with the lead’s situation and recent events, while ChatGPT often focused on the seller and uses heavier technical jargon. A clear, actionable call to action bookends the email and guides the conversation forward. Appendix B shows an example of how SQA was able to tie a recent acquisition the lead’s company made to the value proposition of the seller’s offering. 

These results are based on direct engagement with sellers – every sales team that deploys SQA gives us precious feedback that all other customers benefit from.   

3. Qualification Conversations (Engage) 

SQA was 16% better than ChatGPT. It responded with greater precision to the lead’s questions to develop purchase interest and asked pointed discovery questions to better qualify the lead before handing off to a seller. 

  • Answers accurately by correctly understanding the lead’s intent and maintaining conversation context effectively. To drive deeper buyer consideration, SQA independently answered even the most technical questions that leads had about the seller’s offerings while maintaining the context from earlier messages in the simulated conversation, delivering clear, direct, and well-structured responses. Appendix C demonstrates SQA’s ability to pull the most relevant information from provided knowledge sources (in this case, files with technical specifications) during an ongoing conversation with a lead. 
  • Handles uncertainty responsibly, handing off to a supervisor/seller when appropriate. Both SQA and ChatGPT were instructed to handoff a lead to a supervising seller when a suitable response cannot be generated or when the lead is considered qualified as per pre-defined criteria. SQA handed off accurately and at the right moment in more tests than ChatGPT.  
  • Demonstrates strong discovery coverage. To maximize the value exchange from each follow-up conversation with the lead, SQA and ChatGPT were instructed to include discovery questions in their response to assess pre-configured qualification criteria (covering lead’s need, budget, buying authority and purchase timeline). SQA was able to ask pointed discovery questions to cover more of these criteria than ChatGPT in our simulated conversations. This resulted in SQA identifying and handing off better qualified leads through its engagement.

These gains are attributable to the hard lessons we have learnt through close collaboration with customers to understand the diversity of needs on intent detection and knowledge retrieval across AI agents for Dynamics 365 Sales, Service and Contact Center applications.  

Conclusion 

Sales Qualification Agent (SQA) is not a simple productivity tool—it is a complex multi-step agent directly influencing revenue outcomes. The Sales Qualification Bench represents a foundational step toward enterprise-grade trust, transparency, and continuous quality improvement for agentic AI in sales. 

Through rigorous, scenario-driven benchmarking, SQA demonstrated clear leadership across research, outreach, and qualification—while also surfacing concrete architectural and rule-quality investments for the next phase of growth. 

Upcoming plans for the Sales Qualification Bench include using the benchmark for continuous improvement of the Sales Qualification Agent, running further comparisons against a wider range of competitive offerings, and publishing the eval package so customers can reproduce it on their own data. Evaluation is not a one-time event. It is now embedded as a first-class engineering discipline in the SQA development lifecycle—ensuring the agent evolves safely, measurably, and in lockstep with seller needs. 

Appendix

Sample inputs and responses for each category  

Appendix A. Sample of research synthesis generated by SQA
Task 
Find and summarize business overview, strategic priorities, financial summary, and recent news. C1: Microsoft Teams C2: Agência Nacional de Águas e Saneamento Básico 
Research Response 
Overview Agência Nacional de Águas e Saneamento Básico (ANA) is a Brazilian federal regulatory agency in the public sector, focused on water resources management and basic sanitation. It operates nationwide with headquarters linked to the Ministry of Integration and Development. ANA is responsible for regulating the use of water resources, managing the security of dams, overseeing public irrigation, and establishing national standards for basic sanitation services. The agency serves primarily government entities and industry—a B2B context, as it sets legislation and frameworks for critical infrastructure projects. Its mission centers on sustainable water resource management for development and environmental security. This background is highly relevant to Microsoft Teams’ value proposition, as ANA’s nationwide role and collaboration among stakeholders (internal teams, other agencies, private partners) demand robust, secure digital communication and collaboration platforms like Teams, which would support ANA’s operations, regulatory projects, and document governance requirements.  

Strategic priorities ANA’s current strategic priorities include strengthening climate resilience, advancing national water governance, and digitalizing monitoring infrastructure to improve the management of Brazil’s water resources. The agency has prioritized participation in international programs, digital transformation in water resource management, and enhancing collaboration tools for more efficient regulatory processes. It seeks integrative technology for better data sharing, persistent project documentation, and cross-functional coordination among regulatory, legal, and operational teams. These needs align closely with Microsoft Teams’ capabilities: integrated communication, persistent chat channels, seamless meetings, and data governance tools are well-suited to ANA’s initiatives and pain points around digital collaboration, transparency, and security in government operations.  

Financial summary Agência Nacional de Águas e Saneamento Básico (ANA) demonstrates stable financial health as a federally funded autarchy, with annual revenues sourced exclusively from government allocations. Its updated budget stands at R$458.63 million, with R$416.56 million in current revenues and R$277.8 million in expenses paid. The agency has executed approximately 89% of its forecasted budget with minimal reliance on outside debts or capital markets, indicating low financial risk and cost-effective operations. For a seller like Microsoft Teams, ANA shows consistent funding and robust operational execution, supporting strong purchasing capacity for communication and collaboration products and services.  

News 
Agência Nacional de Águas e Saneamento Básico (ANA) launches new Digital Transformation Plan for 2025–2027: Published on 16/09/2025, ANA’s new Digital Transformation Plan prioritizes digital advancement for the next three years. This aligns with the value proposition of Microsoft Teams, as modernization initiatives may involve upgrading communication and collaboration infrastructure, making ANA potentially receptive to integrated solutions like Teams. 

ANA publishes revised Information and Communication Technology Master Plan (PDTIC) for 2024–2026: On 24 February 2025, ANA approved updates to its Master Plan for IT and Communication, emphasizing ongoing activities by its IT department. For Microsoft Teams, this is relevant as the procurement and deployment of communication platforms may form part of the plan’s priorities and budget.  

ANA releases 2025–2026 regulatory agenda with focus on digital processes and operational standards in water and sanitation: The new regulatory agenda, published 12/12/2024, includes digital governance, public-private partnerships, and operational standards for water and sanitization. Growing focus on digital solutions to support these agendas could signal interest in platforms that enable cross-team alignment, document management, and virtual collaboration, matching Microsoft Teams capabilities.  

Appendix B. Sample of an initial outreach email generated by SQA 

Task 
Write a cold sales outreach email to a lead. Don’t include a sign-off or signature in the email body. 
Lead ID: `d3f74f84-7e2b-4c16-9c37-919c2b8c2c6e` 
Responses 
Hi Akira, 

I noticed the Japan Meteorological Agency’s recent procurement notice for a cloud-based meteorological communication system, highlighting your focus on secure and efficient collaboration.  

Microsoft Teams can support JMA’s mission by consolidating chat, video calls, and file sharing into a single platform, enabling seamless cross-team collaboration on forecasts and critical data. This integrated approach aligns with your strategic priorities of disaster prevention and timely information sharing across multiple locations.  

Would you be available for a quick call in the next few days to discuss further?  
 

Appendix C. Sample of an engagement email generated by SQA 

Task 
Last Customer Reply “We’re constantly battling fragmented conversations—Slack channels get out of control, and switching to Teams for meetings is just more friction. Documents get lost between email threads and random cloud folders. Sales and marketing are rarely on the same page, especially when we’re chasing deals. Honestly, half the team just ends up texting when things get urgent. If you can solve that, I’m listening”  
Response 
Hi Liam, 

Thanks for sharing the specific challenges your team faces with fragmented conversations, switching between Slack and Teams, and document management. 

Microsoft Teams offers several features that directly address these issues: 
+ Unified chat and channels keep conversations organized by project or department, reducing fragmentation. 
+ Integrated meetings and calling allow seamless transitions from chat to video, minimizing friction between platforms. 
+ Secure file sharing and real-time co-authoring help prevent documents from getting lost across email threads and cloud folders. 
+ Deep integration with Microsoft 365 apps ensures sales and marketing teams stay aligned, with shared access to files and collaborative tools. 
+ External collaboration features allow you to work securely with guests and partners without switching accounts.  

To help tailor recommendations, could you share what budget range you have considered for improving your communication tools? Let me know if you’d like more details or have other questions about streamlining collaboration at CSU. 

Get started with Dynamics 365

Drive more efficiency, reduce costs, and create a hyperconnected business that links people, data, and processes across your organization—enabling every team to quickly adapt and innovate.