Microsoft Copilot scales AI workflows to hundreds of millions with Durable Task Scheduler

Copilot’s rapid growth introduced complex, long-running AI workflows at massive scale, but fragmented retry logic and custom recovery made it difficult to ensure reliability and maintain engineering velocity.

Microsoft Copilot standardized on Durable Task Scheduler in Azure Functions to provide a unified orchestration layer that automatically handles state, failure recovery, and replay for complex, multi-step AI workflows.

Durable Task Scheduler enabled Copilot to reliably run hundreds of millions of weekly executions, standardizing resilience across 25+ orchestrations and 10+ microservices while accelerating feature delivery at global scale.

Microsoft Copilot serves hundreds of millions of people. On the surface, the experience feels simple. Answer a question, summarize content, draft an email, manage a schedule. But Copilot is no longer just a conversational interface. It is a platform with increasingly diverse capabilities, each operating at massive scale. 

Some capabilities are user initiated, for example, “search my past conversations." Others are autonomous, like "generate my weekly digest." Some are one time, such as "run this research task." Others are perpetual, for instance, "keep my profile recommendations fresh." All of them depend on background execution, meaning coordinated, multi-step workflows that must track state, survive failures, and complete correctly even as load spikes, infrastructure shifts, or unexpected errors occur. 

The workload categories are diverse. Scheduled AI execution, personalized content pipelines, memory indexing, account management, data enrichment, and deletion workflows each have different timing characteristics, failure modes, and correctness requirements. Yet they all share a common need. They must run reliably at the scale Copilot operates at, for hundreds of millions of concurrent users. 

“Every new capability needed its own retry logic and failure recovery. That was not sustainable. When we standardized on Durable Task Scheduler, every team got the same resilience guarantee without building it themselves. That is what unlocked our shipping speed,” said Ryan Nowak, Member of Technical Staff, Microsoft Copilot. 

Why retry loops were not enough 

As these background workloads grew, so did the engineering challenge. A traditional queue and retry approach can move work through a system, but it leaves teams responsible for correctness over time. If a workflow fails halfway through, after validating a task but before scheduling the next recurrence, or after fetching user preferences, but before ranking results, teams need custom logic to resume from that exact point. They need idempotency, state checkpoints, and decisions about what data persists, where the failure occurred, and how to resume without duplicating work. 

For Copilot's workload mix, where workflows run for seconds to minutes and include multiple dependent steps, this model multiplied complexity. Building custom recovery logic for each of the 25+ orchestrations Copilot needed would have been operationally untenable. And building custom checkpointing for every new workflow would have slowed development and fragmented reliability patterns across teams. 

The Copilot engineering team reframed the problem. Instead of letting each team build its own recovery and retry logic, they standardized on Durable Task Scheduler (DTS) in Azure Function as the unified execution engine for all long running, multi-step work. This moved durability from a per feature concern to a platform level capability. Workflows became durable orchestrations by design, with state persistence, failure recovery, and replay handled automatically. 

“A user schedules a task that needs to run every morning for months. If a server restarts or a region fails, the workflow picks up right where it left off and runs to completion. That's the durable execution guarantee our users depend on,” said Ryan Nowak, Member of Technical Staff, Microsoft Copilot.

“Every new capability needed its own retry logic and failure recovery. That was not sustainable. When we standardized on Durable Task Scheduler, every team got the same resilience guarantee without building it themselves. That is what unlocked our shipping speed.”

Ryan Nowak, Member of Technical Staff, Microsoft Copilot

One reliability model, 25+ orchestrations 

The power of standardizing on DTS became clear not in any single orchestration, but in the breadth of workloads it could handle. Copilot now uses DTS across 25+ orchestrations in 10+ microservices, with workflows authored using the Durable Task SDKs in services running on Azure Kubernetes Service all following the same reliability pattern. 

What makes this significant is the diversity of problems solved with one resilience model. 

Copilot Tasks (scheduled AI execution): A user schedules a task. The system validates state, invokes AI execution on schedule, then for recurring tasks uses durable timers and sub orchestrations to chain the next execution, indefinitely and reliably. 

Personalized discovery:  A pipeline orchestrates profile retrieval, queries multiple grounding sources, ranks millions of candidates via AI, hydrates images, and composes the final feed. If any step gets interrupted, the orchestration resumes from that exact point. 

Memory and search indexing: Conversation histories continuously flow into batch workflows that process embeddings, update search indexes, and maintain freshness. Progress persists at each step, so infrastructure shifts never cause data loss. 

Deep research: When a user asks Copilot for in-depth research, a multi-minute agentic session coordinates web searches, synthesis, and report building. DTS ensures the session continues reliably even if underlying nodes shift or recycle mid execution. 

Shopping recommendations and account lifecycle:  Multi-step operations including ranking, enrichment, account deletion, data linking, and entitlement synchronization all coordinate through orchestrations with strict completion semantics. 

The same pattern also powers notification delivery, AI podcast generation, video and 3D content pipelines, healthcare profile suggestions, and Edge browser journeys. It set the pattern other teams now follow. 

Deliberate rollout, compounding results 

The rollout was deliberate. Copilot applied the same DTS powered resilience model across different workload shapes as new needs emerged, rather than adopting a separate reliability approach for each one. Each additional use case reinforced the same lesson. Once durability lived in the orchestration layer, teams could keep the programming model consistent while the platform handled persistence, recovery, and replay underneath. 

By April 2026, DTS had become foundational infrastructure in Copilot, supporting both user-facing experiences and behind-the-scenes systems at sustained production scale. 

“The same orchestration pattern powers scheduled AI tasks, personalized feeds, memory indexing, deep research, and shopping recommendations. We built one reliability model, and it scaled across 25 orchestrations and hundreds of millions of weekly executions,” said Ryan Nowak, Member of Technical Staff, Microsoft Copilot.

“The same orchestration pattern powers scheduled AI tasks, personalized feeds, memory indexing, deep research, and shopping recommendations. We built one reliability model, and it scaled across 25 orchestrations and hundreds of millions of weekly executions.”

Ryan Nowak, Member of Technical Staff, Microsoft Copilot

Impact at scale 

The real shift was operational. Durable orchestration became a repeatable standard across teams and services, allowing new capabilities to plug into an existing resilience model instead of rebuilding one. 

Metric	Volume
Total DTS activity invocations (weekly)	Hundreds of millions
Memory indexing activities (weekly)	~300 million
Discovery feed generation activities (weekly)	~85 million
Shopping recommendation activities (weekly)	~27 million
Distinct orchestrations	25+
Microservices using DTS	10+
Copilot Tasks scheduled executions (daily)	~40,000
Unique users with active tasks	~15,000
Scheduled task execution growth	3x in one month

Snapshot of current usage. These volumes are still growing, and DTS is provisioned to scale beyond them. 

A blueprint for AI scale orchestration 

For engineering and platform leaders, the Copilot story reflects a clear systems design choice. Place durability in the orchestration layer rather than in individual features. Durable timers, replay based recovery, and sub orchestration composition reduce custom reliability work while supporting high throughput across heterogeneous AI workloads. 

The result is a platform that scales both execution volume and feature velocity without fragmenting reliability patterns across teams. As Copilot continues to grow the breadth and depth of its AI capabilities, the DTS powered orchestration layer provides the foundation to do so reliably, at global scale. 

Discover more about Microsoft Copilot on, Instagram, LinkedIn, X/Twitter, and YouTube.