Thousands of businesses rely on Microsoft Azure to host mission-critical apps quickly and seamlessly. To ensure flawless infrastructure for more than 200 Azure data centers globally, Microsoft engineers and data scientists leverage Gandalf, an internal machine learning service powered by Azure Data Explorer (ADX). This solution enables internal teams to deploy frequent and secure updates of Azure components while ensuring no interruptions to Azure’s customers. By processing huge volumes of data in near real time, ADX empowers developers and data scientists to work faster and innovate at speed.
Keeping Azure infrastructure stable can be difficult; frequent updates are needed to tighten security, tune performance, and align services with customers' evolving expectations. "To keep pace with technological and market changes, we needed a big data analytics solution that could handle huge volumes of data faster," says Nick Swanson, Principal Software Engineering Manager at Microsoft.
Embedding resilience into cloud-scale infrastructure
Azure Compute Insights team deployed Azure Data Explorer (ADX)—a fast, fully managed big data analytics service—to replace a legacy solution. With the shift, the software engineering team sought to improve detections of issues and to prevent bad rollouts from impacting customers' services. "For virtual machines (VMs), we strive to ensure the highest availability possible, as customers are very sensitive to interruptions," explains Swanson. "So, it's critical not to deploy builds that interrupt virtual machines, and should that happen—to stop them immediately."
To ensure flawless deployment of infrastructure components, Azure Compute Insights team came up with Gandalf Safe Deployment, monitoring more than 20TB of data from ADX every day for over 2,000 fault types. When anomalies occur, it immediately pauses the rollout and notifies the development team. "We wanted to get faster results. Because Gandalf is powered by ADX, it leverages the very flexible and intuitive Kusto query language (KQL) as well as unparallel performance of Azure Data Explorer," Swanson says.
“We can run queries effectively against huge volumes of data, and because the language is so powerful and flexible, we really enjoy working with it.”
Ze Li, Principal Data Scientist and Applied Scientist, Microsoft
Frequent checks, faster response times
A significant enhancement over the legacy solution is faster and more granular anomaly detection. The team built a data pipeline that splits big queries into smaller time windows to process data every minute. Evaluating deployments in small batches greatly improves detection latency, ensuring high availability of cloud infrastructure provided by Microsoft. "We can process data quickly, without risking a timeout," says Ze Li, Principal Data and Applied Scientist at Microsoft. "If an issue occurs right after deployment, we can detect it within minutes.”
Additionally, the user-friendly, high-performing KQL query language helps speed up innovation when working on new services. "KQL lets developers iterate hundreds of times faster than other languages, so we can implement and improve our service rapidly," Swanson says. And because ADX comes with visual dashboards, rollout data is more accessible and easier for development teams to read. "Our product teams get a live view of their rollouts to check their status and see if they are interrupting VMs," he adds.
“As ADX has grown, we have grown along with it. Today, it's a core piece of technology we use more than any other solution in our day-to-day lives and our organization.”
Nick Swanson, Principal Software Engineering Manager, Microsoft
Letting data scientists enjoy their work more
From the data scientist's point of view, working with the enhanced system means more time to spend on meaningful work. "You don't need to plan your query; you just do it. And if an error occurs, fixing it is a matter of seconds," shares Li.
"We can run queries effectively against huge volumes of data, and because KQL is so powerful and flexible, we really enjoy working with it,” Li adds.
Swanson confirms that the evolving ADX ecosystem aligns with the changing demands of Azure Compute Insights team and supports increasingly complex use cases. "As ADX has grown, we have grown along with it. Today, it's a core piece of technology we use more than any other solution in our day-to-day lives and our organization," sums up Swanson.
Follow Microsoft