Monitoring end-to-end enterprise health with Azure

Aug 11, 2020   |  

Male developer working while standing at his desk.

Microsoft Digital has created a unified telemetry platform to provide end-to-end enterprise health monitoring and data-driven decision-making capabilities at all relevant business levels. This platform helps Microsoft Digital gain key insights about features, customer experience, usage patterns, and effectiveness across the entire business.

Microsoft Digital has created a unified telemetry platform to maintain a robust, scalable, and reliable set of services. As our digital transformation has progressed, we’ve enhanced the way we monitor our service and business process health. By building on the monitoring and data analytics capabilities in Microsoft Azure, our unified telemetry platform (UTP) provides telemetry, monitoring, logging, and reporting across our entire business landscape, connecting previously disconnected services and business processes. UTP helps us gain key insights about features, customer experience, usage patterns, and effectiveness across our entire business. We’ve built UTP into an end-to-end, enterprise-scale solution that provides data-driven decision-making capabilities at all relevant business levels.

Understanding the service and business process environment

At Microsoft Digital, we build and operate the IT infrastructure on which Microsoft runs. The Microsoft Digital infrastructure consists of more than 900 individual services that combine to provide Microsoft technical operations for our employees and customers worldwide. These services include wire payment processing, vendor payment, purchasing, Xbox support, and payroll. These services span multiple business groups and technical teams. We host these services in cloud platforms distributed throughout the world. These critical services support and run alongside the business processes that define how our business operates.

Managing and maintaining the health of these services and business processes is critical to Microsoft business operations. As we digitally transform, our teams are delivering compelling and efficient services quickly and with increased expectations of quality and user experience.

Challenges in service and business process health monitoring

Historically, our service monitoring and management capabilities were developed by teams focused on specific services or groups of services. Within this environment, individual business groups could obtain relevant monitoring data to help them improve their services and business processes. However, the monitoring solutions and associated taxonomy often didn’t provide business-level context or reporting. Many of our business processes depended on multiple services and obtaining a high-level perspective or understanding dependencies across services or business processes was difficult. The most significant challenges included:

  • Multiple, isolated monitoring platforms.
  • Inconsistent data sources.
  • Lack of monitoring continuity across multiple services.
  • No end-to-end observability for business groups or executive-level stakeholders.
  • No single interface for stakeholders to observe the health of services, business operations, customer experience, and enterprise health across the organization.

Focusing on enterprise health

Our vision for UTP is to provide a modern, reliable, scalable, and cost-effective telemetry platform that Microsoft Digital builds to run and analyze its infrastructure, application, and business processes. Our goal for UTP is to give our teams the ability to understand and improve their platforms, features, and processes regardless of whether they own them end-to-end. We want end-to-end enterprise health to be the focus of our monitoring platform, providing a holistic health perspective, which includes:

  • Enterprise health. Enterprise health defines how our organization runs, from a broad level. This level is informed by the underlying operational and service health. It summarizes, aggregates, and often simplifies the data to provide a comprehensive, high-level perspective.
  • Operational health and customer experience. Operational health focuses on the health of business processes that define how Microsoft Digital operates. We base operational health on underlying service health data, but it also includes logical business process and customer experience data that doesn’t derive directly from service health metrics.
  • Service health. Service health defines technical reliability and availability of our service infrastructure and provides a comprehensive overview of our technology platforms along with individual metrics for individual services.

We designed UTP to fulfill the monitoring and reporting requirements across these levels, from engineer to business operations manager to business leader. We established metrics, observability standards, and standard measurements by designing our platform to meet the needs of the following levels of organization:

  • Business leader. Our business leaders need to understand the health of the enterprise, which includes how their respective business groups are operating and how business operations affect them. This level provides the broadest overview of enterprise health by relying on the metrics exposed at lower levels.
  • Business operations manager. Our business operations managers need to understand the health of their business processes. These metrics focus on the core operations of our business and the general health of the services that support them. They inform the broad level metrics for the business leader and are informed by the core service metrics provided to the engineer.
  • Engineer. At the most fundamental level, our engineers need to understand core service health. This reflects the “how” of our operations, including how each service is running, based on the health of its components and related dependencies.
The three levels of metrics and requirements for end-to-end-enterprise monitoring including what for business leaders, why for busines operations managers, and how for engineers.
Figure 1. Designing UTP for relevant metrics and requirements at every level of the organization

Our design decisions for UTP are informed by these three levels of organization. For example, we adopted the HEART (Happiness, Engagement, Adoption, Retention, and Task success) framework to measure key user-focused metrics throughout the service and business process environment. We use HEART’s data-driven approach and combine it with direct user communication to create a comprehensive measurement of user sentiment. This measurement enables stakeholders to accurately understand the user experience and identify the correct investment areas for the product. To enable HEART across Microsoft Digital, we leveraged UTP to ingest, onboard, store, analyze, and report on HEART-specific metrics.

There are two important aspects of UTP design that define how UTP is built. Guiding design principles call out our functional goals for UTP and standard design elements define how we achieve our goals for UTP using architecture and data components.

Guiding design principles

Our design outline defines how UTP is built. It describes how we create our architecture and expose metrics for consumption by UTP users.

We aligned our guiding design principles with the broader vision for digital transformation at Microsoft Digital. Mapping out an end-to-end overview of the business, defining a vision, prioritizing scenarios, creating a release roadmap, building the product, and defining metrics based on business outcomes are core to digital transformation across Microsoft. Being vision-led is the primary driver of our digital transformation. This means having a clear overview of where we want to take things and what we need to get there.

Implementing our vision involves adhering to guiding principles for our design. These principles keep the design process focused on our vision while ensuring that our approach meets the needs of the business. Our three design principles are:

  • Speed to value. UTP is designed with an intuitive onboarding process that allows monitoring and reporting within minutes, giving value back to teams immediately.
  • Employee experience. We believe that adhering to a common schema and a common platform is an important investment in the Microsoft Digital business. By enforcing commonality where it is necessary, but being otherwise flexible, we provide value for the entire organization, from engineers to business operations managers to business leaders.
  • Modern architecture. We align our architecture and design with industry-accepted initiatives and engineering principles for big data and monitoring. We want to leverage our platform as a paradigm for telemetry at an enterprise scale for any organization.

Standard design elements

UTP is built upon standard design elements that follow our guiding design principles. These design elements define how UTP stores data, how it exposes metrics, and how we design tools and architecture. Our standard design elements include:

  • Single data store and common schema. We use a common data store and schema to standardize both incoming telemetry and outgoing metrics. This enables the use of common tools, data capture processes, and visualizations across all services and business groups. It creates consistent data about our services and users and enables simple correlations across data structures. Our common schema defines the following:
    • Common fields for all events
    • Domain-specific fields unique to each event type
    • Custom fields defined by the application owner

    We also use our existing service management database to populate core service information from across the organization.

  • Minimum standards. Our minimum standards define a set of key performance indicators that span the Microsoft Digital environment to address the most critical areas for monitoring:
    • Actionable service data. Near real-time, actionable observations of user experience data for issue monitoring, response, and resolution for service and business process users.
    • Service insight. Visual insights that provide consistent data across all services and applications within a portfolio.
    • Cross-service correlation. A common taxonomy for collected data with correlation vectors that enable end-to-end joining and tracking of data based on user, device, and service information.
    • Customer insight. A collection of correlated user scenarios and workflow information.
  • Standard tools and architecture. Our standard tools and architecture ensure that the instrumentation, collection, monitoring, and analysis of services adhere to Microsoft Digital and industry standards. We’ve built UTP on Microsoft Azure to provide enterprise cloud scalability, agility, and reusability.

Implementing UTP

UTP fulfills two purposes and performs three functions for our business. These purposes and functions represent the practical implementation of our vision for UTP, and they inform all aspects of UTP design and capability. The design, architecture, and onboarding processes are built on the core functionality we’ve built into UTP.

Purposes

The two primary purposes for UTP directly affect how we designed UTP and what we require from each of the UTP functions. These purposes are business driven. They define the key information we need and why we need it.

Service health monitoring

Service health monitoring occurs at the engineering level. It involves telemetry from the more than 900 services on which our business depends. Service health monitoring exposes the health of our services from an engineering perspective and a user perspective. The combination of these two perspectives helps us to:

  • Identify and fix issues with services at the engineering level before they become a problem.
  • Create service health metrics that translate well to the business operations level, where our business operations managers can identify how service-level issues impact users and customers.

Business process monitoring

A business transaction can span multiple services in different business units. Each component has business process telemetry that UTP needs to ingest. By explicitly defining business process definitions using configuration files, the UTP business analytics engine can generate the business process monitoring analytics data model based on:

  • All events that belong to the same business transaction instance. aggregated and correlated based on cross-correlation vectors. Cross-correlation vectors link tables across data sources to maintain data relationships for end-to-end reporting.
  • Metrics for each instance of the transaction or all transactions combined.

Functions

The primary functions define what UTP does, including: what we monitor, how we manage the telemetry data, and how the monitoring result manifests. The three functions combine into data pipelines, which are organized by service and business function.

Data ingestion

Our data ingestion practices affect every aspect of the UTP environment. UTP receives 10 million queries per month and hosts more than 150 terabytes of raw and processed data. Our data ingestion systems provide methods for teams to capture and push their telemetry data into UTP and store the data in Azure Blob Storage for transformation and transportation into warm or cold data paths.

UTP uses Application Insights as the default and preferred ingestion method for incoming data. We provide build-time onboarding with a custom software development kit (SDK) built on the Applications Insights SDK. For applications that can’t incorporate build-time onboarding, we offer run-time onboarding using Application Insights. UTP allows bulk ingestion for externally developed applications or applications that can’t use build-time or run-time onboarding to ingest their application telemetry data. For an example, refer to End-to-end telemetry for SAP on Azure. UTP’s bulk ingestion tool supports ingestion and transformation using Application Insights across multiple channels, including file-based data sources, Azure Event Hubs, or Structured Query Language (SQL) database stores.

Data transformation

Once ingested, data is available in Azure Blob Storage as raw files. Using Azure Functions, the data is transformed with the common schema and pushed into an Azure Data Explorer cluster that contains the common data store for reporting and visualizing. Transformation occurs on a pipeline-by-pipeline basis and we customize it to meet the telemetry needs of the application, service, and business process owners. High-level steps for data transformation include:

  1. Read from raw data folders.
  2. Convert the raw data into the common schema structure.
  3. Add service or business-specific attributes based on the folder structure where UTP stores raw data.
  4. Store the data in structured Azure Data Explorer tables for user consumption.

Data visualization

UTP users manage data visualization through Azure Data Explorer and Power BI. Data Explorer queries, filters, and functions provide optimized reporting capabilities for engineering, business, or executive-level monitoring and reporting. Owners access the query engine and visualizations directly from the Application Insights portal to perform all monitoring and reporting tasks and access reports and visualizations relevant to their level of the business. Raw telemetry data is also available to individual teams for further analysis, insights, and machine learning.

The UTP architecture. Application Insights supplies data to Azure Storage Queue and Azure Storage Blob Containers. An Azure Function contains data transformation and routing logic which is monitored by Azure Monitor and made available to the end-user via Azure Data Explorer.
Figure 2. UTP architecture

Moving forward

In the era of digital transformation, modernizing applications empowers organizations to be more efficient in creating business value. However, as applications change, they introduce attack vectors and potential exposures. Microservices, cloud-native applications, APIs, and mobile applications need continual review and application of security best practices. UTP provides the foundation for capturing security telemetry, and we’re strengthening UTP’s security monitoring capabilities to capture unauthorized access attempts, suspicious admin actions, and security validation failures in critical business process scenarios.

We’re also developing onboarding for applications and services with existing Log Analytics workspaces, where owners can use an Azure PowerShell script to deploy the necessary resources in their Azure subscription to copy Log Analytics data into UTP automatically.

Conclusion

UTP provides logging, monitoring, alerting, and reporting across our entire business landscape, connecting previously disconnected components. Built on Azure, UTP helps us gain critical insights about features, customer experience, usage patterns, and effectiveness across services and business processes. We’ve built UTP into an end-to-end, enterprise-scale solution that provides data-driven decision-making capabilities at all relevant business levels. In line with our ongoing digital transformation, we’re growing and adapting UTP to accommodate and enable our continually changing business needs. With UTP, our entire organization can understand the end-to-end health of our entire enterprise, from engineer to executive.