In Core Services Engineering (CSE, formerly Microsoft IT), we’ve implemented Microsoft Operations Management Suite (OMS) to provide cloud-based monitoring that can encompass our entire IT operations environment, from the cloud to the datacenter. We’ve created a management model for OMS across the enterprise that helps us deploy and manage our OMS environment effectively at each level of our enterprise structure, while still allowing our business groups to administer their monitoring and management environments in the way that best suits their business needs.
Moving the enterprise to the cloud
The cloud-first, mobile-first culture at Microsoft is designed to give our business groups the most effective IT environment possible. For our IT teams, this means quickly creating that environment, to the required scale, and in a cost-effective manner. Microsoft has championed a move to the cloud because it gives us the infrastructure we need to power the next generation of business applications. It also elevates collaboration and productivity, thereby enabling our employees to be more successful.
Microsoft Azure is at the core of our cloud infrastructure. We’re continually moving applications to the cloud, and Azure is the first choice for new IT solutions that we implement. We currently support the largest public cloud–based corporate IT infrastructure in the world using Azure. Our Azure environment includes:
- More than 1,700 active Azure subscriptions.
- More than 250 cloud-based applications.
- More than 16,000 Azure virtual machines.
- More than 18 billion Azure Active Directory authentications per week.
- More than 30 trillion objects stored on Microsoft Azure.
Our business runs on Azure, and we’re dedicated to increasing our footprint in the cloud, by migrating our on‑premises resources to Azure. By the end of the 2017 fiscal year, we plan to have over 90 percent of our IT infrastructure hosted in Azure.
Managing the cloud from the cloud
With so much of our IT resources hosted in the cloud, we’re looking for more effective and agile ways to manage our cloud-based resources. We’re using OMS in the enterprise to integrate our cloud resources into our enterprise monitoring solution and to provide new ways of monitoring and managing our cloud resources. OMS provides several benefits when monitoring Azure resources:
- It’s cloud-based. We don’t need to add on-premises infrastructure to our environment, and we’re remaining in the cloud network space to perform our monitoring and management tasks.
- It collects and analyzes logs and reporting across multiple platforms, including Microsoft Azure and Windows Server, thereby enabling comprehensive monitoring and management.
- It automates tasks and uses familiar operations components. It also uses familiar automation tools, like PowerShell scripts and runbooks.
- It provides an environment that can be quickly created, customized, and implemented. It doesn’t require complicated setup or significant administration to the OMS infrastructure because it’s cloud based and user-directed.
- It provides direct integration with Azure services, and it can be scaled to fit our Azure environment without procuring infrastructure resources or large capital expenditures.
Adopting an enterprise view of OMS
Our strategy for OMS uses a model that provides a view of our entire cloud operation, but it puts monitoring and managing control in the hands of our business units that manage their own apps and infrastructure. We’ve created an enterprise environment in which Microsoft customers and partners can tailor this implementation strategy to meet their requirements and standards. Our environment doesn’t require extra IT infrastructure management or a hierarchical reporting and control structure.
Providing an enterprise-wide OMS infrastructure
Our OMS implementation is designed for all of CSE. We’re in the early stages of rollout to our organization, and we currently have more than 25,000 systems reporting into our enterprise OMS implementation using the OMS agent. We collect data from three primary sources:
- On-premises operating system instances. These are physical or virtual servers that are still located in our datacenters. Some of our applications leverage on-premises components, so it’s important to capture on premises data to provide a complete monitoring picture.
- Azure assets that are connected back to our on-premises network using ExpressRoute. Resources that are connected the corporate network with ExpressRoute have separate monitoring considerations from those Azure resources that aren’t connected to the corporate network.
- Azure assets that aren’t connected via ExpressRoute. These are Azure resources that maintain network connectivity only within the Azure environment.
Our underlying structure for Azure resources is based heavily on Azure subscriptions and Azure Resource Manager. Resource Manager provides specific levels of granularity and manageability with its resource groups, and we leverage the Resource Manager model throughout our OMS environment. We encourage our business groups to use and share Resource Manager templates for deployment and management of Resource Manager resources.
Our OMS implementation is intentionally simple, using standard best practices and architectural design. The OMS workspaces we create act as the central collection point for agent data, and they’re available to different levels of our OMS model. Figure 1 illustrates our typical OMS architecture.
Establishing a model for administration
We’ve adopted the federal, state, city, neighborhood model for OMS administration. The four different levels each correspond to a Microsoft organizational unit or an Azure management boundary. They give us clear lines between the groups involved with OMS throughout our organization. They also set the stage for the separation of OMS responsibilities and scope of management. We chose this model to ensure that we could capture the business data we needed to properly evaluate our enterprise infrastructure performance. We didn’t choose it to give us centralized, enterprise control over the OMS implementation at CSE—it gives control of the OMS primarily to the business groups that support the environment that OMS is monitoring. Our teams know their infrastructure and environments, and we want them to make the important functional decisions about how OMS integrates with their solutions. Figure 2 illustrates our administration model, using a federal, state, city, neighborhood structure to represent the different areas of scope within our cloud environment.
The federal level represents the CSE enterprise. The enterprise level is our highest view of IT operations. The purpose of this level is to perform aggregated and combined monitoring and reporting. We use this level to ensure that our executive teams have up-to-date and relevant information about our entire organization.
The federal level is designed primarily as a scoping level. For OMS implementation, there aren’t any hard rule sets enforced at the federal level. Business groups are permitted and encouraged to implement their OMS workspaces in a way that best suits their business needs. Our primary administrative activities at this level are to:
- Provide guidance for OMS setup and implementation in the form of Quick Start guides.
- Provide security and compliance governance.
- Act as an on-ramp for OMS, encouraging and assisting business groups with adoption. This includes providing a toolkit that creates functional examples in OMS such as:
- Creating a workspace.
- Adding resources to the workspace.
- Adding solutions and doing preliminary configuration.
Monitoring is the biggest role at the federal level, even though it’s the smallest part of overall OMS activity. Most monitoring is done at the state level (business group) and passed up the federal OMS workspace. Our primary monitoring considerations at the federal level are:
- Performance. We want to ensure proper performance of the overall system and our combined business units.
- Compliance. Legal compliance is an enterprise concern, so it’s monitored at this level.
- Inventory. We keep track of overall inventory and enterprise-relevant numbers.
- Integration. Feeds to business intelligence systems integrate other data for a more comprehensive view into external systems.
State—business groups and Azure tenants
The state level is where the bulk of the control in OMS implementation lies. Business groups are encouraged to create and manage their OMS workspaces in the way that best fulfills their needs. The bulk of actual OMS activity occurs at the city/subscription level, and the state level provides logical views of combined data for cities and neighborhoods that reside within the scope of the state. State activities include:
- Using the OMS workspace to manage resources and OMS subscriptions.
- Enabling or disabling OMS capabilities based on state-level or federal-level policies.
- App portfolio monitoring, telemetry, and automation of runbooks or scripts.
The city level is where the bulk of OMS activity occurs. The city level uses the Azure subscription as its boundary, which means that permissions, control, and usage are configured at this level. Our implementation toolkit is designed to operate at the city level. It provides access to manageability tasks such as:
- Automated setup of an OMS instance.
- Automated integration with the Azure Virtual Machine management platform.
- Automated discovery of all the assets in the subscription.
- Enabling integration with enterprise update solutions.
- Enabling log and performance collection for state and federal solutions and reporting.
The IT teams that are managing the subscriptions at each city level are encouraged to adopt the position of systems integrators, using OMS in a way that best fits their solutions.
Neighborhood—Azure resource groups
Within each city-level subscription are the neighborhood levels, which correspond to Resource Manager resource groups. We use resource groups to group together services that are associated with specific applications, services, or service lines. Resource groups provide the flexibility to use role-based access control (RBAC) to manage each resource group independently or as part of a greater group. OMS leverages RBAC and enables a parallel management structure. Typically, the neighborhood level is used to provide unique monitoring views or automation actions for specific components of a solution.
At this level, one common use of OMS is Application Insights. It monitors user experience and application performance. Using telemetry across applications provides a user-centric view that allows us to better understand user behavior. These insights enable us to make intelligent decisions based on business processes for future improvements to the entire application portfolio. For more information, see Understanding our business with app telemetry.
Onboarding to OMS—DataMall case study
CSE built DataMall as a solution to centrally store data used by apps in the CSE environment. By using DataMall, Microsoft developers can access data from a central repository that collects and stores corporate data. DataMall provides two primary functions at Microsoft:
- It addresses the challenges of monitoring an app in a hybrid environment.
- It provides a more robust monitoring environment
DataMall uses both on-premises and Azure components to provide its functionality.
Addressing the challenges of monitoring an app in a hybrid environment
Due to the broad set of data sources referenced to and stored by DataMall and the hybrid nature of its architecture and infrastructure, we monitored DataMall using a combination of technologies that provide operations monitoring and management for most components of the application. However, many monitoring and remediation efforts were isolated from each other. This isolation resulted in incomplete data monitoring and a tedious process for finding and resolving issues with DataMall functionality.
We wanted a monitoring and management solution that could mitigate some of the most important shortcomings of our pre-existing solution, including:
- We had no direct visibility of end-to-end app health. We understood how certain pieces of DataMall were functioning, but it was difficult to see how those pieces interacted with each other and what the overall health of the app was.
- We wanted detailed monitoring of the collect and consume component and how it relates to the rest of the DataMall environment.
- We hoped to host the monitoring and management solution in the cloud.
We onboarded DataMall to OMS to provide more comprehensive monitoring and management. The DataMall onboarding process was relatively simple. We used the following steps:
- Create the OMS workspace.
- Associate OMS with the Azure subscription that hosts DataMall. This put DataMall at the city level within our federated architecture.
- Select the functional solutions we wanted, such as change tracking and automation.
- Install and configure the OMS agent in the on-premises computers and Microsoft Azure virtual machines that comprise DataMall.
- Select the event log and performance counters to monitor.
After the initial configuration, the OMS dashboard gave us usable and actionable information that spanned the entire DataMall infrastructure. Because OMS is cloud-based, the administration and monitoring console are available from anywhere. This offers a more accessible environment for our team. We don’t need to sign in or remotely connect to a specific server to use OMS. We can simply open a browser window and navigate to the OMS portal.
We’re currently using the following features and functions of DataMall that are managed by OMS:
- On-premises and virtual machine server operations:
- Memory usage
- CPU usage
- Disk usage
- IIS functionality:
- Site response time, failures, and general availability
- Site content validity and SSL certificates
- SQL monitoring:
- Database health
- Job failures
- Event log monitoring
OMS runbooks automate how exceptions or certain alerts are handled. For example, if an IIS website is unavailable beyond a certain threshold time, we run a PowerShell script to recycle the application pool for the involved components. Runbooks also provides similar automation for SQL databases and back-end processes to automatically remediate issues that arise with DataMall. Figure 3 illustrates the logic diagram for OMS runbooks for DataMall.
Providing a more robust monitoring environment
Using OMS as the monitoring and management solution for DataMall has provided a comprehensive, highly available, and accessible solution for DataMall. Specifically, joining the federated OMS implementation has enabled us to take advantage of the following benefits:
- Easy implementation. In pre-production, we delivered a working and complete OMS environment in one day. During implementation, we received usable results 30 minutes after installation and configuration of our OMS workspace.
- Scalability and resiliency. The OMS Azure platform makes it instantly scalable and resilient to failure and outages. We provided a solution that consistently performs well and that’s available to the DataMall monitoring team from anywhere at any time.
- Clear visibility into app functionality. OMS gives us a very clear view of DataMall functionality. Within our OMS workspace, we can view metrics and graphs for the entire application, or we can drill down into specific performance and log details for individual components from a single workspace. Many of the default OMS alerts and views provide the functionality and visibility that we require without customization.
We’re currently in the middle of our enterprise-wide implementation of OMS. As we continue to bring new business groups onboard and expand the scope of OMS in our organization, we continue to refine our model and implementation methods. We’ll continue to develop more specific compliance and governance standards as we grow OMS at CSE. We’re also continually developing our OMS toolkit for business groups and deploying it to the most mature business groups, thereby allowing them to quickly integrate into OMS. We plan to continue integrating business groups on a regular basis into our federated architecture.
For more information
Microsoft IT Showcase
© 2019 Microsoft Corporation. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.