Transforming modern engineering at Microsoft

Mar 18, 2024   |  

Microsoft Digital technical storiesOur Microsoft Digital Employee Experience (MDEE) team is implementing a modern engineering vision that creates a culture, tools, and practices focused on developing high-quality, secure, and feature-rich services to enable digital transformation across the company. Our Modern Engineering initiative has helped us be customer-obsessed, accelerated the delivery of new capabilities, and improved our engineering productivity.

Our journey

Our move to the cloud enabled us to increase the overall agility of the development process and accelerate value delivery for approximately 600 services comprised of about 1,400 components to new cloud technologies which provide quicker access to additional infrastructure. This enables spinning up environments and resources on demand, which allows an engineer to respond more quickly to evolving business needs.

However, we still needed to address several structural issues, including inconsistency between teams in basic engineering fundamentals like coding standards, automated testing, security scans, compliance, release methodology, gated builds, and releases.

We lacked a centralized common engineering system and related practices. Recognizing that we could not continue to evolve our engineering system in a federated way, we invested in a central team. The team was chartered to develop a common engineering system based on Microsoft Azure DevOps, while driving consistency across the organization regarding how they design, code, instrument, test, build, and deploy services. We brought a product engineering mindset to our services by defining a vision for each service area and establishing priorities based on objectives and key results (OKRs) which we define, track, and report using Viva Goals. These scope what we want to achieve each planning period and then execute on them via a defined cadence of sprints. The resulting engineering processes have promoted business alignment, developer efficiency, and cross-team mobility.

We incorporated industry-leading development practices for accessibility, security, and compliance. Achieving compliance has been very challenging, forcing us to change from legacy processes and tooling and requiring us to actively respond to our technical debt in these areas. We also lacked a consistent level of telemetry and monitoring that allowed us to obtain key insights about service health, features, customer experience, and usage patterns. We have moved towards a Live Site culture so that we can comprehensively drive sustained improvements in service quality. Telemetry capabilities have been improved through the ability to do synthetic monitoring and the ingestion of data from a wide variety of data sources and using services such as Azure Monitor.

Our vision for modern engineering

Microsoft’s digital transformation requires us to deliver high-quality capabilities and solutions at a faster pace and with reliability and security. To achieve this, we’re modernizing how we build, deploy, and manage our services to get new functionality in our users’ hands as rapidly as possible. We’re re-examining every part of our engineering process and instituting modern engineering practices. Satya Nadella, our Chief Executive Officer, summarized this well.

“In order to deliver the experiences our customers need for the mobile-first, cloud-first world, we will modernize our engineering processes to be customer-obsessed, data-driven, speed-oriented and quality focused.”

Our ongoing investments in modern engineering practices and technology build on the foundation that we’ve already established, and they reflect our vision and support our cultural changes. We have three key pillars on which we’re basing these investments along with a commitment to infuse AI into each pillar wherever appropriate.

  • Customer obsession
  • Engineering productivity
  • Rapid delivery

Customer obsession

We want to ensure our engineers keep customers front and center in their thoughts, so we’re capturing feedback to provide our engineers with a deep understanding of the customer experience. Our service monitoring has enabled us to be alerted to problems and fix them before our customers are even aware of them.

We are the first customers of Microsoft’s commercial offerings, which enables us to identify and address the engineering needs of the enterprise operating in a cloud-centric architecture. We constantly work with our product engineering groups across the company, creating a virtuous cycle that makes our products such as Azure DevOps and Azure services even more enterprise ready.

Using customer feedback to drive development

We’re keeping the customer experience at the center of the engineering process via feedback loop mechanisms. Feedback loops serve as a foundation for hypothesis-driven product improvements based on actual sentiment and usage data. We’re making feedback submission as easy as possible with the same tool that the Microsoft Office product suite uses. The Send a Smile feature automatically and consistently gathers feedback across multiple channels and key user touchpoints. We use this tool as a centralized data system for storing, triaging, and analyzing feedback, then aggregating it into actionable insights.

We encourage adoption of feedback loops and experimentation methods, such as feature flighting and ring deployment, to help measure the impact of product changes. With these foundational components in place, we’re now correlating feedback data with related telemetry so that we can better understand product usability issues and the impact of service issues on customers. Our use of controlled rollouts eliminates the need for UAT environments, which accelerates overall delivery.

Telemetry

We unified the telemetry from disparate systems by building on Azure Monitor to help us implement continuous improvements in the quality of our services. This platform integrates with heterogeneous data sources such as Kusto, Azure Cosmos DB, Azure Application Insights, and Log Analytics to collect, process, and publish data from applications, infrastructure, and business processes. This helps us obtain end-to-end views and generate more actionable insights about our service management.

We’re working toward delivering highly connected insights that aggregate the health of component services, customer experience, and business processes. This produces contextual data that not only identifies events but also identifies root causes and recommended next actions. We’re using business process monitoring (BPM) to monitor true availability and performance by tracking successful transactions and customer impact across multiple services and business groups.

To achieve a sustained level of quality, we’re leveraging synthetic monitoring for all critical services, especially those with a relatively low volume of business transactions. Data-enhanced incident tickets provide a business impact prioritized view of issues, supplemented with potential causes including those identified through Machine Learning. These data-enhanced tickets allow teams to focus on the most important tickets and reduce mitigation time.

We are investing in AI technologies to proactively detect anomalies and automatically remediate them wherever possible. Being able to intelligently respond to incidents reduces support costs and improves service reliability and the overall user experience.

Service health

We have focused on increasing our effectiveness in service and live site incident management. We rolled out a standard incident management process and measured continual improvements against key incident management metrics. We monitor service health metrics and key performance indicators (KPIs) across the organization to understand customer sentiment and ensure services are reliable, compliant, and performing well. We’re using consistent standards, which helps ensure that we can aggregate data at any level in the service hierarchy and compare it across different team groups. We built a more integrated experience on top of Azure Monitor, enriched with contextual data from the unified telemetry platform, and created a set of defined service health measures and an analyzer to track events that can affect service reliability, such as upcoming planned maintenance or compliance related changes. This enables us to detect and resolve issues proactively and quickly. Defined service health measures make it easier to enable service health reporting across various services.

We knew that we must connect service health to business process health, and how we prioritize issues, so that engineers could address them in a way that reduces the negative business impact. The experience we’re building enables visualization of end-to-end business process health and the health of the underlying services by analyzing their telemetry.

We also simplified the flow of service health and engineering fundamentals data to the engineer and reduced the number of dashboards and tools they use. An internal tool is now the key repository for all service owners to view service health and other relevant KPIs. The tool’s integrated notification workflow informs service owners when a service reaches a defined threshold, making it more convenient to prioritize any needed remediation into their backlogs.

Embracing a Live Site culture

Increasing scale and agility in our services and processes required us to focus on making customers’ experiences better. We’re establishing a Live Site culture and pursuing excellence via customer-obsessed, data-driven, multidisciplinary teams. These teams embrace potential failure with honest observation, continuous learning, and measurable improvement targets.

We host an organization-wide, live site review that includes postmortem reviews on incidents, examining long-term remediation plans, and guiding service teams through modern engineering standards that will help them perform robust reviews at a local level. We base these reviews on standard and actionable reports that contain leading indicators for outages or failures based on the analysis of telemetry, synthetic monitoring, and other data.

Engineering productivity

We’re providing our engineers with best-in-class unified standards and practices in a common engineering system, based on the latest Azure tools, such as Azure DevOps. A consistent development environment allows our engineers to transition smoothly between projects and teams. Improved automation, consistency, and centralized engineering systems enable engineers to better focus on the core role of developing. This also reduces onboarding time and allows our engineers to be more flexible across projects.

Integrating developer tooling

We made organizationally mandated code analysis and compliance tools accessible directly within the development environment, thereby helping our shift-left goal. We built self-service capabilities to manage access, set policies, and make changes to Azure DevOps artifacts such as area paths, work items, and repositories. This has made it easy for engineers to create, update, or retire services, components, and subscriptions, minimizing the time spent managing such resources. We want to extend our shift left goal to also examine optimization of our Azure service design and surface recommendations for configuration optimization so that these occur early in the deployment cycle and allow us to rightsize our configurations and avoid unnecessary Azure costs.

Enabling code reuse

While at a low volume, we’re still supporting a few applications (fewer than five percent) that use on-premises servers and domain-joined Azure virtual machines. This results in ongoing effort to patch servers, upgrade software, and perform basic infrastructure maintenance tasks. It also impedes our ability to scale apps to accommodate growth. We’ve transformed these applications to Microsoft Azure platform-as-a-service (PaaS) and software-as-a-service (SaaS) based solutions, thereby leveraging the scale and availability of Azure. We enabled this by providing architectural guidance and tools to migrate data, refactoring existing functionality as APIs, and building lightweight applications by reusing APIs that others have already published.

Promoting data and code reuse to build solutions more rapidly and align with a service-oriented architecture requires that developers have the ability to publish and discover APIs easily. We built an API economy by creating a common set of guidelines for developing coherent APIs, and a central catalog and search experience for discovery. We integrated validation against API guidelines and enabled our teams to integrate API publishing into their Azure DevOps pipelines. We created a set of common API health analytics. We also enabled the growth of inner source in which sharing code outside of APIs is achieved.

Workforce strategies

To address our previous high level of dependency on suppliers, we implemented a new workforce strategy, hiring more full-time employees and bringing more work in-house. This allowed us to transform and modernize how we deliver services. Furthermore, this workforce strategy makes it imperative that there is full-time employee oversight of any supplier deliveries, ensuring they adhere to processes, standards, and regulatory requirements, including security, accessibility, and privacy. We implemented a common bar for hiring across all teams and a common onboarding program to ensure all new hires receive a consistent level of training on all key tools and technologies. As we ramp up our use of AI technologies to further transform our engineering, we are investing in re-skilling and training initiatives to expand the engineering capacity available to work on AI-related projects.

Universal design system

We leveraged Microsoft’s product design system to engineer solutions that look and behave like other Microsoft products. Every product should meet the quality expectations of today’s consumers, meaning that every piece of the user interface (UI) and user experience (UX) should be engineered with accessibility, responsiveness, and familiar behaviors, states, motion, and visual styling. On complex but common components like headers, navigation menus, and data grids this can mean weeks of engineering time multiplied across every MDEE team that requires the same components. This is considerably reduced by adopting a universal design system.

Rapid delivery

To be customer-obsessed, we’re acquiring and protecting customer trust in every aspect of our relationship. We are tracking delivery metrics so that we can shorten lead times from ingestion of customer requirements to the time the solution is in the customer’s hands and then on to measuring customer usability and feedback, while still ensuring service reliability. We’re helping engineers achieve this objective by checking for issues earlier in the pipeline and providing a way to rapidly experiment and mitigate risk. We are building feedback-loop mechanisms to ensure that we can understand the user experience as new functionality gets deployed, and we perform automated rollbacks if customer reaction or service-health signals are less favorable than we anticipated.

Integrating security, accessibility, and fundamentals

Delivering secure, compliant, accessible, dependable, and high-quality services is critical to building trust with our customers. Our engineers are checking for issues earlier in the pipeline, and we’re enabling them to experiment rapidly while limiting potential negative effect on the release process.

We moved to a shift left process, in which work happens as early in the development process as possible. This enabled us to avoid carrying debt from sprint to sprint. We also implemented gates in the developer workflow that help build security in a streamlined way and auto-onboarding services to ensure continuous compliance.

We scan code for security issues and log bugs in Azure DevOps that we discover during the scanning process, so developers can fix them directly in the same engineering system they use for other functional bugs rather than having to triage separately from security tools.

We assess accessibility within our applications, but this happens late in the development process. To move this further upstream, we adopted accessibility insights tooling during development and now expose accessibility-related bugs as part of the pipeline workflow.

We are adopting AI technologies for providing accessibility guidance and conducting accessibility assessments to ensure that our applications conform to accessibility requirements.

Additionally, we enabled engineering teams to utilize the guardrails we’re implementing by integrating policy fundamentals into the pipeline, and we’re implementing continuous integration practices. This ensures that all production releases, including hot fixes, come from builds of the main branch of source code and all have appropriate compliance steps applied consistently. Each pull request must have a successful build to ensure that the main branch is golden and always production ready. Maintaining high-quality code in the main branch minimizes build failures that ultimately slow our time to production.

Deploying safely to customers

We created an environment where teams test ideas and prototypes before building them. The goal is to drive customer outcomes in a way that encourages risk-taking with a fail-fast, fail-safe mentality. Central to increasing the velocity of service updates to customers is a consistent, simple, and streamlined way to implement safe deployments. Progressive exposure and feature flags are key in deploying new capabilities to users via controlled rollouts, so we can quickly start receiving customer feedback.

We implemented checks and balances in the process by leveraging service indicators such as latency and faults within the pipeline, thereby catching regressions and allowing initiation of automated rollbacks when predefined thresholds are exceeded. Implementing safe deployment practices, combined with a streamlined and well-managed pipeline, are two of the key elements for achieving a continuous integration, continuous deployment (CI/CD) model.

Reliability and efficiency

We are enhancing our DevOps engineering pipeline across services by identifying and removing bottlenecks and improving our services’ reliability. We’ll use DevOps Research and Assessment (DORA) metrics to measure our execution and monitor our progress against industry benchmarks.

We’re focusing on deployment frequency, lead time for changes, change failure rate, and mean time to recover in order to gain a comprehensive view of our software or service delivery capabilities. Based on this data, we’ll increase productivity, speed up time-to-market, and enhance user satisfaction.

Key Takeaways

  • We’re making our vision for modern engineering a reality at Microsoft by promoting a Live Site first culture, using data to provide service and business process health signals to inform the rapid iteration on new ideas and capabilities with customers.
  • We’re supporting this by moving to an Azure DevOps model of continuous integration and continuous deployment governed by a standardized engineering pipeline with automatic policy enforcement.
  • The Live Site first culture and the tools and ceremonies that support it have increased visibility into engineering processes, improved the quality and delivery of our services and improved our insight into our customer experiences, all of which ensure we are continually improving and adapting our set of services and processes to support digital transformation now and into the future.

Related links

Tags: , ,