Learning from engineering Zero Trust networking at Microsoft

Our Microsoft Digital (MSD) team is deploying Zero Trust networking internally at Microsoft as part of our Zero Trust initiative, our comprehensive approach to verification and identity management.

Powered by Microsoft’s internal security team, our Zero Trust model centers on strong identity, least-privilege access, device-health verification, and service-level control and telemetry across the entire IT infrastructure. Our networking leadership and engineering teams are building a network to support the Zero Trust model. It includes fully integrated authentication across all network devices, effective segmentation of our global network, end-to-end encrypted connectivity, and intelligent monitoring.

Graphic of the four primary functions of Zero Trust networking, including authentication, segmentation, connectivity, and monitoring. — The primary functions of Zero Trust networking.

Zero Trust networking is a journey; we’ve come a long way, and we’ve learned valuable lessons. In this article, we share these lessons with you to help you plan and deploy Zero Trust networking effectively and efficiently in your environment.

Primary goals

Our engineering goals for Zero Trust followed the general scope of the primary functions of Zero Trust, and they established how we approached the implementation of Zero Trust networking.

Understand devices and environment. Accurate information is critical to effective implementation. We had to understand the state of devices on our network before, during, and after deployment.
Design for inherent security. Zero Trust networking is about a security posture. Our planning and design always included security as an intrinsic priority.
Deploy and manage with automation. We didn’t have the time or resources to reconfigure our entire network manually. Our deployment and management used automation wherever possible, relying heavily on virtual networking and network as code.
Optimize costs. Refactoring the entire network involves a massive amount of infrastructure. We focused on optimizing costs as we implemented, reusing infrastructure when we could.
Maintain a consistent user experience. We wanted the transition to Zero Trust networking to be as noninvasive to the user as possible, while placing all devices in a more secure and controlled environment.

These goals directly influenced our implementation and molded our approach to specific inventory, design, deployment, and monitoring efforts throughout Zero Trust Networking.

Implementation considerations

For something as critical as a Zero Trust networking implementation, we needed to use our own network and security experts; we couldn’t outsource that intellectual property. We allocated these resources early and dedicated our best and brightest minds to critical decisions and tasks.

Zero Trust networking requires a reassessment of any organization’s network operations. At Microsoft, we’re making fundamental changes to a network that hosts more than 1 million devices.

—David Lef, principal IT enterprise architect, Microsoft Digital

Although high-level goals established at the leadership level drove the entire Zero Trust implementation, we didn’t expect our leadership to make every decision. As objectives and decisions became more defined, we found it best to address issues and make decisions at the feature-team level. This model helped us react quickly to issues that arose and maintain project timelines despite obstacles.

Understanding the environment

Zero Trust networking forced us to comprehensively change our network infrastructure, from the edge to the wide-area network (WAN), and from remote users to in-building wired and wireless experiences. We had to make sure that we fully understood our existing infrastructure on several different levels: what equipment was in the field, how our network was supporting critical business processes, and what changes were required to support Zero Trust networking properly.

“Zero Trust networking requires a reassessment of any organization’s network operations. At Microsoft, we’re making fundamental changes to a network that hosts more than 1 million devices,” says David Lef, a principal IT enterprise architect in MSD.

Creating a framework for network inventory

Zero Trust networking has a vast scope, affecting more than 1 million devices using our network. Building a framework and a strategy for creating this inventory was critical to making informed decisions for planning and deployment.

End-user devices play an essential role in Zero Trust networking, but so do the infrastructure devices that support them, and the networking switches and routers that manage connectivity. Across all these devices, we established a solid inventory framework to ensure that we could collect relevant data from our devices, including status, device details, capabilities, and requirements for connectivity to corporate resources. We used asset-inventory tools, device-configuration backups, and data pulled from live devices to collect and assemble our network device inventory. This data was critical for our reporting and dashboards so that we could track progress as we deployed new network-configuration standards, segments, and policies.

Cataloging and assessing devices

After we collected device data, we had to decide what to do with the devices. Identifying devices that were incompatible with Zero Trust networking policies, configurations, protocols, and management techniques was a high-priority task. While many devices contained modern networking capability, we identified a large device population that required special attention.

We had devices on our network that supported only basic networking capabilities. For example, the air-handling units for many of our buildings in the Puget Sound area connected to the network with a TCP/IP address, but they didn’t support Dynamic Host Configuration Protocol (DHCP) or remote configuration. Changing the units’ addresses meant traveling to the buildings, connecting a network cable to the unit, and accessing the management console by using a laptop. It was a simple task for one air handler, but a massive project to address the thousands of air handlers spread across an entire campus.

Simple devices like these air handlers would never support Zero Trust networking’s controls and configuration. Likewise, many devices across our network had similar issues. Replacing these devices wasn’t an option, so we needed to understand how to deal with them in place.

Zero trust networking is about security posture. How our devices connect to, authenticate to, and traverse the network is under our control and management from end to end.

—Sean Adams, lead engineer for wired infrastructure, Microsoft Digital

We also provided guidance for these devices and recommendations for eventual replacements. The guidance was essential to moving the network toward compliance. However, the bigger job was turning that guidance into governance to ensure that newly purchased devices and infrastructure supported our Zero Trust networking implementation requirements. As part of ongoing efforts to modernize our in-building experiences, we supply device guidance into our broader Digital Transformation initiative to ensure that new devices in the network ecosystem meet the basic requirements for Zero Trust compatibility.

Onsite connectivity

Zero Trust network connectivity needs to be inherently secure, flexible, and universal. To build effective connectivity across Microsoft, we aligned our security and segmentation strategies with Zero Trust model goals. We ensured that our connectivity methods could support and enforce the controls necessary for Zero Trust networking.

“Zero trust networking is about security posture. How our devices connect to, authenticate to, and traverse the network is under our control and management from end to end,” says Sean Adams, a lead engineer for wired infrastructure in MSD.

Establishing inherent security

Security is inherent in our Zero Trust networking design, from end to end. We designed our implementation to create secure experiences for devices and users across our entire network. We involved our security experts in design and recommendations from the beginning of the project. Risk and vulnerability assessments helped us determine prioritization for deployment.

Segmenting and connecting devices

Emphasis on network perimeter security and defense-in-depth concepts are no longer useful or relevant in a Zero Trust networking environment. Network segmentation assures limited lateral movement and is foundational to our Zero Trust strategy. We created our segmentation strategy to support the greatest level of network flexibility with the fewest number of segments. Segmentation provided absolute control over network access. We implemented our segmentation controls over six different segments: corporate network, internet, guest connectivity, isolated IoT, modern IoT, and infrastructure administration. We connected our users to the closest possible internet egress point to facilitate an internet first approach and provide the best performance and highest bandwidth. Our network environment was already virtualized, so we were able to implement segmentation with relative ease.

Implementing Zero Trust networking controls will disconnect incompatible devices from the network. In cases where simple IoT devices were present, as with the air handlers mentioned earlier, we moved them to the dedicated IoT segment to isolate those devices from the rest of the general network population but still allow them network connectivity.

Coming from a primarily flat corporate network meant a restructuring of standard connectivity. With segmentation, network ports were no longer linear. For wired devices, we dynamically assigned devices to segments based on port and geographic region. This gave us full control over the connection, right down to the individual port, and massive scalability across all regions.

While we have maintained our multi-protocol label switching (MPLS) network, we also maintain more than 250 carrier-dependent WAN circuits. Implementing consistent segmentation across these circuits required effective planning and testing. Testing carrier QoS measurement was important. In some instances, implementing segmentation across previously unsegmented circuits caused incorrect QoS calculations that directly affected available bandwidth.

Managing connectivity methods

Wired and wireless connectivity are both built on the same system of network segmentation and routing. The internet is our default network wherever possible. We operate most of our infrastructure in the cloud, and we get devices to an internet edge in as few hops as possible.

We’ve consolidated our wireless networks across our regions. We’re moving toward a single default service set identifier (SSID), combining our corporate and internet wireless networks into one network with a default internet posture and least-required privilege on the network. Through 802.1X and network policy, we can move devices into segments that provide corporate resource access. This makes network posture flexible, monitorable, and fully enforced across all connectivity methods.

Particularly in the current circumstance with the COVID-19 pandemic, it’s crucial that the majority of our workforce can perform their job duties without being onsite. We already had a robust remote access infrastructure for mobile workers and off-hours use, but we’ve augmented our services and scaled them up to support every user at all times.

—David Lef, principal IT enterprise architect, Microsoft Digital

Consistent segmentation and a consolidated wireless SSID provide several advantages for Zero Trust networking: the internet as the network of choice, wireless as the connectivity method of choice, and required proof of identity across all devices and segments. After a device connects to wireless, it’s easy to transparently move that device across segments and implement other Zero Trust networking controls.

Offsite and remote connectivity

Most of our workforce expects to be able to access the resources required to perform their duties when they’re not actually on-premises in a Microsoft building.

“Particularly in the current circumstance with the COVID-19 pandemic, it’s crucial that the majority of our workforce can perform their job duties without being onsite,” Lef says. “We already had a robust remote access infrastructure for mobile workers and off-hours use, but we’ve augmented our services and scaled them up to support every user at all times.”

The majority of our productivity resources are available through the internet and Microsoft public services. For those that remain on our private networks, two primary services are available today to provide seamless and secure client connectivity to our users:

A virtual private network (VPN) infrastructure accessible by Microsoft employees and vendors with managed corporate devices and identities.
A centralized Windows Virtual Desktop (WVD) service running in Microsoft Azure, which supplies a managed Windows 10 desktop experience to employees and vendors from devices that support the Remote Desktop Protocol RDP).

Investments in automation tooling and education are significant, but it would have been impossible to deploy Zero Trust networking at Microsoft without effective automation and a network as code approach.

—Sajith Balan, lead engineer for network routing and transport, Microsoft Digital

Deploying and automating functionality

We’ve deployed Zero Trust networking across our global network. Considerations for individual regions, business needs, and technical requirements all influenced deployment methods and cadence. Throughout the deployment landscape, we’ve integrated automation and configuration validation by default to ensure a consistent, repeatable, and scalable deployment experience.

“Investments in automation tooling and education are significant, but it would have been impossible to deploy Zero Trust networking at Microsoft without effective automation and a network as code approach,” says Sajith Balan, a lead engineer for network routing and transport in MSD.

Planning and deploying

We established the scope of our Zero Trust networking deployment early. This helped us develop design principles and set standards to ensure our designs remained consistent throughout the project. We based deployment priority primarily on business impact, technical requirements, and potential for vulnerability. Deploying to five devices was quicker and less complex than deploying to five hundred devices, so we deployed to smaller environments first.

We prioritized our infrastructure deployment over the user experience deployment to minimize disruption and ensure quick learning with the least impact. Decoupling these two elements allowed us to implement and test infrastructure early and address bugs and issues without the pressure of the user experience being affected by this process. When infrastructure was ready, we deployed the software components that brought Zero Trust networking to the user.

We optimized costs wherever it was relevant and effective. Zero Trust networking affected every device at Microsoft, and we didn’t have the scope or budget to replace every device.

Automating with network as code

Network as code, the concept that the definitive configuration for a network is defined by code in a centralized repository and not by the current state of a device, was critical to the overall implementation of Zero Trust networking and our ability to deploy at scale. Our network environment already had well-defined engineering standards, so we implemented network as code with relative ease. We used network as code to standardize our network’s configuration management and reduce configuration drift by using software development processes. One of network as code’s outcomes is the ability to reconstruct the network repeatedly from nothing more than a source code repository and bare-metal resources.

Network as code provided a source of truth across our environment. By modeling device configuration into structured data, we used network as code to store and catalog network device configuration data centrally and decouple it from the physical device. This supported more efficient management of configuration and created new scenarios for disaster recovery and rapid deployment. Without network as code, deployment at scale would have been impossible to accomplish at Microsoft. We also use network as code to validate the health of deployed services.

Deploying iteratively

Starting small and gradually deploying to a broader scope was our standard approach with Zero Trust networking. Using this ring-based approach helped us test deployment models on small groups before releasing functionality more broadly. Flighting with a small cohort helped us grow to larger deployments without fear of time-consuming rollbacks or sweeping changes to the configuration.

Incremental deployment made it easier to deploy to an actively used environment. Throughout the Zero Trust networking implementation, we worked with live networks that hosted users and business processes happening in real time. In situations where we couldn’t gather the appropriate data to guarantee success, the ability to deploy on a small scale helped us test, assess, and quickly refine our deployment approach. For example, when we deployed our new internet-first wireless network to shift the default client posture off the corporate intranet, we started with an individual floor in a building that contained active users. This initial deployment supplied quick feedback with little risk. From there, to minimize disruption, we gradually expanded to entire buildings and then multiple buildings per day.

Zero Trust networking shouldn’t impinge on an employee’s ability to use the network. We want the transition to Zero Trust to be as friction free as possible for our employees while ensuring secure and monitored infrastructure.

—Mark Bryan, lead engineer for wireless infrastructure, Microsoft Digital

Flighting and iteration also helped us identify solutions that wouldn’t work and find alternatives early in the deployment before too many users were affected. If we found that a solution only worked for 10 percent of the devices or users in a location or region, we knew that we had to reassess the solution and refactor to involve the broadest device population while still maintaining our standards.

Ensuring consistent user experiences

Our users are the consumers of our Zero Trust networking environment. For that reason, it’s critical that we continually examine their needs and how Zero Trust networking affects their experience. How users interact with the network affects their acceptance level for Zero Trust networking. Immature deployments and mismatched pilot groups create dissatisfaction that can lead to low adoption and acceptance rates. We focused on effectively monitoring and incorporating user feedback throughout implementation.

“Zero Trust networking shouldn’t impinge on an employee’s ability to use the network. We want the transition to Zero Trust to be as friction free as possible for our employees while ensuring secure and monitored infrastructure,” says Mark Bryan, a lead engineer for wireless infrastructure in MSD.

Educating users

Educating users and device owners on Zero Trust networking helped increase adoption and user satisfaction. Deployment of Zero Trust networking immediately identified incompatible devices. If a device didn’t support the controls in place, the device was disconnected from the network. Informing end users of these behaviors was critical to smooth deployment. For example, if we planned to enforce authentication on a network where it hadn’t been enforced before, we could initially enable the authentication method in a silent/soft mode and identify which devices weren’t successfully authenticating. Owners of those devices were notified so that they could adapt their configurations to meet requirements or make plans to move to a more suitable network type.

Working with users and deployment regions

Through our data analysis and pilot testing, we encountered a diverse set of business and technical needs across our global network environment. Each region or location had specific technical capabilities and considerations. Device availability, telecom capabilities, data-residency regulations, and many other regional considerations contributed to how we approached, designed, and implemented Zero Trust networking for each location.

Monitoring the user experience

We had to understand the use cases and the personas on our network. Deploying Zero Trust networking wasn’t typically disruptive. However, many systems that developers and software engineers used were designed for the corporate network and didn’t scale well to a Zero Trust networking environment. In these areas, we considered test groups and early adoption carefully. These users were potential Zero Trust networking advocates, especially in situations where implementation could have been disruptive. We monitored user experience with data insights and Microsoft Power BI to gather actionable data and modify our implementation approach accordingly.

Zero Trust networking provides a model that effectively adapts to the complexity of and constant change within the corporate environment. It supports the mobile workforce and protects people, devices, apps, and data regardless of location. In sharing the lessons that we’ve learned so far, we hope to help other enterprises to adopt Zero Trust networking effectively and efficiently. As we continue to deploy the Zero Trust model across the Microsoft enterprise, we’re learning from our experience and adapting our approach to achieve our goals.

Stay tuned for more articles and case studies that provide additional details about our Zero Trust network implementation.

Tags: developer tools, Enterprise Mobility and Security, network

Inside Track Blog