At Microsoft Core Services Engineering and Operations (CSEO), we’re building our systems in the cloud to be agile, resilient, cost effective, and scalable—so we’ll be proactive and innovative as we transform our IT and business operations. Microsoft Azure resides at the core of our architecture, and we’re using the platform to automate our processes, unify our tools, and improve our engineering productivity. We’re working toward a process driven by user experience, which changes the way we provision and manage our IT infrastructure.
A modern cloud-centric architecture is foundational to our digital transformation, and we’re building integrated, reliable systems, instrumented for telemetry, to gather data and enable experimentation. To learn more about how we’re transforming, refer to Inside the transformation of IT and operations at Microsoft.
Building a foundation for digital transformation
Azure is now the default platform that our IT infrastructure is built upon. Several years ago, CSEO created a vision for moving from on-premises datacenters to Azure as the “first and best customer” of our cloud services. We examined our infrastructure to understand usage practices and how we could best support application teams via Azure subscriptions and connectivity options. We reviewed on-premises datacenter assets and developed schedules to migrate or retire the assets and to close multiple datacenters. Our leadership established plans at the strategic level to move applications, which trickled down to individual cloud migration and adoption plans for each part of the organization. Our cloud-centric approach thus created a functional and flexible platform for our services and processes. We’re using Azure to enable a self-service model for users of the platform—providing robust telemetry and reporting capabilities via Azure Monitor and Application Insights and using Azure ExpressRoute to facilitate enterprise-level connectivity to the cloud from our facilities and networks.
Establishing a vision for cloud-centric architecture
We’ve moved more than 93 percent of our on-premises infrastructure to the cloud, and we’re assessing our strategic initiatives around our cloud efforts. We’ve fulfilled our goal of moving out of the datacenter. However, many services moved from virtual machines (VMs) running in a datacenter to infrastructure as a service (IaaS) VMs running in Azure with very little change to those services. We thus recognize the opportunity to further optimize our presence in the cloud by creating more-refined and targeted strategic initiatives both for the company itself and as examples for external customers.
We need to modernize our application and service portfolio to take advantage of capabilities that were previously unattainable because of datacenter and support constraints. We need to examine how we manage our data and work toward a strategy that separates data from compute resources. We need to examine open-source big data platforms, event processing, other modern services that we can more effectively scale. Policies should enforce required controls for all configurations to improve safety regardless of the network involved. We also need to continue embracing modern engineering practices and pipelines and DevOps methods of managing services. We’re capturing the transformation of cloud-centric architecture in the following investment areas:
- Transitioning from on-premises to cloud offerings to enable dynamic elastic compute, georedundancy, a unified data strategy (that uses Azure Data Lake), and flexible software-defined infrastructures.
- Moving to cloud-centered IT operations, including automation for provisioning, patching, monitoring, and backing up our cloud and on-premises environments through Azure-based offerings. In this way, software engineers can manage their DevOps environments with a minimal number of manual operations.
- Facilitating continued company growth and the improvement of our platform services while staying flat on the running cost of our services.
- Developing deeper and richer insights into our service reliability via the standardization of monitoring solutions through Application Insights, incident-management tooling, and automatic alerting. At the same time, we’re increasingly modeling our critical business processes and helping ensure end-to-end integrity through the monitoring and alerting of complex processes spanning multiple systems.
- Supplying a powerful feedback loop to our product-group partners (such as those for Azure, Microsoft Dynamics 365, and Windows) to showcase Microsoft running on Microsoft. This is resulting in an improved enterprise-customer experience, including running one of the largest SAP instances entirely on Azure and helping ensure that Azure is SAP-ready for our customers.
Designing for the future
As our services move to these modern designs, our architectures need to evolve. We need to build our solutions to adopt the advantages of Azure and to adapt as those advantages change and grow. We need to clearly understand that Zero Trust efforts will change how users access our solutions. Our network postures and zonal controls need to adapt as well. “Internet first” should be the goal of all solutions. We need to implement the governance of all corporate resources—regardless of their network environments—and recognize that user identity and data are the critical resources to keep under the proper controls. Through this continued transition to a more cloud-centric architecture, we need to remain cost effective and create clear guidance on how to transform from VMs and on-premises solutions to modern solutions.
Enabling the cloud-centric architecture
Deploying workloads to the cloud introduces the need to develop and maintain trust in the cloud to the same degree that we have in our existing datacenters. In this model, we can apply isolation policies to help achieve the required levels of security and trust. To use the cloud as our trusted platform for our new cloud-centric architecture, we need to invest in plans for multiple areas:
- Administering the Azure fabric
- Using Infrastructure as Code (IaC)and DevOps
- Using identity management and governance
- Using modern apps and data solutions
- Using modern networks
The following sections detail the specific investments that combine to fulfill these requirements.
Administering the Azure fabric
The Azure fabric is a collection of programming interfaces that allows application engineering to interact with the underlying services and infrastructure. On one end of the spectrum is an application engineer connecting to the fabric and running a script to provision a VM. On the other end of the spectrum is automation connecting to the fabric, pushing data into a service, merging this data with external data sources, performing an analysis, and then publishing this data to a user interface for consumption.
The role of the IT infrastructure provider will be to supply security-enhanced, flexible, and reliable hosting in our corporate fabric for applications and data (whether in our private or our public cloud). From the perspective of an application engineering team, provisioning infrastructure will appear a lot like updating templates and running scripts that land code and data in VMs; in containers; or in purpose-built, platform as a service (PaaS) solutions, like Azure SQL Database. The role of the core hosting provider will be to present a flexible, reliable, and safer fabric to these teams for interaction with their templates and scripts.
The role of the infrastructure team will be to enable frictionless and security-enhanced access to a fabric of APIs. A subscription will enable access to the scope of computing capacity that the subscriber can use. Subscriptions will connect to on-premises environments for hybrid scenarios, to added subscriptions for scaling out, and to third-party services for specialized processing. Our infrastructure team will need to do all of this in a security-enhanced manner, use standardized methods and building blocks, and maintain fiscal effectiveness. The team will need to conduct these interactions in a way that Microsoft deems appropriate.
The role of the fabric administrator will be to provision this fabric through subscriptions and to help ensure that each subscription has the required capacity and connectivity to meet the demands of the application in a security-enhanced and fiscally responsible manner. The fabric administrator will:
- Build subscriptions and help ensure that enough capacity exists to accommodate the demands of each application.
- Connect subscriptions to our corporate network zones where appropriate and help ensure that the required connectivity exists for the application to adequately perform and to reliably and securely communicate with integration points.
- Help ensure that our corporate standards for configuration and security are applied to the subscription.
- Work to continuously grow and expand our fabric. That is, CSEO will continuously release new capabilities and expand our cloud presence.
- Continuously monitor and troubleshoot fabric-related issues.
In many ways, our IT organization will function like a managed service provider or Azure service broker. The Azure product group recognizes that a necessary gap exists between corporate application engineering and Azure services. We refer to this addressable gap as the corporate context. The corporate context consists of the specific company’s policies, standards, identity scenarios, and network connectivity scenarios. It’s the role of the service broker or fabric administrator to apply the corporate context to the fabric to enable loosely moderated consumption by application engineering teams.
Using IaC and DevOps
Within the IaC and DevOps area, we’re building a more agile and flexible process for developing and deploying critical pieces of the cloud-centric architecture. Self-service and automation are paramount, driving the goal of empowering our engineers to quickly create and configure their solutions in an unencumbered manner.
Infrastructure as Code (IaC) is the process of managing and provisioning cloud infrastructure and its configuration through definition files that machines can process—rather than through the configuration of physical hardware or the use of interactive configuration tools. IaC is about using scripts and templates to build or configure a connected landing place for applications and business data.
IaC doesn’t involve building user-interactive portals or creating tickets for others to run automation. IaC instead involves supplying standardized, robust APIs to application engineering teams to integrate into their deployment automation. Beyond supplying APIs, the infrastructure team supplies standard, curated configuration templates and software images for application engineering teams to consume.
Within the Azure Resource Manager framework, Azure contains recognized IaC that allows engineering teams to rapidly provision the underlying hosting platform for their applications.
We need to continue the push from fully centralized operations to a DevOps model. Specific efforts from infrastructure teams in partnership with business units need to continue and improve in the following ways:
- Continuing to decentralize operations that involve governance and auditing, while the centralized team remains responsible for the security and compliance posture
- Investing in management groups and Azure Policy to supply guardrails for DevOps environments
- Decentralizing services, including those for patching, configuring, monitoring, backup, and managing alerts and events
- Gaining clarity on who's responsible for responding to incidents by using the proper tools and processes for DevOps
- Improving automation by investing in tools such as Chef, creating a runbook library strategy, and specifying how teams should use Azure Automation
- Ensuring that DevOps processes can properly deal with accessibility and privacy
Using identity management and governance
Identity management and governance supply the guardrails that help protect our cloud-centric architecture. Identity is the new perimeter in modern networking and architecture, so it deserves high-priority consideration within the architecture to help ensure the security of our environment. Governance is also critical in the modern architecture, helping to guide and safeguard a largely self-service environment.
We need to simplify provisioning, entitlements, and access management. We also need to streamline account provisioning and management, helping ensure that all access is auditable and linked to an approved business justification. Finally, we need to ensure that all credentials will expire or be revoked when no longer required while maintaining the principle of least privilege for administrators and users. Our two primary efforts in identity management are:
- Eliminating passwords through Azure Multi-Factor Authentication. We want to remove the use of passwords in favor of strong authentication mechanisms.
- Helping protect administrators. We want to help ensure that all users with elevated privileges use those privileges in compliance with our access control standard.
Cloud-focused architectures still require proper guardrails and governance for two reasons: to help protect corporate data and assets from internal and external threats and to help ensure that the data and assets adhere to corporate and compliance standards. Much of our current governance is manual in nature, and some is our own intellectual property created to fill product gaps in Azure. As Azure continues to add features, we need to embrace those native features that will help ensure we’re properly governing the cloud:
- Azure Policy will take a forefront position in supplying the right guardrails to help ensure that application teams can operate day to day within subscriptions that will keep the data in those subscriptions safer and more secure.
- With Azure Blueprints, we want to create appropriate sets of controls for bundling policies, networking, role-based access control, runbooks, and templates info full workspace packages that complement the DevOps environments we’re pushing teams to use for their day-to-day operations.
- We need to invest efforts into the lifecycle workflow around governing subscriptions, exception management, and scaling to the enterprise via management groups.
Using modern apps and data solutions
The way we treat our apps and data has changed in cloud-centric architecture. With more user-design models becoming available, engineers no longer function as the only developers in our organization. Users are taking advantage of platforms and tools that offer no-code or low-code development methods to create business solutions. Through all of this and within our more traditionally developed apps, we need to drive consistent development and data usage and protection methods.
As more teams use Containers and Azure Service Fabric, the infrastructure and security teams need to invest in creating the right guardrails for these new paradigms. This means that even more than previously, we need to track the Azure subscription, make the correct policies and templates available, and then apply those policies and templates—to help ensure that the more-transitory resources belonging to modern solutions immediately use the correct controls. Our priorities are as follows:
- We need to supply design patterns and templates to help ensure that teams build resources according to a standard. Automatic configuration should occur during deployment, and desired state should be automatically enforced on a continual basis.
- Developers need to create containers by using default images containing the correct settings and policies.
- For microservices, teams need to build Service Fabric clusters that use standardized settings and policies right from creation time.
- We need to assess the connectivity that modern apps require. We want users to primarily access these modern systems only via the internet, but private virtual networks might also have some use in data-focused, segmented environments. The hybrid model will benefit some teams and certain types of data.
Modern data solutions
Managing our most-critical data assets will continue to be a top priority going forward. With more modern architectures, an increased ability to separate the compute and storage resources will exist, so managing the storage data will become a critical priority:
- We need to continue examining solutions based on VMs and Microsoft SQL Server and transition them to more modern architectures.
- We need to accelerate data deduplication by moving commonly used data sources into Azure Data Lake Storage.
- Our DevOps teams need to manage cloud storage in a security-enhanced and efficient manner. This includes having centralized standards for using encryption at rest whenever possible and helping ensure that all solutions use the proper business continuity and disaster recovery options.
- We need to ensure that we classify, label, and protect all Microsoft data.
Using modern networks
Our investment in the modern networks area involves all aspects of our networking environment. That is, we’re investing in modern deployment and configuration practices to create and support a networking environment that supplies a solid foundation upon which the cloud-centric architecture rests. This includes adopting an internet first network model, increasing support for Software-Defined Networking, making more efficient use of ExpressRoute connections, creating more intentional network segmentation, migrating to Internet Protocol version 6 (IPv6), and increasing Network Function Virtualization (NFV).
All clients have been moving to an internet first model over time—first, by enrolling mobile devices with Microsoft Intune and, eventually, by connecting branch offices and some corporate offices primarily through the internet instead of through traditional on-premises network connectivity. Clients traversing a virtual private network (VPN) or similar solution for access to corporate applications won’t offer the best model going forward. To become an internet first organization, we’re focusing on the following:
- We need to make line-of-business applications accessible from the internet by either providing a hybrid connection to the application’s presentation layer, making the presentation layer entirely internet facing, or making the full application internet based versus traditionally on-premises network based.
- Infrastructure teams need to help secure the solutions in a standardized manner and always use verified intended access.
- Infrastructure teams need a way to correctly and efficiently handle edge traffic. They need to know how to accurately audit, respond to, and report that traffic.
- We need to find ways to supply hybrid access for data anchors that stay in a more-restricted zone, which won’t be the on-premises network.
- We need to invest in resiliency and security for these internet-facing solutions to help prevent unwanted impacts.
With most clients moving to an internet first model over the next few years, CSEO needs to examine where line-of-business applications place services going forward. With most clients moving outside the on-premises network boundary, it makes the most sense for the applications they use day to day to have a presence on the internet versus continuing to require a special network connection back to an on-premises network-based solution. To improve services placement, we’re examining the following:
- Making user-facing services and the presentation layer externally reachable.
- Placing data or the backend in an administrator channel or private zone if appropriate to help safeguard access.
- Resolving impact to clients and applications, such as when they send data to an on-premises printer.
Software-Defined Networking and ExpressRoute
Within CSEO, the Zero Trust and internet first efforts will encourage teams to examine their on-premises, network-bound solutions by using ExpressRoute. Additionally, the ExpressRoute service will continue to grow, because a plethora of product teams are just starting to move their lab and build solutions to Azure. Over time, we want teams to examine hosting their solutions outside the traditional corporate network more and more—that is, in a fully internet-based posture, in an appropriate Software-Defined Networking environment, and with defense-in-depth security controls applied.
To further embrace Software-Defined Networking and ExpressRoute, we’re focusing as follows:
- Teams should modernize their solutions as their first choice versus migrating them directly to Azure IaaS services. This needs to become a strategic goal across the organization.
- For CSEO applications, we need to prioritize deployment governance over ExpressRoute usage. This will encourage the transition to modern applications that assume an internet posture versus continued dependence on the on-premises network.
- Even with a pure internet first design, these modern solutions should use Software-Defined Networking and the security features of Azure that supply controlled access to solutions.
- We’ll simplify our current ExpressRoute architecture, which uses significant physical resources. We’ll redesign the architecture to use more of the Software-Defined Networking components of Azure. The goals are to reduce costs, increase the deployment speed, make the service easier to consume, and make the service even less reliant on on-premises hardware.
- We need to engineer differentiated zonal-stratification offerings for production and mission-critical solutions versus those for research and development.
- We need to revisit the hybrid design options and revise them based on both new features and proper governance within all zones.
For CSEO, network segmentation is one of the largest components of the cloud-centric architecture. The corporate extranet network and the security zones that define it have existed for decades. In the modern cloud-environment era, we need to revise network segmentation by:
- Dismantling the on-premises network and its security. This should result in the creation of multiple new zones that have improved controls and management.
- Noting that the traditional, on-premises-focused perimeter network is deprecated and that we’ve created a modern perimeter network having the proper controls and less blanket access both vertically and horizontally.
- Using Software-Defined Networking to enable better horizontal network controls for larger zones and for individual zones created for specific solutions.
- Creating a new and different space for virtual local area networks that includes internal zones and an administrator network specifically segmented to manage devices and the Internet of Things.
Migration to IPv6
Internet Protocol version 4 (IPv4) address ranges continue to be challenging to manage because of the dwindling number of available addresses versus the growth of the environment. We need to accelerate IPv6 deployment to help ensure continued network capacity. IPv6 removes complications from network address translation and simplifies acquisitions. We’re addressing the migration to IPv6 as follows:
- Our network team has deployed IPv6 to multiple environments and plans to have some areas use only IPv6 (where possible) to remove the dependency on the limited number IPv4 addresses.
- Application teams will need to bind their applications to IPv6 in addition to IPv4. Older applications that understand only IPv4 will need to modernize. Security optimization also needs to occur.
- The research and engineering teams will need to ensure that all policies are correct. They’ll pay special attention to the boundaries between IPv4 and IPv6.
Going forward, CSEO needs to heavily invest in Software-Defined Networking, including Network Function Virtualization (NFV). NFV has substantially improved and will continue to do so. By moving older network zones to the internet, we can increase the internet first mentality while still supplying adequate controls. Making applications self-contained within a specialized zone can help lock down both vertical and horizontal access, which makes solutions more secure. The NFV-related actions include:
- Creating best practices for zones and disaster recovery.
- Helping ensure the proper governance of Software-Defined Networking devices and zones.
- Defining methods for monitoring Software-Defined Networking environments, including those for security, telemetry, and outages.
- Defining best practices for teams managing multiple perimeter networks versus the single, flat model used on-premises.
Azure is adding the ability to use service tunneling to access resources via VPNs, including ExpressRoute virtual networks. With this new model, teams might be able to use PaaS resources within a more-limited network and security boundary. To improve service tunneling, we’re examining the following:
- Doing more work to understand this model and how we should use it within CSEO. This will potentially function as an interim step before getting to a full internet first posture. Service tunneling helps secure connections to Azure from on-premises solutions, but we need to examine the benefits of this model and decide if it’s the right model to use going forward.
- Creating an internet first tie-in. We’re examining the potential for external presentation, where data is kept within a more private network that can use PaaS resources via a service tunnel. We need to work through the proper times and places to use hybrid cloud environments.
We’re continually assessing our approaches to cloud-centric architecture to help ensure continued growth and reliable and optimized services. We have over 40 years of IT history and technical debt that we can’t transform overnight. Our success will be determined by the fluidity of our users’ experience and the level to which we can create an abstraction of our IT infrastructure via cloud-based platforms. This abstraction will create flexibility, usability, scalability, and resiliency for the entire business, which our cloud-centric architecture will support. We’re exploring further transformation while staying dedicated to the effective operation of our entire service portfolio. We’re finding common scenarios where we can optimize services and applications for the cloud, and we’re automating and abstracting as many manual processes and tasks as possible. We’re using the metadata across all our systems to digitally document our cloud infrastructure, creating software-defined templates for the deployment and configuration of infrastructure resources.
For more information
© 2020 Microsoft Corporation. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.