In the era of digital transformation, Microsoft is moving all business assets to the cloud to increase efficiency and reduce infrastructure costs. Microsoft Core Services Engineering (CSE, formerly Microsoft IT) helped Microsoft Learning and Readiness (L&R) move their content to the cloud. We encountered challenges and developed best practices for two major components of this cloud migration journey:
- Migrating a massive amount of content to the cloud. We moved 1 PB of learning content to Azure Blob Storage and enabled easy access, sharing, and reuse of this content across teams.
- Migrating the MSCert platform to PaaS: We moved the entire Microsoft certification and credentialing platform (MSCert), including legacy databases, web applications, and services to Azure PaaS (platform as a service) while maintaining business continuity.
Migrating massive content stores to the cloud
Like many other businesses, content is a core asset of our learning organization. The complexity and cost of managing local data centers , however, scales exponentially with the continuously increasing quantity of content. The Microsoft Azure cloud storage service significantly reduces the time and cost of managing content repositories while providing greater data reliability, automatic hardware and software updates, convenient access, and scalability.
As part of the goal of retiring all on-premises servers, L&R needed to migrate about 1 PB of content, which was spread across more than 300 servers, to the cloud. The fast and reliable migration of this unusually large volume of data was made possible by a novel tool we developed. Without this first-of-its-kind tool, the migration would have taken more than a year to complete or would have required shipping hundreds of hard drives to Azure data centers. This tool is publicly available on GitHub.
With the tool, we can move content over the Microsoft corporate network at speeds of up to 5 TB per day. In less than three months, we successfully moved 450 TB of content. Already, the team has saved $500,000 this year by decommissioning old servers.
Managing L&R content in the past
The content created in L&R serves many learning programs and platforms. Some examples of content include exams, courses, lab setup documents, virtual machine files, Microsoft Press files, webinar recordings, and videos. Before the migration, the content was hosted in multiple Microsoft Puget Sound labs, which included more than 300 servers and more than 700 virtual machines.
Figure 1 illustrates some limitations of the local content storage experience and how Azure storage helped overcome these shortcomings:
The content migration journey
We explored multiple Azure storage offerings before selecting blob storage as the most inexpensive and flexible option for the quantity and diversity of content that was hosted on our on-premises servers. The options we considered were:
- Azure Blob Storage. Most cost-effective for large amounts of data. Capable of handling a diverse mix of file formats and of providing high read/write throughput rates.
- Azure File storage. Extensive tools for managing file sharing. Blob storage was chosen over file storage because cost and read/write throughput were high-priority items.
- Azure tables. Did not fit our storage purposes because we do not need content to be accessed by any application.
- Azure queues. Did not fit our primary purpose as we do not need content to be stored in a specific order.
Implementing a hybrid solution
The process of moving content to Azure was complicated by the fact that our local servers had multiple functions in addition to storage—functions like hosting web apps and creating and exporting virtual machine templates for learning lab setups. To provide business continuity, we wanted a hybrid approach that would allow us to first migrate the learning content, while we continued to identify and test cloud solutions for other dependencies on the servers.
Some challenges remain unresolved, and we are working with Azure teams to identify equivalent features and do proof of concept for features that are under development.
Table 1 summarizes our approach, challenges, and solutions to migrating each server while minimizing disruption to business operations.
Migrate the content from a file share to a storage account containing multiple blobs, which correspond to top-level folders in the file share.
Connection to Microsoft corporate network
We needed to connect to the Microsoft corporate network to access some internal tools used in course creation and packaging. However, Azure is a public network.
ExpressRoute creates a private pathway to access Azure as required by corporate organizations.
Hosting web applications
Migrate to Azure PaaS in phases. Keep on‑premises servers in the interim.
Applications and the databases they depend on were hosted on several on‑premises servers and virtual machines. Moving all applications at once is risky, especially since applications running in Azure can require complex networking configurations to connect with private database servers.
Proof-of-concept test for Azure Migration Assistant in progress.
Exporting and importing nested VMs for lab setup
Implemented Azure nested virtualization features.
Azure does not support direct installation or use of Hyper-V on Azure VM boxes. Hence the multi-tiered VM creation used by some Microsoft teams is currently not supported by Azure. We attempted to create similar VM files and extract VHD using Azure IaaS VM, Azure Dev Test Labs, and Azure Service Fabric (host Hyper-V on a dock container). None of these options could support nested or multi-tiered virtualization.
Following the scheduled release of Nested Virtualization (M series) by the Microsoft Azure team, we have successfully migrated the process of VM file creation. For more information, see Tuesdays with Corey: M-Series and Nested Virtualization on Azure.
Limitations of Azure Blob Storage
We encountered two limitations in Azure Blob Storage that affect how our stakeholders use and access the content:
- Files are not directly accessible from the cloud. For example, videos must be downloaded onto local machines before they can be played.
- There is limited built-in control to share files and manage permissions to access Azure blobs. Permission to both read and write is granted by distributing access keys to specific users. The security of the system from unauthorized users is solely dependent on strict and proper control of access keys. Because access is not tracked and cannot be restricted to read-only, there is a risk for anonymous content modifications.
The security risks stemming from access management were addressed by integrating our existing content publishing platform with Azure Blob Storage using Azure Active Directory (Azure AD) authentication. We used Azure REST APIs to build file sharing capabilities similar to the file shares that teams are familiar with.
During our migration of L&R programs to the cloud, we learned some important best practices. They include:
- Prepare operations teams for the migration. The process of moving content to Azure Blob Storage has the greatest impact on operations. It is, therefore, critical to involve the operations team early in the planning process to ensure a mutual understanding of how they will be affected and to minimize interruptions to their workflow.
- Assess security risks of content access. Permission to both view and edit the content of blobs is granted by having the account owner share access keys with users. In cases where users need different levels of access, Azure AD authentication may be required. Azure AD must be connected through REST APIs.
- Conduct performance testing early in the release cycle. Running performance and stress tests with Azure AD authentication and blob storage access features (such as bulk upload/download) can present unexpected challenges. In our case, doing performance tests early in the release cycle allowed us to diagnose and resolve issues before they affected project timelines.
PaaS migration of MSCert platform
IT departments and organizations can use Azure PaaS to simplify maintenance, reduce costs, expand software development and data analytics capabilities, and create opportunities to optimize performance.
To unlock these benefits, we partnered with business teams to migrate MSCert to Azure PaaS. The complexity, scale, and database-driven code of services, applications, and databases within MSCert presented migration challenges, especially for compatibility with the Azure PaaS environment. To allow for business continuity and to optimize Azure resources, we adopted a hybrid migration approach and used agile methods for time-effective completion of the project.
As of October 2017, we have successfully moved all our applications to Azure PaaS and databases to Azure SQL databases. Several core business functions that are executed as SQL jobs remain on the virtual machines. We are working to analyze the dependencies of each of these operations and to convert them to APIs that can be migrated to PaaS while preserving their complete functionality. In the meantime, we are already realizing the benefits of Azure PaaS. The current hybrid solution has increased our data processing speeds, reduced the size of our databases, enhanced performance, and given us faster response rates.
MSCert is the platform on which all Microsoft certification and credentialing programs are managed. It includes web applications, services, and databases that are used by Microsoft learning customers like IT professionals, students, learning partners, Microsoft Certified Trainers, and lab administrators. The MSCert system contains the master data for all exams and certification requirements and the user profiles of all certification candidates.
Benefits of Azure PaaS
Before the PaaS migration, MSCert was hosted on 57 virtual machines spread across on-premises servers of multiple Microsoft partners. By moving to PaaS, we sought to support the growth of MSCert and to capture opportunities for enhancing its performance. Compared to on-premises servers, PaaS allows our platform to be more scalable and adaptable, reduces maintenance, and enables our users to be more agile.
Reduced system and infrastructure maintenance
Using PaaS reduces maintenance labor and costs, and it also:
- Reduces the physical footprint of servers and allows flexible procurement of services.
- Deploys automatic hardware and software updates.
- Contains built-in security feature; Azure AD authentication supports industry-standard protocols and simplifies authentication needs for developers.
- Issues security tokens upon successful authentication. Azure AD verifies the identity of users and applications that are in an organization’s directory.
- Offers preconfigured tools such as development and testing environments, business intelligence and analytics, and database management systems.
The new PaaS platform enhances performance by:
- Using telemetry and automatically collecting performance data with the cloud service Application Insights.
- Data processing is faster using parallelism.
- Flexible and agile development methods are supported with clean and efficient system architecture.
Challenges—and solutions—to Azure PaaS migration
Many of the challenges stemmed from the fact that MSCert had database-heavy code, which is incompatible with Azure PaaS. Specifically, some issues and their solutions were:
Public to private connectivity issues. L&R applications, services, and databases were hosted on domains belonging to our learning partners. Migrating these applications and services in one shot would not have been safe or manageable.
Solution. First, we moved the MSCert databases from learning partners’ servers to Azure public IaaS with SQL authentication to maintain security. This step avoided the public-to-private connectivity issue by having applications connect to the database in the public cloud as they were migrated to Azure web/worker roles in batches.
Accessing file shares on private, on-premises servers. MSCert is dependent on feed files for data processing. In Azure, it is not possible to access private file shares.
Solution. We configured an application to upload feed files from a shared location to blob storage. An L&R job could then pick up the feed and process the records to a database. Upon completion of the job, the application would also download the outfeed from blob storage to the shared location.
Managed service account authentication not supported in Azure. L&R applications run on Passport and Windows authentication, which are not available for websites hosted on Azure.
Solution. Azure AD is used for authentication and authorization across the platform.
Limitations on the size and type of databases in PaaS. MSCert originally had about 1.7 TB and was heavily dependent on temporary databases and tables. SQL Azure limits databases to 1 TB and does not support temporary tables and cross-database joins.
Solution. We reduced database sizes by limiting data to what was strictly necessary for operations. For example, as part of logging, we were using database tables in all L&R applications, which increased the size of the database by 150 to 200 MB each day. We migrated only 90 days of data from history tables, we deleted unused tables and unwanted data from transactional tables and removed unwanted indexes. We were able to reduce the database from 1.7 TB to 340 GB.
100 percent memory utilization in Azure SQL PaaS. Due to the massive use of temporary tables and bulk transactions with legacy code, Azure SQL was hitting 100 percent Data Transmission Units, causing deadlock and timeout issues.
Solution. We used SQL 2016 features, in-memory tables and column-stored indices to fine tune the code.
Data migration required significant downtime. Migrating 1.7 TB of data to PaaS using SQL Server Integration Services packages or Azure migration agents would require at least 24 hours of downtime.
Solution. For faster transactional replication of databases to Azure PaaS, we updated our database architecture from SQL 2012 to 2016. This allowed us to execute the database migration with only 30 minutes of downtime instead of 24 hours.
Canonical name (CNAME) changes required partners to update connections. With geo-replication, whenever there was a primary failover, partners needed to change their connectivity to a new failover server.
Solution. We unmapped the existing CNAME from IaaS and mapped the same CNAME to a newly created failover group name. The same failover group provides high availability with business continuity and disaster recovery. Therefore, partners no longer need to update their connections during a failover. Azure’s new Auto-failover groups feature provides a powerful abstraction of active geo-replication by supporting group-level replication and automatic failover. In addition, it removes the need to change the SQL connection string after a failover by providing an additional listener endpoint. To achieve automatic failover of databases with zero downtime, we configured geo-replication with secondary databases in different data center locations.
Performance monitoring in Azure PaaS
With Azure, telemetry of the MSCert system allows L&R to respond to issues more quickly and identify opportunities to improve performance. Some of these opportunities included:
- Request rates, response times, and failure rates. We can better understand how users navigate applications by identifying which pages are most popular and at what times of day. If response times and failure rates are high when there are more requests, then there is likely a resourcing problem.
- Dependency rates, response times, and failure rates. We can identify if external services reduce service speed.
- Exceptions. We can analyze aggregated statistics or pick specific instances to investigate the stack trace and related requests. Both server and browser exceptions are reported.
- Page views and load performance. We can monitor page views and load performance that is reported by browsers.
- AJAX calls from web pages. We can track AJAX call rates, response times, and failure rates.
- User and session counts. We can find out when and where people use a web app, what pages they visit most, and what browsers and operating systems they are using. This helps us analyze business and usage.
- Performance counters of Windows or Linux servers. We can track CPU, memory, and network usage.
Increased data visibility
Application Insights can assist a development team to better understand how the app is performing and being used. We can monitor performance using the Application Insights dashboards that graphically illustrate important performance indicators, like those shown in Figure 2. In addition to an overview screen, other dashboards tell us if pages are loading slowly, any changes in traffic, if script errors occur, and page view properties such as page URL, browser version, country or region, client IP, operating system, and page load time.
L&R connects Microsoft customers and partners with the wealth of knowledge and learning tools created by our content developers. Through this Azure migration project, we increased the availability, scalability, and performance of our learning system using simple and easy-to-learn options provided by Azure Blob Storage and PaaS solutions. This cloud migration has already resulted in significant cost savings for Microsoft corporate functions and engineering. As the migration continues, and as Microsoft Azure develops, the benefits will only increase.
For more information
© 2019 Microsoft Corporation. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.