When Microsoft began preparing for its employees to work remotely in response to COVID-19, it was the job of Ludo Hauduc, corporate vice president of Core Platform Engineering in Microsoft Core Services Engineering and Operations (CSEO), and his team to make sure that the company’s internal network would hold up.
They were cautiously optimistic—the team had just rebuilt the entire network, including the virtual private network (VPN). This network supports access to key internal servers with protected data, personnel information, and other critical assets that must be on lockdown.
“Our network has done very well since we asked our employees to work remotely,” Hauduc says. “So far, we’ve seen a really strong performance from our network and VPN, specifically.”
The strong response has been fueled by an earlier decision the team made to reduce the workload that the company pushes through its VPN pipes. The team did that by implementing split tunneling at most of its locations worldwide, which funnels the majority of the company’s mobile workload to the internet.
Split tunneling became possible because Microsoft is nearly 100 percent in the cloud, which allows its remote workers to access core applications and experiences over the internet via Microsoft Azure and Office 365. Before the company migrated to the cloud, everything would have been routed through VPN.
“It really helps us that most of our mobile workload—including traffic to high volume and performance sensitive Office 365 and Azure applications—is securely routed directly over the internet,” Hauduc says.
In retrospect, adopting split tunneling was a pivotal decision.
“It is allowing our employees to maintain their normal level of productivity even as they all work remotely,” he says.
He pointed to how employees are now using Microsoft Teams as an example.
“Our employees have significantly increased their usage of voice and video conferencing on Teams,” he says. “We’ve been able to sustain this massive spike in Teams usage without major issues because it’s being routed over the internet—leaving our VPN capacity for just necessary connections between users and our internal resources.”
There have been challenges, however, which began when the company’s employees in China started working from home.
“Unlike here at our headquarters and other worldwide locations, when our employees in China work remotely, everything they do goes exclusively through our VPN pipe,” Hauduc says.
That meant 100 percent of the workload of employees in Shanghai and Beijing was suddenly going through already heavily used VPN gateways.
“It was almost an overnight phenomenon,” Hauduc says. “We were suddenly seeing usage of 85 to 95 percent of our network bandwidth and our VPN capacity.”
Already tight before the spread of COVID-19 began, VPN was quickly becoming a bottleneck in China.
“We started asking ourselves a lot of questions,” Hauduc says. “Can we handle the expected number of concurrent VPN sessions? How is bandwidth holding up for employees? What’s their experience like? Are they all being successful?”
Quick action was needed.
“We had data to answer all the questions, but what we didn’t have was a single pane of glass where we could quickly look at everything to see what was happening across the company’s infrastructure,” Hauduc says. “And company leaders were trying to figure out how to respond to the crisis—they needed data from us, and they needed it quickly.”
The answer was to identify the data that mattered the most and aggregate it into a Microsoft Power BI dashboard, which the company now uses to track all its VPN systems as the COVID-19 situation evolves.
As for the offices in Shanghai and Beijing, Hauduc’s team worked with local internet providers to increase VPN capacity by 50 percent so they had enough headroom to handle the new usage safely.
“That was a budget decision,” Hauduc says. All they had to do was sign some contracts—no new hardware was needed. “Once we agreed that it was the right thing to do, we were able to remove that bottleneck in less than a day.”
[Learn how CSEO is using a modern network infrastructure to drive transformation at Microsoft. Read how Microsoft is approaching to Zero Trust Networking. Learn how Microsoft is modernizing its internal network using automation. Check out how Microsoft uses Azure ExpressRoute hybrid networking technology to help secure the enterprise.]
Investments in VPN infrastructure paying off
When the announcement came that all but a few of Microsoft’s employees and vendors would work remotely, Hauduc was confident that its VPN infrastructure would support that sudden spike in demand.
Three years ago, he would not have been so confident.
“We were in a tough spot a few years ago,” Hauduc says. “We had multiple and complex reasons for why our employees’ end-to-end VPN experience wasn’t very strong—it was a complicated stack that had multiple potential failure points.”
The team ran into issues on the Windows side, there were challenges with the network, and the company was using several different VPN clients at once, which created confusion and complexity for employees. Hauduc’s team worked closely with the Windows team, and through direct partnership and engagement, helped drive significant stability improvements in the Windows native VPN client.
“We saw a connectivity success rate in the 60 to 65 percent range, which is very low,” Hauduc says. “That meant that a third of people would run into an issue every time they tried to work remotely.”
A fix was needed.
“We knew this could become a problem if we had a situation where we needed our employees to work remotely,” Hauduc says. “So, we invested heavily in strengthening our VPN service by focusing on the user experience and partnering closely with internal teams.”
The team worked to replace its VPN infrastructure from the ground up, says Steve Means, a senior service engineer on the CSEO team that manages VPN for the company.
“We built the new system so it could support over 200,000 concurrent sessions,” Means says. “In an extreme situation, we could support that many people on VPN at the same time.”
Microsoft has 151,000 employees and a large contingent of vendors who work on the company’s network. They don’t all work at the same time, but the goal was to cover the worst-case scenario and to future-proof the solution.
“Across the world, we normally have about 55,000 employees connect via VPN on a given day,” Means says. “With everyone working remotely, that has climbed as high as 128,000 employees and vendors per day, including about 45,000 per day at our headquarters in Redmond.”
Previously, employees used a large number of gateways to access the company’s internal network, but many of those gateways provided poor connectivity.
“We consolidated the gateways to data centers and locations with reliable and plentiful bandwidth,” Means says. “This shrunk the number of gateway sites, but increased overall reliability and made it so we could handle more concurrent connections.”
The hybrid design that the team put together uses Microsoft Azure Traffic Manager to geolocate VPN users. “That allowed us to send them to their nearest gateway and to meet scale demands,” he says. “We used Azure Active Directory (AAD) to authenticate our users and to validate the status of their device before allowing them on VPN.”
The team also began using servers that can handle 30,000 or 60,000 users each, much more than the old servers that could only handle 750 to 2,000 users. “Theoretically, we could now handle 500,000 concurrent VPN connections worldwide,” Means says.
Hauduc says the improvement in the company’s VPN service was substantial, so much so that employees forgot it was working behind the scenes when they worked remotely.
Despite being worked harder than ever before, the company’s VPN infrastructure is performing at a high level. “Knock on wood, there have been no major incidents since the crisis started,” Hauduc says.
Importantly, VPN is allowing employees to get their work done.
“Today, even as we ask almost all our employees to work remotely, our success rate is at 92 percent,” Hauduc says. “That’s one of the highest rates we’ve ever recorded—the only reason it isn’t at 99 percent is because that number includes drops because of reboots during patch updates, getting disconnected from Wi-Fi, and home network or internet service provider issues.”
Employee productivity also has held strong.
“We measure employee productivity, and the productivity of our software engineers in particular,” Hauduc says. “We look at pull requests, commits per day, and other indicators—so far, we haven’t seen any measurable drop in work performance. Our focus has been to keep our entire global workforce safe, connected, and productive through this crisis.”
Hauduc says the situation is creating a learning moment for his team.
“One thing that we’re learning is it’s really about the data,” he says. “There are so many things we can measure—finding the right things to measure so we can take the right actions is critical.”
The team’s data-centric approach to VPN and networking also has allowed it to make smart investments, like provisioning capacity only when required. It also helps the team respond quickly when needed—which happened recently when Italy tightened its remote working restrictions.
“We doubled capacity in London, which is where we run the VPN connection for our employees in Italy,” Hauduc says. “Having good data allows us to quickly take proactive action when needed and to stay ahead of the crisis as it unfolds.”
The team also recently saw the potential for a bottleneck at its headquarters in Redmond, Washington, where the number of concurrent sessions that VPN needed to support was climbing close to capacity. The company addressed this concern by adding another VPN gateway. The next test will be in India, where work from home recommendations are now being implemented.
“This has caused us to reflect on our readiness efforts overall,” Hauduc says. “We’ve used this as an opportunity to improve how we do things.”
The team expects to keep learning as the COVID-19 response unfolds.
Hauduc says one of the most uplifting things about the response is that all the employees working to strengthen the company’s VPN infrastructure have been able to work from home themselves.
“We want everyone to be as safe as possible,” he says. “Stay safe.”
Tips for retooling VPN at your company
For enterprises and organizations looking to optimize and scale out their VPN capabilities, some of the best practices shown above and recommended by Microsoft are:
- Save load on your VPN infrastructure by using split tunnel VPN, send networking traffic directly to the internet for “known good” and well defined SaaS services like Teams and other Office 365 services, or optimally, by sending all non-corporate traffic to the internet if your security rules allow.
- Collect user connection and traffic data in a central location for your VPN infrastructure, use modern visualization services, like Power BI, to identify hot spots before they happen, and plan for growth.
- If possible, use a dynamic and scalable authentication mechanism, like Azure Active Directory, to avoid the trouble of certificates and improve security using multi-factor authentication (MFA) if your VPN client is Active Directory aware, like the Azure OpenVPN client.
- Geographically distribute your VPN sites to match major user populations, use a geo-load balancing solution such as Azure Traffic Manager, to direct users to the closest VPN site and distribute traffic between your VPN sites.
Finally, and probably most important, know the limits of your VPN connection infrastructure and how to scale out in times of need. Things like total bandwidth possible, maximum concurrent user connections per device will determine when you’ll need to add more VPN devices.
If your devices are physical hardware having additional supply on-hand or a rapid supply chain source will be critical. For cloud solutions, knowing ahead of time how and when to scale will make the difference.
Azure offers a native highly-scalable VPN gateway, as well the most common third party VPN and SDWAN network virtual appliances in the Azure Marketplace.
For more information on these and other Azure and Office network optimizing practices please see:
- Office Connectivity Principles
- Network considerations for Teams
- Azure VPN Gateways
- Azure Point-to-site VPN
- VPN Network Virtual Appliances in the Azure Marketplace
- Updates for improving work-from-home employee access
Here are additional resources to learn more about how Microsoft applies other networking best practices: