How Microsoft used SQL Azure and Azure Service Fabric to rebuild a key internal app

Oct 16, 2019   |  

When Raja Narayan took over supporting the Payee Management Application that Microsoft Finance uses to onboard new suppliers and partners, the experience was broken.

“Our application’s infrastructure was on-premises,” Narayan says. “It was a big, old-school architecture monolith and, although we had database-based logging in place, there was no alerting setup at any level. Bugs and infrastructure failures were bringing the application down, but we didn’t know when this happened.”

And it went down a lot.

When it did, the team wouldn’t know until a user filed a ticket. Then it would take four to six hours before the ticket reached Narayan’s team.

“We would undertake root-cause investigation and it sometimes could take a solid two to three hours, if not more in rare cases, until we managed to eventually identify and resolve the problem,” says Narayan, a principal software engineer on the Microsoft Core Services Engineering and Operations (CSEO) group that supports the Microsoft Finance team.

[Take a look at how Narayan’s broader team is modernizing applications.]

All told, it would take at least 10 to 12 hours to bring the system back online.

And it wasn’t only the reliability challenges that the team was hit with daily. Updates and fixes required taking the system down. Engineering teams didn’t have insight into work that other teams were doing. Cross-discipline collaboration was minimal. Continuous repetitive manual work was required. And telemetry was severely limited.

“There was no reliability at all,” Narayan says. “The user experience was very, very bad.”

That was four years ago, before the team moved its payee management system and its 95,000 active supplier and partner accounts to the cloud.

“When I joined our team, it was obvious that we needed a change. And going to Azure was a big part of it,” Narayan says. “Going to the cloud was going to open up new opportunities for us.”

He was right. After the nine-month migration was finished, things got much better right away. The benefits included:

  • The team was empowered to adopt modern, DevOps engineering practices, something they really wanted. The benefits showed up in many ways, including reduced cross-team friction and faster response times.
  • Failures were reported to a Directly Responsible Individual (DRI) immediately. They would fix the problem right away or queue it up for the engineering team to do deeper-level work.
  • The time to fix major production issues dropped to as few as 15 minutes, and a maximum of four hours.
  • The team no longer needed to shut down the system to make production fixes (thanks to the availability of staging and production slots, and hosting frameworks like Azure Service Fabric).
  • Application reliability shot up from around 95 percent to 99 percent. Availability stayed high because of redundancy.
  • Scaling the application up and out became just a configuration away. The team was able to scale the services based on memory and processor utilization.
  • The application’s telemetry data became instantly available to analyze and learn from.
  • The team could start taking advantage of automation and governance capabilities.

The shift to Azure is having a lasting impact.

“If someone asked me to go back, I don’t think I could happily do it,” Narayan says. “I don’t know how we survived in those old days. It’s so much faster and more powerful to be on Azure.”

Instead of spending all his time fighting to reduce technical debt, on building and maintaining too many services, and buying and installing technical infrastructure, he’s now focused on what his internal business customers need.

“Now we’re building a program,” Narayan says. “Now we’re taking care of our customers. Application-hosting infrastructure is not our concern now. Azure takes care of it.”

Opening doors with SQL Azure

Moving to the cloud also meant the team got to move on from an on-premises SQL Server database that needed continuous investment in optimization and maintenance to avoid problems with performance.

“We’ve never had an incident where our SQL Azure database has gone down,” Narayan says. “When we were on-prem, our work was often interrupted by accidental server restarts and patch installations.”

The team no longer needs to shut the application down and reboot the server when it wants to fix something or make an upgrade. “Every time we want to do something new, we make a couple of clicks, and boom, we’re done,” he says.

Azure SQL made it much easier to scale up and down when user loads changed. “My resources are so elastic now,” Narayan says. “I can shrink and expand based on my need—it’s a matter of sliding the scrollbar.”

Moving the application’s database to SQL Azure has given the team access to several new tools.

“With our move to cloud, the team can experiment on any databases, something that wasn’t possible before,” Narayan says. “Before we could only use SQL Server. Now we have an array of options such as Cosmos DB, table storage, MySQL, and PostgreSQL. New features from these products are available automatically to us. We don’t have to install feature updates and patches—it’s all managed by Azure.”

Living in the cloud also gives the team new access to the application’s data.

“We now live in this new big-data world,” Narayan says. “We can now get a lot of insights about our application, especially with machine learning and AI.”

For example, SQL Azure learns from the incoming load and accordingly tunes itself. Indexes are created or dropped based on how it learns. “This is one of the most sought-after features by our team,” he says. “This feature does what a database administrator used to have to do by hand.”

And processing the many tiny transactions that come through Narayan’s application? Those all happen much faster now as well.

“For Online Analytic Processing (OLAP), we need big processing machines,” he says. “We need big resources.”

Azure provides him with choices, including Azure Datawarehouse, Azure Databricks, and Azure HDInsights. “If I was still on-prem, this kind of data processing would just be a dream for me,” he says. “Now they are a click away for me.”

Going forward, the plan is to use AI and machine learning to analyze Payee Management Application’s data at greater depth. “There is a lot more we can do with our data,” Narayan says. “We’re just getting started.”

Narayan’s journey toward more reliable and agile service is a typical example of how off-loading the work of managing complex on-premises infrastructure can help the company’s internal and external customers focus on their core businesses, says Eli Birova, a site-reliability engineer on the Azure SQL SRE Team.

“And one of the biggest values Azure SQL DB brings is a database in the Azure cloud that scales in and out together with your business need and adapts to your workload,” Birova says.

That provides customers like Narayan and his team with a database as a service tailored by the deep Relational Database Management Systems (RDBMS) engineering expertise that come from long years of developing Microsoft SQL Server, she says. It’s a service that incorporates large-scale distributed systems design and implementation best practices, which also natively leverages the scalability and resiliency mechanisms of the Azure stack itself.

“We in the Azure SQL DB team are continuously monitoring and analyzing the behavior of our services and the experience our customers have with us,” Birova says. “We’re very focused on identifying and implementing improvements to our feature set, reliability, and performance. We want to make sure that every customer can rely on their data when and as they need it, and that they can count on their server being up to date and secure without needing to invest their own engineering resources into managing on-premises database infrastructure.”

Harnessing the power of Azure Service Fabric

Once Narayan’s team finished migrating the Payee Management Application to the cloud, it got the breathing room it needed to start thinking bigger.

“We started asking ourselves, ‘How can we get more out of being in the cloud?’” Narayan says. “It didn’t take us long to realize that the best way to take advantage of everything Azure had to offer would be to modify our application from the ground up to be cloud-native.”

That shift in thinking meant that his days of running a massive, clunky, monolithic application were numbered.

“We realized we could use Azure Service Fabric to rebuild the application as a suite of microservices,” Narayan says. “We could get an entirely fresh start.”

Azure Service Fabric is part of an evolving set of tools that the Azure product group is using to help customers—including power users inside Microsoft—build and operate always-on, scalable, distributed apps like the one Narayan’s team manages. So says Spencer Schwab, a software engineering manager on the Microsoft Azure Site Reliability Engineering (SRE) team.

“We’re learning from the experience Raja and his team are having with Service Fabric,” Schwab says. “We’re pumping those learnings back into the product so that our customers have the best experience possible when they choose to bet their businesses on us.”

Narayan’s team is using Azure Service Fabric to gradually rebuild the Payee Management Application without interrupting service to customers. That’s something possible only in the cloud.

“We lifted and shifted all of the old, existing monolith components into Azure Service Fabric,” he says. “Containerizing it like that has allowed us to gradually strangle the older application.”

Each component of the old application is docked in a container. Each is purposefully placed next to the microservice that will replace it.

“Putting each microservice next to the component that it’s replacing allows us to smoothly move that bit of workload to the new microservice without shutting down the larger application,” Narayan says. “This is making our journey to microservices pleasant.”

The team is halfway finished.

“So far we have 12 microservices, and we’re planning to expand up to 25,” he says.

Once the team is done, the team can then truly take advantage of being in the cloud.

“We’ll be ready to reap the benefits of cloud-native development,” Narayan says. “Anything becomes possible at that point.”

Take a look at how Narayan’s broader team is modernizing applications.

Read more about Azure reliability improvements.

Learn about Project Tardigrade and improving Azure resiliency.

Tags: , , , , , , , ,