How we manage incidents at Microsoft

July 6, 2023

Microsoft Australia

Learn about Microsoft’s Cyber Defence Operation Centre

Microsoft has over 13,000 people working in a diverse set of teams in the cyber security arena. We adapt and change our approach as we face new threats.

Below, we’ll talk about the things we’ve learnt from managing incidents at Microsoft, giving you tips that can be applied to business of all sizes.

Our Cyber Defense Operation Center (CDOC) is a fusion centre. Incident responders, threat intelligence, legal, investigators, communications and marketing teams could all be working together on an incident.

How the CDOC responds to incidents

At Microsoft, we see 600 billion events per month, but only hundreds of these need to be looked at by a person. That’s because our tiered incident response structure incorporates automation.

Firstly, we run the telemetry from the use of our products around the world through cloud-powered AI and machine learning engines. The AI and the ML identify anomalous and malicious behaviour. Then, automation responds, remediating these incidents where possible.

An incident that can’t be resolved by automation is escalated through a tiered structure:

Triage tier: The CDOC’s triage team, once alerted, has roughly 15 minutes to respond to, and troubleshoot the incident
Investigation’s tier: Incidents not resolved in the 15-minute triage stage are passed to experts in a product, service, or area. They dig into the problem and have a little more time to resolve the problem.
Tier three: 96% of incidents are resolved in the first two tiers. The rest are escalated. Time sensitive incidents go to the reactive team. Less urgent alerts go to proactive hunters who will investigate and tune the automation.

If you’d like to learn more about our approach, you can find our best practices in the Microsoft Cybersecurity Reference Architecture here: aka.ms/MCRA.

Continuously improving our security operations

The CDOC includes experts in getting the right signals from our automated tools and filtering out noise.

To make sure our people are focussed on the most important things, we track:

The number of alerts automation identifies
The number of alerts automation remediates on its own
How fast remediation happens
The different types of alerts automation answers
How long it takes for people to acknowledge, respond to, and resolve an alert

Automation in Microsoft Sentinel and Microsoft 365 Defender

Microsoft Sentinel and Microsoft 365 Defender both rely on automation, but there are differences.

Microsoft Sentinel

Microsoft Sentinel is built on the concept of playbooks, so we can flexibly orchestrate different sources. That way we make sure end-to-end automation is working across all the data sources CONNECTED TO Microsoft Sentinel. This model offers a high degree of flexibility on the role of human input and the scope of automation across various connected data sources.

Microsoft 365 Defender

Microsoft 365 Defender is an XDR product that brings together all the alerts from endpoints, identities, apps, and data. The alerts are grouped into incidents that trigger auto investigations. Then the system surfaces the decisions the security operations people need to make.

Will automation solve our security problems?

Automation takes the complexity out of situations. However, it also brings in more points of data for analysis. And sometimes it can even point you in the wrong direction.

There’s a lot of nuance to this information, which means people are still crucial to security operations functions. People are needed to take in information from unorthodox sources, to piece things together and to make deductions.

It’s becoming clear that we’re not going to be able to get away with fewer people. What you need is to find the right people and keep them! Compensate and reward and provide challenging work to keep your people interested and engaged.

Integrate security operations with the rest of your business

You need a great relationship between your business units and your security operations teams. The security operations teams need to understand the business units’ priorities. That way they can consider the effects of security actions. Should they shut down a compromised computer in the middle of what might be a critical negotiation or should they find another way around the problem.

Equally, the business units need to know what’s being done to protect their infrastructure.

Using threat intelligence

At Microsoft, we have product-specific threat intelligence teams doing a variety of things. They take threat intelligence from all manner of sources, internal, dark web research, third parties, and government and industry partnerships.

What we find is used to write detections for new vulnerabilities and alerts. There’s also strategic work, like studying threat actors and doing malware analysis.

Our findings help strategic leadership understand the vulnerabilities and attacks out there. We discuss with leadership what these threats mean to the business and so they’re aware of the kind of impact that might be expected.

At Microsoft, we have teams of reactive and proactive threat hunters.

Reactive threat hunters

Our reactive hunters are standing by to work with investigators after we get indicators of an incident. They’ll hunt through data for signs that we’ve been compromised.

Proactive threat hunters

Proactive threat hunters aren’t dragged into incidents. Instead, they take a strategic view of the threat landscape. They use threat intelligence and work with our red teams to identify hunting grounds.

Ensuring knowledge is shared between the teams

The reactive and the proactive threat hunting teams are separate but co-located and in the same part of the organisation.

Being co-located means we can fulfil the feedback loop better than ever. The teams:

work from the same playbooks
follow established processes to get their results integrated into our products and services
hold war games and debrief afterwards to share findings

We make sure remediations have been implemented and lessons learned have been recorded before closing off any incident or a war game.

Operationalising the MITRE ATT&CK framework

There’s a lot of information in the MITRE ATT&CK framework, which can make it daunting to figure out the best and most agile way of using it. However, not all of it will apply to everyone. It depends on where you sit in the incident response or security spectrum.

To get the most from the framework, you need to take the relevant parts and operationalise them. The key idea is to make those parts digestible.

You wouldn’t be surprised to find that Microsoft has a broad but integrated approach to incidents, involving thousands of people and dozens of teams. However, the scalable lessons for everyone are that clarity is key:

How much can be handled by automation
Where can we most effectively use our people
What roles will each of those people play
Where are we going to need help?

And of course, after every incident, asking yourself: What have we learned?

Categorised in: Security, Solution

This post was written by Microsoft Australia