Improving reliability with Windows Analytics Device Health
Microsoft Core Services Engineering and Operations (CSEO) manages the performance and health of roughly 300,000 Windows 10 devices. We deployed the newest addition to the Windows Analytics services suite, Device Health. Device Health works within the Microsoft Operations Management Suite (OMS) and complements existing Upgrade Readiness and Update Compliance services. It reports on some common problems that our Windows 10 users might experience and helps us to proactively remediate issues.
In the first month of using Device Health, we were able to enhance worldwide productivity and improve the user experience by eliminating thousands of blue-screen crashes. Device Health showed which third-party device driver was triggering crashes, showed which devices were affected, and revealed a better-performing version we could deploy—saving us about 3,000 hours of employee productivity per month. (Details are in the Device Reliability section.)
Our inventory and management tools were very good at collecting data about devices, drivers, and applications, but until we started using Device Health it was difficult and time-consuming to weed through the data and learn what was going on with the devices in our environment. Our reaction times were constrained by the amount of time it took to find out if an issue was part of a trend and to identify commonalities of the devices experiencing the issue.
We looked at issues that were coming in through help-desk support calls, but we did not have all the information we needed to recognize the scope or cause of many issues, including the number of people experiencing a problem. Too often, only a small percentage of users contact the help desk to report that they are having recurrent crashes; most users simply reboot and don’t call the help desk. We can now see crashes before employees contact the help desk (and even when they never inform the help desk). We can see when a driver started causing crashes, which version of the driver caused the crashes, and we can identify the root error code. We can see which of our core applications are running well and which versions are causing crashes and hangs for our employees. We use that information to address classes of issues for all our users instead of responding to each issue.
Device Health also gives us visibility into Windows Information Protection App Learning. It helps us refine our app rules and fine-tune how we use Windows Information Protection to protect company data. And it shows us the success rates of our login authentication solutions, and what failure codes employees are seeing when logins fail. Device Health uses diagnostic data that is built into all Windows 10 devices. We didn’t need to install any agents, and cloud-connected access using Windows 10 telemetry through OMS made it easy to enable Device Health analytics for proactive monitoring of device health and performance—without the need to invest in new, complex, or customized solutions or infrastructure.
(Note that the OMS portal will be officially retired on January 15, 2019. We are excited to move to the Azure portal and expect the transition to be easy. Please view this article for more information on OMS portal moving to Azure.)
Enabling Device Health analytics
To get started with Device Health analytics, we simply needed to add the solution to our OMS workspace through the OMS Solutions Gallery.
On the client side, most of our devices were already running Windows 10 and had Enhanced telemetry enabled, as required for Device Health analytics. If we had not already enabled other Windows Analytics services, we would use Group Policies in System Center Configuration Manager and Microsoft Intune to set telemetry levels and deploy our unique Microsoft commercial ID on employee computers.
Benefits of enabling telemetry
When you allow Windows telemetry to be sent from your managed devices to Microsoft, the data collected helps Microsoft to improve products and to solve problems which may be impacting your enterprise. That same data drives your Windows Analytics experience to provide proactive insights and optimization opportunities for your environment. For more information about Windows telemetry, see Configure Windows telemetry in your organization.
Device Health solution workflow
As Figure 1 shows, telemetry collected from employee devices is sent to a secure Microsoft data center using the Microsoft Data Management Service. The telemetry data is analyzed by Windows Analytics and is then pushed into our OMS workspace, where it can be accessed through the Device Health solution.
Figure 1. Device Health solution workflow.
Using Device Health analytics
Using Device Health we can easily see Device Reliability information, including devices that crash frequently, device drivers that are causing device crashes, applications in use-reporting failures, and login-type success rates and failure errors. We can also see Windows Information Protection App Learning activity that can indicate that a policy is misconfigured and is causing excessive or unnecessary user prompts.
Device Reliability provides overviews for devices that crash frequently, drivers that are inducing crashes, applications reporting failure events, and login-type success rates and failure error. Figure 2 shows how we can click through into more details and sort, group, and filter to understand different sub-populations.
Figure 2: Device Health panes in OMS.
Driver-induced OS crashes
We have focused most of our proactive monitoring efforts around tracking, resolving, and mitigating device driver crashes. For any listed driver, we can open the driver perspective view, which shows details for the responsible driver including crash rates for each version of the driver. We can also view trends and compare our environment against the broader enterprise Windows ecosystem.
Device Health showed that a particular networking device driver had caused about 6,000 crashes across about 3,500 devices during a two week period. Our testing team had earlier ruled out deploying a newer version of that driver because they experienced some bluescreen crashes on some machines during their manual testing of the driver. Device Health, however, showed that in aggregate the newer version of that driver was actually performing much better in our environment than small scale manual testing had indicated.
We were able to use this data to prove that, while some devices had indeed crashed with the new driver, the crash rate was significantly lower than it was with the older driver in broad deployment. We used Device Health to make a data-driven decision on the best solution to the problem.
Two weeks later, after deploying the preferred version to most machines, the crash rate for that driver was approaching zero. We estimate about 15 minutes of productivity loss per bluescreen (reboot, logon, recover docs, rejoin meetings, etc.), which means that this driver update unblocked about 1,500 hours of company time every two weeks!
The driver version table helps us determine which version of a driver might help reduce the crash rate. In the case discussed here, we used Device Health along with other internals signals (such as help-desk calls and other reports) to prioritize deployment of the 220.127.116.11 version. Figure 2 illustrates our view of the driver after it reached broad deployment.
Figure 3. Driver-Induced OS Crashes driver version table
Using Device Health, along with internal reports and tools, we exported the list of affected devices into System Center Configuration Manager and deployed the preferred driver.
The trending graphs, shown in Figure 4, have been useful in helping us track and report on our overall progress resolving driver issues. The trending graph on the left shows a trailing 14-day count of unique devices that were affected. It peaked at around 3,500 unique devices. It fell dramatically as we deployed the updated driver.
Figure 4. Driver-induced OS Crashes trend data
We also use trend data to help us determine whether driver-induced crashes for a specific driver are trending up or down after a regular update release. If we see an increase in crashes, we can quickly identify the trend and start investigating the issue.
Many driver-induced crashes don’t have an immediate available path for mitigation or resolution and we must go back to the product group or the vendor and ask for help in resolving issues. Device Health analytics has vastly improved the quality of information we can give them. We can tell them what version of the driver is crashing, when it began crashing, and provide machine configuration specific data that will help them resolve the issue.
Our app reliability feature (shown in Figure 5) gives us insights into the reliability of our employee’s applications so we can take action to improve productivity. This allows us to see which versions of our employees’ apps are encountering crashes (app stops unexpectedly) or hangs (app stops responding), and what versions are providing a better customer experience.
Figure 5. App Reliability trend data
We track login success rates (shown in Figure 6) so we can see when errors occur, learn from them, and quickly deploy fixes. We use it to track how well our users like the new authentication options we deploy to them (like Hello facial recognition) and their success rates. It also helps us promote our Hello solutions (which are quite good) so we can work on getting past using passwords.
Figure 6. Login Health trend data
Frequently crashing devices
This view can be used to identify problem devices which may need to be wiped and reloaded or, perhaps, replaced. Clicking through any device in the list shows details such as what kinds of crashes occurred and when.
The device-level data can also be used the keep a closer eye on critical devices like those that are used by senior executives and kiosks. We are evaluating help desk and IT engineering procedures (such as potential proactive outreach to users with unhealthy devices) for how we will operationalize the Frequently Crashing Devices information.
Windows Information Protection
User productivity can be disrupted if Windows Information Protection rules are not aligned with real work behavior. Windows Information Protection App Learning in Device Health makes it easy for us to see which apps—on which devices—would receive user prompts when they attempt to cross policy boundaries. That visibility makes it much easier to determine whether we have the correct policies in place for the right apps. We are currently running Windows Information Protection in Silent Mode, and we are fine-tuning our Windows Information Protection policies to protect work data from accidental sharing.
Before Device Health, we would have needed to collect Windows Information Protection audit logs from every Windows 10 device in the environment, and have a data analyst create custom queries to provide usable views of the data. As shown in Figure 7, we now have a chart view that displays the apps that are hitting Windows Information Protection policy boundaries.
Figure 7. Device Health Windows Information Protection App Learning
We can click through the chart view to view details that we use to identify apps that are triggering Windows Information Protection incidents.
Device Health is a key solution
Device Health has quickly become a key solution that we use daily to monitor and track the health of devices and drivers in the organization. It allows us to discover issues directly through Windows 10 telemetry—which reduces reliance on client-side agents and scripts to track health. The information can also be exported from OMS into PowerBI, so we are looking at opportunities to integrate information from Device Health with data from System Center Configuration Manager, help desk, and other systems to help us make faster decisions and improve employee productivity and satisfaction.
For more information
Microsoft IT Showcase
© 2018 Microsoft Corporation. This document is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS OR IMPLIED, IN THIS SUMMARY. The names of actual companies and products mentioned herein may be the trademarks of their respective owners.