| Introduction | |
| What are MCA Events? | |
| Dispatch Flow for MCA Events | |
| MCA Error Record Format | |
| Windows 64-bit MCA Features | |
| Post-Windows XP/Server 2003 64-bit MCA Features | |
| Call to Action and Resources |
The 64-bit versions of Microsoft Windows XP Professional and Windows Server 2003 support the Machine Check Architecture (MCA) defined by Intel for the Itanium processor. This paper describes the MCA support model for these releases of the Windows operating system. The information in this paper is intended to help manufacturers plan support for MCA on systems running 64-bit Windows.
As computer systems grow larger and the number and density of components increase, hardware failure rates for each system are also likely to increase. At the same time, it is becoming increasingly important for large enterprise-class computer systems to have higher levels of reliability, availability, and scalability (RAS). To address these requirements, Intel has defined a Machine Check Architecture (MCA) for the Itanium processor. MCA is a hardware and software architecture for Itanium that attempts to improve the RAS features of the system.
Machine Check Architecture (MCA) on Itanium-based computer systems refers to:
| • | Hardware components responsible for identifying and delivering hardware error events to software. |
| • | Software components responsible for parsing, identifying, analyzing, and (if possible) resolving these events. |
This white paper describes changes to the delivery mechanisms for hardware error events on Itanium-based platforms from the delivery mechanisms used on IA-32. It also describes how Windows will implement a set of features to support these changes and further enhance the overall reliability and availability of the system.
This paper makes references to the following documents:
| • |
The System Abstraction Layer (SAL) implementation is based on Itanium Processor Family System Abstraction Layer Specification, January 2001, published by Intel Corporation and available for download from http://developer.intel.com |
| • |
The processor abstraction layer (PAL) implementation discussed in this paper refers to the Processor Abstraction Layer firmware produced by Intel Corporation. Details of the PAL can be found in Volume 2 of the Intel IA64 Architecture Software Developer’s Manual, which is available for download from http://developer.intel.com |
| • | The MCA Implementation Guide for 64-bit Windows is an implementation guide for MCA on the Windows XP and Windows Server 2003 64-bit operating system releases. This guide is for system manufacturers and developers of management software. |
References to Windows XP and Server 2003 64-bit in this document refer to the Windows XP Professional 64-bit and Windows Server 2003 Enterprise and Datacenter 64-bit editions.
Almost all MCA events are hardware errors, but not every hardware error is reported as an MCA event. For example, a failure in a memory device in system memory would probably be reported as an MCA event, but a component failure on a hard disc drive would probably not be reported to the operating system as an MCA event. The difference between these two events is the hardware mechanism that is used to alert the CPU that the event occurred.
All MCA events fall within two categories based on the hardware reporting mechanism used. These are described in the SAL 3.0 Specification as follows:
| • | CPU errors. Errors that either occur within components of the CPU itself, such as the caches, or that are detected on the CPU front side bus (FSB) during an external transaction. |
| • | Platform errors. Errors that are delivered to the CPU through external pins that are provided for reporting MCA events. Further detail on these pins is provided in section four of the SAL 3.0 Specification. |
Any hardware error event that does not fall within these two categories is not delivered or processed as an MCA event. There is some flexibility in this model. System designers can connect hardware error event signals to the pins provided on the CPU to ensure that the errors are processed as MCA events. However, an MCA event must successfully identify the component that failed, or the component that is likely to fail in the future. For peripherals on an I/O bus such as peripheral connect interface (PCI), there is no standardized direct link from peripheral devices to the system chipset that could be used to raise an MCA event that identifies the failing component. MCA does provide full error handling for the standard PCI fatal error types such as PERR and SERR.
MCA events can be further categorized based on the severity of the error. Each error delivered to the operating system falls within one of the following categories:
| • | Non-corrected or fatal errors. Errors that could lead to non-recoverable data loss or corruption. This type of error will cause the system to be restarted. Examples might be a parity error on the PCI bus or a double-bit error correction code (ECC) error in system memory. Note: Non-corrected or fatal errors are referred to as machine check abort (MCA) events. This term causes confusion with Machine Check Architecture (MCA). In this paper, MCA always refers to Machine Check Architecture. | ||||
| • | Corrected errors. Errors that can be corrected either by the hardware or by some level of software. The occurrence of a corrected error can indicate instability in the hardware and can be used to predict future fatal errors. The more sources of corrected errors within the system, the better the quality of the system’s error prediction. Examples of this type of error might be a double-bit ECC error in a clean cache block or a single-bit ECC error in system memory. There are two types of corrected error:
|
Different reporting mechanisms are used for fatal errors versus corrected errors:
| • | Fatal errors require the operating system to deal with a potentially catastrophic hardware problem immediately. |
| • | Corrected errors do not have to be dealt with immediately because they pose no serious risk to the integrity of the system. |
The most significant difference between the IA-32 MCA model and the Itanium MCA model is that with the Itanium MCA model, the operating system depends on external software to provide the hardware context for MCA events. As a result, it is critical that all Itanium-based Windows systems support a generic definition for the format of this hardware context data structure, as described in “MCA Error Record Format” later in this paper.
Dispatch mechanisms. Itanium-based systems use different mechanisms to dispatch MCA events to software than IA-32-based systems. On IA-32-based systems, the CPU detects the error and dispatches directly into the operating system. The operating system is responsible for gathering the error context and identifying the error source. On Itanium-based systems, the dispatch flow has changed and the action taken depends on the category of the error event. The dispatch flows for Itanium are described in chapter four of the SAL 3.0 Specification. Figure 1 later in this paper shows the Windows XP and Server 2003 64-bit specific dispatch flows.
Dispatch flow on Itanium-based systems. All fatal MCA events are handled first by firmware and then dispatched into the operating system using the operating system MCA handler address. During the boot process, the operating system MCA code registers this fatal error (machine check abort) handler address with the SAL firmware. The firmware uses this dispatch point for all fatal MCA events.
Corrected events are initially handled by firmware too. These events can be dispatched to the operating system using interrupts, or can be polled for by the operating system. The SAL has the option to use ACPI to provide interrupt information for dispatching platform-corrected error events. If no platform interrupt information is provided, the operating system enters polled mode for corrected events. In this mode, the operating system polls the SAL at regular intervals for any corrected MCA events that might have occurred.
When the CPU detects an error during normal operation, one of two things can happen:
| • | If the error is a CPU error that has been corrected by the hardware, the CPU saves error context within the CPU hardware and returns to the interrupted flow. If CMC interrupts are enabled, a CMC interrupt is dispatched to the operating system at some point after the event. |
| • | In all other cases, the CPU dispatches into the processor abstraction layer (PAL) code. PAL is the firmware layer that abstracts processor implementation-specific features. |
If the CPU dispatches into the PAL, the dispatch flow follows these steps:
1. | If the error is a CPU error, the PAL attempts to recover from the error. |
2. | Whether or not the PAL succeeds in recovering from the error, it saves the CPU context to memory and dispatches into the system abstraction layer (SAL) firmware. |
3. | If the error is a platform error, the SAL attempts to recover from the error. |
4. | If either the PAL or the SAL recovers the error, the SAL returns control to the PAL, which then returns control to the interrupted flow. |
5. | At some point after the event, a CPE or CMC interrupt is dispatched. |
6. | If the error is not recoverable, the SAL dispatches into the operating system MCA code. |
To ensure the operating system can parse the error and analyze the error severity and root cause, the error record must be in a generic format. The minimum error record format required by the 64-bit versions of Windows XP and Windows Server 2003 is the format defined in Appendix B of the SAL 3.0 Specification. This format will become a logo requirement for the Windows Server 2003 Enterprise and Datacenter 64-bit releases.
This error record structure defines seven classes of error, one CPU class and six platform classes. The six platform classes are listed below. The CPU class and the following two platform classes are generic enough so that the operating system can parse the error record and perform an analysis to identify the source of the error.
| • | System memory |
| • | PCI bus |
The remaining four platform classes are more platform-specific and the operating system can only identify the class of the error:
| • | PCI component |
| • | SMBIOS |
| • | System event log (SEL) |
| • | Platform specific |
To fully analyze and identify the cause of these more platform-specific classes of error, the operating system needs assistance from the platform OEM, as described in “Windows XP/Server 2003 64-bit MCA features” later in this paper.
Operating systems can provide three basic features for MCA:
| • | Error logging. The operating system can consistently save the error context on every hardware error to ensure that the source of the error can be identified when the system is restarted. This is critical for fatal error events. |
| • | Error prediction. The operating system can use corrected error information to provide error analysis and fatal error predictions to the system administrator or to system support staff. |
| • | Error recovery. In some cases, the operating system can take corrective action that would work around the failing hardware and allow the system to continue without crashing. |
The first implementation of Windows 64-bit operating system MCA code will be in the Windows XP 64-bit Professional and the Windows Server 2003 Enterprise and Datacenter 64-bit releases. This implementation will consist primarily of error logging and error prediction features.
Figure 1 shows the MCA flow within the operating system for Windows Sever 2003 64-bit editions.

Figure 1. Windows XP/Server 2003 64-bit MCA flow.
The Kernel/HAL provides a handler for each of the three MCA events that can be delivered from the hardware. In addition, it supports polled mode for corrected errors, as described in “Dispatch Flows for MCA Events.”
Microsoft Windows Management Interface (WMI) has built-in classes and events for 64-bit MCA in the Windows XP and Windows Server 2003 64-bit releases.
The rest of this section describes the MCA features in the Windows XP and Server 2003 64-bit releases.
Error records for both CMC and CPE events are delivered to Windows Management Interface (WMI). WMI executes the following steps:
1. | Writes the error record and an analysis string into the event log. The analysis string is a high-level description of the error that is produced by parsing the error record fields that the operating system understands. Analysis strings for the more platform-specific error classes will not provide information on the source of the error. A local dialogue box will be provided to alert the user to the event log entry. |
2. | Delivers the MCA event to any MCA consumers that have registered for delivery. An MCA consumer might be a management application such as Microsoft Operations Manager (MOM). The event consumer can parse the error record and notify an administrator or support person that the error occurred. The management software can provide a full analysis of the hardware error and provide information on the replaceable unit. Analysis of more platform-specific errors would require management software from the system OEM. |
These capabilities can be used to support features like remote health monitoring of the system, failure prediction, and component swap out before a fatal error occurs. The more sources of corrected errors there are in the hardware, the better the error prediction will be. These features can go a long way to improve the reliability and availability of the system.
When a fatal error event is delivered to the operating system, the handling is different from that performed on corrected errors. For example:
| • | There is no fatal error recovery support in the 64-bit MCA code in Windows XP and Server 2003.The operating system will attempt a controlled shut down to ensure that the error is contained. Information will be provided on the stop error screen (blue screen) to indicate the class of the MCA error. |
| • | On each reboot, the operating system MCA code polls the SAL for any unprocessed error records. The SAL must maintain error records for fatal events in NVRAM to ensure that they are available after a reboot or power cycle of the system. If an error record is returned by the SAL, the operating system delivers it to WMI, which in turn writes it to the event log and delivers it to any MCA event consumers that have registered, as described earlier. |
| • | To allow for a scenario in which the operating system cannot reboot after a fatal error event, the SAL should provide an error analysis screen that parses the error record, stored in NVRAM, identifies the cause of the error, and presents the information to the user. |
Maintaining error records for fatal errors in NVRAM ensures that an error record that describes the cause of the error is automatically saved for every fatal event that occurs on every system running 64-bit Windows XP and Windows Server 2003. This provides the following benefits:
| • | The error record is available on the next reboot to identify the failing hardware component. |
| • | Saving the error record requires no manual input and ensures that the record is persistently saved across system resets and power cycles until it has been processed. |
These features help ensure that no system crashes more than once as the result of any fatal hardware error. After the system is rebooted, notification of the fatal event can be delivered remotely to the system administrator or support staff using management software. This is a powerful feature that enhances availability and serviceability.
For information about access to error records from the event log or through WMI, see the MCA Implementation Guide for 64-bit Windows.
There is no fatal error recovery support in Windows Server 2003. However, the operating system MCA code monitors corrected errors and checks for related events. Related events are single-bit ECC errors to the same physical page or corrected errors on the same CPU. This capability allows the operating system MCA code to keep a count of the number of related events and take action to avoid the occurrence of non-correctable events, such as multi-bit ECC errors.
The operating system MCA code automatically attempts to remove any physical page of memory that experiences more than a preset limit of corrected error events. The data from the physical page is copied to another physical page and the failing physical page is no longer used by the operating system. When a paging event occurs, an event log entry and a WMI MCA event will be created.
After a page has been mapped out, it will not be used again as long as the operating system continues to execute. However, if the system is rebooted, the operating system will not remember which physical pages were removed in a previous session and will use all physical memory that is made available by the system firmware.
Note: This section discusses the operating system error recovery features that might be possible in the future. The information in this section does not represent a commitment to provide any of the features discussed.
The MCA support in the Windows XP and Windows Server 2003 64-bit operating system releases can be expanded to provide operating system recovery from MCA events that were previously considered fatal.
Error recovery is performed at different levels within the system. Some errors can be corrected by the hardware, some by the PAL, and yet more can be corrected by the SAL. Some categories of error require access to operating system context for correction. The use of a standard error record to pass the error information to the operating system MCA code is a key enabler for future operating system MCA error recovery.
For certain types of platform-specific errors, the operating system will not have enough information to attempt recovery, as described in “Windows XP/Server 2003 64-bit MCA Features” earlier in this paper. The SAL would have an opportunity to recover from these classes of errors.
The error recovery techniques to be implemented in post Windows XP and Server 2003 64-bit releases are still in the planning stages. Work is ongoing to investigate and prioritize the error scenarios from which the operating system could recover.
Error containment is essential to future operating system MCA error recovery work. OEMs can use hardware error containment techniques to restrict error propagation. The Windows XP and Server 2003 64-bit error record allows the SAL to communicate the error containment status to the operating system. The better the error containment, the better chance the operating system MCA code has of recovering from the error.
The most obvious recovery scenario would be a fatal ECC error in memory. With each error type, there are levels of correction that can be applied. Generally, cruder recovery techniques are easier to implement, but results are not optimal. For the fatal-memory error scenario, if the failing address is in user space, the process owning the memory can be terminated and the page containing the failing location can be removed from the operating system working set. This is a crude solution, but is better than a system crash.
The Windows XP and Windows Server 2003 64-bit MCA code provides the interface to the operating system for all hardware errors within the system that are reported as machine checks. As new reliability features such as multi-path I/O or dynamic system partitioning are added to future releases of the operating system, MCA can play a significant role in enabling these features to recover from what would previously be fatal hardware errors.
Manufacturers of Itanium-based systems should plan to support MCA in 64-bit Windows XP and Windows Server 2003 by doing the following:
| • | Read and understand the PAL and SAL specifications and the MCA Implementation Guide for 64-bit Windows. |
| • | Plan your MCA implementation to produce error records in the generic format defined in Appendix B of the SAL 3.0 Specification. |
Itanium Processor Family System Abstraction Layer Specification, January 2001,
http://developer.intel.com ![]()
Intel IA64 Architecture Software Developer’s Manual, http://developer.intel.com
. Volume 2, Section 11, IA64 Processor Abstraction Layer.