Application Software Considerations for NUMA-Based Systems

Updated: March 5, 2003
*

Introduction

The majority of Microsoft Windows compliant, high-end server platforms that will be developed over the next three years--that is, server platforms that can run a single instance of Windows on eight or more processors--will have a cache-coherent non-uniform memory architecture (NUMA).

This white paper is intended to help software vendors understand the NUMA architecture and learn how it might affect execution and performance of applications that will be supported on the Windows Server 2003 operating system. The specific goals of this paper are to:

Introduce the NUMA architecture.

Describe the NUMA features supported in the Windows Server 2003 operating system.

Identify ways that the performance of software applications might be affected when running on a NUMA-based system.

Identify application features that might warrant some level of NUMA investigation to ensure that they perform optimally under Windows Server 2003 on a NUMA system.

Top of pageTop of page

NUMA Architecture

As processor clock rates continue to increase, it becomes more difficult and therefore more expensive to provide the bandwidth and latency needed to support large numbers of processors on a single system bus. As a result, the trend has been to support an optimal number of processors per system bus. Supporting an optimal number of processors helps to ensure that the system buses do not create a performance bottleneck and that the development cost is acceptable. Most current Windows–compliant systems support four processors on each system bus.

High-end servers are designed to support more than one system bus. One design approach is to create a number of nodes where each node contains some processors, some memory, and, in some cases, an I/O subsystem. Figure 1 shows a typical node architecture.

Figure 1 - Typical Four-Processor NUMA Node Architecture

Typical Four-Processor NUMA Node Architecture

Note: Although it is possible to create a node that has more than four or fewer than four processors, this paper assumes the typical case of a four-processor node.

All resources within a node are considered to be local to that node, and access to local memory from within the node is considered to be uniform. It should be noted that I/O might reside within the same node as the processors and memory, or it might reside in dedicated I/O nodes. The architecture of this node is similar to a classic four-way symmetric multi-processor (SMP) system. The main difference between a four-processor NUMA node and an SMP system is the cache-coherent system interconnect that is accessible outside the NUMA node.

To increase system capacity, additional nodes are connected using the high-speed cache-coherent system interconnect, as shown in Figure 2.

Figure 2. Two Four-Processor NUMA Nodes Connected as an Eight-Processor NUMA System

numa isv2.gif
Click to view full-size image.

In Figure 2, all eight processors can access memory in both nodes coherently. For example:

A processor in Node 1 can access memory within Node 1, (that is, local or "near" memory) using a direct path through the memory controller in Node 1.

For the same processor to access memory in Node 2 (that is, "remote" or "far" memory), the path taken is through the memory controller in Node 1, out through the system interconnect, and then through the memory controller in Node 2.

It takes more time to access memory in another node than it takes to access local memory. This difference in memory access times is the origin of the name for these systems: non-uniform memory architecture (NUMA).

The ratio of the time taken to access near memory to the time taken to access far memory is referred to as the NUMA ratio. The higher the NUMA ratio value -- that is, the greater the disparity between the time it takes to access far memory as compared to near memory -- the greater the effect that NUMA characteristics may have on software performance. To ensure that optimal performance can be achieved on this type of system, the Hardware Design Guide for Microsoft Windows 2000 Server Version 3.0 recommends a far-to-near ratio no greater than 3:1.

To achieve the best performance on a single operating-system image that is running across multiple NUMA nodes, accesses over the system interconnect must be kept to a minimum. This can only be achieved by adding NUMA support features to the operating system itself, and in some cases to individual applications, as described in this paper.

Top of pageTop of page

NUMA Support in Windows Server 2003

The NUMA features described in this section are available in the 32-bit and 64-bit versions of Windows Server 2003, Enterprise Edition and Windows Server 2003, Datacenter Edition. To enable the operating system to provide NUMA enhancements, the hardware must pass a description of the physical topology of the system to the operating system.

In Windows Server 2003, the topology description is passed using a static Advanced Configuration and Power Interface (ACPI) Specification table called the Static Resource Affinity Table (SRAT). The SRAT is constructed by system firmware and is passed to the operating system at boot time as part of the ACPI data structures. Although this table is referenced in Version 2.0 of the ACPI Specification, it can be implemented in ACPI Specification Version 1.0b data structures. For more information on the SRAT, see the "Resources" section of this paper.

Note: On a NUMA system that is divided into several hardware partitions -- that is, where an instance of the operating system is running on each partition -- the SRAT table passed by the firmware describes the hardware resources owned by the particular partition on which the operating system image is running.

The SRAT uses the concept of proximity domains, introduced in ACPI 2.0. Processor and memory resources that are physically located in the same NUMA node can be grouped into the same proximity domain by using the SRAT. This feature allows the firmware to pass the NUMA node topology to the operating system generically. Although the SRAT is a static table, it can be used to indicate where memory might be hot-added in the future to support the Windows Server 2003 hot-add memory feature.

The operating system uses the information provided by the SRAT to support the following NUMA features:

Provide default processor affinity settings for processes and threads

Ensure threads are rescheduled to a processor within the same NUMA node whenever possible

Ensure that memory is allocated locally wherever possible

Provide an API that application software can use to obtain the NUMA topology of the system

Windows Server 2003 Use of the SRAT Information

The operating system uses information from the SRAT to ensure that, wherever possible, threads of execution and the memory that the threads use most heavily are physically located in the same node. This capability ensures that most Windows applications run locally within a node, minimizing accesses across the system interconnect, which should ensure optimal performance of these applications on NUMA–based systems.

Based on information from the SRAT, the operating system supports the following capabilities:

Threads from the same process are assigned to the same NUMA node by default. During boot on a NUMA system, the operating system, by default, assigns each process to the next NUMA node in the system using a "round-robin" algorithm. Soft processor affinity is used to assign the processes. Each thread created inherits the node affinity of the process -- that is, it is assigned soft affinity to a processor in the same NUMA node that the process was assigned to. This ensures that, wherever possible, the threads for any given process run within the same NUMA node by default.

Threads for a process remain within the same NUMA node. Wherever possible, the scheduler uses the system topology information made available by the SRAT to ensure that threads are rescheduled to a processor in the same NUMA node as the processor where they previously were run. This process helps ensure that threads for a given process remain within the NUMA node to which they were originally assigned.

Memory is allocated in the same NUMA node as the requesting processor. The memory management code creates a paged pool and a non-paged pool for each proximity domain described in the SRAT -- that is, for each NUMA node. When a thread requests memory, the memory manager ensures that, whenever possible, the physical memory is allocated from the same NUMA node as the processor that is running the thread that requested the memory.

At the time the requested memory is committed -- that is, when the memory is paged in -- the memory manager assigns physical memory from the node containing the processor that is running the thread that touched the memory, wherever possible. Physical Address Extension (PAE) memory requested by using the Address Window Extension (AWE) API is also allocated locally to the processor that requested it, wherever possible.

The system topology is made available to application software. The information supplied by the SRAT can be retrieved by applications by using a set of Windows APIs.

In some scenarios, an excess number of threads or memory requests exist within a NUMA node. In these cases, the thread or requested memory may be scheduled or allocated to resources in a remote node.

Application management techniques can help prevent threads or memory from being scheduled or allocated to another node. For more information, see "Thread and Process Affinity and Scheduling" later in this paper.

NUMA Topology API

The NUMA features in the Windows Server 2003 operating system are automatically enabled when the operating system detects an SRAT describing a NUMA topology. Certain application features require knowledge of the system topology, for example, setting processor affinity for a thread. In a NUMA system, applications should call the NUMA topology API to ensure a NUMA-aware choice of processors is made. This helps to ensure optimal performance.

The NUMA topology API includes the following functions:

GetNumaHighestNodeNumber returns the highest-numbered NUMA node.

GetNumaProcessorNode returns the node number for the specified processor.

GetNumaNodeProcessorMask returns a bit mask for the processors in a specified node.

GetNumaAvailableMemoryNode returns the amount of memory currently available within a specified node.

Process and thread affinity are discussed in more detail in "Thread and Process Affinity and Scheduling" later in this paper. For more information on the NUMA topology API and on process and thread affinity see the "Resources" section of this paper.

Top of pageTop of page

NUMA Effects on Windows Applications

Existing applications should run without modification on NUMA systems. Most applications perform optimally without modification on NUMA systems running Windows Server 2003 because of the automated NUMA features in the operating system. The following describes how the NUMA features in Windows Server 2003 might affect performance on different application architectures:

Single-threaded applications. For applications that are single-threaded, the kernel NUMA features ensure that, wherever possible, all memory requested by the application is allocated in the same node as the requesting processor. This helps minimize remote memory accesses and provide optimal performance.

Single-process multi-threaded applications. For single-process multi-threaded applications that do not require more than the four processors that we have assumed exist in a NUMA node, the kernel NUMA features ensure that all memory requested by the application is allocated in the same node as the requesting processor, if possible. A process by default has affinity set to a NUMA node. The threads that the process creates inherit that NUMA node affinity. If an application sets processor affinity for its thread, care should be taken to ensure that threads are assigned to processors within the same node. When affinities for all the application threads are set to processors in the same node, the kernel NUMA features ensure that, wherever possible, all memory requested by the application is allocated in the same node.

Applications that create more than one process. Most applications that create more than one process (where each process requires four or fewer processors) should perform optimally on NUMA systems without any modification. The processes created by the application would be assigned to different NUMA nodes. However, because each process has its own address space, use of common data between the processes is likely to be limited. As a result, remote memory accesses are kept to a minimum.

Top of pageTop of page

NUMA Effects on Large, Enterprise-Class Applications

You may need to investigate and possibly modify some application characteristics to achieve optimal performance on NUMA systems. Examples of these application characteristics are:

Setting processor affinity for its threads.

Creating multiple processes that have a high usage rate of shared data.

Creating any process that requires more than four processors.

To improve the performance of applications with these characteristics, follow these general guidelines:

Ensure that threads with heavy dependencies on the same memory structures and data caches have their affinities set to processors in the same node.

Partition data caches and control structures wherever possible to provide node-specific data, and ensure that the threads that most heavily use this node-specific data allocate the required memory on each node.

Many attributes of hardware and software on a NUMA system can affect the performance of software applications, from the hardware NUMA ratio to the thread affinity used. Covering all possible scenarios and providing accurate estimates of the effects of each is beyond the scope of this paper.

The following sections attempt to provide guidelines and suggest remedies for application features that may cause performance issues on NUMA systems. Three areas are discussed:

Thread and process affinity and scheduling

Memory partitioning and allocation

I/O configuration for optimal performance

Thread and Process Affinity and Scheduling
After the default NUMA-aware process and thread processor affinities have been assigned, the NUMA features of the Windows Server 2003 operating system attempt to provide best performance. Take care to ensure that threads from the same process dont have their processor affinity settings reset to processors in different NUMA nodes. Components that could reset a threads processor affinity are:

Applications -- using the process and thread processor affinity APIs

Administrators -- using a resource management tool

Application Use of Process and Thread Processor Affinity APIs

Applications can override the default process affinity assignment by reassigning process or thread affinities:

Applications can set hard affinity, using SetThreadAffinityMask, or soft affinity, using SetThreadIdealProcessor, to assign each thread to a particular processor. Hard affinity is more restrictive on thread rescheduling. If thread affinity is assigned, the application should use the NUMA topology API described earlier in this paper to ensure that the processors used are located in the same node.

Applications should keep memory allocation to a minimum before setting application processor affinity because the operating system does not support automated migration of memory allocated from the default assigned node to the node assigned by the application.

An application that creates multiple processes and assigns the threads for each process to processors within the same node should perform optimally on a NUMA system.

Applications that spawn a large number of threads from a single process may not perform optimally when run on a NUMA system. This type of application might create a situation where threads with a shared memory space run across more than one NUMA node. Under this scenario, common data access from these threads is likely to cause increased remote memory accesses, which can degrade application performance.

A performance investigation into such applications running on a NUMA system is warranted. If performance issues are identified, you can optimize the performance of the application with one of the following NUMA optimizations:

Set processor affinities to group threads that have heavy usage of the same data components into the same NUMA node.

Set the affinity of multiple threads to the same processor to support all threads in the same NUMA node.

The goal of both of these optimizations is to maximize the amount of data that resides locally to the threads that most frequently use it.

Administrator Use of Resource Management Tools
The default NUMA affinity assignment might assign multiple processes, each creating multiple threads, to a single NUMA node. This scenario could cause more than four threads to compete for the processor resources within the NUMA node. Because soft affinity is used with the default assignments, this scenario could lead to a situation where threads cannot be rescheduled within the assigned NUMA node and must be scheduled to processors in other nodes.

To help avoid this problem, you can use a resource management tool, such as Windows System Resource Manager (WSRM), to balance the application load across the processors on the system. You can also use a resource management tool to assign additional resources to a high-priority application.

One mechanism used by resource management tools is to set processor affinities for application processes and threads. Administrators must take the system topology into account when performing such actions on a NUMA system. The affinity settings used must ensure that data and the threads using the data are kept local to a NUMA node wherever possible. If multiple processors are being assigned to threads from the same process, those processors should be in the same NUMA node wherever possible. Other consequences that you should be aware of when using resource management tools to assign processor affinity on NUMA systems are:

The new settings might override affinities set up by the applications. Applications typically set up processor affinity once, typically to configure for optimal performance. The application assumes these affinities for the rest of the session. No mechanism is available in the operating system to communicate to the applications that the affinities have been reset.

Windows Server 2003 does not include a mechanism to migrate data that was allocated by a thread in the original NUMA node to the new node. Accessing data from the original node results in remote memory accesses. The NUMA support in the operating system might migrate memory to the new node over time, as data is paged out. This migration happens because each time a page of memory is paged in, the operating system attempts to allocate the physical memory for the page in the NUMA node that contains the processor running the thread that touched the memory.

Memory Partitioning and Allocation

In most software applications, the default NUMA assignment of processor affinities for all processes and threads minimizes the need to access memory across the system interconnect and optimizes application performance. In the few cases where the thread usage for a single process crosses multiple NUMA nodes and thread grouping is used as described earlier in this paper, you should investigate memory usage for potential effects on performance.

If threads are grouped based on memory usage, the operating system ensures that memory requested by the threads is allocated locally wherever possible. However, performance could degrade if all threads need access to global memory components, such as a control structure, data cache, or global lock.

You should investigate the effect of such a data access problem to understand how it affects performance before considering a change. If change is needed, you could update the application to partition any data components to provide, for example, a per-node data cache or control structure with distributed locks. Having the data cache local to the node reduces the requirement to access data over the system interconnect and improves application performance.

Where global application data locks are required, they should be distributed or queued to ensure that they dont lock out requests from remote nodes.

Another potential problem is excess memory allocation requests within a node, saturating the available per-node memory pool. In this scenario, further memory requests result in memory being allocated from other nodes, which would increase system interconnect accesses and degrade performance.

A system administrator might be able to avoid this scenario by using a resource management tool to balance the application load across the nodes in the system, as described in the "Administrator Use of Resource Management Tools" section of this paper.

I/O Configuration for Optimal Performance

The NUMA features implemented in the Windows Server 2003 operating system address processor and memory locality issues. Features to enhance I/O performance on NUMA systems will be investigated for inclusion in a future release of the Windows operating system. In a NUMA system, the I/O subsystems might be within the processor/memory nodes, or they might be in separate I/O nodes that are connected to the system interconnect. This section suggests ways to configure I/O subsystems for optimal performance on NUMA systems:

If the I/O subsystem resides in the same node as the processor and memory, the goal is to initiate as much of the access to that I/O subsystem as possible from within the node, to minimize I/O access across the system interconnect. Performance would improve most from having Direct Memory Access (DMA) buffers for a given adapter allocated within the same node as the adapter. Because DMA buffers are usually created by the application code, this can best be achieved by ensuring that applications that produce heavy I/O to a given adapter execute locally to that adapter by assigning processor affinity to the application processes and threads as discussed earlier in this paper.

Another way to improve I/O performance on a NUMA system that has an I/O subsystem within the NUMA node is to configure multiple adapters to have access to common media. The optimal solution is to have one or more adapters in each node, a feature that is commonly referred to as Multipath I/O. For a Multipath I/O solution to perform optimally on a NUMA system, the software needs to determine the node in which the memory described in each I/O request resides. The software could then identify an adapter in that node and assign the I/O request to that adapter. This process ensures that the DMA occurs within a node. Microsoft has developed a generic Multipath solution for Windows. This solution requires OEM-specific files, and storage vendors will be distributing it independently of the operating system. It is not clear at this point what level of NUMA support is possible in these solutions.

Top of pageTop of page

Call to Action

Server application developers should determine whether their applications use any of the features that could cause performance issues when executed on a NUMA system.

Administrators of large NUMA systems should read this paper before using a resource management tool to manage application resource use across the system.

System and storage vendors should implement the correct Multipath I/O solution to get best performance on a NUMA system.

Top of pageTop of page

Resources

Hardware Design Guide for Windows Server, Version 3.0

Static Resource Affinity Table

Information on the Windows Server 2003 NUMA application APIs and process and thread affinity is available in the MSDN Library


Top of pageTop of page