United States   Change   |   All Microsoft Sites

Windows HPC Server 2008

Performance Benchmarking

Windows HPC Server 2008 R2 Benchmarks

Windows HPC breaks Petaflop barrier and grabs #3 and #5 on Little Green500 List

  • Executive Summary

    Windows HPC Server 2008 R2 breaks Petaflop barrier on the Tsubame 2.0 system, which is the first petascale cluster in Japan and is housed at Tokyo Institute of Technology, Tokyo, Japan. The Windows run reached 1.127PF on 1296 nodes and would have attained #4 rank on the TOP500 list. Breaking the Petaflop barrier represents an important milestone Windows HPC has achieved in terms of cutting edge performance and scalability. Also, two supercomputing clusters running Windows HPC Server captured #3 and #5 spots on the Little Green500 List. The Green500 list signifies the desire and trend in the supercomputing world to consciously drive the implementation of eco-friendly data centers. Placing two of top 5 systems on this list illustrates how customers can take advantage of commodity hardware powered with Windows HPC Server to build industry leading power efficient computing platforms.

  • Heralding the Petaflop Era on both performance and power efficiency with Windows HPC

    By crossing the petaflop barrier, running both on Windows and LINUX, Tsubame 2.0 joins a select camp of computer installations which have achieved this status. In fact, only 7 systems on the TOP500 list have cracked the petaflop barrier and Tsubame 2.0 captured the #4 spot on the TOP500 list. Systems in TOP500 list are ranked based on Rmax which is the sustained performance of a computer system while running High Performance Linpack (HPL). Rpeak measures raw processing capacity of the computer system. Both Rmax and Rpeak values are given in Teraflops. The efficiency of the computer system is

    Efficiency (%)=R"max"/R"peak"

    Referring to the table below, the efficiency of over 50% attained by Tsubame 2.0, running Windows, competes very well with the efficiency of other GP/GPU based clusters in the TOP500 list.

    GP/GPU based Systems - TOP500 List - November 2010**
    RankSystem DescriptionVendorOSRmaxRpeakEfficiency
    1Tianhe-1A - NUDT TH GP/GPUNUDTLINUX25664701.0054.58%
    3Nebulae - Dawning TC3600 GP/GPUDawningLINUX12712984.3042.59%
    4TSUBAME 2.0 - HP ProLiant GP/GPUNEC/HPLINUX11922287.6352.10%
    4*TSUBAME 2.0 - HP ProLiant GP/GPUNEC/HPWindows11272185.3651.57%
    22LOEWE-CSC – Supermicro GP/GPUClustervision/HPLINUX285.20469.7360.71%
    145CSIRO - Supermicro GP/GPU XenonLINUX52.55143.3036.67%

    * TSUBAME 2.0/Windows results were not submitted and would have retained the #4 ranking
    ** Source: http://www.top500.org/list/2010/11/100

    On an equally significant note, two supercomputing clusters running Windows HPC Server 2008 R2 captured #3 and #5 spots on the Little Green500 List. The Green500 list signifies the desire and trend in the supercomputing world to consciously drive the implementation of eco-friendly data centers. Placing two out top 5 systems on this list illustrates how customers can take advantage of commodity hardware powered with Windows HPC Server to build industry leading power efficient computing platforms. It is also worth mentioning that Windows HPC represent 50% of x86/x64 based systems in the top10 list.

    Little Green500 List - November 2010*
    RankSystem DescriptionVendorOSMFLOPS/
    Watt
    Total Power
    (kW)
    1NNSA/SC Blue Gene/Q Prototype IBMLINUX1684.2038.8
    2GRAPE-DR accelerator ClusterNAOJLINUX1448.0324.59
    3TSUBAME 2.0 - HP ProLiant GP/GPUNEC/HPWindows HPC1031.9226
    4EcoGNCSALINUX933.0637
    5CASPUR-Jazz Cluster GP/GPUClustervision/HPWindows HPC886.0726

    * Source: http://www.green500.org/lists/2010/11/little/list.php

    Performance tuning of the PetaFlops run on Windows HPC

    Surpassing the petaflop barrier was achieved with significant tuning effort and in three incremental steps. The performance of the three jobs along with the nodes used and important HPL parameter affecting the respective performance are given in the table below. Changing the BCAST from 1 to 3 in Job 3 for HPL implies having 2 rings of panel broadcast. Having 2 rings of broadcast means less bandwidth and blocking communications and this resulted in further improvement in performance.

    Job DescriptionNodesHPL BCAST ParameterPetaflop Rating
    Job 1 (Red)1280BCAST=11.103 PF
    Job 2 (Blue)1296BCAST=11.118 PF
    Job 3 (Green)1296BCAST=31.127 PF

    The general pattern of the progress in computation for each these jobs is a quick climb until a peek is reached, followed by a plateau and a small drop until the third of the progression. Then the performance drops gradually until the end on the last third. The first third is highly computation intensive (in our case GPU only) and the last third is dominated by communication.

    HPL Progression

    Cluster Scalability under a power budget

    The graphic presented below summarized the scalability tests done from 1 up to 32 nodes, for two configurations of nodes: power capped and non-power capped. The Tsubame 2.0 cluster had nodes with a power capping mechanism that controls how much power can be consumed by a chassis, once the limit is reached the CPU frequency is then lowered for all nodes sharing that chassis, resulting in a lost performance. The chart below shows the performance of a highly tuned HPL implementation on Windows, first for a set of non-power capped machines (the green bars) and then for power capped machines (blue bars). The single node performance is above 1 TF, and the difference in performance between the power capped and non-power capped machines is roughly 3%. The performance lost from 1 to 32 nodes on the non-power capped is about 5% while for the power capped machines from 1 to 32 it is 7%.

    Gflops/Node graph

    Cluster Configuration

    The Tsubame 2.0 cluster consists of more than 1408 compute nodes interconnected by low-latency and high-bandwidth full bisection-bandwidth Infiniband networks. All compute nodes share scalable storage system that provides 7PB of capacity.

    Node NamePurpose#Node Product NameCPUGPUMemoryLocal Storage (SSD)Network
    Thin NodeFor jobs needing less than 54GB1408HP Proliant SL390s G7Intel Xeon 2.93 GHz (6 cores) x 2 (Hyperthreading enabled)NVIDIA Tesla M2050 x 354GB (partly 96GB)120GB (partly 240GB)QDR InfiniBand x 2 (80Gbps)

    The single node performance is presented in the table below. The main CPU can deliver 140.64 Gflops peak, while each GPU can deliver 515.20 Gflops peak for a total of 1545.6 Gflops peak. Because the GPU peak performance is 11 times the CPU peak performance main CPU was not used for the DGEMM computation in HPL, but it was used for panel broadcast and communication.

    Step 1 Cluster Configuration

    HPL Application

    High Performance Linpack (HPL) is an industry standard benchmark commonly used to measure and rate the performance of leading high performance computer architectures. HPL is a software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. The HPL package provides a testing and timing program to quantify the accuracy of the obtained solution as well as the time it took to compute it. For more information about HPL please see http://www.netlib.org/benchmark/hpl/.

    http://blogs.msdn.com/b/dan_fay/archive/2010/11/16/sc-2010-winhpc-server-links-to-the-cloud-amp-breaks-petaflop-barrier.aspx »

    * Note: We gratefully acknowledge Tokyo Institute of Technology for their permission to use HPL benchmark results.

Manufacturing Applications

  • ANSYS FLUENT 12.0

    ANSYS FLUENT 12.0 Benchmark

    ANSYS FLUENT 12.0 Benchmark

    ANSYS FLUENT 12.0 Benchmark

    Application

    ANSYS FLUENT® is the CFD solver of choice for broad physical modeling capabilities needed to model flow, turbulence, heat transfer, and reactions for industrial applications ranging from air flow over an aircraft wing to semiconductor manufacturing just to mention a few.

    Benchmark Description

    The ANSYS FLUENT benchmark suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few thousand cells to more than 100 million cells. Both the segregated and coupled implicit solvers are included, as well as hexahedral, mixed and polyhedral cell cases. This broad coverage is expected to demonstrate the breadth of ANSYS FLUENT performance on a variety of hardware platforms.

    Benchmark test case used in this study

    The test case used in the results presented below is called truck_14m. This test case models external flow over a truck body. This case contains about 14 million cells of mixed type and uses the DES turbulence model with the segregated implicit solver.

    Results

    The results for one of the test cases, truch_14m, presented in the graphic demonstrate excellent scalability of ANSYS FLUENT 12.0 running on Windows HPC Server 2008 R2.

    The results for the test case truck_14m running on the Windows HPC Server 2008 R2 are published on ANSYS FLUENT website and are available at: http://www.fluent.com/software/fluent/fl6bench/fl6bench_12.0/problems/truck_14m.htm

    For Windows performance:

    (Look for Platform=HP C7000 (INTEL_X5650,2670,WINHPC,IB))

    More results are available at the Micro(soft)site setup by Ansys: http://www.ansys.com/ansysonwindows/ansys-performance-on-windows.pdf

    The results for the remaining test cases in the benchmark suite are also available the ANSYS FLUENT benchmark website at: http://www.fluent.com/software/fluent/fl6bench/fl6bench_12.0/index.htm

    Configuration

    ANSYS FLUENT 12.0.16 was used to benchmark WINDOWS runs. The hardware configuration which was used to benchmark ANSYS FLUENT on WINDOWS HPC Server 2008 R2 is as follows:

    System:

    HP Blade System BL2x220G7 ™

    Compute Node:

    Intel Nehalem CPU 5650 2.67 GHz (Dual-Processor, Quad Core)

    Memory:

    24GB (1333 MHz)

    Interconnect:

    Mellanox Infiniband

    Number of Nodes:

    64

    System Name:

    HP/Westmere (Microsoft USA)

    Note: We gratefully acknowledge Ansys Inc for their permission to use the truck_14m image, the model, and generated results.

  • ANSYS CFX 13.0

    ANSYS CFX 13.0 Benchmark

    ANSYS CFX 13.0 Benchmark

    ANSYS CFX 13.0 Benchmark

    Application

    ANSYS CFX is a high-performance, general purpose CFD program that has been applied to solve wide-ranging fluid flow problems.

    Benchmark Description

    The ANSYS CFX benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few million cells to more than 20 million cells. This broad coverage is expected to demonstrate the breadth of ANSYS CFX performance on a variety of hardware platforms and test cases.

    Benchmark test case used in this study

    The test case used in the results presented below is a channel flow KE model of internal flow type and contains around 20M million cells.

    Results

    The results for one of the test cases, channel flow KE model of internal flow, presented in the graphic demonstrate excellent scalability of ANSYS CFX running on Windows HPC Server 2008 R2.

    More results are also available at the Micro(soft)site setup by Ansys at: http://www.ansys.com/ansysonwindows/ansys-performance-on-windows.pdf

    Configurations

    ANSYS CFX 13.0 was used to benchmark both LINUX and WINDOWS runs.

    The following hardware configuration is used to benchmark ANSYS CFX on LINUX as well as WINDOWS HPC Server 2008 R2:

    System:

    ACER/Gateway Model GW2000HQ

    CPU:

    Intel Xeon X5570 @ 2.93 GHz (Dual-Processor, Quad Core)

    Memory:

    24GB

    Interconnect:

    Infiniband

    Number of nodes:

    16

    Hyper Threading:

    OFF

    TURBO Mode:

    ON

    System Name:

    GAMBRINUS (Fraunhofer SCAI, Germany)

    Note: We gratefully acknowledge Ansys Inc for their permission to use model and generated results.

  • LS-DYNA 971 3.2.1

    LS-DYNA 971 3.2.1 Benchmark

    LS-DYNA 971 3.2.1 Benchmark

    LS-DYNA 971 3.2.1 Benchmark

    Application

    LS-DYNA® is a leading simulation solution for finite element analysis produced by Livermore Software Technology Corporation. LS-DYNA is used in a number of industries such as automotive, metal forming, and aerospace.

    Benchmark Description

    The LS-DYNA benchmarks suite comprises of a set of test cases covering a large range of mesh sizes, physical models and solvers representing typical industry usage. The cases range in size from a few thousand elements to more than 2 million elements. Benchmark data sets and cluster performance results are available from the independent web site www.topcrunch.org

    Benchmark test case used in this study

    The test case used in the results presented below is called car2car which models automotive crash simulation for the head-on crash of two minivans (Car2Car) and contains about 2 million elements.

    Results

    The results for one of the test cases, car2car, presented in the graphic demonstrate excellent scalability of LSTC LS-DYNA running on Windows HPC Server 2008 R2.

    The results for the test case car2car running on the Windows HPC Server 2008 and LUNUX system are published by the independent website topcrunch and are available at: http://topcrunch.org/benchmark_results.sfe

    For Windows performance:

    (Select File Name=car2car; Year=2011; In the search results, look for Vendor/Submitter=HP/Microsoft, Processor=Intel® Xeon® Six Core X5650 2.66 GHz)

    For LINUX performance:

    (Select File Name=car2car; Year=2010; In the search results look for Vendor/Submitter=Intel/SSG/ASE and Processor= Intel Xeon Six Core X5670)

    Configurations

    LS-DYNA 971 3.2.1 was used to benchmark both LINUX and WINDOWS runs.

    System Configuration used for Windows HPC Server 2008 R2 benchmarks:

    System Name:

    WESTMERE - HP Blade System BL2x220G6™

    Interconnect Manufacturer:

    Mellanox

    Interconnect Type:

    QDR Infiniband – Mellanox

    Operating System:

    Windows HPC Server 2008 R2

    MPI:

    MSMPI

    Processor Type:

    2.67 GHz, Xeon X5650 (SMT OFF, Turbo ON)

    Memory:

    24 GB (DDR3-1333)

    System Configuration used for LINUX benchmarks:

    System Name:

    Intel® SR1600UR system

    Interconnect Manufacturer:

    Mellanox

    Interconnect Type:

    QDR Infiniband

    Operating System:

    SuSe 11

    MPI:

    Unknown

    Processor Type:

    2.93 GHz, Xeon X5670

    Memory:

    Unknown

    LINUX numbers obtained from the public website: http://www.topcrunch.org

    Note: We gratefully acknowledge Livermore Software Technology Corporation for providing the two car collision image. The car models were developed by the FHWA/NHTSA National Crash Analysis Center of the George Washington University and available through www.topcrunch.org. We would also like to acknowledge that the performance numbers for Linux were obtained from the www.topcrunch.org site and posted by Intel.

Financial Applications

  • MG-ALFA 7.1

    MG-ALFA 7.1 Benchmark

    MG-ALFA 7.1 Benchmark

    MG-ALFA 7.1 Benchmark

    Application

    Milliman’s MG-ALFA is a Windows-based actuarial system that generates financial projections to support decision and risk analysis. It is used by insurance and financial firms to perform financial projections to support product development, financial reporting, risk management, and decision analysis.

    Benchmark Description

    MG-ALFA benchmarks suite comprises of several test cases covering a large range of problem sizes and real-life scenarios.

    Benchmark test case used in this study

    The test case used in this study consists of 809 cells, 1000 scenarios and distribution of tasks is done by scenario.

    Results

    The results for the benchmark test case, presented in the graphic, demonstrate excellent near linear scalability of MG-ALFA on an IBM iDataplex cluster running HPC Windows HPC Server 2008 R2. For more previous results please see: http://www-03.ibm.com/systems/resources/systems_deepcomputing_mg_alfa_white_paper_2009_final.pdf

    Configuration

    MG-ALFA Version 7.1 is used to benchmark runs. The following hardware configuration was used to benchmark SPEC MPI2007 on both Windows and LINUX:

    System:

    IBM System x3650 M2

    Compute Node:

    Intel Nehalem CPU 5500 2.67 GHz (Dual-Processor, Quad Core)

    Memory:

    24GB (1333 MHz)

    Interconnect:

    Mellanox Infiniband

    Number of nodes:

    48

    System Name:

    RHIANNON (Microsoft, USA)

    Note: We gratefully acknowledge Milliman for their permission to use MG-ALFA benchmark test cases to conduct the benchmark experiments.

Microkernel Benchmarks

  • SPEC MPI®2007 2.0

    SPEC MPI®2007 2.0

    SPEC MPI®2007 2.0

    SPEC MPI®2007 2.0

    Application

    SPEC MPI2007 is SPEC's benchmark suite for evaluating MPI-parallel, floating point, compute intensive performance across a wide range of cluster and SMP hardware. MPI2007 continues the SPEC tradition of giving users the most objective and representative benchmark suite for measuring and comparing high-performance computer systems.

    Benchmark Description

    SPEC MPI2007 focuses on performance of compute intensive applications using the Message-Passing Interface (MPI), which means these benchmarks emphasize the performance of:

    • the type of computer processor (CPU),

    • the number of computer processors,

    • the MPI Library,

    • the communication interconnect,

    • the memory architecture,

    • the compilers, and

    • the shared file system.

    For more detailed information about the SPEC MPI2007 benchmark suite, please refer to: http://www.spec.org/mpi2007/

    Results

    The SPEC MPI2007 results presented below will be posted the public website http://www.spec.org/mpi2007/results in the near future.

    The comparison presented left is based on performance (SPECmpiM_base2007 ) of LINUX and HPC Server 2008 R2 each running on 16, 32, 64, and 128 core configurations of ACER/Gateway Model GW2000HQ. The results of SPEC MPI2007 benchmarks, presented in the graphic demonstrate that the Windows version is very competitive.

    Some of the previous on Windows HPC Server 2008 are published on http://www.spec.org/mpi2007/results/res2010q1/#SPECmpiM as of April 13, 2010 and July 19, 2010 for LINUX and Windows HPC Server 2008 respectively.

    Configuration

    SPEC MPI2007 version 2.0 was used to benchmark both Windows and LINUX systems.The following hardware configuration is used to benchmark SPEC MPI2007 on LINUX as well as WINDOWS HPC Server 2008 R2:

    System:

    ACER/Gateway Model GW2000HQ

    CPU:

    Intel Xeon X5570 @ 2.93 GHz (Dual-Processor, Quad Core)

    Memory:

    24GB

    Interconnect:

    Infiniband

    Number of nodes:

    16

    Hyper Threading:

    OFF

    TURBO Mode:

    ON

    System Name:

    GAMBRINUS (Fraunhofer SCAI, Germany)

    SPEC®, the benchmark name SPEC MPI®2007 and the metric SPECmpiM_base2007 are registered trademarks of the Standard Performance Evaluation Corporation. For the latest SPEC MPI2007 benchmark results, visit http://www.spec.org/mpi2007.

Customer and Partner Testimonials

  • Mellanox is working closely with Microsoft to bring optimized performance on Windows HPC Server 2008 by supporting NetworkDirect, a software interface that enables the most efficient RDMA and MPI solution for HPC applications. Powered by Mellanox ConnectX 40Gb/s InfiniBand, the Windows HPC benchmarks obtained on the joint benchmarking cluster (“Rhiannon”) demonstrate performance parity with Linux on major CAE applications (including ANSYS Fluent 12 and LSTC LS-DYNA®) and superior scalability for Milliman MG-ALFA®.

    Gilad Shainer, Director of HPC, Mellanox Technologies
  • Tests of Windows HPC Server 2008 have shown that it delivers accurate results with no compromise on speed compared with the previous Linux implementation used by the centre.

    Prof. Andy Keene, Head, Rolls-Royce University Technology Centre (UTC) for Computational Engineering, Southampton University
  • Our performance tests were so conclusive that we’re now converting our Linux server to run on Windows HPC Server 2008. We’re never going back to Linux.

    Dr. Marco Derksen , Manager of R&D, Stork Thermeq
  • We've done benchmarks with up to 256 cores that showed performance that meets, and in some cases exceeds, the Linux tests done on the same hardware.

    Greg Keller, Technical Principal, R Systems
  • We saw outstanding performance from Windows HPC Server during our Linpack benchmarking run on Tsubame 2.0, it broke the Petaflop barrier and was on par with Linux at this scale. In a power-optimized configuration, it recorded over a Gigaflop/Watt, showing it is nearly three times more energy efficient than an average laptop. We were very excited to see this level of performance given Windows applications will be an important part of our work with industry partners.*

    Satoshi Matsuoka, professor at the Global Scientific Information and Computing Center, Tokyo Institute of Technology
Download now

Get Started!

Download the latest version:

Microsoft HPC Pack 2012 »
FAQ

FAQ

Find answers to Frequently Asked Questions about Microsoft® Windows® HPC Server 2008 R2 Suite.

See Frequently Asked Questions »