Performance Guide For HPC Applications On IDataplex Rel 1.0.2

Performance Guide for HPC Applications on iDataPlex dx360 M4 systems
A Performance Guide
For HPC Applications
On the IBM System x iDataPlex
dx360 M4 System
Release 1.0.2
June 19, 2012
IBM Systems and Technology Group
Copyright 2012 IBM Corporation Page 1 of 153

Contents
Contributors...................................................................................................... 8
Introduction...................................................................................................... 9
1 iDataPlex dx360 M4 ................................................................................. 11
1.1 Processor .....................................................................................................11
1.1.1 Supported Processor Models ................................................................................. 12
1.1.2 Turbo Boost 2.0 ...................................................................................................... 13
1.2 System .........................................................................................................16
1.2.1 I/O and Locality Considerations.............................................................................. 17
1.2.2 Memory Subsystem................................................................................................ 17
1.2.3 UEFI......................................................................................................................... 20
1.3 Mellanox InfiniBand Interconnect ................................................................23
1.3.1 References .............................................................................................................. 27
2 Performance Optimization on the CPU ..................................................... 28

2.1 Compilers.....................................................................................................28
2.1.1 GNU compiler ......................................................................................................... 28
2.1.2 Intel Compiler ......................................................................................................... 29
2.1.3 Recommended compiler options ........................................................................... 29
2.1.4 Alternatives............................................................................................................. 46
2.2 Libraries .......................................................................................................47
2.2.1 Intel Math Kernel Library (MKL) ........................................................................... 47
2.2.2 Alternative Libraries ............................................................................................... 48
2.3 References ...................................................................................................49
3 Linux......................................................................................................... 51
3.1 Core Frequency Modes.................................................................................51
3.2 Memory Page Sizes ......................................................................................51
3.3 Memory Affinity...........................................................................................52
3.3.1 Introduction............................................................................................................ 52
3.3.2 Using numactl ......................................................................................................... 52
3.4 Process and Thread Binding .........................................................................53
3.4.1 taskset..................................................................................................................... 53
3.4.2 numactl................................................................................................................... 54
3.4.3 Environment Variables for OpenMP Threads......................................................... 55
3.4.4 LoadLeveler............................................................................................................. 56
3.5 Hyper-Threading (HT) Management .............................................................57
3.6 Hardware Prefetch Control...........................................................................57
3.7 Monitoring Tools for Linux ...........................................................................58
3.7.1 Top.......................................................................................................................... 58
3.7.2 vmstat ..................................................................................................................... 59
3.7.3 iostat....................................................................................................................... 59
3.7.4 mpstat..................................................................................................................... 59

4 MPI........................................................................................................... 61
4.1 Intel MPI ......................................................................................................61
4.1.1 Compiling................................................................................................................ 61
4.1.2 Running Parallel Applications ................................................................................. 62
4.1.3 Processor Binding ................................................................................................... 63
4.2 IBM Parallel Environment ............................................................................64
4.2.1 Building an MPI program ........................................................................................ 64
4.2.2 Selecting MPICH2 libraries...................................................................................... 65
4.2.3 Optimizing for Short Messages............................................................................... 65
4.2.4 Optimizing for Intranode Communications ............................................................ 65
4.2.5 Optimizing for Large Messages............................................................................... 65
4.2.6 Optimizing for Intermediate-Size Messages........................................................... 66
4.2.7 Optimizing for Memory Usage ............................................................................... 66
4.2.8 Collective Offload in MPICH2.................................................................................. 66
4.2.9 MPICH2 and PEMPI Environment Variables ........................................................... 67
4.2.10 IBM PE Standalone POE Affinity ............................................................................. 69
4.2.11 OpenMP Support .................................................................................................... 70
4.3 Using LoadLeveler with IBM PE ....................................................................70
4.3.1 Requesting Island Topology for a LoadLeveler Job................................................. 70
4.3.2 How to run OpenMPI and INTEL MPI jobs with LoadLeveler ................................. 71
4.3.3 LoadLeveler JCF (Job Command File) Affinity Settings ........................................... 71
4.3.4 Affinity Support in LoadLeveler .............................................................................. 72
5 Performance Analysis Tools on Linux ........................................................ 73

5.1 Runtime Environment Control......................................................................73
5.1.1 Ulimit ...................................................................................................................... 73
5.1.2 Memory Pages ........................................................................................................ 73
5.2 Hardware Performance Counters and Tools .................................................73
5.2.1 Hardware Event Counts using perf......................................................................... 73
5.2.2 Instrumenting Hardware Performance with PAPI .................................................. 79
5.3 Profiling Tools ..............................................................................................80
5.3.1 Profiling with gprof ................................................................................................. 81
5.3.2 Microprofiling ......................................................................................................... 82
5.3.3 Profiling with oprofile ............................................................................................. 84
5.4 High Performance Computing Toolkit (HPCT) ...............................................84
5.4.1 Hardware Performance Counter Collection ........................................................... 85
5.4.2 MPI Profiling and Tracing........................................................................................ 85
5.4.3 I/O Profiling and Tracing......................................................................................... 86
5.4.4 Other Information .................................................................................................. 86
6 Performance Results for HPC Benchmarks ................................................ 88

6.1 Linpack (HPL) ...............................................................................................88
6.1.1 Hardware and Software Stack Used ....................................................................... 88
6.1.2 Code Version........................................................................................................... 88
6.1.3 Test Case Configuration.......................................................................................... 88
6.1.4 Petaflop HPL Performance...................................................................................... 91
6.2 STREAM .......................................................................................................92
6.2.1 Single Core Version GCC compiler ....................................................................... 93

6.2.2 Single Core Version - Intel compiler ....................................................................... 94

6.2.3 Frequency Dependency .......................................................................................... 98
6.2.4 Saturating Memory Bandwidth .............................................................................. 99
6.2.5 Beyond Stream ..................................................................................................... 104
6.3 HPCC .......................................................................................................... 108
6.3.1 Hardware and Software Stack Used ..................................................................... 109
6.3.2 Build Options ........................................................................................................ 109
6.3.3 Runtime Configuration ......................................................................................... 109
6.3.4 Results .................................................................................................................. 110
6.4 NAS Parallel Benchmarks Class D................................................................ 111
6.4.1 Hardware and Building ......................................................................................... 111
6.4.2 Results .................................................................................................................. 112
7 AVX and SIMD Programming.................................................................. 113

7.1 AVX/SSE SIMD Architecture ....................................................................... 113
7.1.1 Note on Terminology:........................................................................................... 113
7.2 A Short Vector Processing History .............................................................. 114
7.3 Intel SIMD Microarchitecture (Sandy Bridge) Overview ............................ 116
7.4 Vectorization Overview.............................................................................. 117
7.5 Auto-vectorization ..................................................................................... 118
7.6 Inhibitors of Auto-vectorization ................................................................. 118
7.6.1 Loop-carried Data Dependencies ......................................................................... 118
7.6.2 Memory Aliasing................................................................................................... 119
7.6.3 Non-stride-1 Accesses .......................................................................................... 119
7.6.4 Other Vectorization Inhibitors in Loops................................................................ 120
7.6.5 Data Alignment..................................................................................................... 120
7.7 Additional References ................................................................................ 122
8 Hardware Accelerators ........................................................................... 123
8.1 GPGPUs...................................................................................................... 123
8.1.1 NVIDIA Tesla 2090 Hardware description............................................................. 123
8.1.2 A CUDA Programming Example ............................................................................ 126
8.1.3 Memory Hierarchy in GPU Computations ............................................................ 128
8.1.4 CUDA Best practices ............................................................................................. 129
8.1.5 Running HPL with GPUs ........................................................................................ 130
8.1.6 CUDA Toolkit......................................................................................................... 131
8.1.7 OpenACC............................................................................................................... 131
8.1.8 Checking for GPUs in a system ............................................................................. 131
8.1.9 References ............................................................................................................ 133
9 Power Consumption ............................................................................... 134

9.1 Power consumption measurements ........................................................... 134
9.2 Performance versus Power Consumption ................................................... 135
9.3 System Power states .................................................................................. 135
9.3.1 G-States ................................................................................................................ 137
9.3.2 S-States ................................................................................................................. 138

9.3.3 C-States................................................................................................................. 138

9.3.4 P-States................................................................................................................. 141
9.3.5 D-States ................................................................................................................ 142
9.3.6 M-States ............................................................................................................... 143
9.4 Relative Influence of Power Features ......................................................... 143
9.5 Efficiency Definitions Reference ................................................................. 144
9.6 Power and Energy Management ................................................................ 144
9.6.1 xCAT renergy......................................................................................................... 144
9.6.2 Power and Energy Aware LoadLeveler ................................................................. 145
Appendix A: Stream Benchmark Performance Intel Compiler v. 11 ....... 147

Appendix B: Acknowledgements ............................................................. 148
Appendix C: Some Useful Abbreviations .................................................. 149
Appendix D: Notices ................................................................................ 150
Appendix E: Trademarks ......................................................................... 152
Figures
Figure 1-1 Processor Ring Diagram ................................................................................................ 12
Figure 1-2 dx360 M4 Block Diagram with Data Buses ................................................................... 16
Figure 1-3 Relative Memory Latency by Clock Speed ..................................................................... 19
Figure 1-4 Relative Memory Throughput by Clock Speed............................................................... 20
Figure 6-1 Comparing actual Linpack and system peak performance (GFlops) for different
numbers of nodes ........................................................................................................................... 91
Figure 6-2 Comparing measured Linpack and system peak performance (PFlops) for large
numbers of nodes ........................................................................................................................... 92
Figure 6-3 Measured Bandwidth (MB/s) for single-core STREAM tests using GCC ........................ 94
Figure 6-4 Measured Bandwidth (MB/s) for single-core STREAM tests using Intel icc................... 96
Figure 6-5 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc and GCC
........................................................................................................................................................ 96
Figure 6-6 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc without
streaming stores and GCC .............................................................................................................. 98
Figure 6-7 Single core memory bandwidth as a function of core frequency .................................. 99
Figure 6-8 Memory Bandwidth (MB/s) over 16 cores GCC throughput benchmark .................. 100
Figure 6-9 Memory bandwidth (MB/s) minimum number of sockets 16-way OpenMP
benchmark.................................................................................................................................... 101
Figure 6-10 Memory bandwidth (MB/s) performance of 8 threads on 1 or 2 sockets............... 102
Figure 6-11 Memory bandwidth (MB/s) split threads between two sockets............................. 103
Figure 6-12 Memory bandwidth (MB/s) vs stride length for 1 to 16 threads .............................. 106
Figure 7-1 Using the low 128-bits of the YMMn registers for XMMn........................................... 113
Figure 7-2 Scalar and vector operations....................................................................................... 114
Figure 7-3 Sandy Bridge block diagram emphasizing SIMD AVX functional units ........................ 116
Figure 8-1 Functional block diagram of the Tesla Fermi GPU ...................................................... 124
Figure 8-2 Tesla Fermi SM block diagram .................................................................................... 125
Figure 8-3 Cuda core..................................................................................................................... 126
Figure 8-4 NVIDIA GPU memory hierarchy ................................................................................... 128
Figure 9-1 Server Power States..................................................................................................... 136
Figure 9-2 The effect of VRD voltage ............................................................................................ 142
Figure 9-3 Relative influence of power saving features................................................................ 143

Tables
Table 1-1 Sandy Bridge Feature Overview Compared to Xeon E5600 ............................................ 11
Table 1-2 Supported Sandy Bridge Processor Models .................................................................... 13
Table 1-3 Maximum Turbo Upside by Sandy Bridge CPU model ................................................. 14
Table 1-4 Supported DIMM types................................................................................................... 18
Table 1-5 Common UEFI Performance Tunings .............................................................................. 22
Table 2-1 GNU compiler processor-specific optimization options .................................................. 30
Table 2-2 A mapping between GCC and Intel compiler options for processor architectures.......... 30
Table 2-3 General GNU compiler optimization options.................................................................. 31
Table 2-4 General Intel compiler optimization options .................................................................. 34
Table 2-5 Intel compiler options that control vectorization ........................................................... 38
Table 2-6 Intel compiler options that enhance vectorization ........................................................ 38
Table 2-7 Intel compiler options for reporting on optimization...................................................... 39
Table 2-8 Global (inter-procedural) optimization options for the GNU compiler suite.................. 40
Table 2-9 Global (inter-procedural) optimization options for the Intel compiler suite................... 41
Table 2-10 Automatic parallelization for the Intel compiler........................................................... 43
Table 2-11 OpenMP options for the Intel compiler suite................................................................ 44
Table 2-12 GNU OpenMP runtime environment variables recognized by the Intel compiler
toolchain......................................................................................................................................... 45
Table 2-13 GNU compiler options for CAF ...................................................................................... 46
Table 2-14 Intel compiler options for CAF ...................................................................................... 46
Table 3-1 OpenMP binding options ................................................................................................ 56
Table 4-1 Intel MPI wrappers for GNU and Intel compiler ............................................................. 61
Table 4-2 Intel MPI settings for I_MPI_FABRICS............................................................................. 62
Table 5-1 Event modifiers for perf e <event>:<mod> ................................................................... 74
Table 6-1 LINPACK Job Parameters ................................................................................................ 88
Table 6-2 HPL performance on up to 18 iDataPlex dx360 M4 islands ........................................... 92
Table 6-3 Single core memory bandwidth as a function of core frequency.................................... 99
Table 6-4 Memory Bandwidth (MB/s) over 16 cores throughput benchmark........................... 100
Table 6-5 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 20M....... 101
Table 6-6 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 200M..... 101
Table 6-7 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with
icc.................................................................................................................................................. 101
Table 6-8 Memory bandwidth (MB/s) split threads between two sockets OpenMP benchmark
with icc.......................................................................................................................................... 102
Table 6-9 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with
gcc ................................................................................................................................................ 103
Table 6-10 Memory bandwidth (MB/s) divide threads between two sockets OpenMP
benchmark with gcc...................................................................................................................... 103
Table 6-11 Strided memory bandwidth (MB/s) 16 threads ....................................................... 104
Table 6-12 Strided memory bandwidth (MB/s) 8 threads ......................................................... 104
Table 6-15 Strided memory bandwidth (MB/s) 1 thread........................................................... 105
Table 6-16 Reverse order (stride=-1) memory bandwidth (MB/s) 1 to 16 threads.................... 106
Table 6-17 Stride 1 memory bandwidth (MB/s) 1 to 16 threads ............................................... 106
Table 6-18 Strided memory bandwidth (MB/s) with indexed loads 1 thread............................ 107
Table 6-19 Strided memory bandwidth (MB/s) with indexed loads 16 threads ........................ 107
Table 6-20 Strided memory bandwidth (MB/s) with indexed stores 1 thread........................... 107
Table 6-21 Strided memory bandwidth (MB/s) with indexed stores 16 threads ....................... 108

Table 6-22 Best values of HPL N,P,Q for different numbers of total available cores................ 109
Table 6-23 HPCC performance on 1 to 32 nodes ......................................................................... 110
Table 6-24 NAS PB Class D performance on 1 to 32 nodes........................................................... 112
Table 8-1 HPL performance on GPUs............................................................................................ 130
Table 9-1 Global server states ...................................................................................................... 137
Table 9-2 Sleep states................................................................................................................... 138
Table 9-3 CPU idle power-saving states ....................................................................................... 139
Table 9-4 CPU idle states for each core and socket ...................................................................... 140
Table 9-5 CPU Performance states ............................................................................................... 141
Table 9-6 Subsystem power states ............................................................................................... 142
Table 9-7 Memory power states................................................................................................... 143

Contributors
Authors
Charles Archer
Mark Atkins
Torsten Bloth
Achim Boemelburg
George Chochia
Don DeSota
Brad Elkin
Dustin Fredrickson
Julian Hammer
Jarrod B. Johnson
Swamy Kandadai
Peter Mayes
Eric Michel
Raj Panda
Karl Rister
Ananthanarayanan Sugavanam
Nicolas Tallet
Francois Thomas
Robert Wolford
Dave Wootton

Introduction
In March of 2012, IBM introduced a petaflop-class supercomputer, the iDataPlex dx360
M4. Supercomputers are used for simulations, design and for solving very large, complex
problems in various domains including science, engineering and economics.
Supercomputing data centers like the Leibniz Rechenzentrum (LRZ) in Germany are
looking for petaflop-class systems with two important qualities:
1. systems that are highly dense to save on data center space
2. systems that are power efficient to save on energy costs, which can run into
millions of dollars over the life time of a supercomputer.
IBM designed the latest generation of its iDataPlex-class systems to meet the
performance, density, power and cooling requirements of a supercomputing data center
such as LRZ.
Servers OS
iDataplex
Management
Interconnect Software
GPFS
Storage IBM
The true benefit of a supercomputer is realized only when the user community acquire
and use the special skills needed to maximize the performance of their applications. With
this singular objective in mind, this document has been created to help application
specialists to wring out the last FLOP from their applications. The document is structured
as a guide that provides pointers and references to more detailed sources of information
on a given topic, rather than as a collection of self-contained recipes for performance
optimization.
In the iDataPlex system, the dx360 M4 is a 2-socket, SMP node which is the
computational nucleus in the supercomputer. Chapter 1 provides a high-level description
of the dx360 M4 node as well as the InfiniBand interconnect. Intels latest generation
Sandy Bridge server processor is used in the dx360 M4. In this chapter, processor and
system-level information that is essential for tuning is provided.
Two different types of compilers, GNU and Intel, are covered as part of processor-level
performance tuning in chapter 2. Various compile options including a set of
recommended options, vectorization, shared memory parallelization and the use of math
libraries are some of the topics covered in this chapter.

The iDataPlex supports Red Hat Enterprise Linux (RHEL) and SUSE Linux Enterprise
Server (SLES). Various aspects of operating system tuning that can benefit application
performance are covered in chapter 3. Memory affinitization, process and thread binding ,
as well as tools for monitoring performance are some of the key topics in this chapter.
A majority of supercomputing applications are parallelized using the message passing

interface (MPI). Performance tuning with different MPI libraries is the main topic for
chapter 4.
A carefully conducted performance analysis of an application is often a prerequisite for

squeezing out additional performance. Profiling is a key aspect of performance analysis.
Profiling can be conducted to understand different aspects of a parallel application.
Compute-level profiling can be done with gprof while an MPI profiling and tracing tool is
needed for analyzing communication in an application. Both types of tools are covered in
chapter 5. A deeper level of analysis can be carried out with a profiling tool that monitors
hardware performance counters. oprofile is one such tool covered in this chapter. The
chapter also covers I/O profiling.
Chapter 6 provides performance results on some of the standard benchmarks that are
frequently used in supercomputing, namely LINPACK and STREAM. Additionally, results
on HPCC and the NAS Parallel Benchmarks on the iDataPlex system are reported.
These first six chapters cover the essentials of performance tuning on the iDataPlex
dx360 M4. However, for those readers who want to go the extra mile in tuning on this
system, a few additional topics are covered in the remaining chapters.
A 256-bit SIMD unit called AVX is provided in the Sandy Bridge class of microprocessors.
SIMD programming is covered in chapter 7.
The dx360 M4 node can also accommodate Nvidia GPGPUs. Aspects on how to compile
and run on Nvidia GPGPUs are covered in chapter 8.
Power consumption in supercomputers has become a serious concern for data center
operators because of the high operating expenses. Consequently, supercomputing
application developers and users have become sensitive to the power consumption
behavior of their applications. Therefore, a document of this nature is not complete
without a discussion of power consumption which is treated in the last chapter.

1 iDataPlex dx360 M4
The dx360 M4 is the latest rack-dense, compute node cluster offering in the iDataPlex
product line, offering numerous hardware features to optimize system and cluster
performance:
Up to two Intel Xeon processor E5-2600 series processors, each providing up to
8-cores and 16 threads, core speeds up to 2.7 GHz, up to 20 MB of L3 cache,
and QPI interconnect links of up to 8 GT/s.
Optimized support of Intel Turbo Boost Technology 2.0 allows CPU cores to run
above rated speeds during peak workloads
4 DIMM channels per processor, offering sixteen DIMMs of registered DDR3
ECC memory, able to operate at 1600 MHz and with up to 256GB per node.
Support of solid-state drives (SSDs) enabling improved I/O performance for
many workloads
PCI Express 3.0 I/O capability enabling high bandwidth interconnect support
10 Gb Ethernet and FDR10 mezzanine cards offering high interconnect
performance without consuming a PCIe slot.
Support for high-performance GPGPU adapters
Additional details on the dx360 M4, including supported configurations of hardware and
Operating Systems can be found in the dx360 M4 Product Guide, located here.
1.1 Processor
The dx360 M4 is built around the high performance capabilities of the Intel E5-2600
family of processors, code named Sandy Bridge. As a major microarchitecture update
from the previous generation E5600 series of CPUs, Sandy Bridge provides many key
specification improvements, as noted in the following table:
Table 1-1 Sandy Bridge Feature Overview Compared to Xeon E5600

Xeon E5600 Sandy Bridge -EP
Number of Cores Up to 6 cores Up to 8 cores
L1 Cache Size 32K 32K
L2 Cache Size 256K 256K
Last Level Cache (LLC) Size 12 MB Up to 20 MB
Memory Channels per CPU 3 4
Max Memory Speed Supported 1333 MHz 1600 MHz

Max QPI frequency 6.4 GT/s 8.0 GT/s
Inter-Socket QPI Links 1 2
Max PCIe Speed Gen 2 (5 GT/s) Gen 3 (8 GT/s)
In addition, Sandy Bridge also introduces support for AVX extensions within an updated
execution stack, enabling 256-bit floating point (FP) operations to be decoded and
executed as a single micro-operation (uOp). The effect of this is a doubling in peak FP
capability, sustaining 8 double precision FLOPs/cycle.
In order to provide sufficient data bandwidth to efficiently utilize the additional processing
capability, the Sandy Bridge processor integrates a high performance, bidirectional ring
architecture similar to that used in the E7 family of CPUs. This high performance ring
interconnects the CPU cores, Last Level Cache (LLC, or L3), PCIe, QPI, and memory

controller on the CPU, as depicted in Figure 1-1.
Figure 1-1 Processor Ring Diagram
Memory Controller
L1 / L2
L1 / L2
Cache
Cache
CPU LLC LLC CPU
Core 2.5 MB 2.5 MB Core
L1 / L2
L1 / L2
Cache
Cache
CPU LLC LLC CPU
L1 / L2
L1 / L2
Cache
CPU LLC LLC Cache CPU

L1 / L2
L1 / L2
Cache
Cache
CPU LLC LLC CPU

QPI PCIe
While each physical LLC segment is loosely associated with a corresponding core, this
cache is shared as a logical unit, and any core can access any part of this cache.
Though access latency around the ring is dependent on the number of 1-cycle hops that
must be traversed, the routing architecture guarantees the shortest path will be taken.
With 32B of data able to be returned on each cycle, and with the ring and LLC clocked
with the CPU core, cache and memory latencies have dropped as compared to the
previous generation architecture, while cache and memory bandwidths are significantly
improved. Since the ring is clocked at the core frequency, however, its important to note
that sustainable memory and cache performance is directly dependent on the speed of
the CPU cores.
Another key performance improvement in the Sandy Bridge family of CPUs is the
migration of the I/O controller into the CPU itself. While I/O adapters were previously
connected via PCIe to an I/O Hub external to the processor, Sandy Bridge has moved the
controller inside the CPU and has made it a stop on the high bandwidth ring. This feature
not only enables extremely high I/O bandwidth supporting the fastest Gen3 PCIe speeds,
but also enables I/O latency reductions of up to 30% as compared to Xeon E5600 based
architectures.
1.1.1 Supported Processor Models

The E5-2600 is available in many clock frequency and core count combinations to suit
the needs of a variety of compute environments. The dx360 M4 supports the following
Sandy Bridge processor models:

Table 1-2 Supported Sandy Bridge Processor Models

Processor Core L3 Cores TDP QPI link Max memory Hyper- Turbo
model speed cache speed speed Threading Boost
Advanced
E5-2680 2.7 GHz 20 MB 8 130 W 8.0 GT/s 1600 MHz Yes Ver 2.0
Standard
Basic
E5-2609 2.4 GHz 10 MB 4 80 W 6.4 GT/s 1066 MHz No No
E5-2603 1.8 GHz 10 MB 4 80 W 6.4 GT/s 1066 MHz No No
Special Purpose / High Frequency
E5-2667 2.9 GHz 15 MB 6 130 W 8 GT/s 1600 MHz Yes Ver 2.0
E5-2637 3.0 GHz 5 MB 2 80 W 8 GT/s 1600 MHz Yes Ver 2.0
Low Power
E5-2650L 1.8 GHz 20 MB 8 70 W 8.0 GT/s 1600 MHz Yes Ver 2.0
E5-2630L 2.0 GHz 15 MB 6 60 W 7.2 GT/s 1333 MHz Yes Ver 2.0
The following observations may be made from this table:

As CPU frequency and core count are reduced, so is supported memory speed
and QPI frequency. Therefore, environments which need maximum memory
performance will generally favor the Advanced CPU models.
The E5-2667 and E5-2637 CPUs can be of particular interest in those (lightly
threaded) compute environments where maximum core speed is more critical
than number of cores.
The 2 Basic Processor Models do not support Turbo Boost or Hyper-Threading,
Intels implementation of SMT technology.
1.1.2 Turbo Boost 2.0

For those processor models which support it, Turbo Boost enables one or more
processor cores to run above their rated frequency if certain conditions are met.
Introduced in Sandy Bridge, Turbo Boost 2.0 provides greater performance
improvements than the Turbo Boost of prior generation processors. In addition, while
Turbo Boost in the E5600 series generally reduced Performance/Watt, Turbo Boost 2.0 in
Sandy Bridge processors can actually increase performance/watt in many applications.
Activated when the operating system transitions to the highest performance state (P0),
and using the integrated power and thermal monitoring capabilities of the Sandy Bridge
processor, Turbo Boost exploits the available power and thermal headroom of the CPU to
increase the operating frequency on one or more cores. The maximum Turbo frequency

that a core is able to run at is limited by the processor model and dependent on the
number of cores that are actively running on a processor socket. When more cores are
inactive, and are therefore able to be put to sleep, more power and thermal headroom
becomes available in the processor and higher frequencies are possible for the remaining
cores. Thus, the maximum turbo frequency is possible when just 1 core is active, and the
remaining cores are able to sleep. When all cores are active, as is common in many
cluster applications, a frequency falling between the Max Turbo 1-Core Active and the
processors Rated Core Speed is achieved, as illustrated in the following table:
Table 1-3 Maximum Turbo Upside by Sandy Bridge CPU model

Processor Rated Cores TDP Max Turbo Max Turbo
Model Core 1 Core All Cores
Speed Active Active
E5-2680 2.7 GHz 8 130 W 3.5 GHz 3.1 GHz
E5-2670 2.6 GHz 8 115 W 3.3 GHz 3.0 GHz
E5-2665 2.4 GHz 8 115 W 3.1 GHz 2.8 GHz
E5-2660 2.2 GHz 8 95 W 3.0 GHz 2.7 GHz
E5-2650 2.0 GHz 8 95 W 2.8 GHz 2.4 GHz
E5-2640 2.5 GHz 6 95 W 3.0 GHz 2.8 GHz
E5-2630 2.3 GHz 6 95 W 2.8 GHz 2.6 GHz
E5-2620 2.0 GHz 6 95 W 2.5 GHz 2.3 GHz
E5-2609 2.4 GHz 4 80 W N/A N/A
E5-2603 1.8 GHz 4 80 W N/A N/A
E5-2667 2.9 GHz 6 130 W 3.5 GHz 3.2 GHz
E5-2637 3.0 GHz 2 80 W 3.5 GHz 3.5 GHz
E5-2650L 1.8 GHz 8 70 W 2.3 GHz 2.0 GHz
E5-2630L 2.0 GHz 6 60 W 2.5 GHz 2.3 GHz
With Sandy Bridge and Turbo Boost 2.0, the core is permitted to operate above the
processors Thermal Design Power (TDP) for brief intervals provided the CPU has
thermal headroom, is operating within current limits, and is not already operating at its
Max Turbo frequency. The amount of time the core is allowed to operate above TDP is
dependent on the application-specific power consumption measured before and during
the above TDP interval, where energy credits are allowed to build up when operating
below TDP, then get exhausted when operating above TDP. In practice, only highly
optimized floating-point-intensive routines, often exploiting AVX optimization, can stress
the core enough to push above TDP, and the duration it can run above TDP generally
lasts only a couple of seconds, but this depends largely on the workload characteristics.
For compute workloads like Linpack, which operate at sustained levels for extended
periods, the brief period of increased frequency while running above TDP returns a
minimal net performance gain. This is because the brief duration of increased frequency
is only a small part of the overall workload time, and the high steady-state loading never
drops significantly below TDP during the measurement interval. Since this sustained
high (TDP) loading prevents energy credits from building back up, the processor is
unable to exceed TDP throughout the remainder of the benchmark measurement interval.
For more bursty, real world applications, the ability to operate above TDP for brief
intervals can return an incremental performance boost. In this bursty application

scenario, the processor spends short intervals below TDP where energy credits are able
to build up, then exhausts those energy credits when operating above TDP. Because
more time is spent above TDP for this case, the performance gains realized for Turbo
Boost are greater.
It is important to note that the maximum Turbo Boost upsides listed in Table 1-3 are not
guaranteed for all workloads. For workloads with heavy power and thermal
characteristics, specifically AVX-optimized routines like Linpack, a processor may run at
a frequency lower than its listed Max Turbo frequency. In these specific high-load
workload cases, the core will run as fast as it can while staying at or under its TDP. The
only frequency guaranteed for all workloads is the processors rated frequency, though
in practice some portion of the Turbo Boost capability is still possible even with highly
optimized AVX codes.
Finally, since any level of Turbo Boost above the All Cores Active frequency is
dependent on at least some of the cores being in ACPI C2 or C3 sleep states, these C-
States must remain enabled in system setup.

1.2 System
The dx360 M4 introduces some key system-level features enabling maximum
performance levels to be sustained. By subsystem, these include:
Memory:
4x 1600 MHz capable DDR3 Memory Channels per processor
2 DIMMs per memory channel
16 total DIMMs supporting a total capacity of up to 256 GB
Processor Interconnect
Dual QPI Interconnects, operating at up to 8 GT/s
Expansion Cards
Each processor supplies 24 lanes of Gen3 PCIe to a PCIe riser card
Each PCIe Gen3 riser provides one x16 slot (1U riser), or one x16 slot and one
x8 slot (2U riser)
Communications
Integrated dual port Intel 1 gigabit Ethernet controller for basic connectivity needs
Mezzanine Card options of either 10Gbit Ethernet or FDR InfiniBand, without
consuming a PCIe slot
The physical topology of these parts is depicted in the following block diagram. Note that
the interconnecting buses are also shown, since understanding which CPU these buses
connect to can be the key to tuning the locality of system resources.
Figure 1-2 dx360 M4 Block Diagram with Data Buses

4 DDR3 Memory Channels 4 DDR3 Memory Channels
SDB SDB
CPU CPU
1 0
16 15 10 9 8 7 2 1
14 13 12 11 6 5 4 3
Dual QPI Interconnects
PCIe Gen3 PCIe Gen3

x24 To Riser x24 To Riser
Slot Slot
10Gb or
1U Riser: PCH Infiniband 1U Riser:
+1x FHHL x16 Mezz +1x FHHL x16
Card
2U Riser: 1Gb 2U Riser:
+1x FHHL x8 +1x FHHL x8
+1x FHFL x16 +1x FHFL x16
Front of System
Not explicitly shown in this diagram, but key to many workloads, is the storage

connectivity. Depending on requirements, each compute node can be configured with

one 3.5 SATA drive, up to two 2.5 SAS/SATA disks, or up to four 1.8 SSDs.
Connection to these drives is via two 6 Gbps SATA ports provided by the Intel C600
Chipset (PCH), or via an optional RAID card. More detail on these options can be found
in the dx360 M4 Product Guide.
Note also that the dx360 M4 uses dual coherent QPI links to interconnect the CPUs.
Data traffic across these links is automatically load balanced to ensure maximum
performance. Combined with up to 8 GT/s speeds, this capability enables significantly
higher remote node data bandwidths than prior generation platforms.
1.2.1 I/O and Locality Considerations

While Non-Uniform Memory Architecture (NUMA) is not a new characteristic for CPU and
Memory resources, the integration of the I/O bridge into the Sandy Bridge processor has
introduced an added complexity for those looking to extract optimal node performance.
As can be seen from Figure 1-2, the integrated 1 Gigabit Ethernet ports, high speed
Mezzanine card, and the PCIe slots of the Right Side riser card are local to only CPU0,
while the balance of the PCIe slots of the Left Side riser card are connected to CPU1.
This Non-Uniform I/O architecture enables very high performance and low latency I/O
accesses to a given processors I/O resources, but the possibility does exist for I/O
access to a remote processors resources, requiring traversal of the QPI links. While
this worst-case remote I/O access is still generally faster than the best-case performance
of the E5600, it is important to understand that some I/O accesses can be faster than
others with this architecture. With that in mind, the end user may chose to implement I/O
tuning techniques to pin the system software to local I/O resources, if maximum I/O
performance is required. This may be especially important for those environments
implementing GPU solutions.
Additional detail covering supported I/O adapters and GPUs is available in the dx360 M4
Product Guide
1.2.2 Memory Subsystem

Multiple DIMM options are available to fit most application requirements, including both
1.5V 1600 MHz capable DIMMs, as well as 1.35V low power options. Unbuffered and
Registered DIMMs are available to fit the reliability and performance objectives of the
deployment, and capacities from 2GB to 16GB are available at product launch.
The speed that the entire memory subsystem is clocked at is determined by the lower of
1) the CPUs maximum supported memory speed as indicated in Table 1-2
2) the speed of the slowest DIMM channel on the system.
The maximum operating speed of each DIMM channel is dependent on the capability of
the DIMMs used, the speed and voltage that the DIMM is configured to run at, and the
number of DIMMs on the memory channel.
A list of the dx360 M4s supported DIMMs and the maximum frequencies that these
DIMMs can operate in various configurations and power settings are listed below.

Table 1-4 Supported DIMM types

Part Feature Size Volt Type DIMM Max Max Max Max Max
number code (GB) (V) Layout DIMM Speed Speed Speed Speed
Speed 1 DPC 2 DPC 1 DPC 2 DPC
(MHz) 1.35V 1.35V 1.5V 1.5V
49Y1403 A0QS 2 1.35 UDIMM 1R x8 1333 1333 1333 1333 1333
49Y1404 8648 4 1.35 UDIMM 2R x8 1333 1333 1333 1333 1333
49Y1405 8940 2 1.35 RDIMM 1R x8 1333 1333 1333 1333 1333
49Y1406 8941 4 1.35 RDIMM 1R x4 1333 1333 1333 1333 1333
90Y3178 A24L 4 1.5 RDIMM 2R x8 1600 N/A N/A 1600 1600
49Y1397 8923 8 1.35 RDIMM 2R x4 1333 1333 1333 1333 1333
* * 8 1.5 RDIMM 2R x4 1600 N/A N/A 1600 1600
* * 16 1.35 RDIMM 2Rx4 1333 1333 1333 1333 1333
* These part details were not available during the writing of this paper, but are expected
to be announced by the time this paper is published. See the product pages for further
DIMM details.
Note that 1.35V DIMMs are able to operate at 1.5V, and this will occur in configurations
which mix 1.35v and 1.5v DIMMs. Memory Speed and power settings are available from
within the UEFI configuration menus, under System Settings -> Memory, or via ASU as
discussed in section 1.2.3.1.
1.2.2.1 Memory Population for Performance

Optimal memory performance is highly dependent on DIMM choice and placement within
the system. Since the memory subsystem is involved in nearly all aspects of compute
and data movement in a system or cluster, it is one of the single most important focus
areas for a high performance solution.
Optimizing performance of the memory subsystem requires adherence to a few simple

rules.
1) Ensure that all memory channels are populated. Memory Performance scales
almost linearly with the number of populated channels. Each Sandy Bridge-EP
processor has 4 memory channels, so a typical 2 Processor system should
populate a minimum of 8 DIMMs. If only 4 DIMMs are populated, over 40% of
the platforms memory performance will be lost.
2) Populate each memory channel with the same capacity. This ensures that on
average, each channel will get the same loading. Unbalanced memory channels
cause memory hot spots, and total bandwidth can be reduced by 30% or more,
depending on the configurations used.
3) Where possible, use Dual Ranked DIMMs (2R) over Single Rank DIMMs (1R).
Though the gains are only a couple of percentage points at peak memory
bandwidth, this can be significant enough to consider for some high performance
environments.
4) Balance memory between CPU sockets. In most environments, this will help to
ensure a balance of memory requests in the Non-Uniform Memory Architecture.
Combining these rules together, we can assert the following general performance rule:
8 of the identical size and type of DIMM should be installed at a time, one
per memory channel, in order to achieve maximum memory performance.

Using the DIMM slot numbering as indicated in Figure 1-1 above, identical DIMMs should
be installed within each of the following 8-DIMM groups:
1) DIMMs 1, 3, 6, 8 (CPU0), and DIMMs 9, 11, 14, 16 (CPU1)
2) DIMMs 2, 4, 5, 7 (CPU0), and DIMMs 10, 12, 13, 15 (CPU1)
1.2.2.2 Memory Performance by Processor

As mentioned in section 1.1, memory performance is also impacted directly by the core
speed, since the cache and ring bus of the CPU are clocked in the same domain as the
core. Figure 1-3 shows unloaded memory latencies for different core clock speeds,
relative to a baseline of a Sandy Bridge processor running at 2.7 GHz.
Figure 1-3 Relative Memory Latency by Clock Speed

180 168
160 151
133 137
140
122
R elative L aten cy
115 115
120 108
100 103
100
80
60
40
20
0
X5670 - SDB - 2.7 SDB - 2.6 SDB - 2.4 SDB - 2.2 SDB - 2.0 SDB - 1.8 SDB - 1.6 SDB - 1.4 SDB - 1.2
2.93 GHz GHz GHz GHz GHz GHz GHz GHz GHz GHz
Processor Frequency (GHz)
As shown, the top Sandy Bridge processor frequencies have up to 15% lower latency
than a prior generation Xeon X5670. With Turbo Boost enabled, this same Sandy Bridge
processor is able to reach even higher clock speeds, reducing latencies by another 5+%.
However, when clock frequencies are reduced to lower frequencies, this has a direct
impact on the memory subsystem, and latencies can increase drastically.
Memory Throughput is also impacted by core clock frequency, though to a somewhat

lower degree than latency, as shown in Figure 1-4.

Figure 1-4 Relative Memory Throughput by Clock Speed
Relative M em o ry T h ro u g h p u t 120
100
80
60
40
20
0
2.7 2.6 2.4 2.2 2 1.8 1.6 1.4 1.2
SDB Processor Frequency (GHz)
While processor ratings less than 1.8 GHz are not supported on the dx360 M4, the lower
frequencies shown are possible when processor P-states are enabled, which enables a
power-saving, down-clocking of the processor. While the active usage of a low frequency
P-state by the OS will general occur only during periods which lack performance
sensitivity, there are specific cases where this can become an issue in real workloads.
Consider the case where an application may only exercise one processor socket (NUMA
node) at a time. In this case, the 2nd processor socket may be allowed to downclock, or
even sleep, assuming these capabilities are enabled in the System Settings and the OS.
However, cache coherency operations and remote memory requests may still take place
to the 2nd processor, which now has a critical component of its cache and memory
subsystem, the ring bus, being clocked at a reduced speed. For this reason,
environments which may have this sort of unbalanced processor loading occurring,
specifically while demanding peak memory performance, may consider disabling
processor P-states within System Settings, or setting the minimum processor frequency
within the OS. This latter method is explained for Linux OS here.
1.2.3 UEFI
The platform UEFI, working in conjunction with the Integrated Management Module, is
responsible for control of the low level hardware system settings. Many of the tunings
used to optimize performance are available within the F1 System Setup menu,
presented during boot time. These settings are also available from a command line
interface using the IBM Advanced Settings Utility, or ASU.
Since the dx360 M4 is used in cluster deployments, this section will first introduce the
ASU tool scripting tool, then provide UEFI tunings as implemented in ASU.
1.2.3.1 ASU
The Advanced Settings Utility is a key platform tuning and management mechanism to
read and write system configuration settings without having to manually enter F1 Setup
menus at boot time. Though changes to ASU settings still generally require a system
reboot to apply, this utility allows a consistent tuning of platforms and clusters using either
manual command line execution, or automated scripting.

Basic ASU commands

To capture all ASU supported system settings:
asu show all
To show permitted values of a specific setting:
asu showvalues <setting>
To set a setting to one of the permitted values:
asu set <setting> <value>
To apply a batch of settings from a file:
asu batch <file name>
batch file format per line:
set <setting> <value>
If <value> contains spaces, be sure to enclose the value string in parentheses, ex:
asu set OperatingModes.ChooseOperatingMode Maximum Performance
All the above commands can be issued from a remote host, by appending the following:
--host <IMM IP Address> --user <IMM userid> --password <IMM password>
Note that for 64-bit OSs, the asu binary name is asu64
Further information on ASU, including links to download Linux and Windows versions of
the tool are located here.
The ASU Users Guide can be found here.
1.2.3.2 UEFI Tunings

While the F1 System Setup menu and asu show all commands provide numerous
configuration parameters, there are few key parameters that are of particular interest for
high performance computing.
Many environments will not need explicit control over individual platform settings with
Sandy Bridge, as the dx360 M4 can be optimized for most environments with just a
couple of simple settings. For most environments, the UEFI enables four Operating
Modes which will cover the tuning requirements for many of the common usage
scenarios. These are:
1) Minimal Power
This mode reduces memory and QPI speeds, and activates the most aggressive
power management features, the combination of which will have steep
performance implications. This setting would be recommended for those
environments which must minimize power consumption at all costs.
2) Efficiency Favor Power
This mode also limits QPI and memory speeds, but not as aggressively as
Minimal Power mode. Turbo Mode remains disabled in this operating mode.
3) Efficiency Favor Performance (System Default)
This is the default operating mode. Memory and QPI buses are run at their
maximum hardware supported speeds, though most power management
features remain enabled. Turbo mode is enabled, and processor sleep states
(C-States) are enabled and allowed to enter the deepest sleep levels. This mode
generally enables an optimal balance of performance/watt on Sandy Bridge.
4) Maximum Performance
This mode generally allows the highest performance levels for most applications,
though exceptions do occur. Processor C-States continue to remain enabled, as
these are necessary for optimal Turbo Mode performance. The C-State Limit is
reduced to ACPI C2 in this mode to minimize the latency associated with waking
the sleeping cores, and C1 Enhanced Mode is disabled. Processor Performance

States are disabled in this mode, ensuring the CPU is always operating at or
above rated frequency.
There is also a 5th, Custom Mode, which enables all of the UEFI features to be set
individually for specific workload conditions. The following table covers the most
common performance-specific parameters, listed in ASU setting format:
Table 1-5 Common UEFI Performance Tunings

ASU Setting Name Tunings Description
(Other Setting Dependencies)
Custom is required for most of the parameters below to be
enabled (i.e. those indicating this mode as a dependency)
Maximum Performance sets Operating Mode as per above
OperatingModes.ChooseOperatingMode Efficiency Favor Performance sets Operating Mode as per
above
Efficiency Favor Power sets Operating Mode as per above
Minimal Power sets Operating Mode as per above
Enable for general purpose applications
Processors.Hyper-Threading
Disable for some highly core-efficient HPC workloads
Enable for most environments
Processors.HardwarePrefetcher
Disable used rarely, only with aggressive software prefetching
Processors.AdjacentCachePrefetch
Processors.DCUStreamerPrefetcher
Processors.DCUIPPrefetcher
Processors.TurboMode Enable for performance and performance/watt centric
(OperatingModes.ChooseOperatingMode environments
"Custom") Disable where minimum power consumption is required
Disable prevents processor down-clocking, improving
Processors.ProcessorPerformanceStates
performance in some environments at the expense of
(OperatingModes.ChooseOperatingMode
increased power consumption
"Custom")
Enable allows processor to down-clock, saving power
Enable allows processor to enter deep sleep states, saving
Processors.C-States power and allowing maximum Turbo Mode frequency
(OperatingModes.ChooseOperatingMode Disable prevents deep sleep states and any potential wake up
"Custom") latency, but will also prevent Turbo Boost from going over the
All Cores Active frequency.
ACPI C2 is the first deep sleep state, mapped to Intels C3
Processors.PackageACPIC-StateLimit state, the core PLLs are turned off and caches are flushed.
(Processors.C-States Enable) ACPI C3 is the deepest sleep state, mapped to Intels C6 state,
(OperatingModes.ChooseOperatingMode saves the core state to LLC and uses power gating to
"Custom") significantly reduce core power consumption. This state has a
higher wake up latency than ACPI C2.
Disable to eliminate the possibility of wake up latencies
affecting an application environment, at the cost of additional
Processors.C1EnhancedMode power usage
(OperatingModes.ChooseOperatingMode Enable allows the cores to enter the halt state. Even in this
"Custom") state, Turbo Boost considers this an active core, though some
power savings are realized. Some compute environments may
see small performance impacts from enabling this setting
"Max Performance" sets QPI to the maximum supported
Processors.QPILinkFrequency frequency
(OperatingModes.ChooseOperatingMode Balanced reduces QPI frequency by one stepping
"Custom") Minimal Power reduces QPI frequency to the lowest supported
speed
Max Performance sets memory to the fastest hardware
supported speed
Memory.MemorySpeed
Balanced reduces memory speed for some DIMMs, but enables
some power savings
"Custom")
Minimal Power runs memory in the lowest power mode, at the
expense of performance
Memory.MemoryPowerManagement Disable for best performance

ASU Setting Name Tunings Description

(Other Setting Dependencies)
(OperatingModes.ChooseOperatingMode Automatic to utilize the Sandy Bridge memory power savings
"Custom") logic. Memory latencies may increase in this mode.
NUMA when using most applications in most OSs
non-NUMA for specific environments where NUMA isnt enabled
Memory.SocketInterleave
in the OS or application. Proof of Concept testing
recommended before setting this to non-NUMA.
Disable for environments where background memory scrubbing
is unnecessary and for maximum performance
Memory.PatrolScrub
Enable for mission critical, high availability environments,
though some memory performance loss is possible
Adaptive provides optimal performance for most environments
Memory.PagePolicy Open to force Open Page policy
Closed to force Closed Page policy
Disable for environments not managed by AEM
Power.ActiveEnergyManager
Enable for AEM managed environments
Platform Controlled allows the system to control how aggressive
Power.PowerPerformanceBias
the CPU power management and Turbo Boost is
OS Controlled allows the OS to control power management
"Custom")
aggressiveness, though only select OSs support this
Controls Turbo Mode and Power Management aggressiveness

Platform Control Turbo Mode Power Management
Power.PlatformControlledType Setting Aggressiveness Aggressiveness
(OperatingModes.ChooseOperatingMode Maximum Highest Low
"Custom") Performance
(Power.PowerPerformanceBias Platform Efficiency Favor High Moderate
Controlled) Performance,
Efficiency Favor Moderate High
Power
Minimal Power Low Highest
Balanced for most workloads

Power.WorkloadConfiguration
I/O sensitive for specific cases where high I/O bandwidth is
needed when the CPU cores are idle, allowing sufficient
frequency for the workload
Each of these parameters can be set from the ASU as well as from the corresponding
menu in F1 System Setup.
Fundamentally, setting the Operating Mode to either the default of Efficiency Favor
Performance or Maximum Performance, depending on the performance and power
sensitivity of the environment, combined with application-driven optimization of the
Hyper-Threading setting, will provide excellent performance results for a given application
environment.
1.3 Mellanox InfiniBand Interconnect

The iDataPlex solution using Mellanox technology is achieved using several
components:
Expansion InfiniBand Host Channel Adapters (HCAs) operating at 40Gbps per
direction per link using Mellanox ConnectX3 technology implementing FDR10 or
operating at 56 Gbps per direction per link implementing FDR.
An external switch used to provide fabric connections for the Expansion HCAs
The Expansion HCAs are Mezzanine cards that do not take up a PCIe slot. They each
have 2 ports.
Above, you will have noted different HCAs are available at different InfiniBand technology
speeds: FDR10 and FDR. First, InfiniBand speed terminology will be described. Then,

FDR10 will be described. Then, FDR will be described.
The base data rate for InfiniBand technology has been the single data rate (SDR), which
is 2.5Gbps per lane or bit in a link. Previous interconnect technologies, up to FDR10,
have used this SDR reference speed. The standard width of the interface is 4X or 4 bits
wide. Therefore, the standard SDR bandwidth or speed of a link is 2.5Gbps times 4 bit
lanes, or 10Gbps. DDR, or double data rate, is 20Gbps per 4x link. QDR, or quad data
rate, is 40Gbps per link.
FDR10 is based off of FDR technology with the 10 appended to FDR to directly indicate
the bit lane speed of 10Gbps. An important difference between FDR10 and QDR is that
FDR10 is more efficient in its data transfer than QDR, because of certain FDR
characteristics.
The FDR nomenclature begins to deviate from basing the speed on a multiple of
2.5Gbps. FDR stands for fourteen data rate, or a bit speed of 14Gbps, which translates
into a 4x link speed of 56Gbps.
More information on InfiniBand architecture is available on the InfiniBand Trade

Association (IBTA) website at http://infinibandta.org
While FDR10 is nominally the same speed (40Gbps per 4x link) as the previous
generation (QDR), there is a different encoding of the data on the link that allows for
more efficient use of the link while still providing data protection. The QDR technology ran
an 8/10 bit encoding which yields 80% efficiency for every bit of data payload sent across
the link. FDR10 uses a 64/66 bit encoding which yields 97% efficiency. In other words,
the effective rate of a QDR link is 32Gbps; whereas, FDR10 has an effective data rate of
38.8 Gbps. In both cases, the effective data rate is also used by a modest number of bits
implementing packet overhead.
By using the same nominal speed as QDR, FDR10 can use the same basic cable
technology as QDR, which has helped with getting the improved FDR link efficiency to
market quicker.
To achieve FDR10 efficiencies, the HCAs must be attached to switches that support FDR
bit encoding. If they are attached to QDR switches, the HCAs will operate, but at QDR
rates and efficiencies.
FDR operates at 56Gbps per 4x link. It maintains the same bit encoding as FDR10 and
therefore the same 97% efficiency of the link. This yields an effective data rate of 54.3
Gbps. To achieve full FDR rates, the HCAs must be attached to switches that support full
FDR rates. If they are attached to switches that support a maximum of QDR or FDR10
rates, the HCAs will operate, but at the lower speeds.
The Mellanox model numbers for currently supported FDR10/FDR switches are:
SX6036 = a 36 port Edge switch.
SX6536 = a 648 port Director switch, which scales from 8.64 Tbps up to 72.52 Tbps
of bandwidth in a single enclosure.
Both switch models are non-blocking. Both switch models can support any speed up to
FDR (including FDR10).
Edge switches are typically used for small clusters of servers or as top-of-rack switches
that provide edge or leaf connectivity to one or more Director switches implemented as
core switches. This allows for scaling beyond 648 nodes.
It is also possible to use the SX6536 to connect up to 648 HCAs in a single InfiniBand

subnet, or plane.
The typical large scale solution is implemented as a fat-tree to maintain 100% bi-
sectional bandwidth for any to any node communication. For example, for a cluster of
1296 nodes, each with one connection into a plane, the typical topology would be to have
72 SX6036 Edge switches distributed amongst the frames of DX360 M4 servers. This
has 18 servers connected to each Edge switch. The Edge switches will then connect to
two SX6536 Director switches with 9 cables from each of the Edge switches connecting
to each of the Director switches.
It is possible to over-subscribe switch connectivity to reduce the number of required

switches in very large fabrics, and thus reduce cost. However, care must be taken in
doing this for solutions that require any node to communicate with any other node in a
random fashion, because the oversubscribed networks can cause data traffic congestion.
However, if data patterns are well understood from the onset, or if applications run on
only portions of the cluster, then it may be possible to design an oversubscribed fabric
that does not lead to excessive congestion.
Another consideration in implementing the InfiniBand interconnect is configuring the

subnet manager. Some considerations are alternative routing algorithms, alternative MTU
settings, and Quality of Service functions.
Various subnet managers typically have several possible routing algorithms such as
Minimum number of Hops (default), up-down, fat tree, and so on. It is recommended that
the various options be discussed with a routing expert before deviating from the default
algorithm. Parameters like the types of applications, the chosen topology and the
experiences of a particular algorithm in the field should be considered.
Some current options for Mellanox routing methods are:
MINHOP or shortest path optimizes routing to achieve the shortest path between two
nodes. It balances routing based on the number of paths using each port in the fabric.
This is the default algorithm.
UPDN or up-down provides a shortest path optimization, but also considers a set of
ranking rules. This is designed for a topology that is not a pure Fat tree, and has potential
deadlock loops that must be avoided.
FAT TREE can be used for various fat-tree topologies as it optimizes for various
congestion-free communication patterns. It is similar to UPDN in that it also employs
ranking rules.
LASH or layered shortest path uses InfiniBand virtual layers (SL) to provide dead-lock
free shortest path routing.
DOR or dimension ordered routing provides deadlock free routes for hypercube and
mesh topologies.
FILE or file-based loads the routing information directly from a file. This would be for very
specialized applications and has disadvantages in that it restricts the subnet managers
ability to dynamically react to changes in the topology.
UFM TARA or traffic aware routing is unique to Mellanoxs Unified Fabric Manager
(UFM) and combines UPDN with traffic-aware balancing that includes application
topology and weighted patterns. This requires UFM to work in concert with applications
and job managers to maintain awareness of traffic patterns so that it can dynamically
optimize routing. Therefore, it may not be possible to use this algorithm with all solutions.

The Mellanox subnet manager also includes support for adaptive routing. Adaptive
routing allows the switch to choose how to route a packet based on availability of the
optional ports used to get from one node to another. This works as you traverse the fabric
from the source node to halfway out in the fabric. Once you reach the halfway point, the
remainder of the path is predestined and no more choices are available. If the congestion
pattern tends to be in the first half of the route, this can be an effective tool. If the
congestion pattern is in the back half of the route, adaptive routing is less effective for
example, for many-to-one patterns, the congestion starts at the destination node end and
backs up into the fabric.
The Mellanox subnet manager also includes support for Quality of Service (QoS). It uses
service lanes (or virtual lanes) and a weight factor for each lane to ensure that higher
priority data traffic is separated from and takes precedence over lower priority data traffic.
In this way, the higher priority traffic avoids being delayed by lower priority traffic. To take
advantage of QoS, the applications, MPI and RDMA stack must be implemented in a way
that uses service lanes. As this is not always the case, some solutions are limited to
separating IP over InfiniBand (IPoIB) traffic from RDMA traffic by taking advantage of the
ability to assign a non-default service lane for IPoIB.
For certain applications it may be possible to improve performance by using a non-default

InfiniBand link MTU or multicast MTU. The default is 2K, but some success has been
seen in the past with 4K. Typically, the starting MTU should be 2K and only adjusted to
4K if performance targets are not being achieved. Also, a performance expert should be
consulted about the MPI solution before attempting a non-default MTU setting.
Note: The InfiniBand MTU is different from an IP MTU. It is a maximum transmission unit
at the physical layer, or the size of packets in the fabric itself. Larger IP packets bound by
the IP MTU are broken down into smaller packets in the physical layer bound by the
InfiniBand MTU.
While it is not often used in the industry, smaller fabric solutions can sometimes benefit
from LMC (LID mask control) being set to 1 or 2. A non-zero LMC causes the SM to
assign multiple LIDs to each device, and then generate a different path to each LID.
While originally envisioned as a failover mechanism, this also allows for upper layer
protocols to scatter traffic over several paths with the intention of reducing congestion.
This requires an RDMA stack that is aware of the multiple paths provided by a non-zero
LMC so that the path can be periodically switched according to some algorithm (like
round-robin, or least recently used). There is a cost associated with LMC > 0, in that each
port is assigned multiple LIDs (local identifiers) and this will take up more buffer and
memory space. It will also affect start-up time for RC (reliable) connections. As a cluster
is scaled up, the impact becomes more noticeable. In fact, if a cluster gets large enough,
the hardware may run out of space to support the number of buffers required for LMC >
0. Typically, a performance expert should be consulted on the MPI solution to see if there
is any benefit to LMC > 0.
Finally, a pro-active method for monitoring the health of the fabric can help maintain
performance. With the current InfiniBand architecture, errors in the fabric require a
retransmission from source to destination. This can be costly in terms of latency and lost
effective bandwidth as packets are retransmitted. Again, the recommendation is to
consult with an expert in fabric monitoring to understand how best to determine that the
network fabric is healthy. Some considerations are:
Monitoring error counters and choosing thresholds that are appropriate for
the expected bit error rate (BER) of the fabric technology.
Typically, default thresholds are only adequate for links that greatly exceed the
acceptable BER. For links that are noisy and can impact performance, but are
barely over the acceptable BER, default thresholds are likely to be inadequate. A

time based threshold is preferred. However, many basic monitoring tools only
have count based thresholds (ignoring the bit error rate), which leads to the need
to develop a local strategy for regularly clearing the error counters to impose a
rough time-base to the count threshold. An expert should be consulted for the
appropriate bit error rate. In many cases, this is currently in the 10-15 to 10-14
range.
Monitoring for lost links in the fabric that can lead to imbalanced routes
and congestion.
Monitoring link speeds and widths.
When a link is established it is trained to the highest speed at which it can
operate based on both the inherent limitations of the technology (FDR10, FDR,
QDR, and so on) as well as the particular instance of hardware on the link. A
noisy cable or port may be tuned to a lower speed or smaller width to allow it to
operate. Quite often the switch technology will allow a system administrator to
override this and force a link to operate only at the maximum. However, the
default case is to tune to whatever the link can handle. Therefore, unless the
default is overridden, the system administrator will want to be sure to monitor for
speed or width changes (particular across node or switch reboots).For FDR10, it
is particularly important to be observant regarding whether or not the link has
come up at FDR10 versus QDR. It is possible that the link can handle 10Gbps
per bit lane, but not tune to the 64/66 bit encoding. Various tools will vary in
reporting whether 10Gbps is QDR or FDR10 some will only report if it is sub-
optimal.
1.3.1 References
[1] InfiniBand Linux SW Stack
[2] Routing Algorithm Info (Requires a userid with Mellanox support access).

2 Performance Optimization on the CPU

Compilers are the basic tools applied to HPC applications to get good performance from
the target hardware. Usually, several compilers are available for each architecture. The
first section of this chapter discusses uses of the widely adopted open source GNU
compiler suite and the proprietary Intel compiler suite. Other alternatives are listed but not
described.
Along with compilers, mathematical libraries are heavily used in HPC codes and are
critical for high performance. Section 2.2 introduces the Intel MKL library optimized for
Intel processors and briefly mentions the other alternatives.
The most important languages for High Performance Computing are FORTRAN, C, and
C++ because they are the ones used by the vast majority of codes. This chapter focuses
on those languages, though compiler providers also support others.
All of the descriptions of the GNU and Intel compiler options included in the tables in this
chapter are taken from the GNU Optimization Options Guide and the Intel Fortran
Compiler User and Reference Guide.
2.1 Compilers
2.1.1 GNU compiler

The GNU compiler tool chain contains front ends for C, C++, Objective-C, FORTRAN,
Java, Ada, and Go, as well as libraries for these languages. These compilers are 100%
free software and can be used on any system independently of the type of hardware. It is
the default compiler for the Linux operating system and is shipped with it.
The support of AVX vector units introduced with Sandy Bridge processor has been
announced with the 4.6.3 release of GCC compilers. With GCC version 4.7.0 several new
features have been implemented, detailed in the compiler option section below.
In the following, we are focusing on gfortran version 4.7.0.
GCC 4.7.0 has been successfully built on a Sandy Bridge system by using the following
configure options:
export BASEDIR=/path/to/GCC-4.7.0
export LD_LIBRARY_PATH=${BASEDIR}/dlibs/lib:$LD_LIBRARY_PATH
${BASEDIR}/gcc-4.7.0-RC-20120314/configure \
-prefix=${BASEDIR}/install \
--with-gmp=${BASEDIR}/dlibs \
--with-gmp-lib=${BASEDIR}/dlibs/lib \
--with-mpfr-lib=${BASEDIR}/dlibs/lib \
--with-mpfr=${BASEDIR}/dlibs \
--with-mpc-lib=${BASEDIR}/dlibs/lib \
--with-mpc=${BASEDIR}/dlibs \
--with-ppl=${BASEDIR}/dlibs \
--with-ppl-lib=${BASEDIR}/dlibs/lib \
--with-libelf=${BASEDIR}/dlibs \
--enable-languages=c,c++,fortran --enable-shared --enable-threads=posix \
--enable-checking=release --with-system-zlib --enable-__cxa_atexit \
--disable-libunwind-exceptions --enable-libgcj-multifile

2.1.2 Intel Compiler

The Intel compiler suite is traditionally the most efficient compiler for the Intel processors
as it is designed by Intel. The latest features of the Intel Sandy Bridge processor are
implemented in the current version of the Intel Composer XE product version 12.1. It
includes a FORTRAN, C, and C++ compiler, an MKL library and other components (Cilk,
IPP, TBB, etc.)
Several additional tools are available to help analyzing and optimizing HPC codes on
Intel systems. These tools are described in Chapter5.
2.1.3 Recommended compiler options

In this section the compiler options which have the greatest impact on improving the
performance of HPC applications are presented. The compiler reference documents can
be consulted for a full list of options,
The options can be separated into categories: architectures, general optimization,

vectorization, inter procedural optimization, shared memory, and distributed
parallelization. These are the most important options for improving the performance of a
HPC application.
2.1.3.1 Architecture
In order to efficiently optimize a program on a specific processor, the compiler needs
information on the architectures details. It is then able to use adapted parameters for its
internal optimization engines and generate optimized assembly code matching the
identified hardware as perfectly as possible: cache sizes, vector unit details, prefetching
engines, etc.
If compiling on the same architecture as the one used for the computation, compilers
(through usage of compiler options) can also automatically detect the processor details
and then generate optimal settings without user interaction.
For cross compiling, the user has to specify the target architecture to the compiler.
Another case is when user wants to have a binary that can run on all architectures in the
processor family (for instance, the x86 family). This is best applied for creating binaries
for pre- or post-processing programs that dont require significant computing power but
can be conveniently run on various systems within the same processor family without
recompilation.
2.1.3.1.1 GNU
The following options are used by GNU compiler to specify the hardware architecture to
be used for code generation:
-march= Generate code for given CPU
-mtune= Schedule code for given CPU
The corei7-avx argument tells the compiler to generate code for Sandy Bridge and use
AVX instructions:
-march=corei7-avx
By default, it also enables the -mavx compiler flag for autovectorization.

Table 2-1 summarizes the other options that are available:
Table 2-1 GNU compiler processor-specific optimization options

-mavx Support MMX, SSE, SSE2, SSE3,
SSSE3, SSE4.1, SSE4.2 and AVX
built-in functions and code
generation
-mavx256-split-unaligned-load Split 32-byte AVX unaligned loads
-mavx256-split-unaligned-store Split 32-byte AVX unaligned stores
-mfma Support MMX, SSE, SSE2, SSE3,

SSSE3, SSE4.1, SSE4.2, AVX and FMA3
built-in functions and code
generation
-mfma4 Support FMA4 built-in functions and

code generation
-mprefer-avx128 Use 128-bit AVX instructions

instead of 256-bit
2.1.3.1.2 Intel
For local or cross compilation, the -xcode option specifies the hardware architecture
(code in this example) to be used for code generation. For the Sandy Bridge
architecture, using
-xAVX
may generate SIMD instructions for Intel processors.
If the code is being compiled on the same processor that will be used for computation
(local compilation), the following option produces optimized code:
-xhost
It tells the compiler to generate instructions for the most complete instruction set available
on the host processor.
For compatibility with GCC, the Intel compiler accepts GNU syntax for some options; the
-mavx and -march=corei7-avx flags are equivalent to -xAVX
Options -x and -m are mutually exclusive. If both are specified, the compiler uses the last
one specified and generates a warning.
The Intel compiler ignores the options in Table 2-2. These options only generate a
warning message. The suggested replacement options should be used instead.
Table 2-2 A mapping between GCC and Intel compiler options for processor architectures
GCC Compatibility Option Suggested Replacement Option
-mfma -march=core-avx2
-mbmi, -mavx2, -mlzcnt -march=core-avx2
-mmovbe -march=atom -minstruction=movbe
-mcrc32, -maes, -mpclmul, -mpopcnt -march=corei7

GCC Compatibility Option Suggested Replacement Option
-mvzeroupper -march=corei7-avx
-mfsgsbase, -mrdrnd, -mf16c -march=core-avx-i
2.1.3.2 General optimization

All compilers support several levels of optimization that can be applied to source code.
The more aggressive the optimization level, the faster the code runs but with a higher risk
of generating significant numerical differences in the results. Balancing performance
against numerical consistency is the code owners responsibility.
The Intel compiler includes many compiler options that can affect the optimization and
the subsequent performance of the code, but this section only touches on the most
common options.
2.1.3.2.1 GNU
The GCC/GFortran compiler has to be configured and compiled on a specific target
system, so it may not support some features and compiler technologies depending on the
configure arguments. The compiler explicitly reports on the exact set of optimizations that
are enabled at each level by including the Q help=optimizers option:
$ gfortran -Q --help=optimizers
The general optimization levels are listed in Table 2-3
Table 2-3 General GNU compiler optimization options

GCC compiler option Explanation
-O0 Reduce compilation time and keep execution in the same

order as the source line listing so that debugging
produces the expected results. This is the default.

-O The first optimization level. Optimizing compilation

-O1 takes more time and a lot more memory for a large
function.
With O / -O1, the compiler tries to reduce code size
and execution time, avoiding any optimizations that
take would significantly increase compilation time.
-O turns on the following optimization flags:
-fauto-inc-dec
-fcompare-elim
-fcprop-registers
-fdce
-fdefer-pop
-fdelayed-branch
-fdse
-fguess-branch-probability
-fif-conversion2
-fif-conversion
-fipa-pure-const
-fipa-profile
-fipa-reference
-fmerge-constants
-fsplit-wide-types
-ftree-bit-ccp
-ftree-builtin-call-dce
-ftree-ccp
-ftree-ch
-ftree-copyrename
-ftree-dce
-ftree-dominator-opts
-ftree-dse
-ftree-forwprop
-ftree-fre
-ftree-phiprop
-ftree-sra
-ftree-pta
-ftree-ter
-funit-at-a-time

-O2 The second optimization level. GCC performs nearly all

supported optimizations that do not involve a space-
speed tradeoff. As compared to -O, this option
increases both compilation time and the performance of
the generated code.
-O2 turns on all optimization flags specified by -O.
It also turns on the following optimization flags:
-fthread-jumps
-falign-functions -falign-jumps
-falign-loops -falign-labels
-fcaller-saves
-fcrossjumping
-fcse-follow-jumps -fcse-skip-blocks
-fdelete-null-pointer-checks
-fdevirtualize
-fexpensive-optimizations
-fgcse -fgcse-lm
-finline-small-functions
-findirect-inlining
-fipa-sra
-foptimize-sibling-calls
-fpartial-inlining
-fpeephole2
-fregmove
-freorder-blocks -freorder-functions
-frerun-cse-after-loop
-fsched-interblock -fsched-spec
-fschedule-insns -fschedule-insns2
-fstrict-aliasing -fstrict-overflow
-ftree-switch-conversion -ftree-tail-merge
-ftree-pre
-ftree-vrp
-Os Optimize for size. -Os enables all -O2 optimizations

that do not typically increase code size. It also
performs further optimizations designed to reduce code
size.
-Os disables the following (-O2) optimization flags:
-falign-functions
-falign-jumps
-falign-loops
-falign-labels
-freorder-blocks
-freorder-blocks-and-partition
-fprefetch-loop-arrays
-ftree-vect-loop-version
-O3 The third optimization level. -O3 turns on all

optimizations specified by -O2 and also turns on the
following options:
-finline-functions
-funswitch-loops
-fpredictive-commoning
-fgcse-after-reload
-ftree-vectorize
-fipa-cp-clone

-freciprocal- Allow the reciprocal of a value to be used instead of

math dividing by the value if this enables optimizations.
For example x / y can be replaced with x * (1/y),
which is useful if (1/y) is subject to common
subexpression elimination. Note that this loses
precision and increases the number of flops operating
on the value.
The default is -fno-reciprocal-math.
-Ofast Disregard strict standards compliance. -Ofast enables

all -O3 optimizations. It also enables optimizations
that are not valid for all standards-compliant
programs. It turns on -ffast-math and the Fortran-
specific options -fno-protect-parens and -fstack-
arrays
The Optimization Options guide [15] on the GNU web site provides more details.
2.1.3.2.2 Intel
Table 2-4 lists the general levels for optimization that are available for the Intel compiler,
Table 2-4 General Intel compiler optimization options

Intel compiler option Explanation
-O0 Disables all optimizations.
This option may set other options. This is determined

by the compiler, depending on which operating system
and architecture being used. The options that are set
may change from release to release.
This option causes certain warning options to be

ignored. This is the default if debug (with no
keyword) is specified.
-O Enables optimizations for speed and disables some

-O1 optimizations that increase code size and affect
speed.
To limit code size, this option enables global
optimization. This includes data-flow analysis, code
motion, strength reduction and test replacement,
split-lifetime analysis, and instruction scheduling.

The -O1 option may improve performance for

applications with a very large code size, many
branches, and where the execution time is not
dominated by the code executed inside loops.

-O2 Enables optimizations for speed. This is the default,

and generally recommended, optimization level.
Vectorization is enabled at -O2 and higher levels.
Some basic loop optimizations such as distribution,

predicate optimization, interchange, multi-versioning,
and scalar replacements are performed.
This option also enables:

- Inlining of intrinsic functions
- Intra-file interprocedural optimization, which
includes inlining, constant propagation, etc.
- dead-code elimination
- global register allocation
- global instruction scheduling and control
speculation
- loop unrolling
- optimized code selection
- partial redundancy elimination
This option may set other options, especially options

that optimize for code speed. This is determined by
the compiler, depending on which operating system and
architecture being used. The options that are set may
change from release to release.
If -g is specified, -O2 is turned off and -O0 is the

default unless -O2 (or -O1 or -O3) is explicitly
specified in the command line together with -g.
Many routines in the shared libraries are more highly

optimized for Intel microprocessors than for non-
Intel microprocessors.
-Os This option enables optimizations that do not increase

code size and produces a smaller code size than -O2.
It disables some optimizations that increase code size
for a small speed benefit.
This option tells the compiler to favor

transformations that reduce code size over
transformations that produce maximum performance.

-O3 Performs -O2 optimizations and enables more aggressive

loop transformations such as fusion, block-unroll-and-
jam, and collapsing IF statements.

When -O3 is used with options -ax or -x, the compiler

performs a more aggressive data dependency analysis
than for -O2, which may result in longer compilation
times.
The -O3 optimizations may not result in higher

execution performance unless loop and memory-access
transformations take place. The optimizations may slow
down code in some cases compared to -O2 optimizations.
The -O3 option is recommended for applications that

have loops that heavily use floating-point
calculations and process large data sets.
Many routines in the shared libraries are more highly

optimized for Intel microprocessors than for non-
Intel microprocessors.
-fast This option maximizes speed across the entire program.
It sets the following options:

-ipo, -O3, -no-prec-div, -static, and xHost
-prec-div This option improves the precision of floating-point

divides. It has a slight impact on speed.
With some optimizations, such as -xAVX, the compiler

may change floating-point division computations into
multiplication by the reciprocal of the denominator.
For example, A/B is computed as A * (1/B) to improve
the speed of the computation.
However, sometimes the value produced by this

transformation is not as accurate as full IEEE
division. When it is important to have fully precise
IEEE division, use this option to disable the
floating-point division-to-multiplication
optimization. The result is more accurate, with some
loss of performance.
If you specify -no-prec-div, it enables optimizations

that give slightly less precise results than full IEEE
division.
The Intel Fortran Compiler User and Reference guide [2] provides more information and
a complete list of optimization flags.
2.1.3.3 Vectorization
All processor manufacturers have introduced SIMD (vector) units to improve the

computing capabilities of their processors. Intel introduced the AVX (Advanced Vector
Extensions) unit working on 256 bit wide data with the Sandy Bridge processor. More
information is available in Chapter 7.
In order to enable access to this additional compute power, compilers must produce
instructions specific for this hardware. The most important compiler options to enable
SIMD instructions are presented next.
2.1.3.3.1 GNU
The GCC compiler enables autovectorization with
-ftree-vectorize
It is enabled by default with O3, -mavx or -march=corei7-avx.
The architecture flags select the type of SIMD instructions used and also enable
autovectorization. For the Sandy Bridge processor, the recommended options are:
-mavx
or
-march=corei7-avx
to use AVX (for 256-bit data) instructions and enable autovectorization.
The following switches:

-ffast-math -fassociative-math
enable the automatic vectorization of floating point reductions,
So the recommended compiler options for the GNU compiler for Sandy Bridge
processors are:
-O3 -march=corei7-avx (or O3 mavx)
Information on which loops were or were not vectorized and why, can be obtained using
the flag -ftree-vectorizer-verbose=<level>
There are 7 reporting levels:

1. Level 0: No output at all.
2. Level 1: Report vectorized loops.
3. Level 2: Also report unvectorized "well-formed" loops and respective reason.
4. Level 3: Also report alignment information (for "well-formed" loops).
5. Level 4: Like level 3 + report for non-well-formed inner-loops.
6. Level 5: Like level 3 + report for all loops.
7. Level 6: Print all vectorizer dump information.
2.1.3.3.2 Intel
Vectorization is automatically enabled with -O2.
The architecture flags select the type of SIMD instructions used and also enable
autovectorization. For the Sandy Bridge processor, the recommended options are:
-xAVX
to allow for 256-bit vector instructions and enable autovectorization or
-xSSE4.2

to select the latest instruction set for 128-bit vector data and enable autovectorization.
The default behavior can be controlled through command line switches:
Table 2-5 Intel compiler options that control vectorization

-vec enable vectorization (the default at -O2 and above)
-no-vec disable vectorization
-simd This option enables the SIMD vectorization feature of

the compiler using SIMD directives.
-no-simd Disable SIMD transformations for vectorization using

SIMD directives.
To disable vectorization completely, specify the
option -no-vec
Note:
User-mandated SIMD vectorization directives supplement automatic vectorization just as
OpenMP parallelization supplements automatic parallelization. SIMD vectorization uses
the !DIR$ SIMD directive to effect loop vectorization. The directive must be added before
a loop and the loop must be recompiled to become vectorized (the option -simd is
enabled by default).
To disable the SIMD transformations for vectorization, specify option -no-simd
To disable transformations that enable more vectorization, specify options -no-
vec -no-simd
Complete information is available in [2]
Additional compiler options allow the compiler to perform a more comprehensive analysis
and better vectorization:
Table 2-6 Intel compiler options that enhance vectorization

-ipo Inter-procedural optimization is enabled across source
files. This may give the compiler additional
information about a loop, such as trip counts,
alignment or data dependencies. It may also allow the
function calls to be inlined.
-fno-alias Disambiguation of pointers and arrays. The switch

-fno-alias may be used to assert there is no aliasing
of memory references, that is, that the same memory
location is not accessed via different arrays or
pointers.
-O3 High level loop optimizations (HLO) may be enabled

with -O3. These additional loop optimizations may make
it easier for the compiler to vectorize the
transformed loops.
The HLO report, obtained with -opt-report-phase hlo
may tell whether some of these additional
transformations occurred.
Sometimes code blocks are not optimized or vectorized. Producing compiler reports
provides diagnostic information providing hints on how to tune source code more. It is
done using the following flags from Table 2-7.

Table 2-7 Intel compiler options for reporting on optimization

-opt-report n Tells the compiler to generate an optimization report
to stderr.
n: Is the level of detail in the report. On Linux OS
and Mac OS X systems, a space must appear before the
n.
Possible values of n are:
0: Tells the compiler to generate no optimization
report.
1: Tells the compiler to generate a report with the
minimum level of detail.
medium level of detail. This is the default
maximum level of detail.
-vec-report n Controls the diagnostic information reported by the

vectorizer
n: Is a value denoting which diagnostic messages to
report.
Possible values of n are:
0: Tells the vectorizer to report no diagnostic
information.
1: Tells the vectorizer to report on vectorized
loops. This is the default
2: Tells the vectorizer to report on vectorized and
non-vectorized loops.
3: Tells the vectorizer to report on vectorized and
non-vectorized loops and any proven or assumed data
dependences.
4: Tells the vectorizer to report on non-vectorized
loops.
5: Tells the vectorizer to report on non-vectorized
loops and the reason why they were not vectorized.
-opt-report- Specifies an optimizer phase to use when optimization

phase hlo reports are generated.
hlo: The High Level Optimizer phase
So the recommended compiler options for the Intel compiler are:

-O3 xAVX
2.1.3.4 Inter-procedural optimization

Inter-procedural optimization is an automatic, multi-step process that allows the compiler
to analyze a code globally to determine where it can benefit from specific optimizations.
Depending on the compiler implementation, several optimization approaches are
attempted on the source code.
2.1.3.4.1 GNU
A complete inter-procedural analysis has only been part of the GNU compiler since
version 4.5. It was previously limited to inlining functions into a single file using the
-finline-functions compiler flag. Now there is a more sophisticated process
available by using the flto option, which performs link time optimization across multiple
files.
A summary of useful information is in Table 2-8:

Table 2-8 Global (inter-procedural) optimization options for the GNU compiler suite
-finline- Consider all functions for inlining, even if they
functions are not declared inline. The compiler
heuristically decides which functions are worth
integrating in this way.
If all calls to a given function are integrated,
and the function is declared static, then the
function is normally not output as assembler code
in its own right.
Enabled at level -O3.
-flto[=n] This option runs the standard link-time optimizer. The

only important thing to keep in mind is that to enable
link-time optimizations the -flto flag needs to be
passed to both the compile and the link commands.
To make whole program optimization effective, it is

necessary to make certain whole program assumptions.
The compiler needs to know what functions and
variables can be accessed by libraries and runtime
outside of the link-time optimized unit. When
supported by the linker, the linker plugin (see -fuse-
linker-plugin) passes information to the compiler
about used and externally visible symbols. When the
linker plugin is not available, -fwhole-program should
be used to allow the compiler to make these
assumptions, which leads to more aggressive
optimization decisions.
Link-time optimization does not work well with
generation of debugging information. Combining -flto
with -g is currently experimental and expected to
produce wrong results.
If you specify the optional n, the optimization and
code generation done at link time is executed in
parallel using n parallel jobs by utilizing an
installed make program. The environment variable MAKE
may be used to override the program used. The default
value for n is 1.
This option is disabled by default.
-fwhole-program Assume that the current compilation unit represents

the whole program being compiled. All public functions
and variables with the exception of main and those
merged by attribute externally_visible become static
functions and in effect are optimized more
aggressively by interprocedural optimizers. If gold is
used as the linker plugin, externally_visible
attributes are automatically added to functions (not
variable yet due to a current gold issue) that are
accessed outside of LTO objects according to
resolution file produced by gold. For other linkers
that cannot generate resolution file, explicit
externally_visible attributes are still necessary.
While this option is equivalent to proper use of the
static keyword for programs consisting of a single
file, in combination with option -flto this flag can
be used to compile many smaller scale programs since
the functions and variables become local for the whole
combined compilation unit, not for the single source
file itself.

For details and related compiler options, please refer to [15].
2.1.3.4.2 Intel
Intel implementation of Inter Procedural Optimization supports 2 models: single-file
compilation using ip compiler option, and multi-file compilation using ipo compiler flag.
Optimizations that can be done by Intel compiler when using inter procedural analysis:
Inlining, constant propagation, mod/ref analysis, alias analysis, forward substitution,

routine key-attribute propagation, address-taken analysis, partial dead call elimination,
symbol table data promotion, common block variable coalescing, dead function
elimination, unreferenced variable removal, whole program analysis, array dimension
padding, common block splitting, stack frame alignment, structure splitting and field
reordering, formal parameter alignment analysis, indirect call conversion,
specialization,
Table 2-9 Global (inter-procedural) optimization options for the Intel compiler suite
-ip This option determines whether additional
interprocedural optimizations for single-file
compilation are enabled.
Options -ip (Linux and Mac OS X) and /Qip (Windows)

enable additional interprocedural optimizations for
single-file compilation.
Options -no-ip may not disable inlining. To ensure

that inlining of user-defined functions is disabled,
specify -inline-level=0 or -fno-inline.
-ipo This option enables interprocedural optimization

between files. This is also called multifile
interprocedural optimization (multifile IPO) or Whole
Program Optimization (WPO).
When you specify this option, the compiler performs
inline function expansion for calls to functions
defined in separate files.
-finline- It enables the compiler to perform inline function

functions expansion for calls to functions defined within the
current source file.
For details and complete information, please refer to [2].
2.1.3.5 Shared memory parallelization

With the rise of the multi sockets systems and multi core processors, many physical
cores are available for a single process to compute its workload. Using shared memory
parallelization is the easier way to enable this process to access these multiple
computing resources and increase its global performance.
The following techniques allow doing so:

Automatic parallelization:
Using a specific compiler option, the user tells the compiler to automatically parallelize
the sections that will support it. This parallelization is implemented through shared
memory mechanisms and is very similar to OpenMP threading. Parallel execution is
managed through environment variables, often the same as are used explicitly with
OpenMP.
Explicit parallelization with OpenMP:

In this case, the user inserts directives (or pragmas in C/C++) into the source code
telling the compiler which regions need to be parallelized. The user manages the data
scope, i.e. which data is shared and which data is private in a thread. Then using
compiler options, the compiler transforms the serial code into a multithreaded code.
Parallel execution is managed through environment variables from the compiler runtime
or OpenMP norm.
Thread Affinity:
Thread affinity is a critical concept when using threads for computing: the locality of the
data used by the threads must be managed. The performance of a core will be higher if
the data used by the thread running on that core is located in the nearest hardware
memory DIMMs. To ensure this locality, one has to bind the thread to run on a particular
core, using the data located in the DIMM physically attached to the processor chip
containing this core. Environment variables and tools can control this binding. More
details are available in Section 3.4.
Compatibility between Intel and GNU:

Intel provides a compatibility library to allow codes compiled with Intel compilers to use
the GNU threading mechanisms.
2.1.3.5.1 GNU
Automatic parallelization
The -ftree-parallelize-loops compiler flag creates multithreading automatically:
-ftree- Parallelize loops, i.e., split the iteration space to

parallelize- run in n threads. This is only possible for loops
loops=n whose iterations are independent and can be
arbitrarily reordered. The optimization is only
profitable on multiprocessor machines, for loops that
are CPU-intensive, rather than constrained e.g. by
memory bandwidth. This option implies pthread.
This flag must be passed to both the compile and link steps.
OpenMP
The OpenMP directives are processed with the fopenmp compiler flag.
-fopenmp Enables the OpenMP directives to be recognized by

compiler and also arranges for automatic linking of
the OpenMP runtime library

This flag must be passed to compile and link steps.
When using this flag, all local arrays will be made automatic and then allocated on the
stack. This can be a source of segmentation faults during execution, because of too small
a limit on the stack size. The stack size can be changed by using the following
environment variables:
OMP_STACKSIZE from the OpenMP standard
GOMP_STACKSIZE from the GNU implementation
Both variables change the default size of the stack allocated by each thread.
The size is limited by the value of the stack limit of the user reported by
ulimit s
Thread affinity
GOMP_CPU_AFFINITY : Binds threads to specific CPUs.
Syntax: GOMP_CPU_AFFINITY="0 3 1-2 4-15:2" will bind the initial thread to CPU 0, the
second to CPU 3, the third to CPU 1, the fourth to CPU 2, the fifth to CPU 4, the sixth
through tenth to CPUs 6, 8, 10, 12, and 14 respectively and then start assigning back
from the beginning of the list.
GOMP_CPU_AFFINITY=0 binds all threads to CPU 0.
OMP_PROC_BIND: Specifies whether threads may be moved between processors. If set

to true, OpenMP threads should not be moved, if set to false they may be moved.
2.1.3.5.2 Intel
Automatic parallelization
The parallel compiler flag automatically enables loops to use multithreading. This flag
must be passed to both the compile and link steps.
This option must be used with optimization levels -O2 or -O3 (The -O3 option also sets
the -opt-matmul flag).
Table 2-10 Automatic parallelization for the Intel compiler

-parallel Tells the auto-parallelizer to generate multithreaded
code for loops that can be safely executed in
parallel.

-par-report Controls the diagnostic information reported by the

auto-parallelizer.
n: Is a value denoting which diagnostic messages to
report. Possible values are:
0: Tells the auto-parallelizer to report no
diagnostic information.
1: Tells the auto-parallelizer to report diagnostic
messages for loops successfully auto-parallelized. The
compiler also issues a "LOOP AUTO-PARALLELIZED"
message for parallel loops. This is the default.
2: Tells the auto-parallelizer to report diagnostic
messages for loops successfully and unsuccessfully
auto-parallelized.
3: Tells the auto-parallelizer to report the same
diagnostic messages specified by 2 plus additional
information about any proven or assumed dependencies
inhibiting auto-parallelization (reasons for not
parallelizing).
OpenMP
The openmp flag allows the compiler to recognize OpenMP directives in the source
code.
This flag must be passed to both the compile and link steps.
Table 2-11 OpenMP options for the Intel compiler suite
-openmp Enables the parallelizer to generate multi-threaded
code based on the OpenMP* directives. The code can be
executed in parallel on both uniprocessor and
multiprocessor systems.
If you use this option, multithreaded libraries are
used, but option fpp is not automatically invoked.
This option sets option -automatic.
-openmp-report=n n: Is the level of diagnostic messages to display.

Possible values are:
0: No diagnostic messages are displayed.
1: Diagnostic messages are displayed indicating
loops, regions, and sections successfully
parallelized. This is the default.
2: The same diagnostic messages are displayed as
specified by value1 plus diagnostic messages
indicating successful handling of MASTER constructs,
SINGLE constructs, CRITICAL constructs, ORDERED
constructs, ATOMIC directives, and so forth.
Thread affinity
The Intel runtime library has the ability to bind OpenMP threads to physical processing
units. The interface is controlled using the KMP_AFFINITY environment variable.
The syntax to use this variable is very complete and covers all testable cases. Reference
[2] [2] has all of the details.
One way to explicitly control the way the threads are assigned to physical or virtual cores
in a system is to use the explicit type in conjunction with the proclist modifier.
For instance, to bind 16 OpenMP threads on 16 physical cores (numbered from 0 to 15)
on a Hyper-Threaded system with 16 cores and 32 logical cpus, the following

environment settings are equivalent:

export OMP_NUM_THREADS=16
export KMP_AFFINITY="proclist=[0-15:1],granularity=fine,explicit
export KMP_AFFINITY= \
"proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],granularity=fine,explicit"
Compatibility between Intel and GNU

Intel provides a way to recognize the GNU syntax for the OpenMP runtime environment
with Intel compilers. These environment variables are GNU extensions. They are
recognized by the Intel OpenMP compatibility library.
Table 2-12 GNU OpenMP runtime environment variables recognized by the Intel compiler toolchain
GOMP_STACKSIZE GNU extension recognized by the Intel OpenMP
compatibility library. Same as OMP_STACKSIZE.
KMP_STACKSIZE overrides GOMP_STACKSIZE, which
overrides OMP_STACKSIZE.
GOMP_CPU_AFFINITY GNU extension recognized by the Intel OpenMP

compatibility library. Specifies a list of OS
processor IDs.
Default: Affinity is disabled
Add the compiler option

-openmp-lib compat
to use these environment variables to manage an OpenMP job.
GOMP_CPU_AFFINITY and KMP_AFFINITY with the explicit type have the same syntax.
The following 3 statements are equivalent:

export GOMP_CPU_AFFINITY="0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
export GOMP_CPU_AFFINITY="0-15:1
export KMP_AFFINITY="proclist=[0-15:1],granularity=fine,explicit
2.1.3.6 Distributed memory parallelization

In order to use a cluster of systems, one has to choose a method that can address the
issue of synchronizing distributed memory used by a simple program. The most popular
method is to use the Message Passing Interface (MPI) API, which enables processes to
share information by exchanging messages.
Recently, Coarrays have emerged as an alternative method integrated into the Fortran
compiler and may be a future contender to address distributed parallelism. Coarrays
(also called CAF = Co Array Fortran) allow parallel programs to use a Partitioned Global
Address Space (PGAS) following the SPMD (single program, multiple data)
parallelization paradigm. Each process (called an image) has its own private variables.
Variables which have a so-called codimension are addressable from other images. This
extension is part of the Fortran 2008 standard.

Various compilers implement Coarrays but do not have the same functionality.
2.1.3.6.1 GNU
The implementation of CAF in the GNU Fortran compiler is very new and almost useless.
The latest information and status are available at http://gcc.gnu.org/wiki/Coarray and
http://gcc.gnu.org/wiki/CoarrayLib.
Reported from the Current Implementation Status in GCC Fortran on the GCC Trunk
[4.7 (experimental)]
GCC 4.6: Only single-image support (i.e. num_images() == 1) but many
features do not work.
GCC 4.7: Includes multi-image support via a communication library. There is
comprehensive support for a single image, but most features do not yet work with
num_images() > 1.
To enable a Fortran code to use CAF with the GNU compiler, the user has to specify the
-fcoarray switch:
Table 2-13 GNU compiler options for CAF

-fcoarray= keyword=
<keyword> none: Disable coarray support; using coarray
declarations and image-control statements will produce
a compile-time error. (Default)
single: Single-image mode, i.e. num_images() is always
one.
lib: Library-based coarray parallelization; a suitable
GNU Fortran coarray library needs to be linked.
Single-image, MPI, ARMCI communication libraries are
under development.
2.1.3.6.2 Intel
The CAF implementation in the Intel compiler is more mature and allows compiling and
running with coarrays on local and remote nodes. It uses shared memory transfers for
intra-node accesses/transfers and Intel MPI for inter-node exchanges. No comparison of
the CAF implementation and the native MPI implementation has been done.
The compiler option -coarray must be included to enable coarrays in a program.
Table 2-14 Intel compiler options for CAF

-coarray Enable coarray syntax for data parallel
[=shared|distributed] programming. The default is shared-memory;
distributed memory is only valid with the Intel
Cluster Toolkit license.
-coarray-num-images=n Set the default number of coarray images.

Note that when a setting is specified in the
environment variable FOR_COARRAY_NUM_IMAGES, it
overrides the compiler option setting.
2.1.4 Alternatives
The following compiler suites include FORTRAN, C, and C++ compilers supporting the
Intel Sandy Bridge features and provide also tools for debugging, optimization, auto
parallelization, etc. They have not been assessed recently enough to be included in this

document.
The PGI Workstation compilers from the Portland Group [11]
The PathScale EKOPath 4 compilers from PathScale [10]
2.2 Libraries
HPC libraries are fundamental tools for scientists during code development: They provide
standardized interfaces to tested implementations of algorithms, methods, and solvers.
They are easy to use and more efficient than manually coding the equivalent
functionality. Frequently, they are already vectorized and parallelized to take advantage
of modern HPC architectures.
HPC codes are usually developed following open standards for the libraries, but used in
production with highly optimized math libraries like MKL for Intel processors.
2.2.1 Intel Math Kernel Library (MKL)

The MKL library is provided by Intel and contains various mathematical routines highly
optimized for Intel Sandy Bridge processors.
2.2.1.1 Linear Algebra and Solvers

Functions calls from BLAS, LAPACK and ScaLAPACK1 libraries are automatically
replaced by functions from MKL if you are linking with MKL library (see Section 2.2.1.5 for
the correct linking syntax). Serial, multi threaded (OpenMP) and distributed (MPI)
versions of these routines are available when possible.
Sparse BLAS and solvers are also available in MKL library. It supports CSR, CSC, BSR,
DIA and Skyline data storage as well as NIST and SparseKit style interfaces.
2.2.1.2 Fast Fourier Transform

Codes already implementing the Fast Fourier Transforms using the FFTW
implementation can quickly benefit from Intel MKL performance by using the wrappers
included in MKL package [7]. Serial, multi threaded (OpenMP) and distributed (MPI)
versions of these routines are available when possible.
2.2.1.3 Vector Math Library (VML)

The Vector Math Library (VML) [8] is designed to compute elementary functions on vector
arguments. VML is an integral part of the Intel MKL library and includes a set of highly
optimized implementations of certain computationally expensive core mathematical
functions (power, trigonometric, exponential, hyperbolic, etc.) that operate on vectors.
VML may improve performance for such applications as nonlinear software,
computations of integrals, and many others.
2.2.1.4 Vector Statistical Library (VSL)

Vector Statistical Library (VSL) [9] performs pseudorandom and quasi-random vector
generation as well as convolution and correlation mathematical operations. VSL is an

integral part of MKL library and provides a number of generator subroutines implementing
commonly used continuous and discrete distributions to help improve their performance.
All these distributions are based on the highly optimized Basic Random Number
Generators (BRNGs) and VML.
2.2.1.5 How is the MKL library used?

A very useful tool is the Intel Math Kernel Library Link Line Advisor [6]. Given the
configuration of the libraries to be used, it provides the exact syntax to be passed to
compiler and linker in order to correctly link the MKL libraries and their dependencies.
For instance, for the following configuration: Linux + Intel Fortran compiler + SMP version
of MKL + SCALAPACK + BLACS + Fortran95 interface for BLAS and LAPACK, this tool
provides the following information:
Compiler options:
-I$(MKLROOT)/include/intel64/lp64 -I$(MKLROOT)/include
For the link line:
-L$(MKLROOT)/lib/intel64
$(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a
(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a
-lmkl_scalapack_lp64 -lmkl_intel_lp64 -lmkl_intel_thread
-lmkl_core -lmkl_blacs_intelmpi_lp64 -openmp -lpthread lm
The GNU compiler, MPICH, 32bit and 64bit are some of the possibilities to use with the
MKL library.
2.2.2 Alternative Libraries

Many alternative libraries to MKL exist, most of them being open source or freely
available. Some of the libraries widely adopted by HPC community are introduced below.
2.2.2.1 BLAS
The BLAS (Basic Linear Algebra Subprograms) [16] are routines that provide standard
building blocks for performing basic vector and matrix operations. The Level 1 BLAS
perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-
vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the
BLAS are efficient, portable, and widely available, they are commonly used in the
development of high quality linear algebra software, LAPACK for example.
2.2.2.2 LAPACK
LAPACK [17] is written in Fortran 90 and provides routines for solving systems of
simultaneous linear equations, least-squares solutions of linear systems of equations,
eigenvalue problems, and singular value problems. The associated matrix factorizations
(LU, Cholesky, QR, SVD, Schur, and generalized Schur) are also provided, as are
related computations such as reordering of the Schur factorizations and estimating
condition numbers. Dense and banded matrices are handled, but not general sparse
matrices. In all areas, similar functionality is provided for real and complex matrices, in
both single and double precision.

2.2.2.3 SCALAPACK
The ScaLAPACK [18] (or Scalable LAPACK) library includes a subset of LAPACK
routines redesigned for distributed memory MIMD parallel computers. It is currently
written in a Single-Program-Multiple-Data style using explicit message passing for inter
processor communication. It assumes matrices are laid out in a two-dimensional block
cyclic decomposition.
ScaLAPACK is designed for heterogeneous computing and is portable on any computer

that supports MPI.
2.2.2.4 ATLAS
The ATLAS [19] (Automatically Tuned Linear Algebra Software) project is an ongoing
research effort focusing on applying empirical techniques in order to provide portable
performance. At present, it provides C and Fortran77 interfaces to a portably efficient
BLAS implementation, as well as a few routines from LAPACK.
2.2.2.5
2.2.2.6 FFTW
FFTW [20] is a C subroutine library for computing the discrete Fourier transform (DFT) in
one or more dimensions, of arbitrary input size, and of both real and complex data (as
well as of even/odd data, i.e. the discrete cosine/sine transforms or DCT/DST).
The latest official release of FFTW is version 3.3.1 and introduces support for the AVX
x86 extensions.
2.2.2.7 GSL
The GNU Scientific Library (GSL) [21] is a numerical library for C and C++ programmers.
It is free software under the GNU General Public License.
The library provides a wide range of mathematical routines such as random number
generators, special functions and least-squares fitting. There are over 1000 functions in
total with an extensive test suite.
2.3 References
All of the descriptions of the GNU and Intel compiler options included in the tables in this
chapter are taken from the GNU Optimization Options Guide and the Intel Fortran
Compiler User and Reference Guide.
[1] Intel Composer XE web page: http://software.intel.com/en-us/articles/intel-composer-xe/
[2] Intel Fortran Compiler User and Reference guide :
http://software.intel.com/sites/products/documentation/hpc/composerxe/en-
us/2011Update/fortran/lin/index.htm
[3] Intel AVX web page: http://software.intel.com/en-us/avx/
[4] Intel MKL: web page: http://software.intel.com/en-us/articles/intel-mkl/

[5] Intel MKL in depth: http://www.cs-software.com/software/fortran/intel/mkl_indepth.pdf

[6] Intel MKL Link Line Advisor: http://software.intel.com/en-us/articles/intel-mkl-link-line-
advisor/
[7] Intel MKL: FFTW to MKL: http://software.intel.com/en-us/articles/the-intel-math-kernel-
library-and-its-fast-fourier-transform-routines/
[8] Intel MKL : VML library:
http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm
[9] Intel MKL : VSL library :
http://software.intel.com/sites/products/documentation/hpc/mkl/vslnotes/index.htm
and
http://software.intel.com/sites/products/documentation/hpc/mkl/sslnotes/index.htm
[10] PathScale Compilers web site: http://www.pathscale.com/ekopath.html
[11] PGI Compilers web site: http://www.pgroup.com/
[12] GNU Compilers web site: http://www.gnu.org/software/gcc/
[13] GNU gfortran web page: http://gcc.gnu.org/fortran/
[14] GNU Online Documentation: http://gcc.gnu.org/onlinedocs/
[15] GNU Optimization Options guide : http://gcc.gnu.org/onlinedocs/gcc/Optimize-
Options.html
[16] BLAS Library web site: http://www.netlib.org/blas/index.html
[17] LAPACK Library web site: http://www.netlib.org/lapack/
[18] SCALAPACK Library web site: http://www.netlib.org/scalapack/
[19] ATLAS library web site: http://math-atlas.sourceforge.net/
[20] FFTW library web site: http://www.fftw.org/
[21] GSL library web site: http://www.gnu.org/software/gsl/

3 Linux
The iDataPlex dx360 M4 supports the following versions of 64bit (x86-64) Red Hat
Enterprise Linux (RHEL) and SUSE Enterprise Linux Server (SLES):
SLES 10.4 with Xen

SLES 11.2 with KVM or Xen
RHEL 5.7 with KVM or Xen
RHEL 6.2 with KVM
3.1 Core Frequency Modes

Linux fully supports the clock frequency scaling capabilities of the Intel processors
available in the dx360 M4. Both Enhanced Intel Speedstep Technology (EIST) and
Turbo Boost are fully supported. These features allow the core clock frequency to be
dynamically adjusted to achieve the desired blend of performance and energy
management.
The processor frequency settings can be controlled through either hardware or software.
The hardware configuration is controllable through the system's UEFI interface which is
available during system initialization. It is also possible to adjust the UEFI configuration
using the IBM Advanced Settings Utility (ASU). Detailed information on ASU is available
in section 1.2.3.1.
In addition to the available hardware controls, Linux provides its own clock frequency
management. Linux uses what are referred to as CPU governors to manage clock
frequency. The two most common governors are performance and ondemand. The
performance governor always runs the processor at its nominal clock frequency. In
contrast, the ondemand governor will vary the clock frequency depending on the
processor utilization levels of the system. The method of controlling the clock frequency
management in Linux varies from one distribution to the next, so it is best to consult the
distribution documentation for details.
RHEL 6
SLES 11
More information is included in the cpufrequtils packages available on RHEL and SLES.
To find the exact package needed on RHEL, try
$ yum search cpufreq
On SLES, try
$ zypper search cpufreq
3.2 Memory Page Sizes

The Linux memory page size support varies from one hardware architecture to the next.
On x86-64 there are only two supported page sizes: 4 KB and 2 MB. The 4 KB page size
is considered the default, while the 2 MB page size is what is referred to as a huge page.
Using the 2 MB huge page can improve performance by reducing pressure on the
processor's translation lookaside buffer (TLB) which typically has a fixed number of
elements that it can cache. By increasing the page size, the TLB is capable of caching
entries that address larger amounts of memory than when small pages are used.
Historically access to the 2 MB page size has been restricted to applications specifically

coded to do so which has limited the ability to make use of this feature.
Two recent additions to Linux have increased the viability of using huge pages for
applications without specifically modifying them to do so. The libhugetlbfs project
enables applications to explicitly make use of huge pages when the user requests them.
There are usability concerns with libhugetlbfs since the huge pages must be allocated
ahead of time by the system administrator. This is done using the following command
(allocating 30 huge pages in this case):
# echo 30 > /proc/sys/vm/nr_hugepages
In order for huge pages to be allocated in this manner, the operating system must be able
to find appropriately sized regions of contiguous free memory (2 MB in this case). This
can be problematic on systems which have been running for awhile, where memory
fragmentation has occurred.
The number of allocated pages (and current usage) can be checked by running the
following command:
# grep HugePage /proc/meminfo
AnonHugePages: 4237312 kB
HugePages_Total: 30
HugePages_Free: 30
HugePages_Rsvd: 0
HugePages_Surp: 0
As shown in this output (line AnonHugePages), recent x86-64 Linux distributions
(including RHEL 6.2 and SLES 11.2) support a new kernel feature called transparent
huge pages (THP). This example shows over 4 GB of memory backed by huge pages
allocated via THP. THP allows for applications to automatically be backed by huge
pages without any special effort by the user. To enable THP, the Linux memory
management subsystem has been enhanced with a memory defragmenter. The
defragmenter increases the likelihood of large contiguous memory regions being
available after the system has been running for awhile. The presence of THP does not
preclude the use of explicit huge pages or libhugetlbfs. The presence of the new memory
defragmenter should make their use easier.
3.3 Memory Affinity
3.3.1 Introduction
Linux optimizes for memory access performance by attempting to make memory
allocations in a NUMA (Non Uniform Memory Access) aware fashion. That is, Linux will
attempt to allocate local memory when possible and only resort to performing a remote
allocation if the local allocation fails.
While the allocation path attempts to behave in an optimal fashion, this behavior can be
offset by the kernel's task scheduler, which is not NUMA aware. It is possible (likely
even) that the task scheduler can move a process / thread to a core which is remote to
the already allocated memory. This increases the memory access latency and may
decrease the achievable memory bandwidth. This is one reason why it is recommended
that most HPC applications perform explicit process or thread binding.
3.3.2 Using numactl

While Linux attempts to optimally allocate memory by default, it may be useful to explicitly
control the allocation behavior in some scenarios. The numactl command is one

method of doing so.
To display the NUMA topology (memory and processor topology) on the dx360 M4:
% numactl --hardware
The output should be similar to the following, depending on the installed processors and
memory:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 65514 MB
node 0 free: 61196 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node distances:
node 0 1
0: 10 11
1: 11 10
The numactl utility can be used to modify memory allocation in a variety of manners
such as:
Require allocation on a specific NUMA node(s)
Prefer allocation on a specific NUMA node(s)
Interleave allocation on specific NUMA nodes
Interleaving is particularly useful when memory allocation is performed by a master
thread and application source code modification is not possible.
For further information, and detailed command argument documentation, man numactl.
3.4 Process and Thread Binding
3.4.1 taskset
taskset is the Linux system command that:
sets the processor affinity of a running process
sets the processor affinity when launching a new command
retrieves the processor affinity of a running process
As such, taskset is a low-level mechanism for managing processor affinity.
Practically speaking, the typical usages of the taskset command are the following:
At process startup: set processor affinity while launching a new command

% taskset -c <cpu list> <command>
During execution: retrieve processor affinity of a running process
% taskset -cp <process ID>
In the context of an MPI parallel execution, the taskset command must be integrated
into a user-defined script that will be responsible for performing an automatic mapping

between a given MPI rank and a unique processor ID for each process instance.
A typical example of such a user-defined script is provided below for reference:

#!/bin/bash
# Read user-defined parameters for process affinity configuration
STRIDE=1
OFFSET=0
# Retrieve number of processors on the node
export PROCESSORS=$(grep "^processor" /proc/cpuinfo | wc -l)
# Retrieve MPI rank for given process depending on selected MPI library
# Open MPI
if [ -n "$OMPI_COMM_WORLD_RANK" ]; then
MPI_RANK=$OMPI_COMM_WORLD_RANK
# Intel MPI
elif [ -n "$PMI_RANK" ]; then
MPI_RANK=$PMI_RANK
else
echo "Error getting MPI rank for process - Aborting"; exit 1
fi
# Compute processor ID to bind the process to
CPU=$MPI_RANK
CPU=$(( $CPU * $STRIDE ))
CPU=$(( $CPU + $OFFSET ))
CPU=$(( $CPU % $PROCESSORS ))
# Launch command with taskset prefix
CMD="taskset -c $CPU $@"
The affinity management script above must be used as a prefix for the application binary
in the mpirun submission command:
% mpirun -np <# tasks> <affinity management> <binary> <arguments>
A comprehensive reference for the taskset command can be found here.
3.4.2 numactl
numactl is the Linux system command that allows processes to run with a specific
NUMA scheduling or memory placement policy. Its coverage is broader than the one of
the taskset system command as it also manages the memory placement for a process.
The typical usages of the numactl command are the following:

Display NUMA configuration of the node, including socket / core / memory
components:
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node distances:
node 0 1
0: 10 21
1: 21 10
Set explicit processor affinity while launching a new process:

% numactl --physcpubind=0-3,6-11 <binary> <arguments>

Set explicit memory affinity while launching a new process:
% numactl --membind=0 <binary> <arguments>
A comprehensive reference for the numactl command can be found here.
In the context of an MPI parallel execution, as is the case with the taskset command,
the numactl command must be integrated into a user-defined script that is to be
responsible for performing an automatic mapping between a given MPI rank and a
unique processor ID for each process instance.
3.4.3 Environment Variables for OpenMP Threads

Processor affinity management for OpenMP threads can be controlled by different
environment variables depending on the compilation/runtime environment.
GNU
The environment variable GOMP_CPU_AFFINITY is used to specify an explicit binding
for OpenMP threads.
Intel
The Intel compilers OpenMP runtime library provides the Intel OpenMP Thread Affinity
Interface, which is made up of three levels:
1. High-level affinity interface
this interface is entirely controlled by one single environment variable
(KMP_AFFINITY), which is used to determine the machine topology and to assign
OpenMP threads to the processors based upon their physical location in the
machine.
2. Mid-level affinity interface

this interface provides compatibility with the GNU GOMP_CPU_AFFINITY
environment variable, but it can be invoked as well by using the KMP_AFFINITY
environment variable.
GOMP_CPU_AFFINITY
Using the GOMP_CPU_AFFINITY environment variable requires specifying the
following compile option: -openmp-lib compat
KMP_AFFINITY
The explicit binding of OpenMP threads can be specified.
3. Low-level affinity interface (not discussed in the present document)

This interface uses APIs to enable OpenMP threads to make calls into the OpenMP
runtime library to explicitly specify the set of processors on which they are to be run.
The GOMP_CPU_AFFINITY environment variable expects a comma-separated list
composed of the following elements:
single processor ID
a range of processor IDs (M-N)
a range of processor IDs with some stride (M-N:S)
The KMP_AFFINITY environment variable expects the following syntax:
KMP_AFFINITY=[<modifier>,]<type>

where:
<modifier>
o proclist= {<proc-list>}
Specify a list of processor IDs for explicit binding.
o granularity= {core [default] | thread}
Specify the lowest levels that OpenMP threads are allowed to float within a
topology map.
o verbose | noverbose
<type>
o none [default]
Do not bind OpenMP threads to particular thread contexts. Specify
KMP_AFFINITY= [verbose, none] to list a machine topology map.
o compact
Assign the OpenMP thread <n>+1 to a free thread context as close as
possible to the thread context where the <n> OpenMP thread was placed.
o disabled
Completely disable the thread affinity interfaces.
o explicit
Assign OpenMP threads to a list of processor IDs that have been explicitly
specified by using the proclist modifier.
o scatter
Distribute the threads as evenly as possible across the entire system
(opposite of compact).
Summary of OpenMP threads binding options corresponding to which compiler was used
to build the executable.
Table 3-1 OpenMP binding options

Runtime Variable Typical value Remark
GNU GOMP_CPU_AFFINITY <processor list>
-openmp-lib
GOMP_CPU_AFFINITY <processor list>
compat
Intel
KMP_AFFINITY granularity=thread,proclist=[<proc. list>],explicit
KMP_AFFINITY granularity=thread,compact
3.4.4 LoadLeveler
When using both LoadLeveler as a workload scheduler and Parallel Environment as MPI
library, processor affinity can be requested directly at LoadLeveler level through the
task_affinity keyword.
This keyword has the same syntax as the Parallel Environment MP_TASK_AFFINITY
environment variable:
task_affinity = {core[(number)] | cpu[(number)]}

core [default]
Specify that each MPI task runs on a single physical processor core
core(n)
Specify the number of physical processor cores to which an MPI task (and its
eventual threads) are constrained (one thread per physical core)
cpu
Specify that each MPI task runs on a single logical processor core
cpu(n)
Specify the number of logical processor cores to which an MPI task (and its
eventual threads) are constrained (one thread per logical core)

The following two additional keywords complement the task_affinity keyword:

cpus_per_core = <number>
Specify the number of logical cores per physical processor core that needs to be
allocated to each task of a job with the processor-core affinity requirement. This
keyword can only be used along with the task_affinity keyword.
parallel_threads = <number>
Request OpenMP thread-level binding by assigning separate logical cores to
individual threads of an OpenMP task.
LoadLeveler uses the parallel_threads value to set the value for the
OMP_NUM_THREADS runtime environment variable for all job types. For serial
jobs, LoadLeveler also uses the parallel_threads value to set the
GOMP_CPU_AFFINITY / KMP_AFFINITY environment variables.
3.5 Hyper-Threading (HT) Management

Hyper-Threading (HT) or simultaneous multithreading (SMT) is a processor feature that
allows multiple threads of execution to exist on a processor core at the same time. The
individual execution threads share access to the processor's functional units, allowing
multiple units to be used simultaneously which often increases efficiency. When running
traditional business logic applications, the effects of HT / SMT are almost always positive.
However, this is not always the case for HPC applications.
In order to experiment with the effects of HT / SMT it must be enabled or disabled

through a hardware configuration change. This can be accomplished during system
initialization by entering the UEFI control interface or by using the ASU utility presented in
Section 1.2.3.1. To enable or disable HT / SMT using the ASU, run the following
command:
# asu64 set Processors.Hyper-Threading <Enable|Disable>
Then, a system reboot will activate the changes to the HT / SMT configuration.
Depending on whether HT / SMT is being enabled or disabled, the number of processors
visible to Linux should either double or be cut in half.
The CPU Hotplug method

An alternative approach (which is more flexible, but also more elaborate) is to use the
CPU hotplug method to turn on or off individual CPUs. This method uses commands like
echo 0 > /sys/devices/system/cpu/cpu16/online
to disable, and
echo 1 > /sys/devices/system/cpu/cpu16/online
to enable specific CPUs. But, to be truly effective, the mapping between logical CPUs
and physical cores in the system need to be known in advance, and the finer points of
implementing this robustly are outside the scope of this document.
3.6 Hardware Prefetch Control

Modifying the hardware prefetch controls of the dx360 M4 changes the behavior of the
processor with respect to pulling data from main memory into the processor's cache. For
memory access sensitive applications, which many HPC applications are, these settings
may provide valuable performance gains with the proper configuration.
These settings are low level hardware configuration details that are not normally visible
from within Linux. Modification of these controls can be accomplished during system

initialization by entering the UEFI control interface or by using the ASU utility presented in
Section 1.2.3.1. To see what prefetch controls can be modified, run the following
command:
# asu64 show all | grep Prefetch
Processors.HardwarePrefetcher=Enable
Processors.AdjacentCachePrefetch=Enable
Processors.DCUStreamerPrefetcher=Disable
Processors.DCUIPPrefetcher=Enable
Each of these controls can be modified using the following ASU syntax:
# asu64 set <property> <Enable|Disable>
For example:
# asu64 set Processors.HardwarePrefetcher Disable
To activate changes to the prefetch controls a system reboot is required. Since these
properties are not normally visible from Linux, verifying the current settings requires the
use of ASU.
3.7 Monitoring Tools for Linux

Below is a brief overview of select monitoring tools that exist on systems running Linux.
Unless otherwise noted, the output provided is exemplary of the default operating modes
of these utilities. The man pages for these utilities give in depth descriptions of the
programs and of the many optional parameters that each has.
It is recommended to ignore the first sample of data from many of these utilities since that
data point represents an average of all data collected since the system was booted
instead of the previous interval.
3.7.1 Top
The top utility is a universally available Linux utility for monitoring the state of individual
processes and the system as a whole. It is primarily a tool used to focus on CPU
utilization and memory consumption but it does expose additional information such as
scheduler priorities and page fault statistics.
Typical output:
# top
top - 15:09:46 up 55 min, 6 users, load average: 0.04, 0.01, 0.00
Tasks: 397 total, 1 running, 396 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 24596592k total, 602736k used, 23993856k free, 24112k buffers
Swap: 26836984k total, 0k used, 26836984k free, 135456k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

1 root 20 0 19396 1572 1264 S 0.0 0.0 0:01.00 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0

3.7.2 vmstat
vmstat is a tool that is widely available across various Unix like operating systems. On
Linux it can display disk and memory statistics, CPU utilization, interrupt rate, and
process scheduler information in compact form.
The first sample of data from vmstat should be ignored since it is represents data
collected since the system booted, not the previous interval.
Typical output:
# vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 718196 209928 443204 0 0 0 1 0 5 1 0 99 0 0
0 0 0 718188 209928 443204 0 0 0 2 179 172 3 0 97 0 0
0 0 0 718684 209928 443204 0 0 0 51 185 161 4 0 96 0 0
3.7.3 iostat
The iostat tool provides detailed input/output statistics for block devices as well as
system level CPU utilization. Using the extended statistics option (-x) and displaying the
data in kilobytes per second (-k) instead of blocks per second are recommended options
for this utility.
The first sample of data from iostat should be ignored since it is represents data
collected since the system booted, not the previous interval.
Typical output:
# iostat -xk 5
Linux 2.6.32-220.el6.x86_64 (host) 03/19/2012 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle

0.88 0.00 0.13 0.00 0.00 98.98
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.01 0.19 0.01 0.48 0.14 2.38 10.34 0.00 0.95 0.54 0.03
dm-0 0.00 0.00 0.01 0.60 0.14 2.38 8.28 0.00 1.37 0.44 0.03
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 1.03 0.84 0.00
avg-cpu: %user %nice %system %iowait %steal %idle

0.20 0.00 0.20 0.00 0.00 99.60
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.60 0.00 0.20 0.00 3.20 32.00 0.00 3.00 3.00 0.06
dm-0 0.00 0.00 0.00 0.80 0.00 3.20 8.00 0.00 3.00 0.75 0.06
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3.7.4 mpstat
mpstat is a utility for displaying detailed CPU utilization statistics for the entire system or
individual processors. It is also capable of displaying detailed interrupt statistics if
requested. Monitoring the CPU utilization of individual processors is done by specifying
the "-P ALL" parameter and can be useful when processor affinity is in use.
The mpstat utility waits for the specified interval before printing a sample rather than
initially presenting data since the system was booted. This means that no samples need
to be ignored when using mpstat to monitor the system.
Typical output:
# mpstat -P ALL 5
Linux 2.6.32-220.el6.x86_64 (host) 03/19/2012 _x86_64_ (2 CPU)

03:56:22 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
03:56:27 PM all 0.10 0.00 0.10 0.00 0.00 0.00 0.00 0.00 99.80
03:56:27 PM 0 0.20 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.60
03:56:27 PM 1 0.00 0.00 0.20 0.00 0.00 0.00 0.00 0.00 99.80
03:56:27 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle
03:56:32 PM all 2.34 0.00 0.71 0.00 0.00 0.00 0.00 0.00 96.95
03:56:32 PM 0 4.21 0.00 0.60 0.00 0.00 0.00 0.00 0.00 95.19
03:56:32 PM 1 0.41 0.00 0.62 0.00 0.00 0.00 0.00 0.00 98.97

4 MPI
4.1 Intel MPI

Intel MPI is used to run parallel applications either in the pure MPI mode or in a hybrid
(OpenMP+MPI) mode. It has support for communications through shared memory (shm),
a combination of RDMA and shared memory (rdssm), or using a TCP/IP interface (sock).
It can use several networks and several network interfaces (dapl, verbs, psm, etc).
This section is not a replacement for the Intel documentation. The Intel documents:
GettingStarted.pdf
Reference_Manual.pdf
can be found in the doc/ directory included with Intel MPI (opt/intel/impi/<version>/doc).
The following steps are involved in using Intel MPI to run parallel applications:
Step 1: Compile and Link
Step 2: Selecting a network interface
Step 3: Running the application
4.1.1 Compiling
To compile a parallel application using Intel MPI, one needs to make sure that Intel MPI
is in the path. Intel provides scripts in the bin/ and bin64/ directories to accomplish this
task (mpivars.sh/mpivars.csh depending on the shell being used). In addition, Intel MPI
provides wrappers for C, C++ and Fortran compilers. Table 4-1 lists some of the
wrappers:
Table 4-1 Intel MPI wrappers for GNU and Intel compiler
mpicc Wrappers for GNU C compilers
mpicxx Wrappers for GNU C++ compiler
mpif77 Wrappers for g77 compiler
mpif90 Wrappers for gfortran compiler
mpiicc Wrappers for Intel C compiler (icc)
mpiicpc Wrappers for Intel C++ compiler (icpc)
mpiifort Wrappers for Intel Fortran compiler
(fortran77/fortran95)
To compile a C-program using the Intel C-compiler:
mpiicc -myprog -O3 test.c
Before compiling using Intel compilers, make sure that the Intel compilers are in your
path. Intel provides scripts to accomplish this (bin/compilervars.sh or
/bin/compilervars.csh).
To compile a fortran application using the Intel Fortran compiler:

mpiifort -myprog -O3 test.f (test.f90)
for serial applications
or
mpiifort -myprog -O3 -openmp test.f (test.f90)
for hybrid applications

4.1.2 Running Parallel Applications

Running parallel applications is a 2-step process: The first step is
mpdboot
This step will create supervisory processes called mpds on the nodes where the parallel
application is run. The synatax is:
mpdboot -n xx -f hostfile -r ssh
where xx is the number of nodes (only one mpd instance runs on a node independent of
the number of MPI tasks targeted for each node) and the hostfile contains the node
names (one host per line).
To make sure that the mpdboot ran successfully, one can query the nodes in the parallel
partition using
mpdtrace
which will list all of the nodes in the parallel partition. To force mpdboot to maintain the
same order as in the hostfile, one can add an additional flag ordered in the mpdboot
command.
After mpdboot, one can run the parallel application using:

mpiexec -perhost xx -np yy myprog
or
mpiexec -np yy -machinefile hf myprog
or
mpiexec -np yy -machinefile hf ./myprog < stdin > stdout
where xx is the number of tasks on a node and yy is the total number of MPI tasks. The
machinefile contains the list of the nodes where the MPI tasks will be run (one per line).
One can combine the functions of mpdboot and mpiexec by using mpirun instead.
After the parllel job has completed, issue

mpdallexit
to close all of the python processes.
Intel MPI dynamically selects the appropriate fabric for communication among the MPI
processes. To select a specific fabric for communication, set the environmental variable
I_MPI_FABRICS. Table 4-2 provides some communication fabric settings.
Table 4-2 Intel MPI settings for I_MPI_FABRICS

I_MPI_FABRICS Network fabric
shm:dapl Default setting

shm Shared-memory
dapl DAPL capable network fabrics eg.
InfiniBand, iwarp etc..
tcp TCP/IP capable networks
tmi Tag Matching Interface (TMI) eg.
Qlogic, Myrinet etc.
ofa InfiniBand over OFED verbs provided

I_MPI_FABRICS Network fabric
by Open Fabrics Alliance (OFA)
On a single node, use

I_MPI_FABRIC=shm
for communication through shared-memory. The I_MPI_FABRICS environmental
variable can also be passed in the run command using
mpiexec -genv I_MPI_FABRIC=ofa -np yy -machinefile hf ./myprog
4.1.3 Processor Binding

To find the way the binding is done on a node, one can use the cpuinfo command. The
command output is:
$ cpuinfo
Intel(R) Xeon(R) CPU E5-2680 0

===== Processor composition =====
Processors(CPUs) : 32
Packages(sockets) : 2
Cores per package : 8
Threads per core : 2
===== Processor identification =====
Processor Thread Id. Core Id. Package Id.
0 0 0 0
1 0 1 0
2 0 2 0
3 0 3 0
4 0 4 0
5 0 5 0
6 0 6 0
7 0 7 0
8 0 0 1
9 0 1 1
10 0 2 1
11 0 3 1
12 0 4 1
13 0 5 1
14 0 6 1
15 0 7 1
16 1 0 0
17 1 1 0
18 1 2 0
19 1 3 0
20 1 4 0
21 1 5 0
22 1 6 0
23 1 7 0
24 1 0 1
25 1 1 1
26 1 2 1
27 1 3 1
28 1 4 1
29 1 5 1
30 1 6 1
31 1 7 1
===== Placement on packages =====
Package Id. Core Id. Processors
0 0,1,2,3,4,5,6,7 (0,16)(1,17)(2,18)(3,19)(4,20)(5,21)(6,22)(7,23)

1 0,1,2,3,4,5,6,7 (8,24)(9,25)(10,26)(11,27)(12,28)(13,29)(14,30)(15,31)
===== Cache sharing =====

Cache Size Processors
L1 32 KB
(0,16)(1,17)(2,18)(3,19)(4,20)(5,21)(6,22)(7,23)(8,24)(9,25)(10,26)
(11,27)(12,28)(13,29)(14,30)(15,31)
L2 256 KB
(0,16)(1,17)(2,18)(3,19)(4,20)(5,21)(6,22)(7,23)(8,24)(9,25)(10,26)
(11,27)(12,28)(13,29)(14,30)(15,31)
L3 20 MB (0,1,2,3,4,5,6,7,16,17,18,19,20,21,22,23)
(8,9,10,11,12,13,14,15,24,25,26,27,28,29,30,31
This shows that the particular node has 2 sockets (Package Id), and each socket has 8
physical cores. The system is running in Hyper-Threading (HT) mode and the physical-
logical processor mapping is provided under the column Processors.
The processor binding can be enforced with the environmental variable

I_MPI_PIN_PROCESSOR_LIST.
For example:
I_MPI_PIN_PROCESSOR_LIST='0,1,2,3'
pins 4 MPI tasks to logical processors 0,1,2 and 3. Setting the environmental variable
I_MPI_DEBUG=10
or higher gives additional information about binding done by Intel MPI.
For hybrid applications (OpenMP+MPI), the binding of MPI tasks is done to a domain as
opposed to a processor. This is done using I_MPI_DOMAIN so that all child threads from
the given MPI task will run in the same domain.
The reference manual describes several ways of specifying I_MPI_PIN_DOMAIN

depending on the user's preference. Setting
I_MPI_PIN_DOMAIN=socket
for a 2 MPI-task job running on a 2 socket system will run each MPI task on its own
socket and all of the child threads from each MPI task will be confined to the same socket
(domain) as the parent thread.
4.2 IBM Parallel Environment

The IBM Parallel Environment (PE) offers two MPI implementations: MPICH2 over PAMI
and PEMPI over PAMI. Two kinds of libraries are built for MPICH2, which differ in
performance. The standard library is recommended for developing and testing new MPI
codes. The optimized library skips checking MPI arguments at run time and is
recommended for performance runs. The Both libraries are 64 bit.
Libraries built with the GCC and Intel compilers are provided for each implementation.
4.2.1 Building an MPI program

Once IBM PE rpms are installed, several scripts appear in the
/opt/ibmhpc/pecurrent/base/bin subdirectory to assist with building an MPI program. The
scripts recognize the options -compiler and -mpilib, which can be used to specify the
compiler and MPI library being used The mpicc script, by default, selects the GNU
compiler and the MPICH2 library. The mpiicc script, by default, selects the Intel compiler

and the MPICH2 library. The PEMPI library is selected if the -mpilib ibm_pempi option
is used.
4.2.2 Selecting MPICH2 libraries

The selection between standard and optimized libraries is done at run time. The standard
library is chosen by default. If the MP_EUIDEVELOP=min environment variable setting is
specified, the optimized MPICH2 library will be selected. A selection can be verified by
checking library paths in the /proc/<pid>/smaps file for a process with process ID
<pid>.
4.2.3 Optimizing for Short Messages

The best latency is achieved when the MPICH2 stack selects lockless mode where no
locking takes place in MPI and PAMI. The MPICH2 library enters this mode if either
MPI_Init() is called at initialization time or MPI_Init_thread() is called with an argument
MPI_THREAD_SINGLE, which is the preferred way. Lockless mode is valid only when
the main MPI thread makes MPI calls. Violations of this rule may cause data corruption.
To select lockless mode in a PEMPI program initialized by MPI_Init(), one has to set
export MP_SINGLE_THREAD=yes
export MP_LOCKLESS=yes
in the environment. Note, these settings would not work with jobs which use one-sided
MPI or MPI/IO functions.
MPICH2 implements a Reliable Connection (RC) FIFO mode on top of RC Transport to

cut PAMI overhead associated with checking whether data sent unreliably actually
arrived and with packet retransmission in cases when data is lost. This mode uses RC
connections spanning a jobs lifetime between any pair of MPI tasks. The environment
setting MP_RELIABLE_HW=yes will enable RC FIFO mode.
All of these optimizations are valid for any message size, but they have the greatest
impact on short messages.
The best latency is achieved with the environment setting MP_INSTANCES=1 (default).
Short messages are defined as the lower bound for MP_BULK_MIN_MSG_SIZE (4K).
The Maximum Transmission Unit (MTU) may impact latency. Setting
MP_FIFO_MTU=2K is typically better for messages under 4KB.
4.2.4 Optimizing for Intranode Communications

Shared memory is the default for intranode communications. It usually gives the best
performance but can be disabled by choosing MP_SHARED_MEMORY=no which will
force messages to go over the IB fabric. In the context of collective operations in PEMPI,
the explicit use of shared memory (as compared to the implicit use of shared memory via
PAMI) does not always provide the best performance. The
MP_SHMCC_EXCLUDE_LIST environment variable is provided for PEMPI to turn off
explicit use of shared memory for a selected list of collective operations.
4.2.5 Optimizing for Large Messages

Large messages, by default, are transmitted via RDMA which uses RC connections. At

large scale, a job may run out of RC connections which consume memory proportionally
to the number of RC connections. The default number of connections can be overridden
by setting MP_RC_MAX_QP.
When the number of MPI tasks sharing a node is small, setting MP_INSTANCES to a
value larger than one may help to improve bandwidth. Increasing the number of
MP_INSTANCES will increase the number of RC connections accordingly.
By default, messages above 16KB are transmitted in RDMA mode (qualifying them as
large messages). This is the crossover point between FIFO and RDMA modes. It can be
overridden by setting MP_BULK_MIN_MSG_SIZE.
4.2.6 Optimizing for Intermediate-Size Messages

Short messages are transmitted via the Eager protocol where incoming messages are
buffered on the receive side if no matching receive message is posted. This requires
special buffers to be allocated on each node. Setting MP_EAGER_LIMIT will override the
default message size at which the eager protocol is replaced by a Rendezvous protocol.
MPI will honor the MP_EAGER_LIMIT value only if the early arrival buffer is big enough.
The default buffer size can be overridden by setting MP_BUFFER_MEM .
On the send side, short messages are copied to a retransmission buffer. This allows MPI
to return to user program while a message is in transit. The default buffer size can be
overridden by setting MP_REXMIT_BUF_SIZE.
As implied by sections 4.2.3 and 4.2.5, intermediate-size messages are between 4KB
and the RDMA crossover point (default 16KB, as set by MP_BULK_MIN_MSG_SIZE).
4.2.7 Optimizing for Memory Usage

For large task counts, a significant portion of memory can be consumed by RC RDMA
connections. Should that become a scalability issue, a program may choose to run in
FIFO mode.
This can be achieved by setting MP_USE_BULK_XFER=no. Buffer tuning environment

variables MP_BUFFER_MEM and MP_RFIFO_SIZE can be set to reduce the MPI
memory footprint.
For a small task count per node, FIFO mode is less efficient than RDMA.
A significant Improvement in network bandwidth can be achieved by setting

MP_FIFO_MTU=4K. Note, a 4K MTU must be enabled on the switch for that to work.
4.2.8 Collective Offload in MPICH2

The ConnectX-2 adapter for QDR networks and the ConnectX-3 adapter for FDR10 and
FDR networks, offer an interface called Fabric Collective Accelerator (FCA). FCA
combines CORE-Direct (Collective Offload Resource Engine) on the adapter with
hardware assistance on the switch to speed up selected collective operations. Using FCA
may significantly improve the performance of an application. Only MPICH2 supports
collective offloading. The following MPI collectives are available with FCA:
MPI_Reduce
MPI_Allreduce
MPI_Barrier

MPI_Bcast
MPI_Allgather
MPI_Allgatherv
The list of supported data types includes:
All data types for C language bindings, except MPI_LONG_DOUBLE
All data types for C reduction functions (C reduction types).
The following data types for FORTRAN language bindings: MPI_INTEGER,
MPI_INTEGER2, MPI_INTEGER4, MPI_INTEGER8, MPI_REAL, MPI_REAL4
and MPI_REAL8
FCA does not support data types for FORTRAN reduction functions (FORTRAN
reduction types).
By default, collective offload is turned off. To enable it, the environment variable
MP_COLLECTIVE_OFFLOAD=[all | yes] must be set. Setting
MP_COLLECTIVE_OFFLOAD=[none| no] disables collective offload. Once enabled, the
FCA collective algorithm will be the first one MPICH2 will try. If the FCA algorithm cannot
run at this time, a default MPICH2 algorithm will be executed. The FCA algorithm may not
be available if a node has no FCA software installed, does not have a valid license, the
FCA daemon is not running on the network, etc. FCA support is limited to 2K MPI
communicators per network.
To enable a subset of supported FCA algorithms two environment variables

MP_COLLECTIVE_OFFLOAD=[all | yes] and
MP_MPI_PAMI_FOR=[<collective1>,,<collectiveN>], where collectives in the list are
identified as Reduce, Allreduce, Barrier, Bcast, Allgather and AllGatherV, must
be set. All collectives outside the list will use default MPICH2 algorithm.
4.2.9 MPICH2 and PEMPI Environment Variables
MP_EUIDEVELOP=min
When set to min, selects the optimized MPICH2 library. The optimized library helps to
reduce latency. The standard MPICH2 library is selected by default. (Note that the
PEMPI library will skip some parameter checking when MP_EUIDEVELOP=min is used.)
MP_SINGLE_THREAD=[yes|no]
Avoids some PAMI locking overhead and improves short message latency when set to
yes. MP_SINGLE_THREAD=yes is valid only for user programs which make MPI calls
from a single thread. For mutithreaded processes with threads making concurrent MPI
calls, setting MP_SINGLE_THREAD=yes, will cause inconsistent results. The default
value is no.
MP_SHARED_MEMORY=[yes|no]
To specify the use of shared memory for intranode communications rather than network.
The default value is yes. In a few cases disabling shared memory improves
performance.
MP_SHMEM_PT2PT=[yes|no]
Specifies whether intranode point-to-point MPICH2 communication should use optimized,
shared-memory protocols. Allowable values are yes and no. The default value is

yes.
MP_EAGER_LIMIT=<INTEGER>
Changes the message size threshold above which rendezvous protocol is used. This
environment variable may be useful in reducing the latency for medium-size messages.
Larger values increase the memory overhead.
MP_REXMIT_BUF_SIZE=<INTEGER>
Specifies the size of a local retransmission buffer (send side). The recommended value is
the size of MP_EAGER_LIMIT plus 1K. It may help to reduce the latency of medium-size
messages. Larger values increase memory overhead.
MP_FIFO_MTU=[2K|4K]
If a chassis MTU on the InfiniBand switch is 4K, the environment variable can be set to
4K. This will improve bandwidth for medium and large messages if a job is running in
FIFO mode (MP_USE_BULK_XFER=no). It may have a negative impact on the latency
of messages below 4K. The default value is 2K.
MP_RDMA_MTU=[2K|4K]
If a chassis MTU on the InfiniBand switch is 4K, the environment variable can be set to
4K. This may improve bandwidth for medium and large messages if a job is running in
RDMA mode (MP_USE_BULK_XFER=yes). The default value is 2K.
MP_PULSE=<INTEGER>
The interval (in seconds) at which POE checks the remote nodes to ensure that they are
communicating with the home node. Setting to 0 reduces jitter. The default value is
600.
MP_INSTANCES=<INTEGER>
The number of instances corresponds to the number of IB Queue Pairs (QP) over which
a single MPI task can stripe. Striping over multiple QPs improves network bandwidth in
RDMA mode when a single instance does not saturate the link bandwidth. The default is
one, which is usually sufficient when there are multiple MPI tasks per node.
MP_USE_BULK_XFER=[yes|no]
Enables bulk message transfer (RDMA mode). RDMA mode requires RC connections
between each pair of communicating tasks which takes memory resources. The value
no will turn on FIFO mode which is scalable. In some cases FIFO bandwidth
outperforms RDMA bandwidth due to reduced contention in the switch. The default value
is yes.
MP_BULK_MIN_MSG_SIZE=<INTEGER>
Sets the minimum message length for bulk transfer (RDMA mode). A valid range is from
4096 to 2147483647 (INT_MAX). Note, that for PEMPI, MP_EAGER_LIMIT value takes
precedence if it is larger. MPICH2 ignores the value of MP_BULK_MIN_MSG_SIZE.

This environment variable can help optimize the crossover point between FIFO and
RDMA modes.
MP_SHMCC_EXCLUDE_LIST= [ all | none | <list of collectives> ]

Use this environment variable to specify the collective communication routine in which
the MPI level shared memory optimization should be disabled. The same can be done
for a list of selected collective routines (separated by colons) : <list of collectives> =
<collective1>::<collectiveN>.
The full list of collectives includes: "Barrier","Bcast","Reduce",

"Allreduce","Reduce_scatter", "Gather", "Gatherv", "Scatter", "Scatterv", "Allgather",
Allgatherv", "Alltoall", "Alltoallv", "Alltoallw", "Scan", "Exscan. The default is none. . It
applies only to PEMPI.
MP_RC_MAX_QP=<INTEGER>
Specifies the maximum number of Reliable Connected Queue Pairs (RC QPs) that can
be created. The purpose of MP_RC_MAX_QP is to limit the amount of memory that is
consumed by RC QPs. This is recommended for applications which are close to or
exceed the memory limit.
MP_RFIFO_SIZE=<INTEGER>
The default size of the receive FIFO used by each MPI task is 4MB. Larger jobs are
recommended to use the maximum size receive FIFO (16MB) by setting
MP_RFIFO_SIZE=16777216.
MP_BUFFER_MEM=<INTEGER>
Specifies the size of the Early Arrival buffer that is used by the communication subsystem
to buffer eager messages arriving before a matching receive is posted. Setting
MP_BUFFER_MEM can address MPCI_MSG: ATTENTION: Due to memory limitation
eager limit is reduced to X. MP_BUFFER_MEM applies to PEMPI only.
MP_POLLING_INTERVAL=<INTEGER>
This defines the interval in microseconds at which the LAPI timer thread runs. Setting the
polling interval equal to 800000 defines an 800 millisecond timer. The default is 400
milliseconds.
MP_RETRANSMIT_INTERVAL=<INTEGER>
PAMI will retransmit packets if an acknowledgement is not received in time.
Retransmissions are costly and often unnecessary, generating duplicate packets. Setting
a higher value will allow PAMI to tolerate larger delays before the retransmission logic
kicks in.
4.2.10 IBM PE Standalone POE Affinity

When PE is used without a resource manager, POE provides standalone task affinity.
This involves:
For OpenMP jobs, MP_TASK_AFFINITY=cpu:n/core:n/primary:n options are

available for both Intel and GNU OpenMP implementations. Since PE does not
know the OpenMP implementation in use (Intel or GNU), PE has to set the both
GOMP_CPU_AFFINITY and KMP_AFFINITY for each task.
For non-OpenMP jobs - using MP_TASK_AFFINITY=cpu, core, primary, or
mcm - POE will examine the x86 device tree and determine the cpusets to which
the tasks will be attached, using system level affinity API calls of
sched_setaffinity.
Adapter affinity (MP_TASK_AFFINITY=sni) is not supported on x86 systems.
4.2.11 OpenMP Support

On the x86 Linux platform, PE provides affinity support for the Intel and GNU x86
platform OpenMP implementations, based on the KMP_AFFINITY and
GOMP_CPU_AFFINITY variables. First POE will determine the number of parallel
threads: for jobs involving a resource manager, POE expects the OMP_NUM_THREADS
variable to be exported and set to the number of parallel threads.
When OMP_NUM_THREADS is not exported, POE will use the value of n in the
MP_TASK_AFFINITY = core:n, cpu:n, or primary:n as the number of parallel threads.
For Intel, PE will set the KMP_AFFINITY variable, with a list of CPUs in the proclist
sub-option value. Note POE has to allow for the user specified KMP_AFFINITY variable,
and the list of CPUs to any existing options. If the user has already specified a proclist
sub-option value, POE will override the user-specified value, while displaying an
appropriate message. If MP_INFOLEVEL is set to 4 or higher, POE will also add the
verbose option to KMP_AFFINITY. An example of the KMP_AFFINITY format POE will
set (for MP_TASK_AFFINITY=cpu:4) is: KMP_AFFINITY=proclist=[3,2,1,0],explicit .
For GNU, PE will set the GOMP_CPU_AFFINITY variable correctly, to a comma-

separated list of CPUs, one per thread, for each task. A sample format of the
GOMP_CPU_AFFINITY (for MP_TASK_AFFINITY=cpu:4) is:
GOMP_CPU_AFFINITY="0,2,4,6" .
Although GOMP_CPU_AFFINITY allows its values to be expressed as ranges or strides,

POE will use a comma separated list of single CPU numbers, representing logical CPUs
attached to the task. If the user has already specified a GOMP_CPU_AFFINITY value,
POE will override the user-specified value, while displaying an appropriate message.
When MP_BINDPROC = yes is specified, POE will bind/attach the tasks based on the
list of CPUs in the KMP_AFFINITY and GOMP_CPU_AFFINITY values.
4.3 Using LoadLeveler with IBM PE
4.3.1 Requesting Island Topology for a LoadLeveler Job

An island can be defined as a set of nodes connected to the same switch. Those
switches can be interconnected via spine switches to form a larger network. Running a
job within an island, typically, provides a better latency and bisection bandwidth than
spreading it over multiple islands. Island topology can be requested by the
node_topology keyword
# @ node_topology = none | island
A second JCF keyword, island_count
# @ island_count=minimun[,maximum]

specifies the minimum and maximum number of islands to select for this job step. A value
of -1 represents all islands in the cluster.
If island_count is not specified, all machines will be selected from a common island.
The llstatus and llrstatus commands will be enhanced to show which island contains
the machine or machine _group.
4.3.2 How to run OpenMPI and INTEL MPI jobs with

LoadLeveler
To run OpenMPI jobs under LoadLeveler, a job command file, must specify a job type of
MPICH. A script that will launch a job built with OpenMPI version 1.5.4 is given below
# ! /bin/ksh
# LoadLeveler JCF file for running an Open MPI job
# @ job_type = MPICH
# @ node = 4
# @ tasks_per_node = 2
# @ output = ompi_test.$(cluster).$(process).out
# @ error = ompi_test.$(cluster).$(process).err
# @ queue
mpirun openmpi_test_program
For OpenMPI versions prior to 1.5.4, a user must specify the LoadLeveler run time
environment variables LOADL_TOTAL_TASKS and LOADL_HOSTFILE, as arguments
to the OpenMPI executable to indicate the number of tasks to start and where to start
them. The user also specifies the LoadLeveler llspawn.stdio executable as the remote
command for the OpenMPI executable to use when launching MPI tasks, and must
specify the mpirun option, --leave-session-attached, which will keep the spawned MPI
tasks descendents of LoadLeveler processes.
A script that will launch a job built with OpenMPI version older than 1.5.4 is given below
# ! /bin/ksh
# LoadLeveler JCF file for running an Open MPI job
# @ job_type = MPICH
# @ node = 4
# @ tasks_per_node = 2
# @ output = ompi_test.$(cluster).$(process).out
# @ error = ompi_test.$(cluster).$(process).err
# @ queue
export LD_LIBRARY_PATH=/opt/openmpi/lib
/opt/openmpi/bin/mpirun -leave-session-attached --mca plm_rsh_agent "llspawn.stdio :
ssh" -n $LOADL_HOSTFILE -machinefile $LOADL_HOSTFILE
mpi_hello_sleep_openmpi
4.3.3 LoadLeveler JCF (Job Command File) Affinity

Settings
LoadLeveler must be configured to enable affinity support, which includes the following
steps
$ mkdir /dev/cpuset
$ mount -t cpuset none /dev/cpuset

The LoadLeveler configuration file must contain keyword

RSET_SUPPORT = RSET_MCM_AFFINITY
The keyword SCHEDULE_BY_RESOURCES must have ConsumableCpus as one of the
arguments. Without the above settings a job requesting affinity will not be dispatched by
the central manager.
4.3.4 Affinity Support in LoadLeveler

The following JCF keywords support affinity in LoadLeveler on x86
1 # @ rset = RSET_MCM_AFFINITY
2 # @ mcm_affinity_options = mcm_mem_[req | pref]
3 # @ mcm_affinity_options =[mcm_distribute|mcm_accumulate]
4 # @ task_affinity = CORE(n)
5 # @ task_affinity = CPU(n)
6 # @ cpus_per_core = n
7 # @ parallel_threads = n
A job containing the mcm_affinity_options or task_affinity keywords will not be
submitted to LoadLeveler unless the rset keyword is set to RSET_MCM_AFFINITY.
When mcm_mem_pref is specified, the job requests memory affinity as a preference.

When mcm_mem_req is specified, the job requests memory affinity as a requirement.
mcm_accumulate tells the central manager to accumulate tasks on the same MCM
whenever possible.
mcm_distribute tells the central manager to distribute tasks across all available MCMs
on a machine.
When CORE(n) or CPU(n) is specified, the central manager will assign n physical cores
or n logical CPUs to each job task. (Note that a physical core can have multiple logical
CPUs.)
cpus_per_core specifies the number of logical CPUs per processor core that should
be allocated to each task of a job with the processor-core affinity requirement
(#@task_affinity = CORE). This requirement can be only satisfied by nodes configured in
SMT mode.
parallel_threads=m will bind m OpenMP threads to the CPUs selected for the task by
task_affinity = CPU(n) keyword, where m <= n. If task_affinity = CORE(n) is specified, m
OpenMP threads will be bound to m CPUs, one CPU per core.
The SMT setting can not be changed per job launch.

5 Performance Analysis Tools on Linux
5.1 Runtime Environment Control
5.1.1 Ulimit
The ulimit command controls user limits on jobs run interactively. Large HPC jobs often
use more hardware resources than are given to users by default. These limits are put in
place to make sure that program bugs and user errors dont make the system unusable
or crash. However, for benchmarking and workload characterization, it is best to set all
resource limits to unlimited (or the highest allowed value) before running. This is
especially true for resources like stack, memory and CPU time.
ulimit s unlimited
ulimit m unlimited
ulimit t unlimited
5.1.2 Memory Pages

The RHEL 6.2 and SLES 11.2 Linux distributions support a new kernel feature called
transparent huge pages (THP). For HPC applications, using huge pages can reduce
execution delays from missing in the TLB. As long as huge pages are available to the
system, the user doesnt need to do anything explicit to take advantage of them. The
details are presented in section 3.2.
5.2 Hardware Performance Counters and Tools

The hardware in x86 processors provides performance counters for monitor events to
understand performance and identify performance limiters. The Sandy Bridge processor
provides 4 programmable counters and 3 fixed event counters. There are also uncore
event counters that can be used to monitor L3, QPI and memory events. There are a
number of tools that can be used to gather counter data. The use of perf for collecting
counter data at the application level and the use of PAPI for instrumenting specific areas
in the code will be covered here. perf ships with Linux but may need to be installed. (It
may not be installed with the default set of packages). Other tools that can be used are
vtune and likwid, but they will not be covered here. vtune requires a license from Intel to
use. likwid can be found here; it does have some nice features.
For a description of the events available to count see chapter 19 in volume 3b of the Intel
architecture manual.
For a guide on how to use these counters for analysis see appendix B.3 of the Intel
Software Optimization guide. The general optimization material in the rest of the
document is also recommended.
There is a good paper on bottleneck analysis using performance counters on the x86
5500 series processors, much of which is applicable to Sandy Bridge processors. It can
be found here.
5.2.1 Hardware Event Counts using perf

perf is a suite of Linux tools that can be used for both collecting counter data and
profiling. To check if perf is installed, enter the perf command. If it is available, the

command returns its help message; if not, it has to be installed. The installation varies by
distribution, but to install on RHEL (as root) enter
$ yum install perf
Install on SLES (as root) with the command
$ zypper install perf
Also make sure if a new kernel is installed that perf gets updated the match the kernel
To use perf for collecting performance counters, the perf list and perf stat
subcommands are used. See 'perf help' for more information on the subcommands
available for perf. A tutorial is available here. For more information on any command
enter perf help COMMAND.
Using the latest available version of perf is strongly recommended. Later versions
provide more features.
The profiling aspect of perf will be covered in a later section. perf list is used to show the
events available for the hardware being used. Any of the events can be appended with a
colon followed by one or more modifiers that further qualify what will be counted. The
modifiers are shown in the table below. Raw events can be modified with the same
syntax as the symbolic events provided by perf list.
Table 5-1 Event modifiers for perf e <event>:<mod>

Modifier Action
u Count events in user mode
k Count event in kernel mode
h Count events in hypervisor mode
Default Count all events regardless of mode
perf list also shows software and tracepoint events, but the focus of this document is
hardware counter events. These include the hardware, cache, and raw events. To see a
list of the symbolic hardware events supported by perf for the current processor, use the
command
$ perf list hardware
For the cache events supported, use the command
$ perf list cache
perf also supports raw event codes of the format rNNN where NNN represents the
umask appended with the event decode. The format is described further by the command
$ perf list help
5.2.1.1 Example 1
The first example demonstrates how to collect counts on built-in events with the
command
$ perf stat
Events will be counted for the standard benchmark program stream. Download stream.c
from here.
Modify the code to set the array size:

# define N 80000000

Compile the code with

gcc -O3 -fopenmp stream.c -o stream
Set the executable to run on 1 thread with
For repeatable results, force it to bind to a core (In this case, core 2):
numactl --physcpubind 2 ./stream
Collect counter data with default events: perf stat numactl --physcpubind 2 ./stream
A typical output is
Performance counter stats for 'numactl --physcpubind 2 ./stream':
4742.460148 task-clock # 0.999 CPUs utilized

70 context-switches # 0.000 M/sec
0 CPU-migrations # 0.000 M/sec
2,078 page-faults # 0.000 M/sec
18,080,352,006 cycles # 3.812 GHz
11,142,262,796 stalled-cycles-frontend # 61.63% frontend cycles idle
8,014,245,810 stalled-cycles-backend # 44.33% backend cycles idle
20,102,536,445 instructions # 1.11 insns per cycle
# 0.55 stalled cycles per insn
2,579,174,124 branches # 543.847 M/sec
39,238 branch-misses # 0.00% of all branches
4.746073552 seconds time elapsed
There are a couple of reasons that would cause problems with getting valid counter data
from perf.
The first is if the oprofile daemon is running. If this is true, running perf will give the
following error:
Error: open_counter returned with 16 (Device or resource busy).
/bin/dmesg may provide additional information.
Fatal: Not all events could be opened.

To disable the oprofile daemon, login as root and enter
$ opcontrol --deinit
The second reason is if the nmi_watchdog_timer is enabled. This will use one of the
event counters. To disable the nmi_watchdog_timer as root enter
echo 0 > /proc/sys/kernel/nmi_watchdog
To collect event counts for L1-dcache-loads, L1-dcache-load-misses, cycles, and
instructions:
perf stat e L1-dcache-loads,L1-dcache-load-misses,cycles,instructions numactl \
--physcpubind 2 ./stream
The output is:
5,906,710,542 L1-dcache-loads
1,107,616,593 L1-dcache-load-misses # 18.75% of all L1-dcache hits
18,073,237,009 cycles # 0.000 GHz

5.2.1.2 Example 2
The next example includes the syntax for a raw event. UOPS_ISSUED.ANY is collected in
addition to the counters above. Section 19.3 of the architecture manual 3b provides the
umask 01 and the event number 0x0e. The raw code concatenates the two - 010e.
An alternative way to get the mask and event code is to use libpfm4. Install libpfm4 and
go into the examples directory. The utility showevtinfo will give the event codes and
umasks for the current processor.
Enter the command:

perf stat -e r10e,L1-dcache-loads,L1-dcache-load-misses,cycles,instructions numactl \
--physcpubind 2 ./stream
This should produce something like:
24,228,890,952 r10e
5,906,666,765 L1-dcache-loads
1,107,547,809 L1-dcache-load-misses # 18.75% of all L1-dcache hits
18,063,959,547 cycles # 0.000 GHz
Notice the output gives the raw code instead of the event name. Using libpfm4 and a
couple of scripts provides the translation from a raw code to a symbolic name. The two
scripts are
Code Listing 1 get_event_dict.awk

# get_event_dict.awk
BEGIN {
mfound = 1;
printf ("event_names = {\n");
}
{
if ( $1 == "PMU") {
if (mfound == 0) {
if ((type == "wsm_dp") || (type == "ix86arch"))
printf(" \'0x%s\':\'%s\',\n", code1, name);
}
mfound = 0;
type = $4
}
if ( $1 == "Name") {
name = $3;
}
if ( $1 == "Code") {
code1 = substr($3,3);
if (length(code) == 1) {
code = "0" code1
}
}
if ( substr($1,1,5) == "Umask" ) {
mfound = 1;
mask=substr($3,3);
if(index(mask,"0") == 1)
mask = substr(mask,2)

qual=substr($7,2);
sub(/\]/,"",qual);
if ((type == "wsm_dp") || (type == "ix86arch"))
printf " \'0x%s%s\':\'%s.%s\',\n", mask, code, name, qual;
}
}
Code Listing 2 get_events.py
# get_events.py
def loadData(infile):
file = open(infile, 'r')
for line in file:
if not line.strip():
continue
line_data = line.split()
if line_data[0].isdigit():
if (len(line_data) > 1):
if line_data[1] == "raw":
if line_data[2] in event_names:
name = event_names[line_data[2]]
else:
print "could not find", line_data[2]
else:
name = line_data[1]
else:
name = "Group"
print line_data[0], ", ", name
file.close()
def main():
if (len(sys.argv)) < 2:
print "Usage:\npython make_perf_data.py infile"
exit()
pass
data = loadData(sys.argv[1])
# munge the data around
return
main()
To translate the raw code, the counter output must be saved to a file, e.g.
$HOME/counter_output_filename. Then the scripts should be copied (as root) to the
examples/ directory in libpfm4. From the examples/ directory issue:
./showevtinfo | awk f get_event_dict.awk > evt_dict (as root)
python get_events.py $HOME/counter_output_filename > $HOME/counters.csv
counters.csv can now be loaded into a spreadsheet and will provide the counts and
symbolic names.
5.2.1.3 Options for perf stat

perf stat help produces the following output:
NAME
perf-stat - Run a command and gather performance counter statistics

SYNOPSIS
perf stat [-e <EVENT> | --event=EVENT] [-S] [-a] <command>
perf stat [-e <EVENT> | --event=EVENT] [-S] [-a] <command> [<options>]
DESCRIPTION
This command runs a command and gathers performance counter statistics from it.
OPTIONS
<command>...
Any command you can specify in a shell.
-e, --event=
Select the PMU event. Selection can be a symbolic event name (use perf list to list
all events) or a raw PMU event (eventsel+umask) in the form of rNNN where NNN is a
hexadecimal event descriptor.
-i, --inherit
child tasks inherit counters
-p, --pid=<pid>
stat events on existing pid
-a
system-wide collection
-c
scale counter values
The e option is used to specify the events to count. The Sandy Bridge processor
supports 4 programmable and 3 fixed counters. The fixed events are
UNHALTED_CORE_CYCLES (cycles), INSTRUCTION_RETIRED (instructions) and
UNHALTED_REFERENCE_CYCLES. If more than 4 events requiring the programmable
counters are specified, perf will multiplex the counters. The c (--scale) option will
normalize the multiplexed counts to the entire collection period and will provide an
indication of how much of this time period was spent collecting each of the counters.
Counters need to be collected for a long enough period to get good samples to represent
the entire application. The time required to get a good sample will vary depending how
steady the application is. The best way to ensure that the samples are large enough is to
run with two different collection periods. (For example, run the benchmark twice as long
the second time through.) The sample period is long enough if the counts are
proportionately similar. Keep in mind that multiplexing influences the total sample time for
each event and it must be taken into account in future collections for that application.
Specifying multiple e options controls how perf multiplexes the counters. A

recommendation is to also collect instructions and cycles with any group of counter data.
Comparing the instructions and cycles for the different runs provides a way to check the
variation from run to run.
5.2.1.4 Example 3
An example using multiplexed event counters is:
numactl --physcpubind 2 perf stat -e cycles,instructions,L1-dcache-loads,L1-dcache-load-
misses,L1-dcache-stores,L1-dcache-store-misses,L1-dcache-prefetches,L1-dcache-
prefetch-misses ./stream_gcc

The output is:

Performance counter stats for './stream_gcc':
17,917,691,998 cycles # 0.000 GHz [57.14%]

17,138,124,280 instructions # 0.96 insns per cycle [71.43%]
5,106,659,829 L1-dcache-loads [71.43%]
1,108,292,001 L1-dcache-load-misses # 21.70% of all L1-dcache hits [71.43%]
1,901,538,315 L1-dcache-stores [71.43%]
473,207,328 L1-dcache-store-misses [71.45%]
821,116,783 L1-dcache-prefetch-misses [57.16%]
The output reports a total runtime of 4.74 seconds. The percentage in the square
brackets shows what percentage of 4.74 seconds was used while collecting for that
particular counter. The perf output automatically adjusts for multiplexing so that the count
displayed represents the actual count divided by the percentage collected.
To collect events for a job that is already running (under numactl binding control) or for a
certain process of a job, perf can be used with the p and a options while running the
sleep command. If the job is multitasked or multithreaded, using the i flag collects event
counts for all child processes.
perf p <pid> -a i sleep <data collection interval>
Attaching to an already running job is the technique to use if
the job runs for a long time and event data can be collected for a shorter time
(the length of the sleep)
the job has a warm up period that needs to be excluded from the event counts.
a specific job phase has to be reached before collecting event counts.
These are the basics of collecting counter data. The question now becomes which events
should be collected for analysis? Start out with the basic events needed for CPI,
instruction mix, cache miss rates, ROB and RAT stalls, and branch prediction metrics.
For programs that use large arrays, data on L1TLB and L2TLB misses is also useful.
Some other things to consider are the use of vector instructions and the effectiveness of
prefetching. What is needed will depend on what performance issues need to be
understood.
5.2.2 Instrumenting Hardware Performance with PAPI

perf is good for collecting performance data on an entire application. perf also has an
API for collecting counter data, but the perf interface is not documented.
To collect data on a specific portion of a program, the PAPI (Performance Application

Programming Interface) library is recommended for instrumenting the source code. The
PAPI source tarball, installation instructions and other documentation are available from
here. Some of the Linux distributions include a prepackaged version of PAPI, but
downloading and installing PAPI is recommended. PAPI supports collecting performance
event counts on a variety of systems including x86 Linux, Power Linux, and Power AIX.
A high level overview of PAPI is here. The examples/ and ctests/ subdirectories in the
PAPI tree have additional useful information. The instrumentation examples in this
document use the low-level API which is documented here. There is no support for this
library, so, as mentioned in the previous section, if there are problems collecting
performance data, look first at conflicts with oprofile or the nmi_watchdog_timer.
The functionality needed to use PAPI to collect counter data is:

1) select the events to collect
2) initialize PAPI

3) set up PAPI to collect the events selected for each thread

4) start counting at the appropriate time
5) stop counting when the section of code that is instrumented is done
6) print the results
Good examples using basic PAPI calls in serial code are the C program low-level.c in
the src/ctests/ directory and the FORTRAN program fmatrixlow.F in the src/ftest/
directory. The program zero_omp.c in src/ctests demonstrates the use of PAPI to
instrument OpenMP code.
The above examples directly call the code to be tested. However, the recommended
method to use is to create a custom library for local use that has the functions
papi_init(), thread_init(), start_counters(), stop_counters(), restart_counters(), and
print_counters(). An associated header file (called papi_interface.h below) including the
declarations for these custom functions is also needed. Using these functions will
minimize the changes needed to instrument the code. Include the .h file in the
instrumented code and add the above functions as needed. Using #ifdefs isolates the
PAPI changes from the rest of the code to make testing more convenient.
The example below illustrates how to alter code so that it is instrumented for collecting
performance counts for my_func2(), not my_func1() or my_func3().
#ifdef PAPI_LOCAL
#include papi_interface.h
#endif
int main ( ) {
#ifdef PAPI_LOCAL
papi_init();
thread_init();
#endif
my_func1();
#ifdef PAPI_LOCAL
start_counters();
#endif
my_func2();
#ifdef PAPI_LOCAL
stop_counters();
print_counters();
#endif
my_func3();
}
5.3 Profiling Tools

Profiling refers to charging CPU time to subroutines and micro-profiling refers to charging
CPU time to source program lines. Profiling is frequently used in benchmarking and
tuning activities to find out where the "hotspots" in a code, the regions that accumulate
significant amounts of CPU time, are located.
Several tools are available for profiling. The most frequently used is gprof, but perf and
oprofile are also used.

5.3.1 Profiling with gprof

gprof is the standard Linux tool used for profiling applications. Be sure to read the
chapter on inaccuracies.
To get gprof-compatible output, first binaries need to be compiled and created with the
added -pg option (additional options like optimization level, -On, can also be added):
$ gcc pg o myprog.exe myprog.c
or
$ gfortran pg o myprog.exe myprog.f
When the program is executed, a gmon.out file is generated (or, for a parallel job.
several gmon.<#>.out files are generated, one per task). To get the human-readable
profile, run:
$ gprof myprog.exe gmon.out > myprog.gprof
or
$ gprof myprog.exe gmon.*.out > myprog.gprof
The first part of an example output from gprof is:
Flat profile:
Each sample counts as 0.01 seconds.

Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
68.72 2.17 2.17 1 2.17 2.17 rand_read
15.83 2.67 0.50 1 0.50 0.66 gen_indices
10.45 3.00 0.33 1 0.33 2.50 run_concurrent
5.07 3.16 0.16 16777216 0.00 0.00 get_block_index
0.00 3.16 0.00 1 0.00 0.00 access_setup
0.00 3.16 0.00 1 0.00 0.00 get_args
0.00 3.16 0.00 1 0.00 0.00 init_run_params
0.00 3.16 0.00 1 0.00 0.00 parse_num
0.00 3.16 0.00 1 0.00 0.00 print_config
In the above profile, the function rand_read accounts for 69% of the time, even though it
is only called once. The function get_block_index is called almost 17 million times, but
only accounts for 5% of the time. The obvious routine to focus on for optimization is the
function rand_read.
gprof then displays a call tree.

Call graph (explanation follows)
granularity: each sample hit covers 2 byte(s) for 0.32% of 3.16 seconds
index % time self children called name

<spontaneous>
[1] 100.0 0.00 3.16 main [1]
0.33 2.17 1/1 run_concurrent [2]
0.50 0.16 1/1 gen_indices [4]
0.00 0.00 1/1 init_run_params [8]
0.00 0.00 1/1 get_args [7]
0.00 0.00 1/1 print_config [10]
0.00 0.00 1/1 access_setup [6]
-----------------------------------------------
0.33 2.17 1/1 main [1]
[2] 79.1 0.33 2.17 1 run_concurrent [2]
2.17 0.00 1/1 rand_read [3]

-----------------------------------------------
2.17 0.00 1/1 run_concurrent [2]
[3] 68.7 2.17 0.00 1 rand_read [3]
-----------------------------------------------
0.50 0.16 1/1 main [1]
[4] 20.9 0.50 0.16 1 gen_indices [4]
0.16 0.00 16777216/16777216 get_block_index [5]
-----------------------------------------------
0.16 0.00 16777216/16777216 gen_indices [4]

[5] 5.1 0.16 0.00 16777216 get_block_index [5]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[6] 0.0 0.00 0.00 1 access_setup [6]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[7] 0.0 0.00 0.00 1 get_args [7]
0.00 0.00 1/1 parse_num [9]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[8] 0.0 0.00 0.00 1 init_run_params [8]
-----------------------------------------------
0.00 0.00 1/1 get_args [7]
[9] 0.0 0.00 0.00 1 parse_num [9]
-----------------------------------------------
0.00 0.00 1/1 main [1]
[10] 0.0 0.00 0.00 1 print_config [10]
-----------------------------------------------
The call tree shows the caller for each of the functions.
gprof can also be used to tell the number of times a line of code is executed. This is
done using the gcov tool. See the documentation here for more details.
5.3.2 Microprofiling
Microprofiling is defined as charging counter events to instructions (in contrast to event
counts for an entire program as discussed in Section 5.2.1, or by function as discussed in
Section 5.3.1). This is typically done with a sampling-based profile. Sampling-based
profiling uses the overflow bit out of a counter to generate an interrupt and capture an
instruction address. The profiling tool can be set up to interrupt after a specified number
of cycles. Based on the number of times an instruction address shows up versus the total
number of samples, the instruction is assigned that percentage of the total number of
event occurrences.
There are three main tools used for microprofiling: vtune, perf and oprofile. All can use
cycles (time) or another counter event to do profiling. Only perf and oprofile will be
covered in this document since vtune requires a license to run.
5.3.2.1 Microprofiling with perf
perf uses cycles as its trigger event by default. This provides a list of instructions where
the program is spending time. To sample the program with the cycles event, enter
perf record [prog_name] [prog_args]
perf outputs some statistics and a file called perf.data. One key point is that the perf
command has to be bound to a CPU to get reproducible results.

An example output from perf using binding is:

# Events: 3K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ..........................
#
66.03% rand_rd rand_rd [.] rand_read
14.81% rand_rd rand_rd [.] gen_indices
5.37% rand_rd rand_rd [.] run_concurrent
4.47% rand_rd rand_rd [.] get_block_index
3.79% rand_rd [kernel.kallsyms] [k] clear_page_c
1.83% rand_rd libc-2.14.so [.] __random_r
1.34% rand_rd libc-2.14.so [.] __random
0.70% rand_rd libc-2.14.so [.] rand
0.27% rand_rd rand_rd [.] rand@plt
0.15% rand_rd [kernel.kallsyms] [k] get_page_from_freelist
0.12% rand_rd [kernel.kallsyms] [k] __kunmap_atomic
0.09% rand_rd [kernel.kallsyms] [k] run_posix_cpu_timers
0.09% rand_rd [kernel.kallsyms] [k] __kmap_atomic
0.06% rand_rd [kernel.kallsyms] [k] free_pages_prepare
0.06% rand_rd [kernel.kallsyms] [k] run_timer_softirq
0.06% rand_rd [kernel.kallsyms] [k] smp_apic_timer_interrupt
0.06% rand_rd [kernel.kallsyms] [k] update_cpu_load
0.06% rand_rd [kernel.kallsyms] [k] __might_sleep
0.03% rand_rd [kernel.kallsyms] [k] native_write_msr_safe
Notice this output includes system call and kernel functions.
perf annotate can be used to see where the program is spending its time. To get more
detail on the rand_read function, enter
perf annotate rand_read
The output is:
: /* j gets set to a random index in rarray */
: j = indices[i];
0.05 : 401209: 8b 45 f8 mov -0x8(%rbp),%eax
0.00 : 40120c: 48 98 cltq
0.00 : 40120e: 48 c1 e0 02 shl $0x2,%rax
1.75 : 401212: 48 03 45 e0 add -0x20(%rbp),%rax
0.00 : 401216: 8b 00 mov (%rax),%eax
0.09 : 401218: 89 45 f0 mov %eax,-0x10(%rbp)
: k += rarray[j];
0.05 : 40121b: 8b 45 f0 mov -0x10(%rbp),%eax
1.85 : 40121e: 48 98 cltq
0.00 : 401220: 48 c1 e0 03 shl $0x3,%rax
0.00 : 401224: 48 03 45 e8 add -0x18(%rbp),%rax
0.00 : 401228: 48 8b 00 mov (%rax),%rax
87.86 : 40122b: 89 c2 mov %eax,%edx
2.03 : 40122d: 8b 45 f4 mov -0xc(%rbp),%eax
0.09 : 401230: 01 d0 add %edx,%eax
1.57 : 401232: 89 45 f4 mov %eax,-0xc(%rbp)
The line
k += rarray[j];
is taking most of the time, with the assembly instruction
40122b: 89 c2 mov %eax,%edx
getting assigned 88% of the total time.
The e option on perf record allows other events besides cycles to be used. This is
useful to figure out which specific lines of code are strongly associated with events like

cache-misses. Call chain data is output by using perf record g option followed by perf
report.
With higher levels of compiler optimization, the compiler can inline functions and reorder
instructions.
5.3.3 Profiling with oprofile

Using oprofile requires root access or sudo privilege levels on the opcontrol command.
Because of this restriction, its easier to use perf, except for Java codes.
See here to for details on how to use oprofile.
The opcontrol command is used get a list of events to profile. Currently (as of April
2012), not many events are available for Sandy Bridge because oprofile has not yet
been updated to include Sandy Bridge.
A typical sequence to use for gathering an event profile with opcontrol/opreport is:
$ opcontrol --deinit
$ opcontrol --init
$ opcontrol reset
$ opcontrol --image all
$ opcontrol --separate none
$ opcontrol --start-daemon --event=CPU_CLK_UNHALTED:100000
--event=INST_RETIRED:100000
$ opcontrol --start
$ [command_name] [command_args]
$ opcontrol dump
$ opcontrol -h
# opreport can then be run to generate a report.

$ opreport l command_name
opannotate can be used to annotate code.
$ opannotate source prog_name
will produce an annotated source listing.
$ opannotate assembly prog_name
will produce an annotated assembly listing.
$ opannotate source assembly prog_name
will produce an annotated combined source and assembly listing.
With higher levels of compiler optimization, the compiler can inline functions and reorder
instructions.
5.4 High Performance Computing Toolkit (HPCT)

The IBM High Performance Computing Toolkit can be used to analyze the performance
of applications on the iDataPlex system. It contains three tools
Hardware performance counter profiling
MPI profiling and trace
I/O profiling and trace

5.4.1 Hardware Performance Counter Collection

The hardware performance counter tool reads the hardware performance counters built
into the system CPUs and accumulates performance measurements based on those
counters. There are three ways to obtain hardware performance counter-based
measurements
System level reporting where system-wide hardware performance counters are
queried and reported on a periodic basis
Overall hardware performance counter values for the entire application
Hardware performance counter values for specific blocks of code, obtained by
editing the application source, adding calls to obtain hardware performance
counter values and then recompiling and re-linking the application.
1 2
The hardware performance counter tool supports Nehalem , Westmere , and Sandy
3
Bridge x86-based processors.
The hpcstat command collects hardware performance counter measurements on a

system-wide basis. For instance, to query hardware performance counter group 2 every 5
seconds for 10 iterations, issue the command
$ hpcstat g 2 C 10 -I 5
The hpccount command collects hardware performance counter measurements over an
entire run of an application. For instance, to get a total event count for hardware
performance counter group 2 for the program shallow, issue the command
$ hpccount g 2 shallow
To get hardware performance counter measurements for specific regions of code inside
an application, the application source must be modified to call the hpmStart and
hpmStop functions at appropriate points. Then the application should be recompiled with
the g flag and linked with the libhpc library.
For instance, when compiling the program petest, the commands to use are
$ gcc -c -g I/opt/ibmhpc/ppedev.hpct/include petest.c
$ gcc o petest g L/opt/ibmhpc/ppedev.hpct/lib64 -lhpc
Before running the application, the HPM_EVENT_SET environment variable has to be
set to the correct hardware counter group. The hpccount l command lists available
groups.
5.4.2 MPI Profiling and Tracing

The MPI profiling and trace tool provides summary performance information for each MPI
function call in an application. This summary contains a count of how many times an MPI
function call was executed, the time consumed by that MPI function call and, for
communications calls, the total number of bytes transferred by that function call.
It also provides a time-based trace view which shows a trace of MPI function calls in the
application. This trace can be used to examine MPI communication patterns. Individual
trace events provide information about the time spent in that MPI function call and, for
communication calls, the number of bytes transferred for that trace event.
1
The product list of Nehalem processors is found here (EP and EX) .
2
The product list of Westmere processors is found here (EP and EX).
3
The product list of Sandy Brtidge processors is found here and here,

To use the MPI profiling and trace tool, the application must be relinked with the profiling
and trace library. A programming API is provided so that an application can be
instrumented to selectively trace a subset of MPI function calls.
When compiling an application, it should be compiled with the g flag. When linking the
application, link with the libmpitrace library.
For instance, to compile when compiling the program petest, the commands to use are
$ gcc -c petest.c
$ gcc o petest g L/opt/ibmhpc/ppedev.hpct/lib64 -lmpitrace
5.4.3 I/O Profiling and Tracing

The I/O profiling and trace tool provides summary information about each system I/O call
in your application. This information includes the number of times the I/O call was issued,
the total time consumed by that function call, and for reads and writes, the number of
bytes transferred by that system call.
It also provides a time-based trace view which shows more detailed information about the
I/O system calls.
To use the I/O profiling and trace tool, you must re-link your application with the profiling
and trace library.
When you compile your application, it should be compiled with the g flag. When you link
your application, you must link with the libtkio library.
For instance, to compile the program petest, use the commands

$ gcc -c petest.c
$ gcc o petest g L/opt/ibmhpc/ppedev.hpct/lib64 -ltkio
5.4.4 Other Information

The HPC Toolkit provides an X11-based viewer which can be used to examine
performance data generated by these tools.
Current documentation for the HPC Toolkit can be found on the HPC Central website.
The documentation web page is here. Click the Attachments tab and download the
latest version of the documentation.
The HPC Toolkit is part of the IBM Parallel Environment (PE) Developer Edition product.
PE Developer Edition is an Eclipse-based IDE that you can use to edit, compile, debug
and run your application. PE Developer Edition also contains a plug-in for HPC Toolkit
that is integrated with the rest of the developer environment and provides the same
viewing capabilities as the X11-based viewer that is part of HPC Toolkit. Since the plug-in
for HPC Toolkit is integrated with Eclipse, an instrumented application can be run from
within the Eclipse IDE to obtain performance measurements.
Current documentation for the IDE contained in PE Developer Edition can be found here.
The HPC Toolkit is installed if the ppedev runtime RPM is present (rpm qi
ppedev_runtime) on all of the nodes in the cluster and that the ppedev_hpct RPM is
installed on the login nodes in the cluster.
When using the HPC Toolkit to analyze the performance of parallel applications, the IBM

Parallel Environment Runtime Edition product must be installed on all of the nodes of the
cluster.
PE Developer Edition, including HPC Toolkit, is supported on Red Hat 6 and SLES 11
SP1 x86 based Linux systems.
The Eclipse IDE environment provided by PE Developer Edition requires that either
Eclipse 3.7 SR2 (Indigo) is already installed or that the version of Eclipse 3.7 SR2
packaged with PE Developer Edition is installed. Also, for Windows- and Linux-based
systems, the IBM Java version 6 packaged with PE Developer Edition must be installed.
For Mac users, the version of Java provided by default is all that is required.

6 Performance Results for HPC Benchmarks
6.1 Linpack (HPL)
6.1.1 Hardware and Software Stack Used

The base system for section 6.1.3 used nodes featuring Intel Xeon E5-2670 ("Sandy
Bridge-EP") 2.60 GHz, 64 Go (8x8 Go) DDR3 1333 MHz, and InfiniBand FDR Mellanox.
The software stack featured RedHat Enterprise Linux Server 6.1 64-bit, Intel Compilers
12.1, Intel MPI 4.0.3 and Intel MKL 10.3.7.256.
6.1.2 Code Version

Rather than using the original HPL source code (available for download from the Netlib
Repository. It is strongly suggested to opt for the Intel Optimized MP LINPACK
Benchmark for Clusters, which is provided for free as part of the Intel MKL library. This
LINPACK version is fully compliant with TOP500 submissions.
This version brings several improvements to the original version, including in particular:
The possibility to run the benchmark in hybrid mode (MPI + OpenMP).
The "As You Go" feature which reports information on achieved performance
level throughout the whole run. The feature evaluates the intrinsic quality of an
execution configuration without the need to wait for the end of the execution.
6.1.3 Test Case Configuration

In order to reach the optimal level of performance, it is crucial to select a proper
LINPACK configuration.
The main execution parameters are the following:

1. Matrix Size (N)
The matrix size constitutes the size of the problem to be solved (number of
elements). It has a direct impact on the memory usage on the computation
nodes. Each matrix element is a variable of type double.
2. Block Size (NB)
3. Grid Size & Shape (PxQ)
The following criteria must be taken into account when defining a LINPACK configuration:
Table 6-1 LINPACK Job Parameters

Parameters Constraints
Matrix size must be sized to fit the available memory on computation nodes:
(N^2 * 8) / NB Nodes < Available Memory Per Node
N
Note: available memory per node must take Operating System consumption into
account
Number of blocks per dimension must be an integer:
N / NB
N modulo NB = 0
PxQ Number of grid elements must be equal to the number of MPI processes

It is crucial to determine an optimal set of parameters in order to reach the best balance
between:
Computation to communication ratio.
Load unbalance between the computation cores.
The optimal configuration will be basically established by using a guess and
check methodology (in which the As You Go feature is extremely useful).
The following hints might help though:
The matrix size (N) must be the largest possible with respect to computation
nodes memory size.
The optimal block size (NB) is said to be 160 or 168 when running with Intel
MKL.
A slightly rectangular shape (P = Q x 0.6) might prove to be optimal, but this
highly depends on the platform architecture (including interconnect).
Other input settings are considered as having a very limited impact on the overall
performance. The following set of parameters can be taken as-is:
16.0 threshold
1 # of panel fact
0 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
2 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
0 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
256 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
0 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
Compilation
Most of the LINPACK performance relies on the selected mathematical library. As such,
compilation options do not play a significant role in the performance field.
Suggested compilation options are:

-O3 xavx
In terms of mathematical library choice, the Intel MKL is said to have the best
performance at this time; the GotoBLAS (from Texas Advanced Computing Center) has
been reported as faster in the past.
CPU Speed
CPU Speed is the internal system mechanism that allows the Turbo Mode capability to
be exploited. It needs to be properly configured to set the CPU frequency to the
maximum allowed and prevent this frequency from being reduced.
The CPU Speed configuration relies on the following file:

/etc/syconfig/cpuspeed

The recommended CPU Speed configuration for a Linpack execution involves:

setting the governor to performance
setting both minimum and maximum frequency to the highest available processor
frequency specified in the following file:
/sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
Pure-MPI versus Hybrid Execution Configuration

The pure MPI execution configuration performs well for single-node runs. However, for
multiple-node runs - and especially full cluster runs - the hybrid execution configuration
makes it possible:
to reach an acceptable performance level with a limited matrix size (keeping a
moderate run duration). In pure MPI mode, memory consumption needs to be
pushed to its highest limit in order to obtain an acceptable Linpack efficiency. The
large problem size, in turn, would lead to unacceptably large run times.
to achieve a level of efficiency that would be clearly out of reach in pure MPI
mode, and which corresponds to a very limited degradation over the single-node
compute efficiency.
For hybrid runs, the natural execution configuration on Sandy Bridge nodes is: 8
processes/node x 2 threads/process. Though the performance provided by the alternate
configuration of 4 processes/node x 4 threads/process should also be evaluated.
Measured Performance
Figure 6-1 presents the measured Linpack compute rate and the peak performance for
different numbers of nodes (1 to 32) using pure-MPI interprocess communication.

Figure 6-1 Comparing actual Linpack and system peak performance (GFlops) for different numbers of nodes
12000.0
10649.6
10000.0
8966.0
8000.0
Data
Performance
6000.0 (GFlops)
5324.8 Peak
(GFlops)
4584.0
4000.0
2662.4
2305.0
2000.0
1331.2
1167.0
584.3 665.6
294.9 332.8
0.0
1 2 4 8 16 32
# Nodes
6.1.4 Petaflop HPL Performance

A large iDataPlex dx360 M4 system has been deployed at LRZ in the first half of 2012.
Based on the HPL performance results, it is 4th in the June 2012 list of TOP500
Supercomputers in the world. The LRZ supercomputer comprises an 18-island parallel
compute system. Each island has 512 dx360 M4 compute nodes with a Mellanox
InfiniBand FRD10 interconnect. The switch fabric is architected to have 1:1 blocking
within an island and 1:4 blocking between any two islands. Each dx360 M4 has a 2-
socket E5-2680 (2.7 GHz) processor and 32 GB of RAM. HPL has been run on up to 18
islands of the LRZ installation. The results achieve over 80% of peak system
performance as shown in Figure 6-2 and Table 6-2. Note that the scale has changed
from gigaflops in Figure 6-1 to petaflops in Figure 6-2.

Figure 6-2 Comparing measured Linpack and system peak performance (PFlops) for large numbers of nodes
Petaflop Linpack performance for up to 9K nodes
3.5
3
2.5
Petaflops
2 Actual Performance
1.5 Peak Performance
1
0.5
0
4096 7168 9216
No. of nodes (dx360 M4)
Table 6-2 HPL performance on up to 18 iDataPlex dx360 M4 islands

No. of islands 8 14 18
No. of nodes (dx360 M4) 4096 7168 9216
Problem Size (N) 3624960 4464640 5386240
Time (s) 27122 28638 40350
Performance (Peta Flops) 1.17 2.07 2.58
With 512 nodes per island and 16 cores (2 sockets) per node, this shows the scalability
for nearly 150K cores.
6.2 STREAM
STREAM is a simple synthetic benchmark program that measures memory bandwidth in
MB/s and the computation rate for simple vector kernels. It has been developed by John
McCalpin, while he was a professor at the University of Delaware. The benchmark is
specifically designed to work with large data setslarger than the Last Level Cache
(LLC) on the target systemso that the results are indicative of very large vector-
oriented applications. It has emerged as a de-facto industry standard benchmark. It is
available in both FORTRAN and C, in single processor and multi-processor versions,
OpenMP- and MPI-parallel.
The STREAM results are presented below by running the application in serial mode on a
single processor and in an OpenMP mode.
The benchmark measures the 4 following loops:
COPY: a(i) = b(i)
with memory access to 2 double precision words (16 bytes) and no floating point
operations per iteration
SCALE: a(i) = q * b(i)
with memory access to 2 double precision words (16 bytes) and one floating point
operation per iteration

SUM: a(i) = b(i) + c(i)
with memory access to 3 double precision words (16 bytes) and one floating point
operation per iteration
TRIAD: a(i) = b(i) + q * c(i)
with memory access to 3 double precision words (16 bytes) and two floating point
operations per iteration
The general rule for STREAM is that each array must be at
least 4x the size of the sum of all the last-level caches used
in the run, or 1 Million elements -- whichever is larger.
To avoid measurements from cache a minimum dimension of the array is required.

Unfortunately the cited rule is ambiguous since the units are not given, but McCalpin
explains the meaning in an example. When the method is applied to the Sandy Bridge
CPU:
Each Sandy Bridge chip has a shared L3 cache of 20 MB, i.e.
the two chips on a standard server have 40 MB or the equivalent
of 5 million double precision words. The translation of the
example given by McCalpin means that the size of the arrays
must be at least four times 5 million double precision words or
160MB.
To improve performance one can test different offsets, which have not been considered
here.
6.2.1 Single Core Version GCC compiler

GCC version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) is used to build the
single core version of Stream. The makefile is:
CC = gcc
CFLAGS = -O3
all: stream.exe
stream.exe: stream.c
$(CC) $(CFLAGS) stream.c -o stream.exe
clean:
rm -f stream.exe *.o
The array dimensions have been set, as per McCalpins rule, to the minimum required
size of 20M or 160MB.
Running the produced executable yields the following output:

boemelb@n005:~/stream/orig_version> ./stream.exe
-------------------------------------------------------------
STREAM version $Revision: 5.9 $
-------------------------------------------------------------
This system uses 8 bytes per DOUBLE PRECISION word.
-------------------------------------------------------------
Array size = 20000000, Offset = 0
Total memory required = 457.8 MB.
Each test is run 10 times, but only

the *best* time for each is used.

-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 17727 microseconds.
(= 17727 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 11495.9682 0.0318 0.0278 0.0395
Scale: 11436.7041 0.0316 0.0280 0.0387
Add: 11238.3175 0.0472 0.0427 0.0556
Triad: 11429.6594 0.0478 0.0420 0.0552
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
So the measured bandwidth is greater than 11 GB/s for all tests.
Figure 6-3 Measured Bandwidth (MB/s) for single-core STREAM tests using GCC
single core, gcc Compiler
12000
10000
8000
6000 gcc
4000
2000
0
copy scale add triad
6.2.2 Single Core Version - Intel compiler

Theres a known issue with the stream performance with the Intel compiler (see e.g.
HPCC-stream performance loss with the 11.0 and 12.0 compilers),
Following the recommendations given there, the following compile options are added:
opt-streaming-stores always
ffreestanding

and using the Intel compiler version 12.1.1.256:

# icc -V
Intel(R) C Intel(R) 64 Compiler XE for applications running on Intel(R) 64, Version
12.1.1.256 Build 20111011
The following Makefile was used to create binaries from C and Fortran source code:
FC = ifort
CC = icc
FFLAGS = -O3 -opt-streaming-stores always

CFLAGS = -O2
all: stream.exe
stream.exe: stream.f mysecond.o

$(CC) $(CFLAGS) -c mysecond.c
$(FC) $(FFLAGS) -c stream.f
$(FC) $(FFLAGS) stream.o mysecond.o -o stream.exe
clean:
rm -f stream.exe *.o
results in:
-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
...
-------------------------------------------------------------
Copy: 11572.7884 0.0277 0.0277 0.0277
Scale: 6919.6523 0.0463 0.0462 0.0463
Add: 9040.9096 0.0531 0.0531 0.0531
Triad: 9121.3157 0.0527 0.0526 0.0527
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
The Intel Fortran compiler yields a result very similar to the Intel C compiler:
...
----------------------------------------------------
Copy: 11581.2764 0.0277 0.0276 0.0277
Scale: 6921.3294 0.0463 0.0462 0.0463
Add: 9045.3372 0.0531 0.0531 0.0531
Triad: 9122.6796 0.0527 0.0526 0.0527
----------------------------------------------------
Solution Validates!
----------------------------------------------------

Figure 6-4 Measured Bandwidth (MB/s) for single-core STREAM tests using Intel icc
single core, Intel compiler
12000
10000
8000
6000 icc
4000
2000
0
Comparing the two sets of results shows the binaries produced by the GCC compiler are
faster:
Figure 6-5 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc and GCC
single core
12000
10000
8000
MB/s
gcc
6000
icc
4000
2000
0
From a private communication with Andrey Semin from Intel, changing the compiler

options for icc to disable streaming stores alters this picture.

Using the options
-diag-disable 161 -O1 -ipo -opt-streaming-stores never -static
produces the result:

-------------------------------------------------------------
-------------------------------------------------------------
-------------------------------------------------------------
Each test is run 20 times, but only
the *best* time for each is used.
-------------------------------------------------------------
Printing one line per active thread....
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 22251 microseconds.
(= 22251 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Copy: 11507.8948 0.0278 0.0278 0.0279
Scale: 11522.8132 0.0278 0.0278 0.0278
Add: 12349.4306 0.0391 0.0389 0.0415
Triad: 12393.5235 0.0388 0.0387 0.0390
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
This result with the Intel compiler is faster than the gcc result.

Figure 6-6 Measured Bandwidth (MB/s) for single-core STREAM tests comparing Intel icc without
streaming stores and GCC
single core
12000
10000
8000 gcc
MB/s
6000 icc
4000 icc*
2000
0
where icc* denotes the results without streaming stores.
6.2.3 Frequency Dependency

Both of the examples above were run at the nominal frequency (2.7 GHz) of the CPU
without turbo mode. The cpufreq-info tool outputs:
cpufrequtils 004: cpufreq-info (C) Dominik Brodowski 2004-2006
Report errors and bugs to http://bugs.opensuse.org, please.
analyzing CPU 0:
driver: acpi-cpufreq
CPUs which need to switch frequency at the same time: 0
hardware limits: 1.20 GHz - 2.70 GHz
available frequency steps: 2.70 GHz, 2.60 GHz, 2.50 GHz, 2.40 GHz, 2.30 GHz, 2.20
GHz, 2.10 GHz, 2.00 GHz, 1.90 GHz, 1.80 GHz, 1.70 GHz, 1.60 GHz, 1.50 GHz, 1.40
GHz, 1.30 GHz, 1.20 GHz
available cpufreq governors: conservative, userspace, powersave, ondemand,
performance
current policy: frequency should be within 1.20 GHz and 2.70 GHz.
The governor "userspace" may decide which speed to use
within this range.
current CPU frequency is 2.70 GHz.
Switching the frequency to 2.5 GHz (and using the gcc-compiled binary), there is a
performance degradation of about 6%, whereas the frequency is 8% lower. So a further
investigation of the scaling of bandwidth with frequency seems warranted.
...
-------------------------------------------------------------

Copy: 10854.8240 0.0381 0.0295 0.0405

Scale: 10730.3791 0.0383 0.0298 0.0398
Add: 10588.0561 0.0559 0.0453 0.0576
Triad: 10757.0391 0.0553 0.0446 0.0567
-------------------------------------------------------------
Solution Validates
-------------------------------------------------------------
Running for all 16 available frequencies (2.7 GHz, 2.6 GHz... 1.2 GHz) one sees that the
bandwidth is very dependent on the frequency, here given for five representative
frequencies:
Table 6-3 Single core memory bandwidth as a function of core frequency

freq copy scale add triad
[GHz] [MB/s] [MB/s] [MB/s] [MB/s]
1.2 5984. 5824. 6118. 6171.
1.6 7668. 6889. 6490. 6560.
2.0 9192. 8975. 9015. 9124.
2.4 10527. 8984. 8281. 8390.
2.7 11478. 11426. 11234. 11422.
Figure 6-7 Single core memory bandwidth as a function of core frequency
single core, gcc compiler
12000
10000
copy
8000
MB/s
scale
6000
add
4000
2000 triad
0
Hz
Hz
Hz
Hz
Hz
G
G
G
7
4
2
0
2.
2.
1.
1.
2.
6.2.4 Saturating Memory Bandwidth

There are two approaches to saturate the available memory bandwidth:
run several instances of stream (a throughput benchmark)
run the OpenMP version.

6.2.4.1 Throughput Benchmark Results with GCC

To avoid switching processes from core to core and therefore producing high variances,
the processes are bound to the 16 physical cores (no Hyper-Threading is used):
#!/bin/bash
for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
do
taskset -c $i ./stream.exe > thruput.$i &
done
The following table gives the average, minimum and maximum over all 16 processes:
Table 6-4 Memory Bandwidth (MB/s) over 16 cores throughput benchmark

average min max sum
[MB/s] [MB/s] [MB/s] [MB/s]
copy 3442 3386 3498 55077
scale 3370 3315 3430 53922
add 3749 3688 3861 59980
triad 4109 3775 4823 65753
All 16 tasks achieve around 60 GB/s in throughput.
Figure 6-8 Memory Bandwidth (MB/s) over 16 cores GCC throughput benchmark
throughput 16 x 1way
70000
60000
50000
MB/s
40000
gcc
30000
20000
10000
0
6.2.4.2 OpenMP Runs with Intel icc

The executable is built with the Intel compiler using
icc -O3 -xAVX -fast -opt-streaming-stores always -openmp -i-static stream.c

For 16 cores, binding is set with

export KMP_AFFINITY=explicit,proclist=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
with the standard dimension of 20 million the rates are:
Table 6-5 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 20M
copy 74055.2461 0.0043 0.0043 0.0044
scale 73511.7362 0.0044 0.0044 0.0044
add 76419.2796 0.0063 0.0063 0.0064
triad 75852.0805 0.0064 0.0063 0.0064
with a dimension of 200 million:

Table 6-6 Memory bandwidth (MB/s) over 16 cores OpenMP benchmark with icc 200M
Copy 75186.1075 0.0427 0.0426 0.0428
Scale 74108.4026 0.0434 0.0432 0.0436
Add 77088.4819 0.0625 0.0623 0.0635
Triad 76260.6505 0.0632 0.0629 0.0634
Figure 6-9 Memory bandwidth (MB/s) minimum number of sockets 16-way OpenMP benchmark
OpenMP 16-way
80000
70000
60000
50000
MB/s
40000 icc
30000
20000
10000
0
Table 6-7 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with icc
Funct. \ 16 8 4 2 1
n
Copy 75193 36546 26072 13628 6943
Scale 74052 35794 25675 13575 6962
Add 77112 37716 32867 17701 9070
Triad 76229 37136 31558 17523 9034

Binding half the threads to one socket and the other half to the other socket the picture
improves, especially for 8 cores:
Table 6-8 Memory bandwidth (MB/s) split threads between two sockets OpenMP benchmark with
icc
Funct. \ 8 4 2
n
Copy 55375 29288 13662
Scale 54543 29220 13649
Add 69763 37665 17752
Triad 67197 37245 17712
Figure 6-10 Memory bandwidth (MB/s) performance of 8 threads on 1 or 2 sockets
OpenMP 8-way
80000
70000
60000
50000
MB/s
1 socket
40000
2 sockets
30000
20000
10000
0
So with Stream
16 cores get ~76 GB/s or 4.75 GB/s per core
8 cores on a single socket ~37 GB/s or 4.63 GB/s per core
8 cores using both sockets ~62 GB/s or 7.7 GB/s per core
Figure 6-11 shows the saturation when splitting different numbers of OpenMP threads
between both sockets. It reveals that, in general, going from four to eight threads using
both sockets scales nicely whereas the memory bandwidth becomes saturated going
from 8 to 16 threads.

Figure 6-11 Memory bandwidth (MB/s) split threads between two sockets
OpenMP 2 sockets
1 ... 16 threads
80000
1
60000
2
MB/s
40000 4
8
20000
16
0
6.2.4.3 OpenMP Runs with GCC

The corresponding GCC numbers (version 4.3.4) are very different

export GOMP_CPU_AFFINITY="0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15")
export GOMP_CPU_AFFINITY="0 1 2 3 8 9 10 11"
Table 6-9 Memory bandwidth (MB/s) minimum number of sockets OpenMP benchmark with gcc
Funct. \ 16 8 4 2 1
n
Copy 53102 26539 27309 21356 11723
Scale 53411 26756 27469 20827 11626
Add 60062 30060 30772 23807 12397
Triad 60336 30416 30915 24298 12750
Table 6-10 Memory bandwidth (MB/s) divide threads between two sockets OpenMP benchmark
with gcc
Function 8 4 2
Copy 54450 43233 24226
Scale 54803 42520 24139
Add 61337 48836 26044
Triad 61721 49789 26501

6.2.4.4 Saturating Memory Bandwidth Conclusions

The rather new OpenMP support for gcc (and perhaps the version of gcc used) may
have had a more negative influence on the performance here as compared to the Intel
implementation.
In summary: Memory performance is very dependant on the chosen frequency.
For the single core performance, gcc is clearly ahead of the Intel executable, but for the
OpenMP version this picture reverses. Furthermore 8 cores distributed over both sockets
can nearly exhaust the memory bandwidth. Binding is absolutely necessary.
6.2.5 Beyond Stream

In the previous section, the simple case of stride-1 loops, where the Sandy Bridge chip
really excels, was discussed. In this section, strided operations, loops with a decrement
(running the wrong direction through memory) and indexed operations (gathered loads
and scattered stores) are investigated.
6.2.5.1 Stride
16 threads by array stride through memory:
Table 6-11 Strided memory bandwidth (MB/s) 16 threads

stride copy scale add triad
2 26195 25832 29003 29290
3 17475 17210 19268 19519
4 13053 12914 14470 14666
5 10505 10301 11558 11634
6 8746 8551 9623 9735
7 7464 7341 8286 8356
8 6558 6431 7246 7329
9 5996 5876 6619 6680
10 5573 5483 6188 6214
20 3641 3563 4095 4071
40 3259 3180 3659 3683
8 threads:

2 13215 13167 14753 14808
3 8837 8784 9843 9921
4 6598 6549 7339 7387
5 5270 5238 5890 5926
6 4399 4381 4897 4916
7 3763 3737 4191 4217
8 3289 3271 3670 3691
9 3027 3000 3370 3380
10 2827 2781 3119 3121
20 1869 1830 2089 2077
40 1683 1650 1891 1884
4 threads


2 13557 13586 15042 15146
3 9033 8976 9989 10021
4 6757 6720 7445 7475
5 5399 5362 5943 5979
6 4475 4446 4947 4963
7 3842 3819 4257 4289
8 3350 3333 3720 3735
9 3098 3062 3421 3426
10 2874 2835 3170 3152
20 1878 1831 2069 2044
40 1637 1604 1790 1785
2 threads

2 10472 10715 11495 11932
3 6902 6979 7329 7421
4 5136 5142 5387 5400
5 4064 4092 4324 4307
6 3384 3389 3611 3591
7 2891 2884 3137 3116
8 2511 2522 2741 2745
9 2277 2266 2496 2486
10 2092 2077 2311 2280
20 1383 1377 1480 1482
40 1199 1189 1239 1259
1 thread
Table 6-15 Strided memory bandwidth (MB/s) 1 thread

2 5736 5908 6015 6106
3 3737 3769 3845 3867
4 2795 2800 2837 2828
5 2203 2213 2275 2257
6 1830 1836 1909 1888
7 1560 1561 1659 1641
8 1351 1352 1448 1444
9 1212 1207 1320 1312
10 1111 1191 1227 1207
20 736 740 793 801
40 656 667 677 691
Figure 6-12 shows, for the triad case, the decrease in effective memory bandwidth as the
stride increases.

Figure 6-12 Memory bandwidth (MB/s) vs stride length for 1 to 16 threads
strided triad w. OpenMP

1 16 threads
30000
25000 1
20000 2
MB/s
15000 4
10000 8
5000
16
0
2 3 4 5 6 7 8 9 10 20 40
stride
6.2.5.2 Reverse Order

Stride through memory in reverse order:
Table 6-16 Reverse order (stride=-1) memory bandwidth (MB/s) 1 to 16 threads

Function 16 8 4 2 1
Copy 25653 26703 27324 20812 11562
Scale 25657 26676 27271 20930 11460
Add 28984 30156 30756 23944 12370
Triad 28993 30128 30816 24238 12689
which is in contrast with Table 6-17 for stride 1 memory
Table 6-17 Stride 1 memory bandwidth (MB/s) 1 to 16 threads

Function 16 8 4 2 1
Copy 53102 26539 27309 21356 11723
Scale 53411 26756 27469 20827 11626
Add 60062 30060 30772 23807 12397
Triad 60336 30416 30915 24298 12750
6.2.5.3 Indexed
There are two cases: loads via an index array and stores via an indexed array.
To achieve this another array index is included to modify the stream code, where
index[i] = (ix + iy * i) % N

and the offset variable ix and the stride variable iy are read in at runtime.
All results have been generated from binaries created with the Intel compiler.
The results showed that the initial offset ix does not change the performance, so ix = 0 is
used for all runs.
load case:
These runs measure the performance of indexed loads - for example, the triad case is
modified to become:
for (j=0; j<N; j++)
a[j] = b[index[j]]+scalar*c[index[j]];
OMP_NUM_THREADS=1
Table 6-18 Strided memory bandwidth (MB/s) with indexed loads 1 thread
1 5773 5776 10327 10137
2 5000 5011 6882 6735
3 4291 4284 5185 5049
4 3693 3694 4084 4061
5 3158 3158 3395 3408
6 2737 2738 2942 2955
7 2362 2363 2587 2581
8 2062 2062 2293 2280
9 1899 1898 2114 2091
10 1777 1777 1958 1937
OMP_NUM_THREADS=16
Table 6-19 Strided memory bandwidth (MB/s) with indexed loads 16 threads
1 52100 51908 49500 49438
2 44120 43960 36057 36097
3 31340 31368 28352 28313
4 33826 33868 31453 31356
5 23580 23583 19623 19613
6 16698 16712 14117 14104
7 16999 17006 13932 13933
8 26357 26378 24269 24554
9 12432 12431 10339 10345
10 11300 11302 9161 9193
store case:
We measure the performance - for example for the triad case -
for (j=0; j<N; j++)
a[index[j]] = b[j]+scalar*c[j];
OMP_NUM_THREADS=1
Table 6-20 Strided memory bandwidth (MB/s) with indexed stores 1 thread
1 9340 9368 10148 10070
2 6107 5992 7486 7445


3 4709 4670 5990 6115
4 3759 3737 4911 4887
5 3139 3142 4213 4197
6 2680 2690 3662 3660
7 2360 2369 3263 3260
8 2091 2091 2920 2915
9 1923 1925 2715 2710
10 1759 1761 2503 2501
OMP_NUM_THREADS=16
Table 6-21 Strided memory bandwidth (MB/s) with indexed stores 16 threads
1 41807 41915 49940 49967
2 24076 24023 34037 34050
3 19998 19963 27335 27359
4 22477 22465 31790 31830
5 12538 12534 17957 17953
6 9908 9904 14470 14477
7 9050 9056 13174 13177
8 16971 17608 24840 24853
9 7412 7414 10865 10867
10 6691 6693 9838 9836
Observations:
remember: index[i] = (ix + iy * i) % N
for the case ix = 0, iy = 1 we have the same memory access pattern as for the
standard stream, but for example, the triad case shows 49967 / 76229 = 65.5%
of the performance (store case). Whereas surprisingly the case
OMP_NUM_THREADS=1 gives slightly better performance (comparing results
against the Intel compiler).
6.3 HPCC
The HPC Challenge Benchmark is composed of seven individual benchmarks combined
into a single executable program, limiting the tuning possibilities for any specific part. It is
made up of the following tests:
1. HPL High Performance Linpack measures the floating-point-execution rate by
solving a linear system of equations.
2. PTRANS Parallel Matrix Transpose measures the network capacity by
requiring paired communications across all processors in parallel.
3. DGEMM (double-precision) General Matrix Multiply measures the floating-
point-execution rate by using the corresponding matrix multiplication kernel
included in the BLAS library.
4. STREAM measures the sustainable memory bandwidth and corresponding
computation rate for simple vector kernels.
5. RandomAccess measures the integer operation rate on random memory
locations.
6. FFT measures the floating-point-execution rate by executing a one-dimensional
complex Discrete Fourier Transformation (DFT), which may be implemented
using an FFT library.
7. Communication a combination of tests to measure network bandwidth and
latency by simulating various parallel communication patterns.

In case of DGEMM, RandomAccess and FFT, three types of jobs are run:
1. single one single thread
2. star also known as embarrassingly parallel, parallel execution without inter-
processor communication (multiple serial runs)
3. MPI parallel execution with inter-processor communication
6.3.1 Hardware and Software Stack Used

The base system used is the IBM System x iDataPlex dx360 M4, with FDR10 as network
fabric, 16 Sandy Bridge cores at 2.7GHz and 32GB DDR3-1600 memory per node,
MVAPICH2-1.8apl1, Intel compiler version 12.1.0 and Intel MKL 10.3.7.256
6.3.2 Build Options

HPCC was compiled with MKL but without FFTW (which leads to higher memory
consumption) and the following options:
-O3 -DMKL_INT=long -DLONG_IS_64BITS -DRA_SANDIA_OPT2
-DUSE_MULTIPLE_RECV -DUSING_FFTW -DMKL_INT=long
HPCC Stream was compiled with the following additional flags:
-xS -ansi-alias -ip -opt-streaming-stores always
6.3.3 Runtime Configuration

These are the differences from the default hpccinf.txt:
NB depends on the math library used. In this case, MKL was used and the best value for
NB is 160.
N, P and Q were chosen by the following formula:
N = sqrt(0.7*m*c*1024*1024*1024/8)
P and Q are chosen by minimizing P+Q, while satisfying P*Q=c and P<=Q.
where m is the memory in GB per core (32 GB in this case) and c is the total number of
available cores.
Table 6-22 Best values of HPL N,P,Q for different numbers of total available cores
Number of cores N P Q
16 219326 4 4
32 310173 4 8
64 438651 8 8
128 620346 8 16
256 877302 16 16
512 1240692 16 32
1024 1754604 32 32

6.3.4 Results
Table 6-23 HPCC performance on 1 to 32 nodes
Cores (16 per node)
16 32 64 128 256
HPL in 0.299 0.591 1.166 2.312 4.480
TFLOP/s
PTRANS 5.173 9.745 12.093 23.296 42.947
in GB/s
DGEMM in GFLOP/s
Single 20.238 20.16 20.232 20.269 20.207
Star 19.783 18.64 19.275 19.362 19.605
Single Stream in GB/s
Copy 7.584 7.606 7.603 7.539 7.535
Scale 7.600 7.618 7.615 7.549 7.546
Add 9.820 9.843 9.834 9.745 9.739
Triad 9.789 9.811 9.796 9.710 9.707
Star Stream in GB/s
Copy 4.680 4.693 4.841 4.746 4.713
Scale 4.608 4.618 4.793 4.703 4.632
Add 4.882 4.889 5.375 5.134 4.883
Triad 4.817 4.818 5.369 5.098 4.819
RandomAccess in GUP/s
Single 0.035 0.035 0.035 0.035 0.035
Star 0.018 0.018 0.018 0.018 0.018
MPI 0.177 0.301 0.495 0.763 1.262
FFT in GFLOP/s
Single 3.500 3.552 3.534 3.530 3.561
Star 2.842 2.630 2.464 2.521 2.593
MPI 19.58 24.833 41.435 74.95 147.63
PingPong Latency in us
Min 0.367 0.517 0.775 0.951 0.876
Avg 0.866 1.233 1.534 1.967 2.134
Max 2.517 2.293 2.724 3.838 2.962
PingPong Bandwidth in GB/s

Cores (16 per node)

16 32 64 128 256
Min 3.696 3.41 2.895 2.886 3.081
Avg 5.545 4.856 4.589 4.578 4.549
Max 7.753 7.381 7.067 7.106 7.767
Ring Bandwidth in GB/s, NOR: Natural Order, ROR: Random Order
NOR 1.332 1.325 1.316 1.274 1.281
ROR 1.036 0.574 0.322 0.241 0.202
Ring Latency in us, NOR: Natural Order, ROR: Random Order
NOR 1.230 1.295 2.027 2.113 1.999
ROR 1.316 1.685 3.247 6.896 10.348
6.4 NAS Parallel Benchmarks Class D

The NAS Parallel Benchmark suite consists of five kernels and three pseudo-
applications, derived from computational fluid dynamics (CFD) simulations. There are
three versions of the benchmark suite, based on the communications method(s) used:
MPI, OpenMP, and hybrid (MPI/OMP); some kernels do not use all communications
methods. Each part tries to mimic the computation and data-access patterns found in
CFD applications and no modifications or external libraries may be used. Problem sizes
are predefined by classes.
Kernels:
CG Conjugate Gradient, irregular memory access and communication
MG Multi-Grid V-Cycle, long- and short-distance communication, memory
intensive
FT (Fast) Fourier Transformation solving a partial differential equation (PDE) in
3D space, all-to-all communication
Pseudo-applications to solve nonlinear PDEs using the following algorithms:
BT Block Tri-diagonal solver, non-blocking communication
SP Scalar Penta-diagonal solver, non-blocking communication
LU Lower-Upper Gauss-Seidel solver
6.4.1 Hardware and Building

The base system used is the IBM System x iDataPlex dx360 M4, with FDR10 as network
fabric, 16 Sandy Bridge cores at 2.7GHz and 32GB DDR3-1600 memory per node,
MVAPICH2-1.8apl1, Intel compiler version 12.1.0 and Intel MKL 10.3.7.256
Compilation was done directly on a Sandy Bridge node with the following flags:
-O3 -xAVX -xhost
In case of class E with small process count, the following flags were added:
-mcmodel=medium -shared-intel

6.4.2 Results
Results are in total giga-operations per second with class D problem size:
Table 6-24 NAS PB Class D performance on 1 to 32 nodes

CG MG FT LU BT SP
64 19.573 99.798 135.32 103.09 49.270
Number of
128 34.372 184.10 98.066 263.89 242.79 93.878

256 66.608 387.80 205.55 496.17 529.1 256.3
cores
484 1003.4 509.1

512 117.03 829.17 342.83 868.1

7 AVX and SIMD Programming
7.1 AVX/SSE SIMD Architecture

Intel Advanced Vector eXtensions (AVX), is an extension to the venerable Intel64
architecture that supports Single Instruction Multiple Data (SIMD) operations on Intel
architectures starting with Sandy Bridge. The new instruction set introduced with AVX is
the natural follow-on to the previous SIMD architecture, SSE4. AVX defines additional
registers and instructions to support single-instruction multiple-data (SIMD) operations
that accelerate data-intensive tasks.
The Intel64 SIMD hardware and instruction set support for previous architectures has
8 architected 64-bit (MMX) and 16 architected 128-bit (SSE) registers
arithmetic, bit-shuffling and logical operations for
1-, 2-, 4-, and 8-byte integers
4- and 8-byte floating point data
The AVX floating point architecture adds the following new capabilities:
Wider Vectors
The 16 128-bit registers (named XMM0-15) have been extended to 32 bytes (256
bits). The new architected registers are named YMM0-15. They can hold 8
single-precision or 4 double-precision floating point values.
Figure 7-1 Using the low 128-bits of the YMMn registers for XMMn
YMM0
XMM0
Three and four operand operations

nondestructive source operands for AVX 128 and AVX 256 operations
A = B + C can be used instead of A = A + B
Increased memory bandwidth
An additional 128-bit load port
2 load ports and 1 store port
read up to 32 bytes and write 16 bytes per cycle
Unaligned memory access support
New instructions
focusing on SIMD instructions for 32-bit and 64-bit floating point data
AVX can work on either 128-bit or 256-bit vector data
AVX-128 instructions replace the complete SSE instruction set. Instructions using
the XMMn registers use the lower 128 bits. The upper 128 bits are zeroed out.
Vector technology provides a software model that accelerates the performance of various
software applications and extends the instruction set architecture (ISA) of the Intel Core
CPU architecture. The instruction set is based on separate vector/SIMD-style execution
units that have a high degree of data parallelism. This high data parallelism can perform
operations on multiple data elements in a single instruction.
7.1.1 Note on Terminology:

The term vector, as used in this chapter, refers to the spatial parallel processing of

short, fixed-length one-dimensional matrixes performed by an execution unit. This is the

classical SIMD execution of multiple data streams with one instruction. It should not be
confused with the software pipelining processing of long, variable length vectors
processed machines or the software pipelining done by optimizing compilers to eliminate
delays due to dependent operations in loops. The definition is discussed further in the
next section.
7.2 A Short Vector Processing History

The basic concept behind vector processing is to enhance the performance of data-
intensive applications by providing hardware support for operations that can manipulate
an entire vector (or array) of data in a single operation. The number of data elements
operated upon at a time is called the vector length.
Scalar processors perform operations that manipulate single data elements such as
fixed-point or floating-point numbers. For example, scalar processors usually have an
instruction that adds two integers to produce a single-integer result.
Vector processors perform operations on multiple data elements arranged in groups

called vectors (or arrays). For example, a vector add operation to add two vectors
performs a pair-wise addition of each element of one source vector with the
corresponding element of the other source vector. It places the result in the
corresponding element of the destination vector. Typically a single vector operation on
vectors of length n is equivalent to performing n scalar operations.
Figure 7-2 illustrates the difference between scalar and vector operations.
Figure 7-2 Scalar and vector operations

Vector Add Operation
7 5 12
Scalar Add Operation 1 7
6
6 11 17
4 + 7 11 +
3 4 7
3 5 8
10 2 12
Processor designers are continually looking for ways to improve application performance.
The addition of vector operations to a processors architecture is one method that a
processor designer can use to make it easier to improve the peak performance of a
processor. However, the actual performance improvements that can be obtained for a
specific application depend on how well the application can exploit vector operations and
avoid other system bottlenecks like memory bandwidth.
The concept of vector processing has existed since the 1950s. Early implementations of
vector processing (known as array processing) were installed in the 1960s. They used
special purpose peripherals attached to general purpose computers. An example is the
IBM 2938 Array Processor, which could be attached to some models of the IBM
System/360. This was followed by the IBM 3838 Array Processor in later years.
By the mid-1970s, vector processing became an integral part of the main processor in
large supercomputers manufactured by companies such as Cray Research. By the mid-
1980s, vector processing became available as an optional feature on large general-
purpose computers such as the IBM 3090TM.

In the 1990s, developers of microprocessors used in desktop computers adapted the

concept of vector processing to enhance the capability of their microprocessors when
running desktop multimedia applications. These capabilities were usually referred to as
Single Instruction Multiple Data (SIMD) extensions and operated on short vectors.
Examples of other SIMD extensions in widespread use today include:
Intel Multimedia Extensions (MMXTM)
Intel Streaming SIMD Extensions (SSE)
AMD 3DNow!
Motorola AltiVec and IBM VMX/AltiVec
IBM VSX
The SIMD extensions found in microprocessors used in desktop computers operate on

short vectors of length 2, 4, 8, or 16. This is in contrast to the classic vector
supercomputers that can often exploit long vectors of length 64 or more.

7.3 Intel SIMD Microarchitecture (Sandy Bridge)

Overview
The Sandy Bridge architecture supports AVX instructions operating on 128-bit and 256-
bit vector data. The architecture adds several new functional units to support AVX-256
operations. New capabilities and functional units that support AVX are outlined in red.
Zeroing is a new capability that allows 128-bit and 256-bit instructions to coexist while
minimizing the impact on performance.
Figure 7-3 Sandy Bridge block diagram emphasizing SIMD AVX functional units
Instruction Fetch
Allocate/Rename Zeroing
max 4 per cycle
Instruction Schedule/Issue ***

Port 0 Port 1 Port 2 Port 3 Port 4 Port 5
ALU/JMP
ALU ALU Load 16 bytes Load 16 bytes Store 16 bytes
VI* MUL VI* ADD Store address Store address VI* ADD
SSE MUL SSE Add SSE Shuffle
DIV AVX FP ADD

AVX FP Shuffle
AVX FP MUL
AVX FP Boolean
AVX FP Blend
AVX FP Blend
0 63 127 255
Memory Control
* VI = Vector Integer 32 bytes read, 16 bytes write/cycle New for AVX!
** Only some non-AVX execution
units are present 32 KB L1 Data Cache
Notes for the programmer:
1. The execution pipeline can sustain up to four instructions fetched, dispatched,
executed and completed in any given cycle. Up to three AVX instructions can be
issued in a given cycle.
2. The peak dispatching rate is 6 micro-ops per cycle, to increase the likelihood that
the execution pipeline will not stall because no instructions are available to
decode.
3. There is no fused multiply-add (FMA). Instead, the maximum performance of 16
FP ops per cycle is reached by issuing an independent AVX FP multiply and an
AVX FP add.
4. A 128-bit AVX load can take one cycle. An AVX store can complete in two cycles
5. 256-bit AVX registers are architected as YMM0-YMM15. The registers can also
handle 128-bit AVX vectors, using XMM0-XMM15.
6. AVX shuffles are different than AVX blends. AVX shuffles are byte-permuting
operations. AVX blends mix bytes from two vectors but preserve order. AVX

shuffles are only executed by Port 5, so minimize the number of times shuffles
are needed.
7. Integer SSE instructions are only supported as 128-bit AVX instructions.
8. There is a 1 cycle penalty to move data from the INT stack to the FP stack.
9. AVX supports unaligned memory accesses, but the performance is better when
accessing 32-byte aligned data.
7.4 Vectorization Overview

So, repeating the point of section 7.2, the reason to care about vector technology is
performance. Vector technology can provide dramatic performance gains, up to 8 times
the best scalar performance for some cases.
So how does SIMD code differ from scalar code? Compare the code fragments from
examples 7-1 and 7-2.
Example 7-1 Scalar vector addition

float *a, *b, *c;
...
for ( i = 0, i < n, i++)
{
a[i] = b[i] + c[i]; // scalar version
}
Example 7-2 Vectorization of scalar addition
#include <immintrin.h>
float *a, *b, *c;
__m256 va, vb, vc;
...
for ( i = 0, i < n / vector_size, i+=8)
{
va = _mm256_loadu_ps(&a[i]);
vb = _mm256_loadu_ps(&b[i]);
vc = _mm256_add_ps(va,vb);
_mm_256_storeu_ps(&c[i],vc);
}
In example 7-2, the 32-bit scalar data type has been replaced by a vector data type,
__m256. The vector data has to be explicitly loaded from and stored back into scalar
arrays. Note that the loop range is no longer n iterations, but is reduced by the vector
data type length (vector_size = 8 for floats when using AVX). Remember, an AVX vector
register can hold 256 bits (32 bytes) of data. Therefore, the vector addition operation
vc = _mm256_add_ps(va,vb);
can execute 8 add operations with a single instruction for each vector, as opposed to
multiple scalar instructions. The vectorized version can be up to 8 times faster than the
scalar version.
Intel AVX functionality targets a diverse set of applications in the following areas:
Video editing/ post production
Audio processing
Image processing
Animation
Bioinformatics
A broad range of scientific applications in physics and chemistry.
A broad range of engineering applications dedicated to solving the partial

differential equations of their respective fields.
7.5 Auto-vectorization
Translating scalar code into vector intrinsics is beyond the scope of this discussion.
However, it is relatively straightforward to get the Intel compilers to automatically
vectorize code and report on which loops have been vectorized. The recommended
options are:
For C/C++:
icc O3 xAVX vec-report1 vec-report2
For Fortran:
ifort O3 xAVX vec-report1 vec-report2
These options would be used in addition to other optimization options, such as -finline or
opt-streaming-stores.
xAVX, is the option that explicitly asks the compiler to auto-vectorize loops with
AVX instructions. The compiler will try auto-vectorization by default at
optimization levels O2 and above.
-vec-report1 reports when the compiler has vectorized a loop
-vec-report2 provides reasons about why the compiler failed to vectorize a
loop
#define ITER 10
void foo(int size)
{
int i, j;
float *x, *y, *a;
int iter_count=1024;
...
...
for (j = 0; j < ITER; j++){
for (i = 0; i < iter_count; i+=1){
x[i] = y[i] + a[i+1];
}
}
}
After building a program with auto-vectorization, program performance should be tested.
If the performance is not as expected, a programmer can refer to comments in the listing
provided by -vec-report2 to identify reasons for loops that failed to auto-vectorize
and give the programmer direction on how to correct code to auto-vectorize properly.
7.6 Inhibitors of Auto-vectorization

Here are some common conditions which can prevent the compiler from performing auto-
vectorization.
7.6.1 Loop-carried Data Dependencies

In the presence of loop-carried data dependences where data is read after it is written,
auto-vectorization cannot be carried out. As shown in the simple code snippet below, in

every iteration i of the loop, c[i-1] is being read which is written in the i-1th iteration.
When such a loop is transformed using AVX instructions it results in incorrect values
being computed in the c array.
for (i = 0; i < N; i++)

c[i] = c[i-1] + 1
Certain compiler transformations applied to loops may help resolve certain kinds of loop-
carried data dependences and enable auto-vectorization.
7.6.2 Memory Aliasing

When loops contain data which can potentially overlap, the compiler may desist from
auto-vectorization as it can lead to incorrect results. For the code snippet shown below, if
the memory locations of the pointers a, b, and c denoted by a[0..N-1], b[0..N-1] and
c[0..N-1] do not overlap, and the compiler can statically deduce this fact, the loop will be
vectorized and AVX instructions generated for it. However, in general, it may be non-
trivial for the compiler to deduce this at compile-time and subsequently the loop may
remain serial. This can happen even when in reality the memory locations do not overlap.
double foo(double *a, double *b, double *c)
{
for (i = 0; i < N; i++)
a[i] = b[i] + c[i]
}
The user can help the compiler resolve the memory aliasing issues by telling the compiler
when memory is disjoint using: #pragma ivdep
For the example above, adding

#pragma ivdep
double foo(double *a, double *b, double *c)
{
for (i = 0; i < N; i++)
a[i] = b[i] + c[i]
}
will enable the compiler to make safe assumptions about the aliasing and subsequently
vectorize the loop.
The compiler may, in certain situations ( where memory overlap pragmas are not
provided or the compiler cannot safely analyze overlaps), generate versioned code to
compute the overlap at runtime i.e. it inserts code to test for memory overlap and
generates auto-vectorized or serial code based on whether the test passes or fails at
runtime.
7.6.3 Non-stride-1 Accesses

For auto-vectorization to happen, usually, the compiler expects the accesses of the data
to happen from contiguous locations in memory (what we refer to as stride-1 accesses)
so that once the vector stores and loads happen, the resultant data can be fed directly to
the computation operands. However, with non stride-1 accesses, data may not be loaded
or stored from contiguous locations, leading to several complexities in code generation.
Even if the non-contiguous loading/storing pattern is known, the compiler would now
need to generate certain extra instructions to pack/unpack the data into vector registers
before they can be operated upon. These extra instructions (not present in the serial

code) may increase the cost of implementing auto-vectorization and the compiler may
pre-empt the decision to auto-vectorize those loops based on heuristic/profile-driven cost
analyses. In the code snippet shown below, the accesses of array b[] are non-unit
strides, since another array, idx[], is used to index into b[]. The compiler does not auto-
vectorize such a loop.
for (i = 0; i < N; i++)
a[i] = b[idx[i]]
7.6.4 Other Vectorization Inhibitors in Loops

A loop which is a candidate for auto-vectorization may exhibit one or more of the
following characteristics which may inhibit vectorization.
contains flow control: (restricted form of if/then/else allowed)
for ( i = 0; i < n ; i++) {

if ( i < 8 )
c[i] = a[i] + b[i];
else
c[i] = a[i] - b[i];
}
This kind of flow control is handled by the compiler by splitting the loop into two separate
loops and vectorizing them:
for ( i = 0; i < 8 ; i++) {
c[i] = a[i] + b[i]
}
and
for ( i = 8; i < n ; i++) {
c[i] = a[i] - b[i]
}
trip count is too small : short loops are not worth vectorizing
contains a function call:
o embedded function calls may be eliminated with the C compiler option

-finline)
7.6.5 Data Alignment

For auto-vectorization, the compiler tries to ensure that vector loads/stores are from
aligned memory locations to minimize the misalignment penalty as much as possible.
The most common sorts of alignment issues come from array malloc()s:
a = (float *) malloc(arraylen*sizeof(float));
The portable way to force alignment on 32-byte boundaries is with posix_memalign()
posix_memalign((void **)&a,32,arraylen*sizeof(float));
The Intel C Compiler also provides its own method of aligning malloc()d data, which has
its own associated free() function, too.
a = (float *) _mm_malloc (arraylen*sizeof(float),32);
_mm_free (a);

These are less portable, and cant be interchanged with malloc() and free().

7.7 Additional References

For additional information about the topics presented in this chapter, the interested
reader can refer to
[1] How to Optimize for AVX
[2] AVX Software Development Tools and Libraries
[3] Introduction to Intel AVX
[4] Practical AVX Optimization
[5] Intel Math Kernel Library Documentation
[6] Intel Integrated Performance Primitives Documentation
[7] Intel Fortran Compiler XE 12.1 User and Reference Guide
[8] Introduction to Vectorization with Intel C++ Compiler
[9] Introduction to Vectorization with Intel Fortran
[10] Auto-vectorization with Intel C++ compilers
[11] A Guide to Auto-vectorization with Intel C++ Compilers
[12] Intel C++ auto-vectorization tutorial

8 Hardware Accelerators
8.1 GPGPUs
The Graphics Processing Unit (GPU) has evolved into a highly parallel, multithreaded
many core processor with exceptional compute power and with very high local memory
bandwidth. Originally, GPUs were designed for 3-D rendering of large sets of pixels, and
vertices mapped naturally to parallel threads. Many high performance computing (HPC)
applications are capable of taking advantage of these threads in the GPUs. Both NVIDIA
and AMD with Fusion/ATI have been offering these GPUs for high performance
computing applications for the last four years. GPUs that can be used for HPC
applications are commonly referred to as GPGPUs (General Purpose GPUs).
IBM has been selling NVIDIA GPUs in the iDataPlex and bladeserver nodes since 2011.
There has been a lot of interest from customers on GPU based servers.
The purpose of this chapter is to examine the role of GPUs on IBM iDataPlex systems,
details of the GPUs, the software available for GPUs, and programming the GPUs for
performance. This chapter discusses how to run HPL with GPUs, and the tools available
for GPU performance.
8.1.1 NVIDIA Tesla 2090 Hardware description

NVIDIA developed CUDA (a hardware, and software architecture that enables NVIDIA
Graphics Processing Units or GPUs) to execute user-developed programs in C, C++,
FORTRAN, OpenCL, DirectCompute and other languages. A CUDA program invokes
parallel kernels, where a kernel executes in parallel across a set of threads. A detailed
description of CUDA hardware may be found in [1]. A brief description is included here to
explain some of the concepts.
A kernel is a set of computations that can be executed in parallel by a single core of a

GPGPU. Each instance of the kernel is executed in a single thread that has its own
private memory, thread ID, program counter, and registers. A thread block is a set of
concurrently executing threads that can cooperate among themselves through barrier
synchronization and shared memory. A thread block has a block ID within its grid.
A grid is an array of thread blocks that executes the same kernel, reads inputs from
global memory, writes results to global memory, and synchronizes between dependent
kernel calls.
Each thread block has a per-Block shared memory space used for inter-thread
communication, data sharing, and result sharing in parallel algorithms. Grids of thread
blocks share results in Global memory space after kernel-wide global synchronization.
CUDAs hierarchy of threads maps to a hierarchy of processors on the GPU; The GPU
executes one or more kernel threads; a streaming multiprocessor (SM) executes one
or more thread blocks; and CUDA cores and other execution units in the SM execute
threads. The SM executes threads in groups of 32 threads called a warp.

There are 512 CUDA cores, and these are organized into 16 SMs of 32 cores each. The
GPU has 64-bit memory partitions, for a 384-bit memory interface supporting a total of 6
GB of GDDR5 DRAM memory. A host interface connects the GPU to the CPU via PCI-
Express. The GigaThread global scheduler distributes thread blocks to the SM thread
scheduler. Figure 8-1 shows the GPU layout for the Fermi GPU (the code name for the
latest Tesla product as of March 2012). One of the SMs is outlined in red.
Figure 8-1 Functional block diagram of the Tesla Fermi GPU

DRAM I/F
DRAM I/F
DRAM I/F
HOST I/F
L2
Giga Thread
DRAM I/F
DRAM I/F
DRAM I/F

Figure 8-2 shows a block diagram of the Fermi Streaming Multiprocessor (SM). Each SM
has16 load/store units. (It can load data for 16 threads at a time.) There are 4 Special
Function Units (SFUs) for functions such as sine, cosine, reciprocal, and square root.
And each SM has a fundamental computational block of 32 CUDA cores. One CUDA
core is outlined in red.
Figure 8-2 Tesla Fermi SM block diagram

Instruction Cache
Scheduler Scheduler
Dispatch Dispatch
Register File
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core
Core Core Core Core

Core Core Core Core
16 Load/Store Units
4 Special Funct Units
Network Interconnect
64K Local Shared Memory
Uniform Cache

As shown in Figure 8-3, each CUDA core has an integer arithmetic logic unit (ALU) and a
floating point unit (FPU) capable of providing a fused multiply-add (FMA) instruction for
both single and double precision arithmetic. Each FPU takes 1 clock to deliver
single-precision results, 2 clocks for double-precision results.
Figure 8-3 Cuda core
CUDA Core
Dispatch Port
Operand Collector
FPU ALU
Result Queue
As mentioned earlier, the SM schedules threads in groups of 32 parallel threads called

warps. Each SM has 2 warp schedulers, and 2 instruction dispatch units, to allow 2 warps
to be issued and executed concurrently.
Each SM has 64 KB of configurable (16/48 KB) shared memory and L1 cache. This on-
chip shared memory enables threads in the same block to cooperate. On the Tesla 2090
GPU, there is a 768 KB L2 cache per SM that can be written by any one of the threads in
the SM.
8.1.2 A CUDA Programming Example

CUDA provides a low level instruction set called PTX (Parallel Thread Execution),
through which higher level languages such as CUDA C can work with the hardware.
The basic programming approach is as follows. If a code will benefit from the data-
parallel model using multiple threads, that kernel is identified, and dispatched to be
performed in the GPUs. There is an overhead associated with the data transfer from and
to the GPUs from the host. From the architecture point of view, GPUs are a lot simpler
since all the threads are doing the same operations, with no context switching. The GPU
computing flow is roughly as follows:
Copy data from CPU memory to the GPU memory
The CPU instructs (by a calling a CUDA function) GPU to perform kernel
computations
The GPU executes the kernel instructions in every one of its cores
Resulting data is copied from the GPU memory to the CPU memory
CUDA C extends the C language by allowing the programmer to define C functions
known as kernels, which are executed N times in parallel by N different CUDA threads

[2]. An example dot product would be as follows:
Serial SAXPY in C :
saxpy_serial(int n, float a, float *x, float *y)
{
for (int i=0; i< n; ++i)
y[i] = a*x[i] + y[i];
}
//driver invocation of the saxpy kernel void
saxpy_serial(n, 5.0, x, y)
Data Parallel SAXPY in CUDA C:

global__ void saxpy_cuda(int n, flaot a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x
if (i < n) y[i] = a*x[i] + y[i];
}
blockIdx, blockDim, and threadIdx are reserved words in CUDA.
Driver invocation :
//driver invocation of the cuda saxpy kernel with 256 threads/block
int nblocks (n+255) / 256;
saxpy_cuda<<<nblocks, 256>>>(n, 5.0, x, y);
With this driver invocation of the CUDA kernel for SAXPY, these computations are done
simultaneously.
Compiling, linking, and running CUDA codes:

The following environment variables need to be set so that the appropriate CUDA
compilers and libraries may be found.
setenv CUDA_HOME /usr/local/cuda-3.2
setenv PATH /usr/local/cuda-3.2/bin:$PATH
setenv LD_LIBRARY_PATH /usr/local/cuda-3.2/lib64:$LD_LIBRARY_PATH
The kernel code written in CUDA C has the extension cu, like kernel.cu. The driver
code is written in regular GNU C, with the extension c, like driver.c. The CUDA compiler
is nvcc. Compile and link all of the source code with:
gcc c driver.c
nvcc c kernel.cu
gcc kernel.o driver.o o solver
and to run the code, one does
./solver
The CUDA C compiler nvcc is found in the /usr/local/cuda-3.2/bin directory. Also, there
is a CUDA debugger called cuda-gdb in the same directory, and it is very much like a
GNU GDB debugger, except that the debugging is done at the device level.
The CUBLAS, and CUFFT libraries are located in /usr/local/cuda-3.2/lib64 as shared

objects. The CUBLAS library is optimized for the BLAS kernels.
FORTRAN support

PGI provides support for GPUs with multiple mechanisms. There is a CUDA C, and
CUDA FORTRAN from PGI. In addition, there is support through directives, in the PGI
Accelerator model. These are very similar to OpenMP directives for parallelization [3].
There is also a CAPS HMPP compiler that provides support for FORTRAN codes [4].
8.1.3 Memory Hierarchy in GPU Computations

Figure 8-4 NVIDIA GPU memory hierarchy
Figure 8-4 shows the basic memory hierarchy of the NVIDIA GPUs. To reiterate each
SM has a 64 KB L1 cache on chip, and 768 KB L2 cache off chip; there is a 6 GB global
GPU memory; and the GPU has a PCI-express interface to the system. The data transfer
rates vary widely in CUDA computing:
with in the device (50-80 GB/sec)
asynchronous host-to-device with pinned memory (10-20 GB/sec)
PCIe transfer rate (4-6 GB/sec)
Because of the small size of the caches, and the possibility of memory bank conflicts,
CUDA programming strongly discourages any cache blocking when tuning for
performance. The desired tuning procedure is known as memory coalescing. The details
of this may be found in the NVIDIA tutorials. Here is an example of memory coalescing
in CUDA for a 2-D transpose [5].
In the nave transpose, loads are coalesced; stores are not (the stride runs over the
column index). In addition, there are other read-only memories, known as texture
memory and constant memory, available in the GPUs.

__global__ void transposeNaive(float *odata, float *idata, int width, int height)
{
int xIndex = blockIdx.x * blockDim.x + threadIdx.x; //Convert thread indices to
int yIndex = blockIdx.y * blockDim.y + threadIdy.y; //coordinates in matrix
int index_in = xIndex + width * yIndex; // Convert matrix coordinates
int index_out = yIndex + height * xIndex; // to flattened array indices
odata[index_out] = idata[index_in]
}
In the coalesced method using shared memory, there are two steps:
1. transpose the submatrix into Shared Memory
2. Write rows of the transposed submatrix back to global memory
And the resulting code looks like this:
__global__ void transposeCoalesced(float *odata, float *idata, int width, int height)
{
__shared__ float tile[TILE_DIM][TILE_DIM+1];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x;
int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + (yIndex)*width;
tile[threadIdx.y][threadIdx.x] = idata[index_in];
--syncthreads();
xIndex = blockIdx.y * TILE_DIM + threadIdx.x;

yIndex = blockIdx.x * TILE_DIM + threadIdx.y;
int index_out = xIndex + (yIndex)*height;
odata[index_out] = tile[threadIdx.x][threadIdx.y];
}
This requires thread synchronization, so that all columns are written to shared memory,
before rows are written back to global memory. It also requires that the tile be a perfect
square so that threadId.x and threadId.y have same range (TILE_DIM).
In addition to memory coalescing, there are other memory optimization procedures like
using texture memory and execution configuration (determining the best thread block
dimension). In [6], implementation of the Himeno benchmark that solves the 3-D Poisson
equation is discussed with the details of optimizing for performance on GPUs. Using
finite-differences, the Poisson equation is discretized in space yielding a 19-point stencil.
The discretized equations are solved iteratively using Jacobi relaxation. This benchmark
is designed to run with various problem sizes to fit the system. On the Himeno code,
optimized for GPUs, memory coalescing gives a 57% performance improvement. Use of
Texture cache improves the performance by an additional 33%. Other optimizations
(removing logic, and branching) improve the performance by an additional 18% [6].
8.1.4 CUDA Best practices

From the Tuning document for Fermi [7], the recommended best practices may be
summarized as
find ways to parallelize sequential code
minimize data transfers between the host and the device
adjust kernel launch configuration to maximize device utilization
ensure global memory accesses are coalesced
replace global memory access with shared memory access whenever possible

avoid different execution paths within the same warp
8.1.5 Running HPL with GPUs

HPL ([8]) on GPUs is one of the most requested benchmarks for systems with GPUs.
There are certain requirements that are important for performance, and those are
discussed here. The HPL-GPU code is provided by NVIDIA, and latest version is version
11. There is a problem with Intel MPI that limits scaling, so it is recommended that HPL-
GPU be built with OpenMPI. It is preferable to use GNU compilers and Intel MKL
libraries to get the best performance.
This example illustrates a case where 2 Tesla M2070 GPGPUs are attached to each 12-
core Intel Westmere node. These are some of runtime environmental variables
CPU_CORES_PER_GPU=6
export MKL_NUM_THREADS=$CPU_CORES_PER_GPU
export OMP_NUM_THREADS=$CPU_CORES_PER_GPU
export CUDA_DGEMM_SPLIT=0.85
export CUDA_DTRSM_SPLIT=0.75
The last 2 environmental variables distribute the work load between the CPUs and the
GPUs.
There is a significant impact with processor binding on the performance. With OpenMPI,
use of rankfile is an effective binding tool.
The HPL-GPU code on the CPU side is a hybrid parallel code. It is best run with a
command like this:
$ mpirun -machinefile host.list -np 8 --mca btl_openib_flags 1 -- rankfile rankfile
$HPL_DIR/bin/CUDA_pinned/xhpl | tee out_8
where the rankfile looks like this :
rank 0=i04n201 slot=0,1,2,3,4,5
rank 1=i04n201 slot=6,7,8,9,10,11
rank 2=i04n202 slot=0,1,2,3,4,5
rank 3=i04n202 slot=6,7,8,9,10,11
rank 4=i04n203 slot=0,1,2,3,4,5
rank 5=i04n203 slot=6,7,8,9,10,11
rank 6=i04n204 slot=0,1,2,3,4,5
rank 7=i04n204 slot=6,7,8,9,10,11
Also, it is important to choose the right problem size for peak performance. This is
generally based on the available node memory, as recommended in the HPL download
site ([8]).
The following table shows GPU HPL performance on a customer system that has 2
GPUs per node on 32 nodes with 48 GB of memory per node.
Table 8-1 HPL performance on GPUs

# of nodes # of GPUs N (matrix size) Gflops/s
1 2 70000 699
2 4 90000 1299
4 8 130000 2475
8 16 186000 4708
16 32 288168 9090
32 64 409168 18390

The scaling is roughly linear. This is slightly above 50% of the peak performance of the
system, and typically that is the expected performance.
8.1.6 CUDA Toolkit

In the last couple of years, many development tools have been added into the CUDA
Toolkit [9]. These include libraries and other tools, and are mentioned here for
completeness.
cuSPARSE: CUDA Sparse matrix library
cuRAND: CUDA Random number generator library
NPP: NVIDIA Performance Primitive, a collection of GPU
accelerated image, video, and signal processing
function
Thrust: library of parallel algorithms and data structures
CUDA-GDB: CUDA debugger that allows debugging both CPU, and
GPU parts
CUDA-Memcheck: utility that identifies the source and cause of memory
access errors
8.1.7 OpenACC
OpenACC is a programming standard to develop applications faster on GPUs. It is
OpenMP-like and directives-based, and is developed by Cray, NVIDIA, PGI and CAPs.
The details may be found in [10].
8.1.8 Checking for GPUs in a system

This simple command nvdia-smi a provides a lot of information on the GPUs in a
system. This is the output of the command when the GPUs are idle.
$ nvidia-smi a
==============NVSMI LOG==============
Timestamp : Tue Mar 20 16:17:09 2012
Driver Version : 270.41.19
Attached GPUs : 2
GPU 0:14:0
Product Name : Tesla M2090
Display Mode : Disabled
Persistence Mode : Disabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0322411030473
GPU UUID : GPU-891aa48aaa2cffee-966d8d4e-e0c73aa5-
7646437
9-4a364676f8c384cd02804ab0
Inforom Version
OEM Object : 1.1
ECC Object : 2.0
Power Management Object : 4.0
PCI
Bus : 14

Device : 0
Domain : 0
Device Id : 109110DE
Bus Id : 0:14:0
Fan Speed : N/A
Memory Usage
Total : 6143 Mb
Used : 10 Mb
Free : 6132 Mb
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Double Bit
Device Memory : 0
Register File : 0
L1 Cache : 0
L2 Cache : 0
Total : 0
Temperature
Gpu : N/A
Power Readings
Power State : P12
Power Management : Supported
Power Draw : 32.58 W
Power Limit : 225 W
Clocks
Graphics : 50 MHz
SM : 100 MHz
Memory : 135 MHz
Similarly, it provides information for the second GPU with bus Id labeled as 0:15:0.
If the GPUs are being used, under Utilization, it will show the GPU and the GPU-
memory utilization details in percentages which change dynamically as the program runs
on the GPUs.

8.1.9 References
[1] NVIDIA Fermi Compute Architecture

[2] NVIDIA CUDA C Programming Guide
[3] PGI CUDA Fortran Compiler
[4] CAPS
[5] NVIDIA Tutorials Fundamentals of Performance Optimization on Fermi 2011
[6] Philiips, Everett, and Fatica, Massimiliano, Implementing Himeno Benchmark with
CUDA on GPU Clusters, 2010 IEEE Symposium on Parallel and Distributed
Processing, April 2010.
[7] Tuning CUDA Applications for Fermi, Version 1.2, NVIDIA, July 2010.
[8] HPL at Netlib.org
[9] NVIDIA CUDA Toolkit 4.0
[10] NVIDIA OpenACC
[11] The Tech Report: Nvidias Fermi GPU architecture revealed

9 Power Consumption
The power consumption of a large HPC cluster can reach hundreds, if not thousands, of
kilowatts and customers can be limited in their deployments by the amount of electrical
power that is available in their data center.
The need for ever growing computing power means that the energy efficiency of HPC
clusters must improve for the power envelope to remain under control and not become
the dominant factor limiting the size of a deployment. This is well understood by all
manufacturers who have put energy efficiency at the top of their priorities, whether at the
chip level, the node level, the rack level or the data center level.
From one generation to the next, improvements in the chip manufacturing process have
allowed the power consumption of processors to stay roughly stable while the computing
power has increased, by adding more cores instead of increasing the processor clock
speed.
The Intel Sandy Bridge processor supports a variety of power-saving features like
multiple sleep states and DVFS (Dynamic Voltage and Frequency Scaling) which can be
leveraged by system and application software to reduce the energy consumption of HPC
workloads.
The iDataPlex dx360 M4 uses all these techniques and provides enhanced power
management functions like power capping and power trending.
9.1 Power consumption measurements

The first step in trying to save energy is to measure and report it. Simple measurements
can give valuable information about the power consumption profile of a cluster.
There are two simple ways of measuring the power consumption of an iDataPlex dx360
M4 server, using either the rvitals command from xCAT (the man page is here) or a
simple Linux script to format the data collected by the ibmaem kernel module.
Here is the output of the rvitals command on a dx360 M4 node with hostname n010.
therm-stc:~ # rvitals n010 power
n010: AC Avg Power: 145 Watts (495 BTUs/hr)
n010: CPU Avg Power: 32 Watts (109 BTUs/hr)
n010: Domain A AvgPwr: 50 Watts (171 BTUs/hr)
n010: Domain B AvgPwr: 50 Watts (171 BTUs/hr)
n010: MEM Avg Power: 6 Watts (20 BTUs/hr)
n010: Power Status: on
n010: AC Energy Usage: 90.4085 kWh +/-5.0%
n010: DC Energy Usage: 39.6219 kWh +/-2.5%
To measure the energy consumption of a given workload on an iDataPlex dx360 M4
server, one would use:
$ rvitals n010 power
$ ./workload.sh
$ rvitals n010 power
and work out the difference between the AC Energy Usage readings before and after.
If rvitals is not available, a simple script to process the information provided by the
ibmaem Linux kernel module can yield similar, although less detailed, information.

Provided the ibmaem kernel module is loaded (modprobe ibmaem), one could use this
script:
$ cat nrg.sh
#!/bin/bash
v=1
if [ ! -r /sys/devices/platform/aem.1 ] ; then
v=0
fi
BDC=`cat /sys/devices/platform/aem.$v/energy1_input`
BAC=`cat /sys/devices/platform/aem.$v/energy2_input`
b=$(date "+%s%N")
$*
e=$(date "+%s%N")
ADC=`cat /sys/devices/platform/aem.$v/energy1_input`
AAC=`cat /sys/devices/platform/aem.$v/energy2_input`
RT=$(echo "($e-$b)/1000000"|bc)
DC=$((ADC BDC)
AC=$((AAC - BAC)
DCP=$(echo "$DC/1000/$RT"|bc -l|cut -d'.' -f1)
ACP=$(echo "$AC/1000/$RT"|bc -l|cut -d'.' -f1)
echo "Energy: $(( (DC) / 1000000 ))J (DC) Time:$RT (ms) AvgPower: ${DCP}W (DC)"
echo "Energy: $(( (AC) / 1000000 ))J (AC) Time:$RT (ms) AvgPower: ${ACP}W (AC)"
9.2 Performance versus Power Consumption

The following conclusions can be deduced from measuring the power consumption of a
compute node:
Partially loaded systems consume a high fraction of the maximum power. It is not
uncommon to see an idle node consume half of the power consumption of a fully
loaded node.
The power consumption of a given system is a monotonic function of the CPU
clock frequency.
The power consumption of a given system depends on the application that is
being run. If Linpack HPL is generally considered as a worst case, there are
applications which consume far less power.
The CPU power consumption is the biggest part of the node power but moving
data to and from memory is also significant.
The energy consumption of a given workload (product of the average power
consumption and the job's run time) is a complex function of the CPU clock
frequency. In some situations it can be beneficial, energy-wise, to run a job at the
highest possible clock speed. Granted, the instantaneous power consumption will
be higher, but this will be for a shorter period, resulting in an overall energy
saving.
The above points show that there are opportunities to reduce the power consumption of
compute nodes while minimizing the impact on compute performance.
9.3 System Power states

Several power states exist in the system. These power states allow components of the
server or the entire server to reduce power consumption and optimize efficiency. The
power states are used both when the system is idle and when it is running.
With any of the power-saving states, there is a tradeoff between power savings and

latency. For example, enabling the CPU C6 state allows CPU cores to be completely
turned off, which saves power. But since the CPU cores are powered down, it takes
additional time to restore their state when they transition back to the C0 state. If
maximum overall performance is desired, all power-saving states can be disabled. This
will minimize latencies to transition into and out of the power states, but at the same time
power will be increased dramatically. At the other extreme, if power settings are
optimized for maximum power savings, performance can suffer due to long latencies.
For most applications, the default system settings offer a good balance between
performance and efficiency. If necessary, the defaults can be changed if increased
performance or power savings are desired.
The diagram below shows an overview of the various power states in the server.
Figure 9-1 Server Power States
There is a hierarchy among the power states. At the highest level, the G-states represent
the overall state of the server. The G-states map to the S-states (system sleep states).
Progressing to the right, there are subsystem power states that represent the current
state of the CPU, memory, and subsystem devices. As shown by the arrows, certain
power states cannot be entered if higher level power states are not active. For example,
for a CPU core to be in P1 state, the CPU core also has to be in C0 state, the system has
to be in S0 system sleep state, and the overall server has to be in G0 state.
For additional information on system power states, refer to the ACPI =Advanced
Configuration and Power Interface.

9.3.1 G-States
G-states are global server states that define the operational state of the entire server. As
the number of the G-state increases, there is additional power saved. However, as
shown below, the latency to move back to G0 state also increases.
Table 9-1 Global server states

G-State Can Applications OS Reboot Description
Run? Required?
G0 Yes No Server is fully on but some H
individual components could I
be in a power-savings state. G H
G1 No No Standby, suspend, or hibernate H I
modes. See S-states. E G
G2 No Yes System is in a soft-off state. R H
For example, the power switch E
was pressed. System draws L R
power from the auxiliary power A
rail. The main power rail may T P
or may not be switched off. E O
G3 No Yes AC power is removed. Server N W
is only being powered by the C E
backup battery for RTC, Y R
CMOS, and wake events.

9.3.2 S-States
S-states define the sleep state of the entire system. The table below describes the
various sleep states.
Table 9-2 Sleep states

S-State G-State BIOS OS Relative Relative Description
Reboot Reboot Power Latency
Required Required
S0 G0 No No 6X 0 System is fully on but some
components could be in a power-
savings state.
S1 G1 No No 2.5X 1% A.K.A Idle, Standby if S3 not
supported. Typically, when the OS
is idle, it will halt the CPU and
blank the monitor to save power.
No power rails are switched off.
This state may go away on future
servers.
S2 G1 No No CPU caches are powered down.
No known server or OS supports
this state.
S3 G1 No No 1.1X 10% A.K.A Standby, Suspend-to-
RAM. The state of the chipset
registers is saved to system
memory and memory is placed in a
low-power self-refresh state. To
preserve the memory contents,
power is supplied to the DRAMs in
S3 state.
S4 G1 Yes No X 90% A.K.A Hibernate, Suspend-to-
disk. The state of the operating
system (all memory contents and
chip registers) is saved to a file on
the HDD and the server is placed in
a soft-off state.
S5 G2 Yes Yes X 100% Server is in a soft off state. When
turned back on, the server must
completely reinitialize with POST
and the operating system.
Just as with G-states, higher numbered sleep states save more power but there is
additional latency when the system transitions back to S0 state. The middle state, S3,
offers a good compromise between power savings and latency.
9.3.3 C-States
C-states are CPU idle power-saving states. C-states higher than C0 only become active
when a CPU core is idle. If a process is running on a CPU core, the core is always in C0
state. If Hyper-Threading is enabled, the C-state resolves down to the physical core. For
example, if one Hyper-Thread is active and another Hyper-Thread is idle on the same
core, the core will remain in C0 state.
C-states can operate on each core separately or the entire CPU package. The CPU

package is the physical chip in which the CPU cores reside. It includes the CPU cores,
caches, memory controllers, PCI Express interfaces, and miscellaneous logic. The non-
CPU core hardware inside the package is commonly referred to as the uncore.
Core C-states transitions are driven by interrupts or the operating system scheduler with
MWAIT commands. The number of cores in C3 or C6 also impacts the maximum turbo
frequency that is available. If maximum peak performance is desired, all CPU C-states
should be enabled.
Package C-state transitions are autonomous. No OS awareness or intervention is

required. The package C-state is equal to the lowest numbered C-state that any of the
CPU cores is in at that point in time. Additional logic inside the CPU package monitors all
of the CPU cores and places the package into the appropriate C-state.
Note that CPU C-states do not directly map to ACPI C-states. The reason for this is
historical. ACPI C-states range from C0 to C3. At the time when they were defined,
there were no CPUs that supported the C6 state. So the mapping was 1:1 (ACPI C0
=CPUC0, ACPI C1=CPU C1, etc.). Newer CPUs, however, support the C6 state. In
order to get the maximum power savings when going to the ACPI C3 state, the CPU C6
state gets mapped to ACPI C3 and the CPU C3 state gets mapped to ACPI C2.
Table 9-3 CPU idle power-saving states

ACPI C-State CPU C-State
C0 C0
C1 C1
C2 C3
C3 C6

Shown below is a description of each core and package C-state.

Table 9-4 CPU idle states for each core and socket
C-State CPU Core CPU Core Power / CPU Package CPU
Description Latency Description Package
Approximations 4 Power /
Latency4
C0 Core is fully on and 100% at Pn / 0 nS At least one core is 100% / 0 nS
executing code in C0 state.
L1 cache coherent
Core power is on
C1 Core is halted 30% / 5uS NA core only state NA
L1 cache coherent
Core power is on
C1E NA package only NA At least one core is 50% / ~5uS
state in C1 state and all
others are in a
higher numbered
C-state.
All cores running at
lowest frequency
VRD 5 switches to
minimal voltage
state.
PLL is on
CPU package will
process bus
snoops
C3 Core is halted 10% / 50uS At least one core is 25% / ~50uS
L1 cache flushed to in C3 state and all
last level cache others are in higher
numbered C-
All core clocks
states.
stopped
Core power is on VRD5 in minimal
voltage state
PLL is off.
Memory is placed
in self-refresh.
L3 shared cache
retains context but
is inaccessible.
CPU package is
not snoopable
4
The number of C-states and the specific power savings associated with each C-state is dependent on the
specific type and SKU of CPU installed.
5
VRD stands for the voltage regulator device

C-State CPU Core CPU Core Power / CPU Package CPU

Description Latency Description Package
Approximations 4 Power /
Latency4
C6 L1 flushed to LLC 0% / 100uS All cores are in C6 16% /
~100uS
Core power is off state.
Same power-
saving features as
package C3 plus
some additional
uncore savings.
9.3.4 P-States
P-states are defined as the CPU performance states. Each CPU core supports multiple
P-states and each P-state corresponds to a frequency. Note, that P0 can run above the
rated frequency for short periods of time if turbo mode is enabled. The exact turbo
frequency for P0 and the amount of time the core runs at the turbo frequency is controlled
autonomously in hardware.
Like core C-states, P-states are controlled by the operating system scheduler. The OS
scheduler places a CPU core in a specific P-state depending on the amount of
performance needed to complete the current task. For example, if a 2GHz CPU core
only needs to run at 1GHz to complete a task, the OS scheduler will place the CPU into a
higher numbered P-state.
Each CPU core can be placed in a different P-state. Multiple threads on one core (e.g.
Hyper-Threading) are resolved to a single P-state. P-states are only valid when the CPU
core is in the C0 state. P-states are sometimes referred to as DVFS (dynamic voltage
and frequency scaling) or EIST (Enhanced Intel Speedstep Technology).
Table 9-5 CPU Performance states

P-State CPU Frequency Approximations Description
P0 100 to ~130% (with turbo) CPU can run at rated frequency indefinitely or at
turbo frequency (greater than rate frequency) for
short periods of time.
P1 ~90 to 95% Intermediate P-state
.
.
.
Pn-1 ~85 to 95% Intermediate P-state
Pn 1.2 GHz Minimum frequency that CPU core can execute
code.
The exact frequency breakdown for the P-states varies with the rated frequency and
power of the CPU used.
In addition to controlling the core frequency, P-states also indirectly control the voltage
level of the VRD (voltage regulator device) that is supplying power to the CPU cores. As
the core frequency is reduced from its maximum value, the VRD voltage is automatically
reduced down to a certain point. Eventually, the VRD will be operating at the minimum
voltage that the CPU cores can tolerate. If the core frequency is lowered beyond this
point, the VRD voltage will remain at the minimum voltage. This is illustrated in the
diagram below.

Figure 9-2 The effect of VRD voltage
CPU voltage at
a minimum here
Frequency
scaling
only
Frequency &
voltage scaling
Typically, the most efficient operating point is at the peak of the curve.
9.3.5 D-States
D-states are subsystem power-saving states. They are applicable to devices such as
LAN, SAS, and USB. The operating system can transition to different D-states after a
period of time or when requested by a device driver. All D-states occur when the server
is in S0 state.
Table 9-6 Subsystem power states

D-State Device Power Device Context Description
D0 On Active Device is fully on. All devices support D0
by default even they dont implement the
PCI Power Management specification.
D1 On Active Immediate power state. Lower power
consumption than D0. Exact power-
saving details are device specific.
D2 On Active Immediate power state. Lower power
consumption than D1. Exact power-
saving details are device specific.
D3 hot (ACPI D2) On Lost Power to device is left on but the device
is placed in a low power state. Device is
unresponsive to bus requests.
D3 cold (ACPI D3) Off Lost Power to device is completely removed.
All devices support D3 by default even if
they dont implement the PCI Power
Management specification.

9.3.6 M-States
M-states control the memory power savings. The memory controller automatically
transitions memory to the M1 or M2 state when the memory is idle for a period of time.
M-states are only defined when the server is in S0 state.
Table 9-7 Memory power states

M-State Power / Latency Description
Approximations
M0 100% at idle / 0 Normal mode of operation
M1 80%/ 30nS Lower power CKE mode. Rank power down.
M2 30%/ 10uS Self-refresh. Operates on all DIMMs connected
to a memory channel in a CPU package.
9.4 Relative Influence of Power Features

The power savings and efficiency seen at the system level is a combined effect of many
individual features. The benefit of the power saving features can vary depending on the
utilization level of the server.
Figure 9-3 Relative influence of power saving features
Figure 9-3 illustrates the relative influence of each power saving feature. The vertical
axis is the system utilization, ranging from 0% (idle) to 100% (maximum utilization). The
width of each polygon at any utilization level represents the relative benefit of each group
of power saving features. For example, at 50% utilization, the power supply and VRD
efficiency have a very large influence on overall system efficiency. This is because the
blue polygon is very wide at the 50% utilization point. In contrast to this, energy-efficient
Ethernet has little benefit at 50% utilization and the idle power-saving features have no

benefit at 50% utilization.
It is important to understand what portion of the utilization curve the server will be
operating in. In this manner, it is possible to understand which power-saving features are
influencing the overall performance/watt efficiency of the server for the target workload.
The composite effect can be measured with industry standard efficiency benchmarks
such as SPEC SPECpower.
9.5 Efficiency Definitions Reference

There are three ways to measure the efficiency of a server,
Electrical conversion efficiency (ECE) measures how much power is lost to convert from
one power level to another (e.g. AC-to-DC or DC-to-DC conversion). If a power supply
converts 220V AC to 12V DC and it is 95% efficient for a 500W load, 5% of the input
power is converted to heat and is typically dissipated with a fan built into the power
supply. In this example, 526W AC is required, 500W is delivered to the load, and 26W is
dissipated as heat. Power supply and VRD efficiency has improved dramatically in
recent years but no electrical circuit is ideal and some power is always dissipated.
ECE =Power out / Power In
Power usage effectiveness (PUE) measures how much power is lost in the datacenter
relative to actual IT equipment power. The overall PUE depends on how close to the true
compute power load that the power out measurement is taken and also what ancillary
loads are included in the calculation (e.g. lights, humidification, UPS, CRACs, chillers,
etc.)
PUE =Total Facility Power / IT Equipment Power =1 / Datacenter Efficiency
Performance/watt efficiency (P/W E) is defined as how much performance can be
achieved for every watt of power consumed.
P/W E = Performance / Power
P/W E focuses on the server, chassis, and rack efficiency. By comparison, ECE or PUE
can be extended to the datacenter level or power station level.
9.6 Power and Energy Management

This section presents two energy management techniques available on iDataPlex dx360
M4. One uses the renergy xCAT command. The second one is implemented at the job
scheduler level.
9.6.1 xCAT renergy

The renergy command is used to interact with the power management features of the
iDataPlex dx360 M4 node. With renergy, a system administrator can:
enable, disable, query or set power capping values (quotas) for a node
put systems into and out of sleep state
report on current and historic AC wattage by querying the PSU (Power Supply
Unit)
read the inlet and exhaust air temperature and fan speeds to check the current
cooling and temperature headroom
To query the energy consumption of one or many iDataPlex servers:

# rvitals r31u39 energy

r31u39: AC Energy Usage: 400.8106 kWh +/-5.0%
r31u39: DC Energy Usage: 310.3434 kWh +/-2.5%
In order to apply a cap, it is first helpful to know what values the server will accept. To
query this information:
# renergy r31u39 cappingmaxmin
r31u39: cappingmax: 111.0 W
r31u39: cappingmin: 68.0 W
In this example, to enact a 70W power quota on the specific system the following
command may be used:
# renergy r31u39 cappingvalue=70 cappingstatus=on
r31u39: cappingvalue: 70.0 W
r31u39: cappingstatus: on
In order to remove the quota:
# renergy r31u39 cappingstatus=off
r31u39: cappingstatus: off
Similarly, the current settings may be queried without use of '=' operator:
# renergy r31u39 cappingstatus cappingvalue cappingmaxmin
r31u39: cappingstatus: off
r31u39: cappingvalue: 70.0 W
r31u39: cappingmax: 111.0 W
r31u39: cappingmin: 68.0 W
Keep in mind that aggregate PSU load on facility power is the sum of two servers in the
chassis configuration, and that the quota does not include inefficiencies in the conversion
from AC to DC. If an administrator is using a cap to stay under PDU or cooling capacity,
these inefficiencies must be accounted for.
See the xCAT pages on Energy Management and renergy for more details.
9.6.2 Power and Energy Aware LoadLeveler

The job scheduler is a natural candidate for implementing energy saving techniques: it
runs on the whole cluster, has control over the resources and knows about the
workloads in the job submission queue. For example, LoadLeveler can put nodes in
sleep mode if it can predict that the nodes will not be used for an extended period of time,
quickly readjusting if a new job comes in. PEA-LL (Power and Energy Aware
LoadLeveler) is the generic name of LoadLeveler extensions for energy management.
Version 5.1 adds energy-awareness to the job scheduling.
This works in two steps:

1. PEA-LL provides an energy report for all parallel (MPI) jobs (power, run time,
energy) and store this information in the xCAT/LL database
2. When the same job is submitted a second time, PEA-LL changes the clock
frequency of the nodes where the job is run to match an energy policy defined by
the system administrator.
As seen before, the energy consumption of a job is a complex function of the CPU clock
frequency. PEA-LL uses predictive models based on a characterization of the
applications being run. These models give us the ability to implement energy policies
such as: Max Performance, Min Energy or Min performance degradation.


Appendix A: Stream Benchmark Performance

Intel Compiler v. 11
The Stream application was also built using Intel compiler version 11.1.073 with the
following compiler options:
icc -O3 -xsse4.1 -ansi-alias -ip -opt-streaming-stores always
Table A-1 summarizes the results. Threads are bound using KMP_AFFINITY
Table A-1 Stream memory bandwidth (MB/s) for serial and OpenMP runs
Serial 8 Threads/ 8 Threads/ 16 Threads/
1 Socket 2 Sockets 2 Sockets
Copy 6884.2 35699.3 53815.5 73804.8
Scale 6950.1 36397.8 55591.7 75152.0
Add 9038.8 37188.4 66312.0 75904.3
Triad 9126.3 37978.4 69986.5 77563.8

Appendix B: Acknowledgements
Nagarajan Karthikesan (k.nagarajan@in.ibm.com) provided information on GCC 4.7.0
compilation.
Luigi Brochard supported this project by helping to gather the resources and people
needed.
Steve Stevens and Lisa Maurice provided the encouragement and managerial
sponsorship to complete this project.

Appendix C: Some Useful Abbreviations

API Application Programming Interface library calls to access specific
functionality
ASU Advanced Settings Utility
DDR double data rate a network hardware link speed
DIMM Dual In-line Memory Module computer memory chips
FIFO First data In is the First data Out (used)
FPU Floating Point Unit
GPU Graphics Processing Unit
GT/s Billions (Giga) of transfers per second a measure of bus bandwidth
independent of the number of bits transferred.
IPoIB IP over InfiniBand
ISA Instruction Set Architecture
LID local identifier a way to distinguish between network ports
LL IBM LoadLeveler a job scheduling application
LLC Last Level Cache refers to the L3 caches in Sandy Bridge chips
NUMA Non-Uniform Memory Architecture
PDE Partial Differential Equation
PEA-LL Power and Energy Aware LoadLeveler
PLL Phase-Locked Loop an electronic circuit
QDR quad data rate (twice the DDR speed)
QoS Quality of Service a way to preferentially treat network packets
QPI Intels QuickPath Interconnect
RC Reliable Connection
TDP Thermal Design Power the maximum power a chip can dissipate safely
TLB Translation Lookaside Buffer
UEFI Unified Extensible Firmware Interface
uOP micro-operation basic machine instruction in the Intel64 ISA
VRD Voltage Regulator Device

Appendix D: Notices
IBM Corporation 2010

IBM Corporation
Marketing Communications, Systems Group
Route 100, Somers, New York 10589
Produced in the United States of America
March 2010, All Rights Reserved
This information was developed for products and services offered in the U.S.A.
IBM may not offer the products, services, or features discussed in this document in other
countries. Consult your local IBM representative for information on the products and
services currently available in your area. Any reference to an IBM product, program, or
service is not intended to state or imply that only that IBM product, program, or service
may be used. Any functionally equivalent product, program, or service that does not
infringe any IBM intellectual property right may be used instead. However, it is the user's
responsibility to evaluate and verify the operation of any non-IBM product, program, or
service.
IBM may have patents or pending patent applications covering subject matter described
in this document. The furnishing of this document does not give you any license to these
patents. You can send license inquiries, in writing, to:
IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-
1785 U.S.A.
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law: INTERNATIONAL
BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS"
WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING,
BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT,
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do
not allow disclaimer of express or implied warranties in certain transactions, therefore,
this statement may not apply to you.
This information could include technical inaccuracies or typographical errors. Changes

are periodically made to the information herein; these changes will be incorporated in
new editions of the publication. IBM may make improvements and/or changes in the
product(s) and/or the program(s) described in this publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for convenience
only and do not in any manner serve as an endorsement of those Web sites. The
materials at those Web sites are not part of the materials for this IBM product and use of
those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it believes
appropriate without incurring any obligation to you.
Information concerning non-IBM products was obtained from the suppliers of those
products, their published announcements or other publicly available sources. IBM has not
tested those products and cannot confirm the accuracy of performance, compatibility or
any other claims related to non-IBM products. Questions on the capabilities of non-IBM
products should be addressed to the suppliers of those products.

This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the names
of individuals, companies, brands, and products. All of these names are fictitious and
any similarity to the names and addresses used by an actual business enterprise is
entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which

illustrate programming techniques on various operating platforms. You may copy, modify,
and distribute these sample programs in any form without payment to IBM, for the
purposes of developing, using, marketing or distributing application programs conforming
to the application programming interface for the operating platform for which the sample
programs are written. These examples have not been thoroughly tested under all
conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function
of these programs.
6
More details can be found at the IBM Power Systems home page .
6
http://www.ibm.com/systems/p

Appendix E: Trademarks
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International
Business Machines Corporation in the United States, other countries, or both. These and
other IBM trademarked terms are marked on their first occurrence in this information with
the appropriate symbol ( or ), indicating US registered or common law trademarks
owned by IBM at the time this information was published. Such trademarks may also be
registered or common law trademarks in other countries. A current list of IBM trademarks
is available on the Web.
The following terms are trademarks of the International Business Machines Corporation
in the United States, other countries, or both:
1350 IBM Process Rational Unified
AIX 5L Reference Model Process
AIX for Rational
alphaWorks IT Redbooks
Ascendant IBM Systems Redbooks (logo)
BetaWorks Director Active RS/6000
BladeCenter Energy RUP
CICS Manager S/390
Cool Blue IBM Sametime
DB2 iDataPlex Summit Ascendant
developerWorks IntelliStation Summit
Domino Lotus Notes System i
EnergyScale Lotus System p
Enterprise MQSeries System Storage
Storage Server MVS System x
Enterprise Netfinity System z
Workload Notes System z10
Manager OS/390 System/360
eServer Parallel Sysplex System/370
Express PartnerWorld Tivoli
Portfolio POWER TotalStorage
FlashCopy POWER VM/ESA
GDPS POWER4 VSE/ESA
General Parallel POWER5 WebSphere
File System POWER6 Workplace
Geographically POWER7 Workplace
Dispersed Parallel PowerExecutive Messaging
Sysplex Power Systems X-Architecture
Global Innovation PowerPC xSeries
Outlook PowerVM z/OS
GPFS PR/SM z/VM
HACMP pSeries z10
HiperSockets QuickPlace zSeries
HyperSwap RACF
i5/OS Rational Summit
The following terms are trademarks of other companies:
AMD, AMD Opteron, the AMD Arrow logo, and combinations thereof, are
trademarks of Advanced Micro Devices, Inc.
InfiniBand, and the InfiniBand design marks are trademarks and/or service
marks of the InfiniBand Trade Association.
ITIL is a registered trademark, and a registered community trademark of

the Office of Government Commerce, and is registered in the U.S.

Patent and Trademark Office.
IT Infrastructure Library is a registered trademark of the Central Computer
and Telecommunications Agency which is now part of the Office of
Government Commerce.
Novell, SUSE, the Novell logo, and the N logo are registered trademarks
of Novell, Inc. in the United States and other countries.
Oracle, JD Edwards, PeopleSoft, Siebel, and TopLink are registered
trademarks of Oracle Corporation and/or its affiliates.
SAP NetWeaver, SAP R/3, SAP, and SAP logos are trademarks or
registered trademarks of SAP AG in Germany and in several other
countries.
IQ, J2EE, Java, JDBC, Netra, Solaris, Sun, and all Java-
based trademarks are trademarks of Sun Microsystems, Inc. in the
United States, other countries, or both.
Microsoft, Windows, Windows NT, Outlook, SQL Server,
Windows Server, Windows, and the Windows logo are trademarks
of Microsoft Corporation in the United States, other countries, or both.
Intel Xeon, Intel, Itanium, Intel logo, Intel Inside logo, Intel
SpeedStep, and the Intel Centrino logo are trademarks or registered
trademarks of Intel Corporation or its subsidiaries in the United States,
other countries, or both. A current list of Intel trademarks is available on
the Web at http://www.intel.com/intel/legal/tmnouns2.htm.
QLogic, the QLogic logo are the trademarks or registered trademarks of
QLogic Corporation.
SilverStorm is a trademark of QLogic Corporation.
SPEC is a registered trademark of Standard Performance Evaluation
Corporation.
SPEC MPI is a registered trademark of Standard Performance Evaluation
Corporation.
UNIX is a registered trademark of The Open Group in the United States
and other countries.
Linux is a trademark of Linus Torvalds in the United States, other
countries, or both.
Other company, product, or service names may be trademarks or service marks of
others.

Performance Guide For HPC Applications On IDataplex Rel 1.0.2

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Performance Guide For HPC Applications On IDataplex Rel 1.0.2

Hochgeladen von

Copyright:

Verfügbare Formate

Performance Guide for HPC Applications on iDataPlex dx360 M4 systems

Copyright 2012 IBM Corporation Page 1 of 153

2 Performance Optimization on the CPU ..................................................... 28

Copyright 2012 IBM Corporation Page 2 of 153

5 Performance Analysis Tools on Linux ........................................................ 73

6 Performance Results for HPC Benchmarks ................................................ 88

Copyright 2012 IBM Corporation Page 3 of 153

6.2.2 Single Core Version - Intel compiler ....................................................................... 94

7 AVX and SIMD Programming.................................................................. 113

9 Power Consumption ............................................................................... 134

Copyright 2012 IBM Corporation Page 4 of 153

9.3.3 C-States................................................................................................................. 138

Appendix A: Stream Benchmark Performance Intel Compiler v. 11 ....... 147

Copyright 2012 IBM Corporation Page 5 of 153

Copyright 2012 IBM Corporation Page 6 of 153

Copyright 2012 IBM Corporation Page 7 of 153

Copyright 2012 IBM Corporation Page 8 of 153

Copyright 2012 IBM Corporation Page 9 of 153

A majority of supercomputing applications are parallelized using the message passing

A carefully conducted performance analysis of an application is often a prerequisite for

Copyright 2012 IBM Corporation Page 10 of 153

Table 1-1 Sandy Bridge Feature Overview Compared to Xeon E5600

Max Memory Speed Supported 1333 MHz 1600 MHz

Copyright 2012 IBM Corporation Page 11 of 153

controller on the CPU, as depicted in Figure 1-1.

Figure 1-1 Processor Ring Diagram

CPU LLC LLC Cache CPU

CPU LLC LLC CPU

1.1.1 Supported Processor Models

Copyright 2012 IBM Corporation Page 12 of 153

Table 1-2 Supported Sandy Bridge Processor Models

E5-2609 2.4 GHz 10 MB 4 80 W 6.4 GT/s 1066 MHz No No

E5-2603 1.8 GHz 10 MB 4 80 W 6.4 GT/s 1066 MHz No No

Special Purpose / High Frequency

E5-2637 3.0 GHz 5 MB 2 80 W 8 GT/s 1600 MHz Yes Ver 2.0

The following observations may be made from this table:

1.1.2 Turbo Boost 2.0

Copyright 2012 IBM Corporation Page 13 of 153

Table 1-3 Maximum Turbo Upside by Sandy Bridge CPU model

E5-2680 2.7 GHz 8 130 W 3.5 GHz 3.1 GHz

E5-2670 2.6 GHz 8 115 W 3.3 GHz 3.0 GHz

E5-2665 2.4 GHz 8 115 W 3.1 GHz 2.8 GHz

E5-2660 2.2 GHz 8 95 W 3.0 GHz 2.7 GHz

E5-2650 2.0 GHz 8 95 W 2.8 GHz 2.4 GHz

E5-2640 2.5 GHz 6 95 W 3.0 GHz 2.8 GHz

E5-2630 2.3 GHz 6 95 W 2.8 GHz 2.6 GHz

E5-2620 2.0 GHz 6 95 W 2.5 GHz 2.3 GHz

E5-2609 2.4 GHz 4 80 W N/A N/A

E5-2603 1.8 GHz 4 80 W N/A N/A

E5-2667 2.9 GHz 6 130 W 3.5 GHz 3.2 GHz

E5-2637 3.0 GHz 2 80 W 3.5 GHz 3.5 GHz

E5-2650L 1.8 GHz 8 70 W 2.3 GHz 2.0 GHz

E5-2630L 2.0 GHz 6 60 W 2.5 GHz 2.3 GHz

Copyright 2012 IBM Corporation Page 14 of 153

Copyright 2012 IBM Corporation Page 15 of 153

Figure 1-2 dx360 M4 Block Diagram with Data Buses

PCIe Gen3 PCIe Gen3

Copyright 2012 IBM Corporation Page 16 of 153

connectivity. Depending on requirements, each compute node can be configured with

1.2.1 I/O and Locality Considerations

1.2.2 Memory Subsystem

Copyright 2012 IBM Corporation Page 17 of 153

Table 1-4 Supported DIMM types