Beruflich Dokumente
Kultur Dokumente
END-TO-END
COMPUTING
By
Dell, the Dell logo, and PowerEdge, are trademarks of Dell Inc; Intel and Xeon are registered trademarks and Core is
a trademark of Intel Corporation in the U.S and other countries; ATI is a trademark of AMD; Microsoft and
Windows are either trademarks or registered trademarks of Microsoft Corporation in the United States and/or
other countries. Red Hat and Red Hat Enterprise Linux are registered trademark of Red Hat, Inc.; SUSE is a
registered trademark of Novell, Inc. in the United States and other countries;
Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks
and names or their products. Dell disclaims proprietary interest in the marks and names of others.
©Copyright 2008 Dell Inc. All rights reserved. Reproduction in any manner whatsoever without the express written
permission of Dell Inc. is strictly forbidden. For more information, contact Dell.
2
Table of Contents
Introduction .................................................................................................................................................. 5
An Overview of InfiniBand ............................................................................................................................ 5
The Basics of the InfiniBand Fabric ........................................................................................................... 5
The InfiniBand Power Advantage ............................................................................................................. 6
The InfiniBand Performance Advantage ................................................................................................... 7
InfiniBand Software Solutions................................................................................................................... 7
InfiniBand Growing Role in the Data Center ............................................................................................. 7
PowerEdge M1000e Architecture ................................................................................................................. 8
Midplane Fabric Connections ................................................................................................................... 9
I/O Communication Paths in Dell PowerEdge M1000e .......................................................................... 10
Server Blades with InfiniBand ConnectX Mezzanine Cards .................................................................... 11
InfiniBand Configuration ......................................................................................................................... 13
Performance Study and Analysis ................................................................................................................ 14
Cluster Test Bed Configurations.............................................................................................................. 14
Hardware Configurations .................................................................................................................... 14
Software Configurations ..................................................................................................................... 14
InfiniBand Blocking Configurations ......................................................................................................... 14
Fully Non-blocking Configuration ....................................................................................................... 14
50% and 75% Blocking ........................................................................................................................ 15
Benchmarking and Analysis: NAS Parallel Benchmark ............................................................................... 16
Summary and Conclusion ........................................................................................................................... 17
References .................................................................................................................................................. 18
3
4
Introduction
With the launch of the Dell™ PowerEdge™ M-Series modular enclosure, Dell released a new set of
switches designed to give more value and flexibility than any before them. An increasingly important
part of Dell's modular switch lineup, Dell's InfiniBand switch provides a low-latency, high-throughput
option for many data centers and high performance computing clusters.
As InfiniBand increases its market presence, the need for InfiniBand module flexibility increases.
Previously on Dell's PowerEdge 1955 enclosure, Dell provided an InfiniBand Pass Through module which
provided one to one throughput out of each blade to the external InfiniBand infrastructure. In order to
provide more flexibility to support different types of InfiniBand environments, the Dell M-Series
supports an internal InfiniBand switch which has one port out for every two servers per module.
This whitepaper demonstrates how supporting one to one, non-blocking InfiniBand architecture (IBA) is
possible utilizing Dell's new InfiniBand switch.
An Overview of InfiniBand
As the I/O technology with the largest installed base of 10, 20 and 40 Gb/s ports in the market (over 3
million ports by end of 2007), InfiniBand has clearly delivered real-world benefits as defined and
envisioned by the InfiniBand Trade Association (www.InfiniBandta.org), an industry consortium formed
in 1999. There are several factors that have enabled InfiniBand adoption in data centers and technical
compute clusters to quickly ramp and explain why it will continue to be the performance computing and
storage fabric of choice.
High bandwidth, low-latency computing, storage and management over a single fabric
Cost-effective silicon and system implementations with an architecture that easily scales from
generation to generation
5
Figure 1: Typical InfiniBand Architecture
With a true cut-through forwarding architecture and well defined end-to-end congestion management
protocol, InfiniBand defines cost-effective and scalable I/O solutions. Switch silicon devices support from
twenty-four 20 Gb/s to thirty-six 40 Gb/s InfiniBand ports, which equates to nearly three terabit per
second of aggregate switching bandwidth.
Switches and adapters support up to 16 virtual lanes per link to enable granular segregation and
prioritization of traffic classes for delivering Quality of Service (QoS).
InfiniBand also defines an industry-standard implementation of Remote Direct Memory Access (RDMA),
protocols and kernel bypass to minimize CPU overhead allowing computing resources to be fully used on
application processing rather than network communication.
InfiniBand is clearly driving the most aggressive performance roadmap of any I/O fabric, while remaining
affordable and robust for mass industry adoption.
6
The InfiniBand Performance Advantage
One of the key reasons that data centers are deploying industry-standard InfiniBand is the total
application level performance the fabric enables. First, InfiniBand is the only shipping solution that
supports 20Gb/s and 40Gb/s host connectivity and 60Gb/s and 120Gb/s switch to switch links. Second,
InfiniBand has world-class application latency with measured delays of 1μs end to end. Third, InfiniBand
enables efficient use of all of the processors and memory in the network by offloading all of the data
transport mechanisms in the adapter card and reducing memory copies. These three metrics combine to
make InfiniBand one of the industry’s’ most powerful interconnect.
The performance benefits are echoed in the trends of the Top500.org list that tracks the world’s most
powerful supercomputers. Published twice a year, this list is increasingly used as an indication of what
technologies are emerging in the clustered and supercomputing arena.
From an application point of view, InfiniBand has support for a plethora of applications in both
enterprise and high-performance computing environments. In the enterprise environment, InfiniBand is
being used for grid computing and clustered database applications driven by market leaders. In the
commercial high-performance computing field, InfiniBand provides the fabric connecting servers and
storage to address a wide range of applications including oil and gas exploration, automotive crash
simulations, digital media creation, fluid dynamics, drug research, weather forecasting and molecular
modeling just to name a few.
The emergence of virtual and grid computing solutions in addition to robust software solutions have set
the stage for mass deployment of InfiniBand in business and utility computing environments.
Industry-standard InfiniBand has the performance, proven reliability, manageability and widely available
software solutions making it ready for prime time.
7
PowerEdge M1000e Architecture
The Dell PowerEdge M1000e Modular Server Enclosure is a breakthrough in enterprise server
architecture. The enclosure and its components spring from a revolutionary, ground up design
incorporating the latest advances in power, cooling, I/O, and management technologies. These
technologies are packed into a highly available rack dense package that integrates into standard Dell and
3rd party 19” racks.
The PowerEdge M1000e enclosure is 10U high and provides the following features:
Up to 16 server modules.
A high speed passive midplane that connects the server modules in the front and the power, I/O,
and management infrastructure in the rear of the enclosure.
Comprehensive I/O options that support dual links of 20 Gigabits per second today (with 4x DDR
InfiniBand) with future support for even higher bandwidth I/O devices when those technologies
become available. This support provides high‐speed server module connectivity to the network and
storage today and for well into the future.
Thorough power management capabilities including delivering shared power to ensure full capacity
of the power supplies available to all server modules.
Robust management capabilities including private Ethernet, serial, USB, and low-level management
connectivity between the Chassis Management Controller (CMC), the keyboard/video/mouse
switch, and server modules.
Up to two Chassis Management Controllers (CMC‐ 1 is standard, 2nd provides optional redundancy)
and 1 optional integrated Keyboard/Video/Mouse (iKVM) switch.
Up to 6 hot pluggable, redundant power supplies and 9 hot pluggable, N+1 redundant fan modules.
A system front control panel featuring an LCD display; two keyboard/mouse USB connections; and
one video “crash cart” connection.
8
Figure 2: Dell PowerEdge M1000e Front View
Both Fabric B and C are supported through optional Mezzanine cards on separate x8 PCI Express lanes.
Fabric B and C can support 2 ports and each port has 4 lanes (1 lane consists of both Transmit and
Receive differential signals) connected from Mezzanine connector to I/O module as shown in Figure 3
and Figure 4. The InfiniBand Mezzanine card can be installed in either Fabric B or C on the Blades.
Fabric B and C I/O Modules receive 16 sets of signals, one set from each blade. Fabric TX and RX
differential pairs are the high speed routing lanes, supporting 1.25 Gb/s to 10.3125 Gb/s.
9
Figure 3: Fabric B and C Midplane Connections
Blade Midplane
(1 of 16)
Fabric B: 4-lane transmit/receive differential pairs
(16 lines total) Fabric B I/O
Module
1
Fabric B
Mezzanine
Fabric B: 4-lane transmit/receive differential pairs
(16 lines total) Fabric B I/O
Module
2
Fabric C
Mezzanine
Fabric C: 4-lane transmit/receive differential pairs
(16 lines total) Fabric C I/O
Module
2
10
Figure 4: Fabric B and C Midplane Connections
….
4-8 lane
PCIe Fabric A Ethernet
CPU LOM I/O Module
8 lane
1-2 lane
PCIe Fabric B
MCH/
IOH Mezzanine
Fabric A2
….
Ethernet
8 lane 1-4 lane I/O Module
CPU PCIe Fabric C
Mezzanine
1-4 lane
1-2 lane
….
…… .
Fabric B1
1-2 lane I/O Module
1-4 lane
….
Fabric B2
I/O Module
Half Height Modular Server (16 of 16)
1-4 lane
4-8 lane
PCIe Fabric A
CPU LOM 1-4 lane
8 lane
….Fabric C1
PCIe Fabric B I/O Module
MCH/
IOH Mezzanine
1-4 lane
8 lane
PCIe Fabric C
CPU
Mezzanine
….
Fabric C2
1-4 lane
I/O Module
11
Memory Slots 8 fully buffered DIMMs (667 MHz)
Local Storage Hard Disk Drive 2 x 2.5 inch hot pluggable SAS or SATA
(HDD)
12
Console Virtual KVM through iDRAC
InfiniBand Configuration
Mellanox ConnectX IB MDI InfiniBand Host Channel Adapter (HCA) mezzanine cards are designed for
delivering low-latency and high-bandwidth performance-driven server and storage clustering
applications in Enterprise Data Center and High-Performance Computing environments.
M2401G InfiniScale® III InfiniBand Switch for the Dell M1000e is used to create reliable, scalable, and
easy to manage interconnect fabrics for compute, communication, storage, and embedded applications.
The switch has 24 ports including 16 internal 4x DDR downlink and 8 external 4x DDR ports. The
M2401G InfiniScale III supports 20Gb/s per 4X port and 60Gb/s per 12X port delivering 960Gb/s of
aggregate bandwidth.
13
Performance Study and Analysis
In this section we study the performance impact of InfiniBand blocking factor on a High Performance
Computing Cluster using a cluster of 32 node M600 blades. The study includes running a synthetic
cluster benchmark suite known as NAS Parallel Benchmark (NPB). The study was done by varying
InfiniBand configurations ranging from 0% blocking to 75% blocking configurations as described below.
Unless otherwise stated the results are shown normalized to the 50% blocking configuration. This is
because the 50% configuration is the most natural configuration that can be created using one switch
per chassis and all 8 external 4x DDR.
Each M600 blade has two PCI-Express x8 Mezzanine card slots (slot B and slot C). One of these slots was
populated with dual port Mellanox ConnectX Mezzanine cards operating at Double Data Rate (DDR)
speed with 20Gbps signaling rate. The Mezzanine cards have the firmware version 2.3.0.
The HCA ports on the blade were connected to the internal InfiniBand switch by means of the chassis
midplane. The InfiniBand switch consists of 16 internal links through which the blades connect and 8
external links for outside connection. Hence in this case, when using a single switch within a chassis the
InfiniBand traffic will be 50% blocking and connect to a fabric outside the chassis; however, it will be
non-blocking within the chassis.
Software Configurations
The cluster was deployed with Red Hat Linux 4 update 5 errata kernel 2.6.9-55.0.12.ELsmp. The driver
stack used for this study is Mellanox Open Fabrics Enterprise Distribution version 1.3.
Our studies were conducted using NAS parallel benchmark which is a synthetic cluster benchmark. The
benchmarks were run with OpenMpi 1.2.5 that comes pre-compiled and packaged with the Mellanox
OFED 1.3 stack.
14
were used to create a non-blocking network between the 4 InfiniBand switch modules in the two
chassis. To ensure a non-blocking configuration from the blades to the I/O modules, 8 of the blades in
each chassis had the ConnectX HCAs in Mezzanine slot B and the other 8 had ConnectX HCAs in
Mezzanine slot C. This configuration is illustrated in Figure 7.
Out of the eight external connections on each switch, there are four connections going to each external
24 port switch. This helps to avoid network congestion caused by multiple hops or credit loop scenarios.
A non-blocking configuration can also be created by replacing the two 24 switches with a single
InfiniBand Large Port Count (LPC) switch that supports 36 or more ports.
The 75% blocking configuration depicts the 50% blocking configuration except fewer cables connect to
the 24 port switch. The 75% blocking configuration was achieved by using only 4 uplinks between the
internal switches and the external switch. These configurations are shown in Figure 8 and Figure 9.
15
Figure 8: Configuration of a 50% Blocking InfiniBand Cluster
SFS 7000D
4
4
4 cables from slot B 4 cables from slot B
16
According to Ahmad Faraj and Xin Yuan [1], both CG and FT have a large volume of inter-node
communication. IS has medium volume whereas MG has relatively less volume of inter-node
communication. As Figure 10 shows, the performance of the FT benchmark is significantly affected by
variations in the blocking factor. The FT benchmark comprises of a number of messages sent in a
collective communication pattern. These messages are greatly affected by the reduced bandwidth
across the bottleneck as seen by the graph.
The CG benchmark comprises a large volume of data communication, but it mainly calls point to point
routines. Hence from the graph it appears that for benchmarks that are both computational and
communication intensive, the performance can be tolerated as seen between 0% and 50% blocking
factor from the graph.
The IS benchmark shows some change when going from 50% blocking to 75% blocking. This benchmark
has significant collective communication and hence shows a greater affect on blocking factor compared
to the MG benchmark which has mainly point to point communication. Thus the MG benchmark shows
no significant impact or degradation for any configuration.
17
blocking factor. For benchmarks that are both communication and computation intensive and that have
a small volume of communication data across nodes or that have a communication pattern that is
mainly point to point, the blocking factor may be of little importance
However it is possible that in real-world commercial applications, there is much less impact of blocking
factor on performance. The application communication characteristics as well as distribution of data
between the nodes govern the performance impact.
Thus it is recommended that the application characteristics be used for designing the appropriate IB
fabric. For certain bandwidth and latency sensitive applications it is imperative to use a complete non-
blocking configuration as described in the “InfiniBand Blocking Configurations” section on page 14. .
However based on the results above, a 50% blocking configuration might provide the best
price/performance benefit for commercial clusters or clusters with a mix of communication and
computation. This configuration will also benefit from ease of design and management as fewer
modules, external switches, and cables could be used.
References
1. Ahmad Faraj, Xin Yuan, “Communication Characteristics in the NAS Parallel Benchmarks”, A
Scientific and Technical Publishing Company(ACTA) – 2002.
2. Jiuxing Liu, Balasubramanian Chandrasekaran, Jiesheng Wu, Weihang Jiang, Sushmitha Kini,
Weikuan Yu, Darius Buntinas, Peter Wyckoff, D K Panda, “Performance Comparison of MPI
Implementation over InfiniBand, Myrinet and Quadrics”, Conference on High Performance
Networking and Computing archive Proceedings of the 2003 ACM/IEEE conference on
Supercomputing.
18