Sie sind auf Seite 1von 18

DELL

Understanding the Performance


Impact of Non-blocking vs. Blocking
InfiniBand Cluster Configurations on
Dell™ M-Series Blades

END-TO-END
COMPUTING
By

Munira Hussain, Vishvesh Sahasrabudhe,


Bhavesh Patel (Dell)
and
Gilad Shainer (Mellanox Technologies)

Dell │ Enterprise Solutions Engineering


1
www.dell.com/solutions
THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND
TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR IMPLIED WARRANTIES OF
ANY KIND.

Dell, the Dell logo, and PowerEdge, are trademarks of Dell Inc; Intel and Xeon are registered trademarks and Core is
a trademark of Intel Corporation in the U.S and other countries; ATI is a trademark of AMD; Microsoft and
Windows are either trademarks or registered trademarks of Microsoft Corporation in the United States and/or
other countries. Red Hat and Red Hat Enterprise Linux are registered trademark of Red Hat, Inc.; SUSE is a
registered trademark of Novell, Inc. in the United States and other countries;

Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks
and names or their products. Dell disclaims proprietary interest in the marks and names of others.

©Copyright 2008 Dell Inc. All rights reserved. Reproduction in any manner whatsoever without the express written
permission of Dell Inc. is strictly forbidden. For more information, contact Dell.

Information in this document is subject to change without notice.

2
Table of Contents
Introduction .................................................................................................................................................. 5
An Overview of InfiniBand ............................................................................................................................ 5
The Basics of the InfiniBand Fabric ........................................................................................................... 5
The InfiniBand Power Advantage ............................................................................................................. 6
The InfiniBand Performance Advantage ................................................................................................... 7
InfiniBand Software Solutions................................................................................................................... 7
InfiniBand Growing Role in the Data Center ............................................................................................. 7
PowerEdge M1000e Architecture ................................................................................................................. 8
Midplane Fabric Connections ................................................................................................................... 9
I/O Communication Paths in Dell PowerEdge M1000e .......................................................................... 10
Server Blades with InfiniBand ConnectX Mezzanine Cards .................................................................... 11
InfiniBand Configuration ......................................................................................................................... 13
Performance Study and Analysis ................................................................................................................ 14
Cluster Test Bed Configurations.............................................................................................................. 14
Hardware Configurations .................................................................................................................... 14
Software Configurations ..................................................................................................................... 14
InfiniBand Blocking Configurations ......................................................................................................... 14
Fully Non-blocking Configuration ....................................................................................................... 14
50% and 75% Blocking ........................................................................................................................ 15
Benchmarking and Analysis: NAS Parallel Benchmark ............................................................................... 16
Summary and Conclusion ........................................................................................................................... 17
References .................................................................................................................................................. 18

3
4
Introduction
With the launch of the Dell™ PowerEdge™ M-Series modular enclosure, Dell released a new set of
switches designed to give more value and flexibility than any before them. An increasingly important
part of Dell's modular switch lineup, Dell's InfiniBand switch provides a low-latency, high-throughput
option for many data centers and high performance computing clusters.

As InfiniBand increases its market presence, the need for InfiniBand module flexibility increases.
Previously on Dell's PowerEdge 1955 enclosure, Dell provided an InfiniBand Pass Through module which
provided one to one throughput out of each blade to the external InfiniBand infrastructure. In order to
provide more flexibility to support different types of InfiniBand environments, the Dell M-Series
supports an internal InfiniBand switch which has one port out for every two servers per module.

This whitepaper demonstrates how supporting one to one, non-blocking InfiniBand architecture (IBA) is
possible utilizing Dell's new InfiniBand switch.

An Overview of InfiniBand
As the I/O technology with the largest installed base of 10, 20 and 40 Gb/s ports in the market (over 3
million ports by end of 2007), InfiniBand has clearly delivered real-world benefits as defined and
envisioned by the InfiniBand Trade Association (www.InfiniBandta.org), an industry consortium formed
in 1999. There are several factors that have enabled InfiniBand adoption in data centers and technical
compute clusters to quickly ramp and explain why it will continue to be the performance computing and
storage fabric of choice.

The Basics of the InfiniBand Fabric


InfiniBand fabrics are created with Host Channel Adapters (HCA) and Target Channel Adapters (TCA) that
fit into servers and storage nodes and are interconnected by switches that tie all nodes together over a
high-performance network fabric.

The InfiniBand Architecture is fabric designed to meet the following needs:

High bandwidth, low-latency computing, storage and management over a single fabric

Cost-effective silicon and system implementations with an architecture that easily scales from
generation to generation

Highly reliable, available and scalable to tens of thousands of nodes

Exceptionally efficient utilization of compute processing resources

Industry-standard ecosystem of cost-effective hardware and software solutions

5
Figure 1: Typical InfiniBand Architecture

With a true cut-through forwarding architecture and well defined end-to-end congestion management
protocol, InfiniBand defines cost-effective and scalable I/O solutions. Switch silicon devices support from
twenty-four 20 Gb/s to thirty-six 40 Gb/s InfiniBand ports, which equates to nearly three terabit per
second of aggregate switching bandwidth.

Switches and adapters support up to 16 virtual lanes per link to enable granular segregation and
prioritization of traffic classes for delivering Quality of Service (QoS).

InfiniBand also defines an industry-standard implementation of Remote Direct Memory Access (RDMA),
protocols and kernel bypass to minimize CPU overhead allowing computing resources to be fully used on
application processing rather than network communication.

InfiniBand is clearly driving the most aggressive performance roadmap of any I/O fabric, while remaining
affordable and robust for mass industry adoption.

The InfiniBand Power Advantage


InfiniBand technology does not only provide a cost effective high-performance interconnect solution,
but also a very low power solution, of 5W or less per 20Gb/s or 40Gb/s InfiniBand port. Coupled with
high-performance and ability to consolidate clustering, networking and storage, a single InfiniBand
adapter can replace multiple legacy Clustering, Ethernet and Fibre Channel adapters to provide
significant power saving to the data center. These advantages are making InfiniBand a vital interconnect
for server blades.

6
The InfiniBand Performance Advantage
One of the key reasons that data centers are deploying industry-standard InfiniBand is the total
application level performance the fabric enables. First, InfiniBand is the only shipping solution that
supports 20Gb/s and 40Gb/s host connectivity and 60Gb/s and 120Gb/s switch to switch links. Second,
InfiniBand has world-class application latency with measured delays of 1μs end to end. Third, InfiniBand
enables efficient use of all of the processors and memory in the network by offloading all of the data
transport mechanisms in the adapter card and reducing memory copies. These three metrics combine to
make InfiniBand one of the industry’s’ most powerful interconnect.

The performance benefits are echoed in the trends of the Top500.org list that tracks the world’s most
powerful supercomputers. Published twice a year, this list is increasingly used as an indication of what
technologies are emerging in the clustered and supercomputing arena.

InfiniBand Software Solutions


Open source and community-wide development of interoperable and standards-based Linux and
Microsoft® Windows® stacks are managed through the Open Fabrics Alliance. This alliance, consisting of
solution providers, end-users and programmers interested in furthering development of the Linux or
Windows stacks, has successfully driven InfiniBand support into the Linux kernel and gained WHQL
qualification for Microsoft Windows Server®. The successful inclusion of InfiniBand drivers and upper
layer protocols in the Linux kernel insures interoperability between different vendor solutions and will
ease the deployment of InfiniBand fabrics in heterogeneous environments.

From an application point of view, InfiniBand has support for a plethora of applications in both
enterprise and high-performance computing environments. In the enterprise environment, InfiniBand is
being used for grid computing and clustered database applications driven by market leaders. In the
commercial high-performance computing field, InfiniBand provides the fabric connecting servers and
storage to address a wide range of applications including oil and gas exploration, automotive crash
simulations, digital media creation, fluid dynamics, drug research, weather forecasting and molecular
modeling just to name a few.

InfiniBand Growing Role in the Data Center


Data centers simultaneously run multiple applications and need to dynamically reallocate compute
resources between applications depending on end user workload. To meet these needs the network
fabric must seamlessly support compute, storage, inter-process communication, and management
traffic.

The emergence of virtual and grid computing solutions in addition to robust software solutions have set
the stage for mass deployment of InfiniBand in business and utility computing environments.

Industry-standard InfiniBand has the performance, proven reliability, manageability and widely available
software solutions making it ready for prime time.

7
PowerEdge M1000e Architecture
The Dell PowerEdge M1000e Modular Server Enclosure is a breakthrough in enterprise server
architecture. The enclosure and its components spring from a revolutionary, ground up design
incorporating the latest advances in power, cooling, I/O, and management technologies. These
technologies are packed into a highly available rack dense package that integrates into standard Dell and
3rd party 19” racks.

The PowerEdge M1000e enclosure is 10U high and provides the following features:

Up to 16 server modules.

Up to 6 network & storage I/O interconnect modules.

A high speed passive midplane that connects the server modules in the front and the power, I/O,
and management infrastructure in the rear of the enclosure.

Comprehensive I/O options that support dual links of 20 Gigabits per second today (with 4x DDR
InfiniBand) with future support for even higher bandwidth I/O devices when those technologies
become available. This support provides high‐speed server module connectivity to the network and
storage today and for well into the future.

Thorough power management capabilities including delivering shared power to ensure full capacity
of the power supplies available to all server modules.

Robust management capabilities including private Ethernet, serial, USB, and low-level management
connectivity between the Chassis Management Controller (CMC), the keyboard/video/mouse
switch, and server modules.

Up to two Chassis Management Controllers (CMC‐ 1 is standard, 2nd provides optional redundancy)
and 1 optional integrated Keyboard/Video/Mouse (iKVM) switch.

Up to 6 hot pluggable, redundant power supplies and 9 hot pluggable, N+1 redundant fan modules.

A system front control panel featuring an LCD display; two keyboard/mouse USB connections; and
one video “crash cart” connection.

8
Figure 2: Dell PowerEdge M1000e Front View

Midplane Fabric Connections


Dell M1000e Blades support 3 fabrics i.e. Fabric A, B and C. Fabric A consists of the dual integrated 1Gb
Ethernet controllers connected directly from the Blade to I/O modules A1 and A2 in the rear of the
enclosure. Fabric A is always an Ethernet fabric and will not be discussed further in this document.

Both Fabric B and C are supported through optional Mezzanine cards on separate x8 PCI Express lanes.
Fabric B and C can support 2 ports and each port has 4 lanes (1 lane consists of both Transmit and
Receive differential signals) connected from Mezzanine connector to I/O module as shown in Figure 3
and Figure 4. The InfiniBand Mezzanine card can be installed in either Fabric B or C on the Blades.

Fabric B and C I/O Modules receive 16 sets of signals, one set from each blade. Fabric TX and RX
differential pairs are the high speed routing lanes, supporting 1.25 Gb/s to 10.3125 Gb/s.

Fabrics internally support a Bit Error Rate (BER) of 10^-12 or better.

9
Figure 3: Fabric B and C Midplane Connections

Blade Midplane
(1 of 16)
Fabric B: 4-lane transmit/receive differential pairs
(16 lines total) Fabric B I/O
Module
1

Fabric B
Mezzanine
Fabric B: 4-lane transmit/receive differential pairs
(16 lines total) Fabric B I/O
Module
2

Fabric C: 4-lane transmit/receive differential pairs


(16 lines total) Fabric C I/O
Module
1

Fabric C
Mezzanine
Fabric C: 4-lane transmit/receive differential pairs
(16 lines total) Fabric C I/O
Module
2

I/O Communication Paths in Dell PowerEdge M1000e


The InfiniBand Mezzanine cards supported in the blades utilize the dual port Mellanox® ConnectX®
chipset. As shown in the diagram below, Port 1 of InfiniBand Mezzanine card in Fabric B communicates
with the InfiniBand Switch inserted in the B1 I/O slot in the rear of the chassis. Port 2 of InfiniBand
Mezzanine card in Fabric B communicates with the InfiniBand Switch in B2 I/O slot. The 2nd InfiniBand
switch is not required unless it’s needed for additional performance or redundancy as described in this
paper.

The InfiniBand Mezzanine card in Fabric C has a similar communication path.

10
Figure 4: Fabric B and C Midplane Connections

Half Height Modular Server (1 of 16) M1000e Midplane


1-2 lane Fabric A1

….
4-8 lane
PCIe Fabric A Ethernet
CPU LOM I/O Module

8 lane
1-2 lane
PCIe Fabric B
MCH/
IOH Mezzanine
Fabric A2

….
Ethernet
8 lane 1-4 lane I/O Module
CPU PCIe Fabric C
Mezzanine
1-4 lane

1-2 lane

….
…… .
Fabric B1
1-2 lane I/O Module
1-4 lane

1-4 lane External


Fabric
Connec-
tions

….
Fabric B2
I/O Module
Half Height Modular Server (16 of 16)
1-4 lane
4-8 lane
PCIe Fabric A
CPU LOM 1-4 lane

8 lane
….Fabric C1
PCIe Fabric B I/O Module
MCH/
IOH Mezzanine
1-4 lane
8 lane
PCIe Fabric C
CPU
Mezzanine
….

Fabric C2
1-4 lane
I/O Module

Server Blades with InfiniBand ConnectX Mezzanine Cards


Dell PowerEdge M600 Server Blades were used for the performance testing described in this paper. The
PowerEdge M600 is based on Intel™ processors/chipsets and its specifics are shown in Table 1.

Table 1: PowerEdge M600 Specifications

Server Module PowerEdge M600

Processor Intel Xeon® Dual and Quad Core™

40W, 65W, 80W, and 120W options

Chipset Intel 5000P

11
Memory Slots 8 fully buffered DIMMs (667 MHz)

Memory Capacity 64 GB (8GB x 8)

Integrated Ethernet Controllers 2 x Broadcom 5708S GbE with hardware TCP/IP


(Fabric A) offload engine and iSCSI firmware boot

Optional upgrade to full iSCSI offload available


with a license key

Fabric Expansion Support for up to 2 x 8 lane PCIe mezzanine


cards (Fabric B and C in Figure 3 and Figure 4)

Dual Port 4x DDR InfiniBand

Dual port 4Gb Fibre Channel and 8Gb Fibre


Channel

Dual port 10GbE (to be available)

Dual port 1GbE

Baseboard Management iDRAC with IPMI 2.0 + vMedia + vKVM

Local Storage Controller Options Serial Advanced Technology Attachment (SATA)


-- chipset based with no Redundant Array of
Independent Disks (RAID) or hotplug capability

Serial Attached SCSI (SAS) 6/IR (R0/1)

Cost Effective RAID Controller (CERC) 6/i (R0/1


with cache)

Local Storage Hard Disk Drive 2 x 2.5 inch hot pluggable SAS or SATA
(HDD)

Video ATI™ RN50

USB 2x USB 2.0 bootable ports on front panel for


floppy/CD/DVD/memory

12
Console Virtual KVM through iDRAC

IPMI Serial over LAN (SoL) through iDRAC

Rear mounted iKVM switch ports (tier able)

Front KVM ports on modular enclosure control


panel

Operating Systems Red Hat® Enterprise Linux® 4/5, SUSE® Linux


Enterprise Server 9/10

InfiniBand Configuration
Mellanox ConnectX IB MDI InfiniBand Host Channel Adapter (HCA) mezzanine cards are designed for
delivering low-latency and high-bandwidth performance-driven server and storage clustering
applications in Enterprise Data Center and High-Performance Computing environments.

Figure 5: Mellanox ConnectX HCA

M2401G InfiniScale® III InfiniBand Switch for the Dell M1000e is used to create reliable, scalable, and
easy to manage interconnect fabrics for compute, communication, storage, and embedded applications.
The switch has 24 ports including 16 internal 4x DDR downlink and 8 external 4x DDR ports. The
M2401G InfiniScale III supports 20Gb/s per 4X port and 60Gb/s per 12X port delivering 960Gb/s of
aggregate bandwidth.

Figure 6: InfiniBand Switch

13
Performance Study and Analysis
In this section we study the performance impact of InfiniBand blocking factor on a High Performance
Computing Cluster using a cluster of 32 node M600 blades. The study includes running a synthetic
cluster benchmark suite known as NAS Parallel Benchmark (NPB). The study was done by varying
InfiniBand configurations ranging from 0% blocking to 75% blocking configurations as described below.
Unless otherwise stated the results are shown normalized to the 50% blocking configuration. This is
because the 50% configuration is the most natural configuration that can be created using one switch
per chassis and all 8 external 4x DDR.

Cluster Test Bed Configurations


Hardware Configurations
The cluster consisted of a 32 node InfiniBand cluster, comprised of two fully populated M1000e with
sixteen nodes in each chassis. Each blade was configured with Quad Core Intel® Xeon ® E5450 running at
3.00 GHz CPU speed and 16GB of 667MHz SDRAM (2GB of memory per core within a node). The nodes
were configured with the latest BIOS available version A2.0.2.

Each M600 blade has two PCI-Express x8 Mezzanine card slots (slot B and slot C). One of these slots was
populated with dual port Mellanox ConnectX Mezzanine cards operating at Double Data Rate (DDR)
speed with 20Gbps signaling rate. The Mezzanine cards have the firmware version 2.3.0.

The HCA ports on the blade were connected to the internal InfiniBand switch by means of the chassis
midplane. The InfiniBand switch consists of 16 internal links through which the blades connect and 8
external links for outside connection. Hence in this case, when using a single switch within a chassis the
InfiniBand traffic will be 50% blocking and connect to a fabric outside the chassis; however, it will be
non-blocking within the chassis.

Software Configurations
The cluster was deployed with Red Hat Linux 4 update 5 errata kernel 2.6.9-55.0.12.ELsmp. The driver
stack used for this study is Mellanox Open Fabrics Enterprise Distribution version 1.3.

Our studies were conducted using NAS parallel benchmark which is a synthetic cluster benchmark. The
benchmarks were run with OpenMpi 1.2.5 that comes pre-compiled and packaged with the Mellanox
OFED 1.3 stack.

InfiniBand Blocking Configurations


Fully Non-blocking Configuration
The first test was to create a fully non-blocking configuration using two InfiniBand switches in each
chassis, one in I/O slot B1 and one in I/O slot C1 in the rear of the chassis. Two external 24 port switches

14
were used to create a non-blocking network between the 4 InfiniBand switch modules in the two
chassis. To ensure a non-blocking configuration from the blades to the I/O modules, 8 of the blades in
each chassis had the ConnectX HCAs in Mezzanine slot B and the other 8 had ConnectX HCAs in
Mezzanine slot C. This configuration is illustrated in Figure 7.

Out of the eight external connections on each switch, there are four connections going to each external
24 port switch. This helps to avoid network congestion caused by multiple hops or credit loop scenarios.
A non-blocking configuration can also be created by replacing the two 24 switches with a single
InfiniBand Large Port Count (LPC) switch that supports 36 or more ports.

Figure 7: Configuration of a Fully Non-blocking InfiniBand Cluster

50% and 75% Blocking


The second test case was configured as a 50% blocking fabric. It was created by populating Mezzanine
slot B on all blades, using a single InfiniBand switch in I/O module slot B1 in the chassis and using one
external 24 port switch. We then connected all 8 external ports on the internal switch to the external 24
port switch. Therefore the InfiniBand traffic is restricted to only half the bandwidth when
communicating outside the chassis.

The 75% blocking configuration depicts the 50% blocking configuration except fewer cables connect to
the 24 port switch. The 75% blocking configuration was achieved by using only 4 uplinks between the
internal switches and the external switch. These configurations are shown in Figure 8 and Figure 9.

15
Figure 8: Configuration of a 50% Blocking InfiniBand Cluster

Figure 9: Configuration of a 75% Blocking InfiniBand Cluster

SFS 7000D

4
4
4 cables from slot B 4 cables from slot B

Benchmarking and Analysis: NAS Parallel Benchmark


The NAS Parallel Benchmark (NPB), developed at the NASA Ames Research Center, has been widely used
to measure, compare, and understand the characteristics of HPC clusters from both a computational
and communication angle. It is a collection of programs that is derived from Computational Fluid
Dynamics (CFD) code. Detailed description of the various benchmarks that form NPB can be found at
http://www.nas.nasa.gov/Resources/Software/npb.html. For the purposes of this study NPB
version 3.2 was used and the Conjugate Gradient (CG), Fourier Transform (FT), Integer Sort (IS), and
Multi Grid (MG) benchmarks were studied.

16
According to Ahmad Faraj and Xin Yuan [1], both CG and FT have a large volume of inter-node
communication. IS has medium volume whereas MG has relatively less volume of inter-node
communication. As Figure 10 shows, the performance of the FT benchmark is significantly affected by
variations in the blocking factor. The FT benchmark comprises of a number of messages sent in a
collective communication pattern. These messages are greatly affected by the reduced bandwidth
across the bottleneck as seen by the graph.

The CG benchmark comprises a large volume of data communication, but it mainly calls point to point
routines. Hence from the graph it appears that for benchmarks that are both computational and
communication intensive, the performance can be tolerated as seen between 0% and 50% blocking
factor from the graph.

The IS benchmark shows some change when going from 50% blocking to 75% blocking. This benchmark
has significant collective communication and hence shows a greater affect on blocking factor compared
to the MG benchmark which has mainly point to point communication. Thus the MG benchmark shows
no significant impact or degradation for any configuration.

Figure 10: Effect of IB Blocking Factor on Various NAS Benchmarks

Summary and Conclusion


This study showed the impact of blocking factor on the performance of various applications run on Dell
PowerEdge blade chassis. The results show a significant impact on the performance of an application
that is highly communication intensive. The bandwidth intensive NAS benchmarks that have collective
communication pattern show significant impact to the restraint cluster bandwidth caused by the

17
blocking factor. For benchmarks that are both communication and computation intensive and that have
a small volume of communication data across nodes or that have a communication pattern that is
mainly point to point, the blocking factor may be of little importance

However it is possible that in real-world commercial applications, there is much less impact of blocking
factor on performance. The application communication characteristics as well as distribution of data
between the nodes govern the performance impact.

Thus it is recommended that the application characteristics be used for designing the appropriate IB
fabric. For certain bandwidth and latency sensitive applications it is imperative to use a complete non-
blocking configuration as described in the “InfiniBand Blocking Configurations” section on page 14. .
However based on the results above, a 50% blocking configuration might provide the best
price/performance benefit for commercial clusters or clusters with a mix of communication and
computation. This configuration will also benefit from ease of design and management as fewer
modules, external switches, and cables could be used.

References
1. Ahmad Faraj, Xin Yuan, “Communication Characteristics in the NAS Parallel Benchmarks”, A
Scientific and Technical Publishing Company(ACTA) – 2002.

2. Jiuxing Liu, Balasubramanian Chandrasekaran, Jiesheng Wu, Weihang Jiang, Sushmitha Kini,
Weikuan Yu, Darius Buntinas, Peter Wyckoff, D K Panda, “Performance Comparison of MPI
Implementation over InfiniBand, Myrinet and Quadrics”, Conference on High Performance
Networking and Computing archive Proceedings of the 2003 ACM/IEEE conference on
Supercomputing.

18

Das könnte Ihnen auch gefallen