Sie sind auf Seite 1von 60

1

CAP3 - InfiniBand

InfiniBand Architecture Overview

Mision

Garner industry agreement on and development of a new channel based, switched fabric server I/O for the data center, clustering and Internet computing environments.

Why InfiniBand?

Increased Application Performance


High Bandwidth Fabric Interconnects Low Latency InfiniBand Protocol Channel I/O Multiple Levels of Redundancy Dense Rack Mounted Servers Blade Server, Compute Brick Enablement

Higher Data Center Reliability


Enhanced Density

Why InfiniBand? Cont.

Customer Benefits

Independent Scaling of Data Center Resources Decreased Cabling Requirements Increased Data Center Flexibility & Performance Sharing of I/O resources Standards Based Clustering Interconnect

A New Generation Computing, Communication and I/O Architecture


Future I/O, Next Generation I/O => IBA Trade Organization Goal: To design a scalable and high performance communication and I/O by taking an integrated view of computing, networking and storage technologies Infiniband Architecture 2000/2002/2006 Last release (1.2.1): January 2008 www.infinbandta.org

Infiniband Architecture - IBA


IBA defines a System Area Network for connecting processing and I/O nodes Communications and management infrastructure for IPC and I/O Switched, channel-based interconnection fabric Consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and routers

A Typical IBA SAN

Topologies and components

At a high level: an interconnect for endnodes Each node can be a processor node, an I/O unit, and/or a router to another network IBA network is subdivided into subnets interconnected by routers Endnodes may attach to a single subnet or multiple subnets Multiple links can exist between any two IBA devices

12

Components: Channel Adapter


Used by processing and I/O units to connect to fabric Programmable DMA engines with protection features One/more ports : LID, buffers Memory Translation and Protection: virtual adr. -> physical adr., validates access rights Host Channel Adapters Target Channel Adapters

13

Channel Adapters: Virtual Lanes


VL = set of transmit & receive buffers in a port Multiple VLs within same physical link Separate buffers and flow control VL15 - management

14

Components: switches and routers, links


Relay packets from one link to another Switches: intra-subnet Switches implement a fat tree topology (for any given level of the switch topology, the amount of bandwidth connected to the downstream end nodes is identical to the amount of the upstream path for interconnect) Routers: inter-subnet May support unicast/multicast Optical, cooper Printed circuit wiring on a back plane

15

Components: Management Entities

Subnet Manager (SM) - configuration, management, ID assignment, link fail - one master subnet manager per subnet - Subnet Management Agent (SMA): one per node, communication with SM through SMI General Service (GS) - chassis management - General Service Agent, communication with GS through GSI Virtual Lane VL15

16

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

21

Communication Stack

22

Communication Model

Consumer queue up a set of instructions that the hardware executes=>work queue Work queue is associated with one/more preregistered buffers Work queues created in pairs=>queue pair Queue pair = send queue + receive queue Signaled/Unsignaled on the completion => completion queue
23

24

Communication Model

Consumer submits a work request (WR) => an instruction called a Work Queue Element (WQE) is placed on the appropriate WQ Channel adapter executes WQEs in the order they were placed on the WQ When channel adapter completes a WQE a Completion Queue Element (CQE) is placed on a completion queue

25

Communication Operations

Channel Semantic - Send-Recv model - one party pushes the data, the destination party determines the final destination of the data Memory Semantic - initiating party reads/writes the virtual address space of the remote node - RDMA Read - RDMA Write - RDMA Atomic Operations (e.g. Fetch & Add, Compare & Swap)

26

Case of RDMA (1)

Existing stacks: - excessive CPU utilization by copying data between kernel and application memory RDMA Remote Direct Memory Access - data transferred from the memory of one device to the memory of another device without passing through eithers device CPU, without calling to an OS kernel
27

Case of RDMA (2)

28

Case of RDMA (3)


Registering memory -> beginning virtual address, size, key Accessible locations identified by addresses + key for protection RDMA Read/Write zero-copy: NIC transfers data directly to/from application memory Reduced demand on CPU Kernel bypass: application issues commands to NIC without execute a kernel call

29

Layered Architecture

Physical Layer - 3 link speeds: 1X (2 Gbps), 4X (8 Gbps), 12 X (24 Gbps) - link: 4 wire serial differential connection - 1x link: 2.5 Gbps -> 8b/10b > 2.0 Gbps - links can be aggregated in units of 4 or 12, called 4X or 12X

30

Layered Architecture (cont.)

Network layer - how packets are routed between subnets Upper layer - allows support of protocols such as IP and SCSI - defines messages/protocols for management functions

31

Layered Architecture (cont.)

Link layer - data integrity: 2 CRCs (link level integrity between hops, end-to-end data integrity) - credit based flow control - MTS14400 12 4X or 4-12X internal connection to the spine non-blocking ultra low switching (<200 nanosec) 2.88 Terabit switch

32

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

33

Transport Layer

Transport services - reliable/unreliable connection - reliable/unreliable datagram Connection oriented: each QP is associated with exactly one remote consumer; QP context I configured with the identity of the remote consumers QP Datagram service: QP is not tide to a single remote consumer, bur rather information in the WQE identifies the destiantion

34

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

35

Keys

Assigned by administrative entities Used in messages Management keys, baseboard management key, partition key, queue key, memory key

36

Virtual Memory Addressing


Consumers use Virtual Address in work request Channel adapters do Virtual-to-Physical address translation Requires consumers to register memory with CA beforehand - L_Key: used in each WQ that requires a memory access to the region - R_Key: for RDMA, the consumer passes the key and a virtual address or buffer in that memory region to another consumer

37

Protection Domains
Allow consumer to control which set of QPs can access which registered memory regions QPs allocated to a protection domain L_Key and R_Key for a particular memory domain are only valid on QPs created for the same protection domain Memory registered to a protection domain

38

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

39

Virtual Lanes

VL = set of transmit & receive buffers in a port Multiple VLs within same physical link Separate buffers and flow control VL15 - management

40

QoS Mechanisms, Multicast

Service Level - packets can operate at one of 16 different SLs

Switches/routers replicate packets Interface for multicast group management protocol

41

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

42

IBA Overview

Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services

43

Subnet Management

Subnet Manager, Subnet Manager Agent Subnet Management Packets operate on QP0 Each node contains an SMA, each port has QP0 One SM per subnet, can reside in any node General Services, General Service Agent General Management Packets QP1

44

Rich set of Features

High Performance Data Transfer - IPC and I/O - low latency, high bandwidth, low CPU utilization - multiple transport services -> flexibility to develop upper layer - multiple operations: Send/Recv, RDMA, Atomic - range of network features and QoS mechanisms: service levels, VL, partitioning, multicast

45

Rich Set of Features (cont.)


Protection operations - keys, protection domains Flexibility for supporting RAS in next generation systems with IBA features - multiple CRC fields => error detection (per-hop, endto-end) - reliable transport services (connection and datagram) - fail-over (managed, unmanaged) - path migration - built-in management services

46

Agenda

What is Infiniband Architecture and Why? Overview of Infiniband and its Novel Features Differences with Other Technologies/Standards Conclusions and Final Q&A

47

Comparison with Other Technologies/Standards

Interfaces - PCI, PCI-X, HyperTransport, PCI-Express (3GIO) Interconnects - Myrinet, Quadrics - Ethernet:1.0/10.0 GigEth 27.6% in TOP500 use Infiniband as interconnect

48

IBA vs PCI-X, Hypertransport, PCI-Express Complementary technologies current focus on IBA community is on data movement from node-to-node PCI-X, HyperTransport, PCI-Express focus on moving data to host processor IBA combined with any of these interface technologies to deliver high performance data transfer
49

GigEth and Storage over IP


Major problem: TCP/IP stack BUT 1.0 GigEth with accelerator Competition from storage world (since Eth is a popular interconnect***) - Direct Data Placement RDMA over IP - TCP Offload Engines (accelerators) *** 56% in TOP500 use GigEth as interconnect

50

Myrinet

11/500 from TOP500 use Myrinet High-end latency: 6.5 microsec Host CPU overhead: 0.15 microsec Data rates: - low cost interfaces ~ 248 Mbytes/s unidir, ~ 489 Mbytes/s bidir - dual-port interfaces (distribute and reasemble data cross the two ports) ~ 490 Mbytes/s unidir, ~ 875 Mbytes/s bidir Lanai-XP programable processor at 225 MHz
51

Quadrics

QsNetII with Elan4 ~ 900MBytes/s unidirectional (sustainable transfer rate) Elite4 crossbar switch - 8 bidirectional links - 400 MB/s on each link direction

52

IBA vs Myrinet, Quadrics


Obtained on 11/16/03 System configuration - 8 Supermicro P4DL6 nodes with Server-Works chipset and dual Intel Xeon 2.40 GHz processor Interconnects - Mellanox IBA 4X (10.0 Gbps), PCI-X 133 - Myrinet LANai XP cards, PCI-X 133 - Quadrics Elan3 cards, PCI 64-bit/66MHz slots Results - MPI-level

53

MPI-level Latency and Comparison

54

MPI-level Bandwidth Comparison

55

MPI-level Bidirectional Bandwidth Comparison

56

Products on IBA

Adapters, Switches Software - lower level: VAPI, IBAL, OpenIB - upper level: MPI, SDP, IPoIB, SRP, uDAPL, kDAPL - subnet manager: MiniSM, VFM, InfiniView

57

Conclusions

New architecture - System Area Network Rich Set of Features Skew clear cable, switches, RDMA, protocols, transport in hardware OpenIB Infiniband support for Linux Storage IP, Myrinet support for TCP/IP

58

Conclusions

1st TOP500 (today): Roadrunner, 12.960 IBM PowerXCell 8i CPUs, 6.480 AMD Opteron dualcore processors, interconnect: Infiniband, memory: 103.6 TiB (1 tebibyte = 240 bytes), Rmax(speed): 1.7petaflops Mozart (Stuttgart University): Intel Xeon 3.06 GHz, 64 dual processor nodes, Rmax = 850.6 GFlops

59

Das könnte Ihnen auch gefallen