Beruflich Dokumente
Kultur Dokumente
CAP3 - InfiniBand
Mision
Garner industry agreement on and development of a new channel based, switched fabric server I/O for the data center, clustering and Internet computing environments.
Why InfiniBand?
High Bandwidth Fabric Interconnects Low Latency InfiniBand Protocol Channel I/O Multiple Levels of Redundancy Dense Rack Mounted Servers Blade Server, Compute Brick Enablement
Enhanced Density
Customer Benefits
Independent Scaling of Data Center Resources Decreased Cabling Requirements Increased Data Center Flexibility & Performance Sharing of I/O resources Standards Based Clustering Interconnect
IBA defines a System Area Network for connecting processing and I/O nodes Communications and management infrastructure for IPC and I/O Switched, channel-based interconnection fabric Consists of processor nodes and I/O units connected through an IBA fabric made up of cascaded switches and routers
At a high level: an interconnect for endnodes Each node can be a processor node, an I/O unit, and/or a router to another network IBA network is subdivided into subnets interconnected by routers Endnodes may attach to a single subnet or multiple subnets Multiple links can exist between any two IBA devices
12
Used by processing and I/O units to connect to fabric Programmable DMA engines with protection features One/more ports : LID, buffers Memory Translation and Protection: virtual adr. -> physical adr., validates access rights Host Channel Adapters Target Channel Adapters
13
14
Relay packets from one link to another Switches: intra-subnet Switches implement a fat tree topology (for any given level of the switch topology, the amount of bandwidth connected to the downstream end nodes is identical to the amount of the upstream path for interconnect) Routers: inter-subnet May support unicast/multicast Optical, cooper Printed circuit wiring on a back plane
15
Subnet Manager (SM) - configuration, management, ID assignment, link fail - one master subnet manager per subnet - Subnet Management Agent (SMA): one per node, communication with SM through SMI General Service (GS) - chassis management - General Service Agent, communication with GS through GSI Virtual Lane VL15
16
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
21
Communication Stack
22
Communication Model
Consumer queue up a set of instructions that the hardware executes=>work queue Work queue is associated with one/more preregistered buffers Work queues created in pairs=>queue pair Queue pair = send queue + receive queue Signaled/Unsignaled on the completion => completion queue
23
24
Communication Model
Consumer submits a work request (WR) => an instruction called a Work Queue Element (WQE) is placed on the appropriate WQ Channel adapter executes WQEs in the order they were placed on the WQ When channel adapter completes a WQE a Completion Queue Element (CQE) is placed on a completion queue
25
Communication Operations
Channel Semantic - Send-Recv model - one party pushes the data, the destination party determines the final destination of the data Memory Semantic - initiating party reads/writes the virtual address space of the remote node - RDMA Read - RDMA Write - RDMA Atomic Operations (e.g. Fetch & Add, Compare & Swap)
26
Existing stacks: - excessive CPU utilization by copying data between kernel and application memory RDMA Remote Direct Memory Access - data transferred from the memory of one device to the memory of another device without passing through eithers device CPU, without calling to an OS kernel
27
28
Registering memory -> beginning virtual address, size, key Accessible locations identified by addresses + key for protection RDMA Read/Write zero-copy: NIC transfers data directly to/from application memory Reduced demand on CPU Kernel bypass: application issues commands to NIC without execute a kernel call
29
Layered Architecture
Physical Layer - 3 link speeds: 1X (2 Gbps), 4X (8 Gbps), 12 X (24 Gbps) - link: 4 wire serial differential connection - 1x link: 2.5 Gbps -> 8b/10b > 2.0 Gbps - links can be aggregated in units of 4 or 12, called 4X or 12X
30
Network layer - how packets are routed between subnets Upper layer - allows support of protocols such as IP and SCSI - defines messages/protocols for management functions
31
Link layer - data integrity: 2 CRCs (link level integrity between hops, end-to-end data integrity) - credit based flow control - MTS14400 12 4X or 4-12X internal connection to the spine non-blocking ultra low switching (<200 nanosec) 2.88 Terabit switch
32
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
33
Transport Layer
Transport services - reliable/unreliable connection - reliable/unreliable datagram Connection oriented: each QP is associated with exactly one remote consumer; QP context I configured with the identity of the remote consumers QP Datagram service: QP is not tide to a single remote consumer, bur rather information in the WQE identifies the destiantion
34
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
35
Keys
Assigned by administrative entities Used in messages Management keys, baseboard management key, partition key, queue key, memory key
36
Consumers use Virtual Address in work request Channel adapters do Virtual-to-Physical address translation Requires consumers to register memory with CA beforehand - L_Key: used in each WQ that requires a memory access to the region - R_Key: for RDMA, the consumer passes the key and a virtual address or buffer in that memory region to another consumer
37
Protection Domains
Allow consumer to control which set of QPs can access which registered memory regions QPs allocated to a protection domain L_Key and R_Key for a particular memory domain are only valid on QPs created for the same protection domain Memory registered to a protection domain
38
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
39
Virtual Lanes
VL = set of transmit & receive buffers in a port Multiple VLs within same physical link Separate buffers and flow control VL15 - management
40
41
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
42
IBA Overview
Architecture and Basic Components Communication and I/O Operations Transport Layer Key, Protection, Domains Virtual Lanes, QoS Mechanisms, Multicast Management and Services
43
Subnet Management
Subnet Manager, Subnet Manager Agent Subnet Management Packets operate on QP0 Each node contains an SMA, each port has QP0 One SM per subnet, can reside in any node General Services, General Service Agent General Management Packets QP1
44
High Performance Data Transfer - IPC and I/O - low latency, high bandwidth, low CPU utilization - multiple transport services -> flexibility to develop upper layer - multiple operations: Send/Recv, RDMA, Atomic - range of network features and QoS mechanisms: service levels, VL, partitioning, multicast
45
Protection operations - keys, protection domains Flexibility for supporting RAS in next generation systems with IBA features - multiple CRC fields => error detection (per-hop, endto-end) - reliable transport services (connection and datagram) - fail-over (managed, unmanaged) - path migration - built-in management services
46
Agenda
What is Infiniband Architecture and Why? Overview of Infiniband and its Novel Features Differences with Other Technologies/Standards Conclusions and Final Q&A
47
Interfaces - PCI, PCI-X, HyperTransport, PCI-Express (3GIO) Interconnects - Myrinet, Quadrics - Ethernet:1.0/10.0 GigEth 27.6% in TOP500 use Infiniband as interconnect
48
IBA vs PCI-X, Hypertransport, PCI-Express Complementary technologies current focus on IBA community is on data movement from node-to-node PCI-X, HyperTransport, PCI-Express focus on moving data to host processor IBA combined with any of these interface technologies to deliver high performance data transfer
49
50
Myrinet
11/500 from TOP500 use Myrinet High-end latency: 6.5 microsec Host CPU overhead: 0.15 microsec Data rates: - low cost interfaces ~ 248 Mbytes/s unidir, ~ 489 Mbytes/s bidir - dual-port interfaces (distribute and reasemble data cross the two ports) ~ 490 Mbytes/s unidir, ~ 875 Mbytes/s bidir Lanai-XP programable processor at 225 MHz
51
Quadrics
QsNetII with Elan4 ~ 900MBytes/s unidirectional (sustainable transfer rate) Elite4 crossbar switch - 8 bidirectional links - 400 MB/s on each link direction
52
Obtained on 11/16/03 System configuration - 8 Supermicro P4DL6 nodes with Server-Works chipset and dual Intel Xeon 2.40 GHz processor Interconnects - Mellanox IBA 4X (10.0 Gbps), PCI-X 133 - Myrinet LANai XP cards, PCI-X 133 - Quadrics Elan3 cards, PCI 64-bit/66MHz slots Results - MPI-level
53
54
55
56
Products on IBA
Adapters, Switches Software - lower level: VAPI, IBAL, OpenIB - upper level: MPI, SDP, IPoIB, SRP, uDAPL, kDAPL - subnet manager: MiniSM, VFM, InfiniView
57
Conclusions
New architecture - System Area Network Rich Set of Features Skew clear cable, switches, RDMA, protocols, transport in hardware OpenIB Infiniband support for Linux Storage IP, Myrinet support for TCP/IP
58
Conclusions
1st TOP500 (today): Roadrunner, 12.960 IBM PowerXCell 8i CPUs, 6.480 AMD Opteron dualcore processors, interconnect: Infiniband, memory: 103.6 TiB (1 tebibyte = 240 bytes), Rmax(speed): 1.7petaflops Mozart (Stuttgart University): Intel Xeon 3.06 GHz, 64 dual processor nodes, Rmax = 850.6 GFlops
59