Beruflich Dokumente
Kultur Dokumente
Guillaume Chelius
Ecole Normale Superieure de Lyon
46 all6 d’ltalie, 69346 Lyon Cedex 07, France
Guillaume.Chelius@ens-lyon.fr
245
0-7695-1010-8/01 $10.00 @ 2001 IEEE
alization directly and subsumes the tasks of multiplexing, Queues and directly performs data transfer functions. The
de-multiplexing, and data transfer scheduling normally per- Kernel Agent is a privileged part of the OS that performs
formed by an OS kernel and device driver. the setup and management needed to maintain a Virtual In-
terface between VI Consumers and VI NICs. This includes
2.1. The Global Architecture creation/destruction of VIS, Completion Queues, manage-
ment of system memory, interrupt management, VI connec-
tion setup and error handling.
A Virtual Interface is the mechanism that allows a VI
VI
Consumer
Application
OS Communication Interface
I Consumer to directly access a VI Provider to perform data
transfer operations. Figure 2 illustrates a Virtual Interface.
A VI consists of a pair of Work Queues: a send queue and
a receive queue. VI Consumers post requests, in the form
rite
of Descriptors, in the Work Queues to send or receive data.
VI Providers asynchronously process the posted Descrip-
tors from the Work Queues and mark them with a status
when completed. VI Consumers remove completed De-
scriptors and use them for subsequent requests. Each Work
Queue has an associated Doorbell that is used to notify the
VI VI network adapter that a Descriptor has been posted. This
Provider
mechanism is directly implanted in the adapter and requires
no OS intervention. A Completion Queue allows a VI Con-
sumer to coalesce notification of Descriptor completions
Figure 1. The VIA Architectural Model from the Work Queues of multiple VIS in a single location.
i;3
a NIC can access them. Pages are unlocked when the trans-
fer is completed. Traditional network transports either per-
form these operations on every data transfer request or copy
the data into a pre-registered buffer. These processes con-
tribute significant overhead to the data transfer operation.
VI The VI Architecture requires the VI Consumer to identify
I memory used for a data transfer prior to submitting the re-
Network Interface Controller
quest. Only memory that has been registered with the VI
Provider can be used for data transfer.
Memory registration consists of locking the pages of a
virtually contiguous memory region into physical memory
Figure 2. A Virtual Interface and providing the virtual to physical translation to the VI
NIC. When registering, the VI Consumer gets an opaque
The VI architecture is composed of four basic compo- handler which is used in subsequent calls to the VI Provider.
nents: Virtual Interfaces (VIS), Completion Queues (CQs), VIA includes several memory protection facilities. When
a VI Provider and a VI Consumer. The architecture is illus- registered, a memory region is associated with a protection
trated in Figure 1. tag and several permission attributes that define the allowed
The VI Consumer is generally composed of an applica- memory operations. Whenever a memory block is con-
tion program and an operating system communication fa- cerned by a user operation, the correct protection tag has
cility. The VI provider is the set of hardware and software to be provided by the VI consumer for the operation to be
components responsible for instantiating a Virtual Interface. effective.
The VI Provider consists of a network interface controller The VI provider is responsible for handling the connec-
and a Kernel Agent. The VI Network Interface Controller tion and error management. VIA only defines the connec-
(NIC) implements the Virtual Interfaces and Completion tion scheme/model. The underlying protocol remains im-
246
plementation dependant. Concerning errors, the specifica- The Core module is responsible for the management of
tions propose several behaviors. They define different lev- resources and connections. All the functionalities provided
els of reliability: unreliable, reliable delivery and reliable by this module are default ones. A device module is free to
reception. override any of them by registering the new functions when
loaded.
2.3. Existing Implementations The VI Provider library implements the API defined in
the VIA specifications. The main characteristic of this li-
Several implementations of VIA exist. Most of them brary lies in the implementation of the kernel agent calls.
have been developed to deal with proprietary media and Indeed, a device driver can decide whether these calls are
only few information is available about them. Examples performed through the use of classical system calls or, to
are Giganet, Tandem, Fujitsu System Technologies or NEC. improve efficiency, through fast-traps. A fast-trap enables
Two other implementations are the product of academic ef- the execution of privileged code with minimum overhead
forts. In the two cases, the aim was to build a reference (no scheduling operations and signal processing as in clas-
implementation of the protocol. sical system calls).
Berkeley VIA is a multi-platform (Solaris, Linux and A device module provides the abstraction of a VI NIC.
When a device registers itself to the core module, it provides
Windows NT) native implementation of VIA for Myrinet
boards. It has been developed at UC Berkeley (see [17] its specific abilities (e.g. Maximum Transfer Unit, resource
and [2] for more information about Berkeley VIA). It is limitations, doorbell mechanisms) and its own implemen-
tation of some of the VI Kernel Agent calls (e.g. the send
an example of embedded VIA since the protocol is almost
call with the DEC Tulip card). Basically (for VIA-unaware
completely implemented directly into the Myrinet board.
hardware) a M-VIA device module is an Ethernet driver en-
M-VIA, standing for Modular VIA, is an implementa-
hanced with a (de-)multiplexing ability.
tion of VIA for the Linux operating system, developed at
the National Energy Research Scientific Computing Center
(NERSC) of the Lawrence Berkeley National Laboratory 3.2. The Myrinet Device Module
(see [ 161 and [5] for more information about M-VIA). It sup-
ports several physical media such as Ethernet cards and the Like classical M-VIA devices, the M-VIA Myrinet driver
GNIC-I1 Gigabit Ethernet device and allows an easy exten- is based upon the Ethernet driver (the GM Ethernet driver
sion to new devices. in our study). Resources (buffer rings, interruption han-
dlers) are shared and a (de-)multiplexing of several events
is required. For example, at the reception of a packet,
3. Integration of Myrinet in M-VIA
the receive interrupt handler is responsible for handling the
packet, checking its type and transmitting it to the correct
A first solution to support VIA on a Myrinet network upper-layer’s protocol . The send interrupt handling is also
is the porting of M-VIA for Myrinet boards. Indeed, M- de-multiplexed. To achieve this, a permanent record of the
VIA has been designed to allow an easy integration of new send ring state and, more precisely, a record of the packets
network interfaces. being processed is kept up-to-date. This information en-
ables the send interrupt handler to decide which protocol
3.1. M-VIA Architecture the sent packet was associated to.
Compared to other M-VIA drivers, one of the Ethernet
A primary design goal of M-VIA is to enable a sup- driver’s features based on Myricom’s GM library, is its abil-
port of VIA for multiple network interfaces, including both ity to handle gather descriptors. This feature is used when
VIA-aware and VIA-unaware ones. This is achieved by a VIA packet is sent. Instead of fragmenting and emitting
the modular implementation of the protocol. M-VIA pro- each of the VIA segments until the termination of the de-
poses a complete framework, but also allows a device to scriptor, the VIA gather list is mapped into one or several
override some VIA functionalities in its own specific way. GM gather lists. The translation of VIA descriptors into GM
This allows VIA-aware devices to use their special abili- ones is not trivial since the constraints of both descriptors
ties. Its architecture is composed of a user-level library are not the same (number of segments, Maximum Transfer
and several Linux kernel modules. The core module is Unit, etc.); but the fragmentation of the sent data is, in most
device-independent and implements the kernel agent of the cases, considerably reduced.
VI provider. The other modules are device modules and im- In addition to classical optimizations, two optimization
plement device specific functionalities. They are composed schemes have been added to the driver. The first one con-
of an Ethernet device driver to which several VIA abilities cerns the handling of sent interrupt on the host side and the
have been added. second one is related to the Myrinet DMA engine and the
247
processing of gather lists. Device I M-VIA I GAMMA
An interruption is a costly process. It requires the GNIC-I1 Gigabit Ethernet 1 59.7 I 93.7
LANai, the processor on the Myrinet board, to access the t--"G-
I
I
I
11.9 I
I
12.1 1
I
248
icant overhead of approximately 9 usec for both devices.
This overhead is OS overhead. Indeed, M-VIA does not pro-
vide a user level access to the network interface. The cost
associated with fast-traps and system managements, even if
lowered, is still present. Secondly, M-VIA is not a 0-copy
,g
..........,
I..........
...........
*..........
~
I-@
L.. ........
.d . Y CQ E m o b ltd
protocol. Although on the send side, memory registration IIc.LI 4-U.
and address translation allow avoiding a costly copy, this is Recv WQ Send WQ Completion
not true on the receive side. Virtual Interface J Queue
249
rupted references to the VIA library. In most existing im- GM only performs a de-multiplexing between ports. Since
plementations of VIA, the common policy is to abstract a ports can’t be used as VIS, this operation has to be imple-
handle as a pointer to the object it references. The benefit is mented in the library. Another issue is the support for scat-
an efficient translation from handler to object in the library. ter and gather lists. Although it is supported by the Ethernet
However it prevents to properly validate the handles. To driver, the user GM API does not accept it. The proposed
remedy this, the architecture shown in Figure 3 is used. In solution differentiates two cases depending on the trans-
every NIC, for each object’s type, the existing resources are fered data size. For short messages, a copy operation is
listed through the use of a reference array. In this scheme, used. In the case of long messages, a rendezvous protocol
handles are just indexes. By use of mutual exclusion re- is set up.
sources, the architecture allows to protect the library against The copy of a small amount of data does not introduce
almost any misbehavior from the VIA consumer. a high overhead in comparison to the transport cost. This
fact justifies the use of shared buffers to receive and send
4.2. A user layer short VIA packets (see Figure 3). When a short message is
received in the Myrinet board, it is copied into the receive
GM already implements some functionality required by buffer ring of the targeted NIC, that is the targeted Port.
VIA: memory registration, address translation and a Do- The library is responsible for copying the packet into the
main Name Service feature. In a classical VIA implementa- scatter list of the first free descriptor in the Receive Work
tion, all these operations are performed by the Kernel Agent Queue of the target VI. It is also responsible for notifying
of the VI Provider. In GVIA, a Kernel Agent is useless since the VI of the completed operation. On the send side, the
the GM driver can handle most of its work. If its remaining data is copied from the different segment of the gather list
tasks can be handled directly in user space, GVIA can be into the send ring buffer. This communication scheme also
reduced to a lone user library. applies to all control messages. In this case, the copy is even
In addition to the tasks performed by the GM driver, avoided since messages are treated on-the-fly.
the remaining responsibilities of the Kernel Agent are re- In the case of large transfer of data, a copy is not accept-
source management (creation, destruction), protection vali- able. To achieve a 0-copy transfer, a rendezvous protocol
dation (use of Protection Tags), error handling and connec- is established. It is allowed by the use of one CM’s fea-
tion management. Without entering into details, the archi- ture, the directed send. A directed send is a Put operation.
tecture briefly described in the previous section allows to . It transfers data from a source buffer on the local host to a
easily handle resource management and protection valida- target buffer on the remote host, both buffers being speci-
tion from the user space. To handle error and connection fied by the initiator of the operation. At the termination of
management, an intense use of the GM’s flow control abil- the transfer, no notification is performed on the remote host.
ities was made. A subtle handling of GM flow control er- A directed send requires an active resource acquisition only
rors and the use of alarms allow the library to translate GM o n the sender node. As the receive side needs to be notified
events into a VIA situation that can be handled by the li- of the reception completion, an acknowledgment message
brary. is sent by the sender to the receiver. As the sender also
The user status of GVIA led to a last particularity. In needs to know where to send the data, the receiver sends
a classical VIA implementation, numerous calls are block- its gather segments’ list to the sender on the request of the
ing. They are connection requests, connection wait, block- latest. Finally, the complete protocol requires three control
ing receive, etc. Usually, the blocking mechanism lies in the packets in addition to the data transfer. They use the short
kernel like for classical data transfer protocols. However, message scheme and are directly handled from the receive
GVIA only provides a user-level access to the communica- buffer ring.
tion media, without any intervention from the kernel. As
a consequence, the blocking mechanism has to be imple- 5. Status of GVIA
mented directly in the library. It is achieved using mutex
resources and by recording all blocking threads as well as GVIA is a implementation of the VIA specifications on
the operations they are blocking on. top of the message passing interface GM. It implements
the unreliable service but offers reliable communications.
4.3. Data Transfer GVIA is a user library which includes no system-dependent
code and very few hardware dependent code (only little en-
VIA requires from the network interface a high level of dian to big endian translation macros). It is compatible with
de-multiplexing; an incoming message has to be dispatched any thread library that supports mutex operations. The re-
between the potential thousand of VIS (the specification re- sult is a high portability which is only limited by the porting
quires a minimal support of 1024 VIS per NIC). Basically, of GM on different systems.
250
GVIA was tested with the Intel Virtual Interface Archi- sented in figure 4 and a bandwidth comparison in figure 5.
tecture Conformance Suite (see [ l ] and [ 141 for more infor- All test machines have a 66MHd64 bits PCI bus and are
mations). It passed all (about 120) but 5 tests corresponding running Linux. For small packets (0 to 100 Bytes), the over-
to its implementation. Of these 5 tests, two of them test the head introduced by the VIA protocol is less than 1 usec. For
behavior of the VI Provider when confronted to non-aligned bigger packets, the overhead never exceeds 30 usec, which
data. Since the DMA engine of the Myrinet board does not is the cost of the extra messages in the hand-shake protocol.
require any particular alignment, we chose to weaken the This last overhead will be reduced in the near future by the
VIA constraint. The three other failed tests correspond to introduction of a new hand-shake protocol.
behaviors that are unspecified in the VIA specifications. Compared to existing implementations of VIA, a GVIA
based on the new Myricom 2000 boards achieves the high-
5.1. Performances est bandwidth ever (215 MBytes/s) and the shortest latency.
For comparison, Giganet's latency is 8.5 usec though the
latency of GVIA with the latest Myrinet cards (LANai 9
200MHz) is 7.5 usec. It is however important to notice
that the hardware conditions are not the same since Giganet
boards can not take a full advantage of 64 bits/66 MHz PCI
buses.
251
Interface for Parallelism, is a message passing interface for the aim is not to get a fully embedded VIA, but only to in-
Myrinet boards (see [ 181 for a description). troduce some VIA facilities into CM.
252