Kerrighed and Parallelism: Cluster Computing On Single System Image Operating Systems

Kerrighed and Data Parallelism:
Cluster Computing on Single System Image

Operating Systems*
Christine Morin', Reriaud Lottiaux', Geoffroy VallBe', Pascal Gallard', David hlargeryl,
Jean-Yves Berthou' and Isaac D. Scherson3
Abatmct- provided to avoid the restarting of applications (from the

A working Single System Image distributed Operating beginning) when node failures happen.
System is presented. Dubbed Kerrighed, it provides a uni-
fied approach and support to both the MPI and the Shared
Kerrighed is implemented as an extension to a standard
~ e m o r yprogramming models. The system is operational single node 0per;lting system and a PrototyPe, based 011
in a 16-processor cluster at the Institut de Recherche en the Linux kernel, has been developed. A patch of less than
Informatique et SystGmes Aleatoires in Rennes, France. In 1000 lines code has beer1 applied to the Linrlx kernel
this paper, the system is described with emphasis on its
main contributing and distinguishing namely its extending it by using the standard modnle mechanism.
DSM based on memory containers, its flexible handling of In this paper, we present the operating system mecha-
scheduling and checkpointing strategies, and its efficient and nisnls that are implemented in Kerrighed to provide global
unified communications layer.
Because of the importance and popularity of data
memory, process and data stream management. We focus
applications in these systems, w e present a brief discussion on mechanisms that are used to SUPPofi workloads cam-
of the mapping of two well known and established data par- posed of sequential and parallel applications based on ei-
allel algorithms. It is shown that Shearsort is remarkably tiler the shared nlemory or the message passing program-
well suited for the architecture/system pair as is the ever
so popular and important Tw-Dimensional Fast Fourier ming
Transform. In Kerrighed, processes and threads can start their ex-
ecution on any node and may be migrated during their
execution. Efficient process management meclianisms are
I. INTRODUCTION
implemented in Kerrighed to provide global scheduling.
Sin& System Image (SSI) operating systems for clusters Moreover, the scheduling policy can easily be adapted to
are attractive as they ease cluster programming and use. a particular workload due to the modular architecture of
Different groups currently targets the development of SSI the global scheduler. Global memory management in Ker-
operating systems based on Linux, such as OpenMosix [I] righed relies on the concept of container. Based 011 containi-
resulting from Mosix [4], OpenSSl [Z] derived from previ- ers, the high level services of a standard operating system
ous systems [14], [13]. Bprocs [GI also provides some of the can be extended elegantly to provide shared virtual mem-
properties of a SSI, providing a global view of processes ory segments, a cooperative tile cache, remote paging and
executing in a cluster. Kerrighed is another such single an efficient way of transferring the address space of a mi-
system image (SSI) operating system (OS) whose main de- grating process. The communications layer in Kerrighed
sign goals are to provide high-level services to high perfor- is based on the novel idea of Dyiiamic Streams and im-
mance parallel % well as sequential applications on clus- plements the standard layered interface expected by most
ters of computers. In Kerrighed, all cluster node resources applications, while allowing communicating process migra-
(processors, memories, disks) are globally and dyriamically tion.
managed. Global resource management enables transpar- The problem of parallelizing applications is considered
ent distribution of resources throughout the cluster nodes in the last section of this paper. Shearsort and the two-
and good usage of the whole cluster resources for demand- dimensional FFT illustrate in general the use of a cluster to
ing applications. Dynamic resource management enables compute data parallel programs. Kerrighed is particularly
transparent cluster recontignrations (node addition or evic- suitable for these applications due to its efficient support
tion) for the applications and high availability in the event of the DSM mechanism as well as the MPI interface.
of node failures. In addition, a checkpointing mechanisni is
DESIGN
11. KERRIGHED'S
'The work pre-ented here hus heen partly financed by the D i r e
tion GBnBrale de I'Armement (DGA) for the COCA project. A . D~~~~~principles
'Project PARIS, iRISA/INRIA. Campus tinivemitaire de ~h~ goal behind the design ofKerrighd is to build
Beaulieu, 35042 Rennes cedex. France
ZEDF R & D. I A " ~ .du a a i . ne ~ ~ ~F.92141
i i clarnart
~ . Cede% an operating system that provides all the properties of a
France genuine Single system image (SSI). A genuine SSI offers
3 D e ~ t .of Comp. Science (Systems), The Donald Bren School iisers and programmers the illusion that a cluster is a single
of Informatiun and Comp. Sciences, University of California, Imine,
Imine, CA 92697.3425, USA. On sabbatical at the IRISA/iNRIA high Performance and highly available computer instead of
and IJniversitQde Rennes I, ?March-.July 2004. a set of independent machines interconnected by a network.
0-7803-8694-9/041$20.0002004 IEEE 277 CLUSTER 2004

An SSI should offer four properties: (1) resource distribu- A given workload can be efficiently executcd UII a clus-
tion transparency, i.e. offering the illusiori that each kind of ter if available resources are efficiently used. To that end,
resource is unique and shared, (2) high pcrformance, (3) the nodes on which processes execute have to he carefully
high availability, i.e. tolerating nodc failures and allow- chosen. A global scheduling policy (simply called schedul-
ing applicatioris checkpoint and restart and (4) scalability. ing policy thereafter) defines how to map processes onto
i.e. dynamic system reconfiguration, node addition and nodes: baying this mapping on some goals to achieve (e.g.
eviction transparently to applications. Kerrighed performs balance the CPU load), some properties of the running
global and dynamic resource management a t the operating processes (e.g. CPU usage) and the cluster nodes states
system level to achieve all the YSI properties. Resource (e.g. nnmber of processes per CPU). We say that a state of
distribution transparency and high performance are pro- the cluster that infringes the scheduling policy (e.g. CPU
vided by a set of distrilmted services that perform global load not balanced) generates a scheduling problem. Hence?
management of processors, memories and disks. To avoid a scheduling policy uses algorithms to prevent and to solve
mechanism redundancy and conflicting decisions in differ- scheduling problems. Therefore, the cluster OS needs a
ent distributed resource nianagement services and to de- global scheduler that can efficiently create processes on
crease the sofkvare complexity of such services, Kerrighed any cluster node according to the scheduling policy, and
resource management services arc built in a unified and that allows migration of processes during their execution
integrat.ed way. The global memory and global process in order to efficiently solve scheduling problems.
management services that are described later in this sec- Moreover, the scheduler should be configurable to be
tion illustrate this philosophy. High availability and scala- adapted to the submitted workload. For example, on the
bility are achieved through dynamic resource management one hand, workloads consisting of sequential applications
in Kerrighed. need a scheduling policy which guarantees an efficient use
A key advantage of Kerrighed approach is that the stan- of memories and processors. On the other hand, workloads
dard interface of a standard single node operating system, containing shared memory applications would also need a
which is familiar to programmers, is not modified. Legacy scheduling policy that prevents ping-pong of memory pages
applications running on this standard operating system between nodes. Therefore, it is important to have a mech-
may be executed without modification on top of Kerrighed anism allowing to specify and implenient new scheduling
and further optimized if needed. policies. Kerrighed implements a global scheduler in which
Kerrighed is not an entirely new operating system de- the scheduling policy can be easily changed.
veloped from scratch as it has been designed to be imple- Kerrighed's scheduler relies on a set of efficient process
mented as an extension to an existing standard operating management mechanisms. Remote process creation, re-
system. Kerrighed only takes in charge the distributed mote thread creation, thread migration and process migra-
nature of the cluster while the local operating system run- tion are used to decrease contention on local resources, to
ning on each node remains responsible of the mamgement evict running applications on a node that must be stopped
of local physical resources. Our current prototype extends (for maintenance for example) or to enable use of remote
Linux. resources. Process checkpointing [3] helps decrease global
We focus here on services that provide process manage- contention on resources by suspending some applications.
ment, global memory management and global data stream These mechanisnis are all based on a common mechanism
management. We show how these services achieve the first that extracts the fullstate of a process froni the local OS.
two properties of an SSI, focusing on mechanisms, not on In a standard Linux system, processes and threads can
resource management policies. We aLm describe how these be created using the fork and execu interface. The stan-
services can be integrated in an existing operating system. dard interface of POSIX threads allows the creation and
Mechanisms to achieve high availability and scalability are management of threads in shared memory applications.
out the scope of this paper. In Kerrighed, these mechanisms are extended using Ker-
Both shared memory and message passing applications righed's process management mechanisms so that legacy
can benefit from Kerrighed's features without suffering applications developed for SMP machines take advantage
from performance penalties. For this reason, two data of clusters.
parallel applications, namely Shearsort and the Two-
Dimensional FFT, are presented in a manner independent B.2 Modular and Customizable Global Scheduler
of the programming model of choice. We propose a modular global scheduler, composed of
B. Global Process Management three layers (see Figure 1): system probes to harvest sys-
tem information to give a view of the cluster's state; lo-
B.l Requirements cal analyzers to detect a11 the local scheduling problems
A process that makes use of Kerrighed's services is re- such as high resource contention or device failures; a global
ferred to as a K-process. For the sake of simplicity, we as- scheduling manager to schedule new processes in the clus-
sume that the whole address space of a K-process is linked ter and to solve scheduling problems. This modular ar-
to containers. We explain in this section how Kerrighed chitecture allows to deal separately with different issues of
manages K-processes in order to enable an efficient and global scheduling, and thus eases modifications of global
simple use of a cluster. scheduling policies.
278
I
I il Node 0 I Node 1
Fig. 2. Global Scheduler Architecture
sion).
B.2.c Global Scheduling Managers. A global scheduling
manager runs on edcb node, and is linked to a set of local
Fig. 1. Kerrighed's Architecture for Global Procm Management
analyzers. Global scheduling managers executing on differ-
ent nodes communicate together to exchange information
B.2.a System Probes. System probes, measuring fnr ex- on the nodes' states (e.g. the nodes' CPU loads). This
ample CPU or memory use make up the first layer. There layer is the only one to have a global view of the clnster.
are two different kinds of probes: passive probes and active This global view is constructed with the probe informat'ion
probes. Each probe can be linked with a set of local analyz- (e.g. from the CPU probes) and enables detection of global
ers, to which it sends information. Active probes are regu- scheduling problems (see Figure 2). To that end, each
larly awakened by a system timer: whereas passive probes global scheduling managcr implements a global schedul-
are awakened by a system event. There are two different ing policy ( e . g . CPU load balancing). When a scheduling
system events: Linnx kernel events, and global scheduler problem is detected (e.g. local CPUs inore loaded than the
events, the latter to get local information. When a passive average in the cluster), the global scheduling manager can
probe is awakened by a system event, the probe sends in- decide to migrate some processes or to checkpoint an appli-
formation about the probed entity to the local analyzers it cation, according to the scheduling policy, in order to have
is linked to. an efficient use of cluster resources. As a resnlt, the global
For example, an active probe can be used to probe CPU scheduler can execute any scheduling policy, performing
use (the CPU is periodically probed), whereas we can use hatch as well as dynamic load balancing.
a passive probe to detect ping-pong of memory pages be- All these layers can be configured using XML config-
tween two threads of a shared memory application (when uration files. The different probes, local analyzers and
a page arrives on a node, the probe is awakened). global scheduling nianagers can he dynamically loaded and
To simplify implementation of the global scheduler, a unloirded without stopping the OS, neither applications.
set of system probes is provided within the Kerrighed OS: Moreover, each layer provides a development framework
to simplify the programming of new components, allowing
a memory probe, a CPU probe, and a probe to detect
ping-pong of memory pages. Additional probes can be to simply create new global scheduling policies. Finally,
implemented hy operating system programmers. the scheduler of the Linnx kernel is not modified. Ker-
righed just adds or removes processes to or from the local
B.2.b Local Analyzers. Local analysers get probe infor- scheduler.
mation, analyze and filter it, and detect abnormal local
system state. This layer is also in charge of sending probe B.3 Process State Management: Placement, Migration
information to the global scheduling managers. A set of and Checkpointing
local analyzers runs on each nodes (see Figure 2). Kerrighed's schedulers are based OIL three mechanisms:
Each local analyzer can be linked with a set of probes. process placement, process migration and process check-
For example, consider a probe for CPU usage and another pointlrestart. For process placement, Kerrighed supplies
for CPU temperature. A local analyzer linked with these two mechanisms: remote process creation and remote pro-
two probes can detect high contention on local CPUs, cess dnplication. Remote process creation uses a dedicated
well as local thermd problems. If a CPU problem is de- interface semantically equivalent to a fork() immediately
tected, the local analyzer sends a scheduling request to the followed by an exec.() in the child process. Process dupli-
global scheduling manager (a local analyzer has no global cation is used during the application's execution when new
vision of the cluster state and therefore can take no deci- processes (using fork()) or threads (using pthreahcreate())
279
formation about dynamic streams are extracted before the
migration, transfert to the remote node. After the trans-
fer, a new link is created with dynamic streams, like for
containers, and the process can then use remotly stream
files.
B.3.b khnagetnent of the Process Identifier (PID). In a
standard Linux OS, threads are iniplernented by pmcesses
and by the pthread library. Hence, processes are identified
by a kernel uniqne identifier: the Process IDentifier ( P I D ) .
Threads are identified by internal identifiers in the pthread
library.
The Kerrighed OS adds one layer to this system. Each
process is assigned a Kerrighed Process IDentifier (KPID)
(a) Beiore migration (b) After migration as well ils its PID. At the kernel level, the PID is used
whereas at the user level, only the KPID is seen. There-
Fig. 3. Containers linked t o process segments fore, provided this KPID is unique accross the cluster,
KPIDs can be used to uniquely designated a process what-
ever the node it is running on. To ensure unicity, he KPID
are created that should inherit the applications context. of a K-process is composed of the initial L i n w process PID
To place such a new process, the system needs to extract created host the K-process and the current node identifier.
an image of the creator process and transfer it to a remote The Kerrighed thread library, krggtkiead, manages an addi-
node to create a new running clone. Similarly to remote tional internal thread identifier.
process duplication, process migration needs to extract a
process image and transfer it to a remote node to create a C. Global Memory Management
running clone, but the initial process is stopped. Process C.1 Requirements
checkpointing also needs to extract a process image and
store it on disk or on remote me1nory[9]. Global memory management in a cluster covers several
Note that remote process creation, duplication, check- services. First, in order to support the execution of multi-
point/restart all use the same underlying niechanisni of threaded applications and more generally the shared mem-
process extraction (see Figure 1). ory programming model a DSM system is needed that al-
Process extraction consists in creating a ghost process, low processes or threads to share data segment whatever
(process virtualization) composed of several parts: the ad- their execution node. Secondly, it is highly desirable in a
dress space, the opened files, the process identifier (PID), cluster to exploit the memory which is distributed in the
the proceswr registers, and signal related data. cluster nodes to increase the operating system services ef-
ficiency. There are two areas for improvement. The first
B.9.a Management of the Address Space and Opened Files is remote paging mechanisms to efficiently support appli-
of Processes. In a standard Linnx kernel, for a process, cations with huge memory requirements. As high speed
all the meniory information and opened files information networks used in clusters have a lower latency than disks,
need to be extracted in order to create a coherent ghost it may be more advantageous to swap into memory avail-
process. In the Kerrighed OS, address space and opened able in a remote node than into the local swap disk. The
files of a K-process are globally managed (by containers for second is a system of cooperative cache files to improve the
memory space and regulsr opened files and by dyiiainic 1/0 efficiently.
stream mechanisms for stream files. For this reason, in- A DSM system, a remote paging system and a system
formation about memory space and opened files do not of cooperative file caches rely on several common mecha-
need to be extracted for process migration. For example, nisms: localizing a page copy in the clnster, transferring
containers allow virtual memory pages of a process to be pages between nodes (either to serve page requests or to
accessed from anywhere in the cluster. New links with inject pages), and marraging the coherency of replicated
containers are just created after the process transfer, and page copies.
then memory pages are migrated on demand by the con- Kerrighed implements the concept of container as a
tainer niechanisni during the process execution (see Fig- unique set of mechanisms to globally manage the clus-
ure 3)>and opened files can be accessed remotly through ter physical memory. All operating system services using
containers. As a result, for process extraction, the con- memory pages access the physical memory through con-
tainer mechanism eases the creation of the ghost process: tainers.
instead of the whole process address space and information
about opened files, only container information needs to be C.2 Containers
extracted. This is the same approach for stream files: in-
In a cluster, each node executes its own operating system
there are different t y p a of files: regular and streams for pipe, kernel, which can he coarsely divided into two parts: (1)
system services and (2) device managers. We propose a
socket. ...
2x0
linkers and a low level linker called input/output linker.
The role of interface linkers is to divert device accesses of
system services to containers while an I/O linker allows a
container to access a device manager.
System services are connected to containers thanks to in-
terface linkers. An interface linker changes the interface of
a container to make it compatible with the high level sys-
tem services interface. This interface must give the illusion
to these services that they communicate with traditional
device managers. Thns, it is possible to "trick" the kernel
and to divert device accesses to containers. It is possible to
connect several system services to the same container. For
instance it is possihle to map a container in the address
space of a process P1 on a node A and to access it thanks
Node A Node B to a readlwrite interface within a process P2 on a node B.
During the creation of a new container, an input/output
Fig. 4. Integration ofcontainers and linkers within the host operating
system linker is associated to it. The container then stops being a
generic object to become and object sharing data coming
from the device it is linked with. The container is said
generic service inserted between the system services and to have been instanciated. For each semantically different
the device managers layers called container [Ill. Con- data to share, a new container is created. For instance,
tainers are integrated in the core kernel thanks to linkers, a new container is used for each filc to share and a new
which are software pieces inserted between existing device container for each memory segment to share or to be visible
managers and system services and containers. The key cluster wide.
idea is that container gives the illusion to system services Just after the creation of a container, it is completely
that the cluster physical memory h shared as in an SMP empty, i.e. it does not contain any page and no page
machine. frame contains data from this container. Page frames are
A container is a software object that allows the cluster- allocated on demand during the first access to a page. Sim-
wide storing and sharing of data. A container is a kernel ilarly, data can be removed from a container when it is de-
level mechanism completely transparent to user level soft- stroyed or in order to release page frames when the physical
ware. Data is stored in a container on host operating sys- memory of the cluster is saturated.
tem demand and can be shared and accessed by the host
kernel of other cluster nodes. Pages handled by a con- (2.4 Design of Distributed System Services
tainer are stored in page frames and can be used by the
Containers and linkers are used to implement several
host kernel as any other page frame. Container pages can
cluster wide OS services. We detail in this section the
be mapped in a process address space, be used as a file
implementation of virtual memory sharing and file m a p
cache entry, etc.
ping services. More details on cooperative file cache can
By integrating this generic sharing mechanism within
be found in [U].
the host system, it is possible to give the illusion to the
kernel that it relies on top of a physically shared mem- C.4.a Shared Virtual Memory. The virtual memory shar-
ory. On top of this virtual physically shared memory, it ing service of an OS allows to share data between threads or
is possible to extend to the cluster traditional services of- between processes through a system V segment. A shared
fered by a standard operating system (see figure 4). This virtual memory extends this service to a cluster by allow-
allows to keep the OS interface, as known by users, and ing several processes or threads running on different nodes
to take advantage of the existing low level local resource to share data through their address space. Providing this
management. service requires three properties: (1)data sharing between
The memory model offered by containers is sequential nodes, (2) coherence of replicated data and (3) simple ac-
consistency implemented with a write invalidation proto- cess to shared data thanks to processor read/write opera-
col. This model is the one offered by a physically shared tions.
memory. Moreover, an injection mechanism similar to [5] The container service ensures the two first properties.
is used to balance memory usage and avoid (or delay) disk The third one is ensured by the mapping interface linker.
swapping. Thus, mapping a memory container in the virtual address
space of several processes via a mapping linker leads to a
C.3 Linkers shared virtual memory.
Many mechanisms in a kernel rely on the handling of When a process page fault occurs, the memory map ui-
physical pages. Linkers divert these mechanism? to ensure terface linker diverts the fault to containers. The container
data sharing through containers. To each container is as- mechanism places a copy of the page in local memory and
sociated one or several high level linkers called interface ensures the coherence of data. Lastly, the map interface
28 1
The KerNet layer iniplernents the abstraction of dy-
naniic stream and KerNet sockets. It is a distributed
service which provides global stream management cluster
wide. In the remainder of this section, we focus on the
design and implementation of the KerNet layer anti of the
Unix socket interface and pipes. The low-level point-to-
Fig. 5 . Kerrighed network stack point communication system is only briefly described.
D.2 Dynamic Stream Service

linker niaps the local copy in the address space of the fault-
All systems for clusters rely on a low-level point-to-point
ing process and changes virtual page access right according
communication system. Most of the time, such a system
to the rights of the page in the container.
provides properties such as reliability, high-performance
C.4.b File Mapping. The file mapping service of an OS and message ordering. So, when no migration occurs? ap-
allows to map a file in the address space of one or sev- plications should be able to take advantage of these prop
eral processes, or in the address space shared by a group erties.
of threads. Extending this service to a cluster supposes In Kerrighed, we designed a low-level communication
the mapping of a file in a process address space whatever system which is reliable and keeps the message sending or-
its execution node or in the shared memory of a group der between two nodes. This system is described in Section
of threads. This is done by mapping a file container in D.4.
the virtual address space of a process thanks to a map- We define a KerNet dynamic stream as an abstract
ping interface linker. Moreover, it is possible to map a file stream with two or more defined KerNet sockets and with
container in the address space of one or more processes no node specified. When needed, a KerNet socket is tem-
running locally or on a remote node and in shared virtual porarily attached to a node. For example, if two KerNet
memory. sockets are attached, send/receive operations can occur.
A KerNet dynamic stream is mainly defined by several
D. Dynamic Streams
D.l Requirements
.
parameters:
Type of stream: It specifies how data is transfered
using the dynamic stream. A stream can be:
Kerrighed implenients load balancing for workloads com- - DIRECT for one to one communication, such as
prising of parallel applications based on the shared memory socket based communication,
or on the message passing prograniniing models. One is- - FIFO or LIFO for streams with several readers and
sue is to efficiently migrate processes communicating using
message passing. Communicating processes using standard
comniunication interfaces such as Unix sockets or pipes
. writers.
N u m b e r of sockets: I t specifies the total number
of available sockets for the stream. Depending on the
should be able to be transparently migrated in the clus- stream type, this value niay increase, decrease or be
ter. Moreover, migrating a process should not alter the constant.
performance of its connnunications with other processes. N u m b e r of connected sockets: It specifies the cur-
With standard communication interfaces such as pipes rent number of attached sockets.
and sockets, two processes communicate through a binary D a t a filter: It allows modification of all data trans-
stream. We propose the concept of dynamic streams on mitted with the streani (in order to have cryptogra-
which standard communication interfaces are built. We phy, backup.. .).
call the extremities of these streams "KerNet sockets" Streams are managed by a set of stream managers, one
and these can be migrated inside the cluster. Dynamic executing on each cluster node. Kernel data structures
streams and KerNet sockets are implemented on top of related to dynamic streams are kept in a global directory
a portable high performance communication system pro- which is distributed on cluster nodes. Data such as the
viding a send/receive interface to transfer data between node location of a KerNet socket, are updated on all the
different nodes in a cluster. nodes acting on the stream.
The proposed architecture is depicted in Figure 5. The The dynamic stream service is in charge of allocating
low-level point-to-point communication service is based ei- KerNet sockets when there are needed, and of keeping track
ther on device drivers (such as Myrinet), the generic net- of these KerNet sockets. When the state of one KerNet
work device in the Linux kernel (netdevice) or a high-level socket changes, the stream's manager takes part in this
communication protocol (such as TCP/IP). On top of the change and updates the other KerNet sockets related to
low level point-to-point layer, we provide three kinds of the stream. With this mechanism, each KerNet socket has
dynamic streams: direct, FIFO and LIFO streams. Dy- got the address of each corresponding socketk node in a
namic s t r e m s are specialized by interfaces. We use these map. In this way, two sockets can always communicate in
dynamic streams, implemented by the KenVet layer to of- the most efficient way (using this map).
fer dynamic versions of standard Unix stream interfaces At the end of n connection, a process is dettached from
(inetluniu sockets, pipe, . . . ). the stream. Depending on the stream type, the stream
282
P1 -
PZ
(1)
11.1)
ji.zj
cpe(fd121)
-
I stream create(D1RECT. 2):
I fdC11 = attach(stream);
(1.3) I fdC21 = attach(stream);
Fig. 6 . Standard environment based on KerNet sockets +---.
(2) fork0
(3) close(fd2) close(fd1)
may be closed (4) urite(fd1,. . .) read(fd2, . . . I
D.3 Implernentatiori of Standard Coniinunication Inter- Fig. 8. Basic pipe implementation

faces using Dynamic Streams
Obviously, standard distributed applications do not use descriptor (as shown in Figure 8)
KerNet sockets. In order to create a standard environment
based on dynamic streams and KerNet sockets: an interface D.4 Low-Level Communication Layer
layer is implemented a t kernel level (see Figure 6). Each The KerNet architecture is designed for a distributed o p
module of the interface layer iniplements a standard com- eratiiig system such as Kerrighed. For this reason, it was
munication interface relying on the interface of the Ker- natural to build KerNet on top of the low-level communica-
Net service. The main goal of each interface module is to tion layer provided by Kerrighed and called Gimli/Gloin.
nianage the standard communication interface protocol (if In the following, we highlight some features of this com-
needed). munication layer.
KeTNet interfaces are the links between the standard Reliability: Every sent message is delivered without
Linux operating system and the Kerrighed dynamic com-
munication service.
The Kerrighed operating system is built on top of a
. any change in the message.
Simple interface: This communication service is
based on the channel idea, and provides a very ba-
lightly-modified Linux kernel. All the different services, sic send/receive interface.
including the communication layer, are implemented as Kerrighed low level point to point communication sys-
Linux kernel modules. The communication layer is niade of tem is divided in two layers: Gimli and GloYn. Gimli prc-
two parts: a static high-performance communication sys- vides a kernel level API on which a11 Kerrighed services
tem that. provides a node to node service (It is on top of rely, in particular KerNet. Gloi'n is in charge of the com-
this system that the dynamic stream service manages the munication reliability management (error control, packets
migration of streams's interfaces), and the interface service retransmission). Finally, Gimli provides a way to classify
that replaces the standard functions for a given communi- the messages according to some profile, a message being
cation tool. characterized by a channel and a type. If a message can-
In the remainder of this section, we describe the imple- not be delivered, an error is returned to the calling service
mentation of the pipe interface on top of the KerNet sock- (KerNet in our case).
ets. We aim at providing transparently to applications a The interaction between Gimli/Gloin and the KerNet
distributed version of this communication tool. service is mainly done by the KerNet-Send function. Basi-
D.3.a Pipe Communication. Pipes beloiig to the most cally, this function tries to send the message for ever pass-
common and oldest communication tools. Pipes are an ing the socket's niap as node destination to the gimli-send
easy way to connect a process to a child process. The function (see Figure 9, KerNet layer). When the the re-
pipe system call creates a pair of file descriptors, one is for ceiver migrates, gimli-send fails. The gimli-send will
reading and the other one is for writing. succeed when the map will he updated at the end of the
In Figure 7, one process P1 creates a pipe and gets two migration (with the new location). Until this event, the
file descriotors ( f d l and f d 2 l . When P1 oerfornis a fork.
< ,
previous location discards the unwanted messages.
the child process P 2 inherits the file descriptors from P1.
PI and P2 may close one file descriptor and use the other 'I1. DATA PARALLEL.CoMPUT1NG IN A
one t o communicate. CLUSTER
Since the two extremities of a pipe are created at the Kerrighed provides efficient and reliable support to pro-
same time, there is no special protocol to implement in gram parallel algorithm in a computing cluster. Sorting
Kerrighed. In order to implement pipes in our KerNet ar- and the two-dimensional FFT were chosen as represen-
chitecture, we just have to create a stream with two Ker- tative of an important class of Data Parallel algorithms
Net sockets. Each KerNet socket is attached to e x h file that work on twc-dimensioiial data structures. Other re-
283
For i = 0, k - 1, do:
For j = I), k - 1, do:
P, sends Block B,,
to Processot P,
endfor
endfor
I-[
Fig. 9. Low-Level communications service For all i: do parallel:
For j = 0, k - 1, do:
P, sends Block B;j
searchers have chosen these problems [IO], [SI, [15]and
to Processot Pi
we report our preliminary modest findings in this section.
endfor
Shearsort seems remarkably suitable for implementation
endfor
in computing clusters while the 2D FFT has proven its
importance and necessity in such high performance com-
putiug systems.
For i = 0, k - 1, do:
A. Matrix Transposition in a Cluster System For all j , do parallel:
Pi sends Block Bij
A common denominator to the programming of Shear-
to Processot Pj
sort and the ZD-FFT is Matrix Transposition. Hence, we
endfor
consider first the problem of transposing a n x n matrix A
endfor
stored in the cluster such that processor Pi, i = 0>k - I,
is given a band of f rows of the matrix starting with row
i x f , If a t each processor we divide the hands into slices Fig. 10. Communications Patterns and Loops t o Block-Transpose 3
of f elements each, the allocation be graphically described Matrix

as follows.
chosen index. Thus, if oue parallelizes on the index i; the

loop is transformed into the one shown in Figure lO(c).
Clearly, the result is that all processors send the same cor-
responding column block to a single destination proces-
sor. As processors have a single input port? the bottle-
neck causes a sequentialization of the parallelized loop and
hence the same complexity of O ( k 2 )block transfers.
Transposing the original matrix A is equivalent to trans- Alternatively, psrallelizing on j results in a similar loop
posiug the individually transposed blocks of the allocation shown in Figure lO(d), which indicates that, a t each itera-
above. tion, a single processor should broadcast all blocks simulta-
A typical sequential loop to transpose by blocks in the neously to all other processors. Again, this is an impossi-
cluster is given in Figure lO(a). For simplicity, the loop bility as processors have a single output port. Complexity
does not optimize for a Processor sending a main diagonal remains a t O ( k 2 )block transfers.
block to itself. Figure 10(b) shows all the processor to
processor block transfers required to transpose the matrix. A clever way to overcome the limitations of the bot-
The complexity of the transfer is O ( k 2 )block transfers. tlenecks described above is to use a skewed trausmission
Standard loop parallelizing techniques call for choosing mechanism where processors send blocks starting with the
an index, i or j, and unfolding the loop according to the block labeled with their own processrir index.
284
For all j = 1,IC - 1, do:
Pisends Block B,(j+i)nLodk
to Processot P(j+i)madk
endfor
endfor
If the cluster network allows for siniultaneous all-
processor-pairs communications, the resulting complexity
becomes O ( k ) block transfers. The skewed technique is
currently in use in many rcedily availahle cluster 2D-FFT
routines (see Martin Siegerts for example).
On Kerrighed, using the DSM support and a double
buffering technique, a good overlap between communica-
tions and coniputations can be achieved.
B. S/Lt%TsO&
Shearsort is a well known sorting technique that works

naturally in TweDimensional models of computation( [7],
[12]. The basic idea is to sort the rows of a 2D array in
alternating directions (odd rows in non-decreasing order
froni left to right, even rows in non-decreasing order right
to left) while sorting all colunins in the same direction (top Fig. 11. Performance of Shearsort
down). O(1ogn) iterations of row-column sort suffice to
sort the entire array into a snakelike row-major indexing
scheme. processor arrays. When mapping the FFT into clusters, a
For the case under study, we propose to apply SliearSort coarser grain implementation is desired. Because for niany
t o sort an integer matrix A into a block-ordered snakelike applications (such m Radar Signal Processing or Iinege Re-
sorted array: the array is sorted such that all blocks in a construction from Projections) the two dimensional t r a n s
row are sorted with respect to one another and all rows are form is necessary, we focns here on the analysis of the pcr-
in tnrn sorted with respect to one another. formance of the 2D FFT implemented using the transpose
Following the basic SlieerSort idea, processors would si- mechanism shown above.
multaeously sort : rows of the array in the usual alter- Recall that for a twmdimensional array. such as the one
introduced in this section, the 2D F F T is defined as the
nating opposite direction order. The transpose procedure
would then be applied to align the data such that all prc- transform of the rows of the array followed by the trans-
form of the columns. To achieve this, with the data allo-
cessors have now the corresponding column blocks of the
array. The Shearsort iteration would then complete with e
cation chosen, processors are given rows on which they
a set of n-element vector sort followed by another block locally perforni a 1D FFT, the whole array is transposed
transpose. This row-colunin sort iteration is carried out and the processors perform a second set of I D FFTs on the
O(nlogn) times. Using a standard O ( nlogn) n-element column vectors now available in their local memories. A
vector sort in each processor of the cluster, the speedup final nietrix trnnspose reorganises the data in the original
attained in a cluster using Shearsort becomes: row-column manner.
Assuming the best possible algorith for a single processor
1D FFT, the speedup for the procedure is given by:
speedup = Time in Single Processor
Time in G l u s p
-
- 2xn Xlogn
Nornializing the expression with respect t o the ratio be- Z$ logn+O(k ( i x z ) black transfers)
tween Communications and Coniputations costs, the per-
formance curves of Figure 11 are obtained. The speedup is As in the case of Shearsort, normalizing the expression
plotted versus a varying problem size (for a k e d number with respect to the ratio between Computation and Com-
of CPUs (32)), and versns a varying machine size (for a munication costs, the performance curves of Figure 12 are
fmed problem size (4096 x 4096)). The vertical rectangle obtained.
next to each plot shows the ratio Conini/Comp assumed
for edch curve. C.l Data Parallel Performance Analysis
Shearsort and the 2D-FFT are representative of data
C. 2D-FFT parallel algorithms that may be needed in applications of
The importance and application of the Fast Fourier cluster systems (such as Radar Data Processing). Experi-
Transform (FFT) needs no introduction. The FFT a l g e ence shows that these algorithms need to be recast so as to
rithm is a fine grain procedure suitable for large SIMD take advantage of a coarser grain computing paradigm than
285
of workload.
A prototypc of Kerrighed hns been built as a set of
modules extending the Linux kernel. The kernel itself has
only been slightly modified. Kerrighed is available as open
source software6.
Our current work ainis a t integrating additional basic
niechanisms into Kerrighed and at studying global schediil-
ing policies for different workloads. Another important re-
search direction relates t o the design of high availability
mechanisms.
In the long term, we plan to integrate new mechanisms
into Kerrighed to ease the use and programniing of a fed-
eration of clusters, each of them running a single system
operating system such as Kerrighed.
REFERENCES
http://openmosix.sourceforge.net/
http://openssi.org/index.shtml.
Ramamurthy Badrinath and Christine Morin. Common mecha-
nisms for supporting fault tolerance in DSM and message p a s s
ing systems. Rapport d e recherche 4613, INRIA, November
2002.
Amnon Barak, Shai Guday, and Richard G. Wheeler. The
MOSIX Distnbvted Operating System, Load Balancing for
UNIX, volume 672 of Lecture Notes in Computer Science.
Springer-Verlag. 1993.
M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and
H. M. Levy Implementing global memory management in a
the one they were originally conceived for. In addition, the workstation cluster. In Proc. of the 15th ACM Symp. on Oper-
data parallelism needs to he combined with efficient single ating Systems Principles (SOSP-I5), pages 201-212, December
1995.
processor procedures to solve the smaller partitions of the Erik Hendriks. Bproc: the heowulf distributed process space. In
problem. Proceedings of the 16th intsmntional conjwerence on Supercom-
Marly DP algorithms were designed t o take advantage puting, p a g u 12+136. ACM Press, 2002.
Sandeep Sen Isaac D. Schorson and Adi Shamir. Shear sort: A
of permutation/alignment networks that contributed data true two?dimensionnl sorting technique for vlsi networksr. In
permuting functionality to the communications function. Proceedings of the 1986 International Conference on Parallel
Many algorithnls would bog down in a cluster if ported Processing. p a g 903-908,
~ August 1986.
Constantine Katsinis. Merging, sorting and matrix operations
directly from their original fine grain instantiation. on the SOMEbus multiprocessor architecture. Future Genera-
These ideas were illustrated in this section and we note tion Computer Systems, Elsevier, September 2003.
Anne-Marie Kermarrec and Christine Morin. Smooth and ef-
that reasonable performance and scalability are obtained ficient integration of high-availability in a parallcl single level
if the proper balance is striken between the data parallel store system. In Proceedings of Euro-Par 2001, August 2001.
algorithm, the single processor procedure to solve the par- Rancescomaria Marino and Jr. Earl E. Swartdander. Parallel
implementations of multidimensional transform without inter-
allel subproblems and the proper capabilities are provided processor communication. IEEE 7bnsactions on Computers,
by the cluster interconnection network. 48(9):951-961, September 1999.
We are in the process of evaluating the effects of the R.Lottiaux and C.Morin. Containers : A sound hasis for a true
single system image. In Pmceeding o j IEEE Intemotional Sym-
DSM and MPI realizations on the performance of the so- posium on Cluster Computing and the Grid, pages 6G73, May
lution of data parallel programs. l"",
-""I.
(121 Sandeep Sen and Isaac D. Schemn. Parallel sorting in twO-

IV. CONCLUSION dimensional VLSI models of computation. IEEE h n s a c t i o n s
on Compsters, 38(2):238-249, February 1989.
Our contribution is the design and implementation of a [I31 Bruce Walker, Gerald Popek, Robert English, Charlu Kline.
and Greg Thiel. The locus distrihuted operating system. In
set of fundamental operating system mechanisms for global Proceedings o j the ninth ACM symposium on Opemting systems
process, niernory an strenm management which constitute pfinciples, pages 49-70. ACM Press. 1983.
a st.rong basis for building an efficient single system image [14] Bruce Walker and Douglas Steel. Implementing a full single
system image unixware cluster: Middleware vs underware. In
operating system to execute workloads made up of sequen- Pmceedings of Intemntional Conjeerenee on Parollel and Dis-
tial and parallel applications. Kerrighed is unique in pr- tlibuted Processing Teclirrique~and Applications, PDPTA'gg,
viding a thread migration mechanism for multi-threaded 1999.
[lS] Limin Xiang and Kaauo Ushijima. On time hounds, the work-
applications. An interesting feature of Kerrighed is the time scheduling principle, and optimdity of BSR. IEEE Pans-
concept of container which provides an elegant way to in- actions on Parallel and DistnbrrtEd Systems, 12(9):912-921,
September 2001.
tegrate a remote paging system, a cooperative file cache
and a shared virtual memory within the standard operat-
ing system running on each cluster node. Another inter-
esting feature is the modular scheduler architecture which
allows t o tailor the scheduling policy to a particular type http://www.kerrighed.org/
286

Kerrighed and Parallelism: Cluster Computing On Single System Image Operating Systems

Hochgeladen von

Dokumentinformationen

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Kerrighed and Parallelism: Cluster Computing On Single System Image Operating Systems

Hochgeladen von

Copyright:

Verfügbare Formate

Kerrighed and Data Parallelism:

Cluster Computing on Single System Image

Abatmct- provided to avoid the restarting of applications (from the

0-7803-8694-9/041$20.0002004 IEEE 277 CLUSTER 2004

Fig. 2. Global Scheduler Architecture

D.2 Dynamic Stream Service

D.3 Implernentatiori of Standard Coniinunication Inter- Fig. 8. Basic pipe implementation

of f elements each, the allocation be graphically described Matrix

chosen index. Thus, if oue parallelizes on the index i; the

Shearsort is a well known sorting technique that works

(121 Sandeep Sen and Isaac D. Schemn. Parallel sorting in twO-

Das könnte Ihnen auch gefallen