Beruflich Dokumente
Kultur Dokumente
278
I
I il Node 0 I Node 1
sion).
B.2.c Global Scheduling Managers. A global scheduling
manager runs on edcb node, and is linked to a set of local
Fig. 1. Kerrighed's Architecture for Global Procm Management
analyzers. Global scheduling managers executing on differ-
ent nodes communicate together to exchange information
B.2.a System Probes. System probes, measuring fnr ex- on the nodes' states (e.g. the nodes' CPU loads). This
ample CPU or memory use make up the first layer. There layer is the only one to have a global view of the clnster.
are two different kinds of probes: passive probes and active This global view is constructed with the probe informat'ion
probes. Each probe can be linked with a set of local analyz- (e.g. from the CPU probes) and enables detection of global
ers, to which it sends information. Active probes are regu- scheduling problems (see Figure 2). To that end, each
larly awakened by a system timer: whereas passive probes global scheduling managcr implements a global schedul-
are awakened by a system event. There are two different ing policy ( e . g . CPU load balancing). When a scheduling
system events: Linnx kernel events, and global scheduler problem is detected (e.g. local CPUs inore loaded than the
events, the latter to get local information. When a passive average in the cluster), the global scheduling manager can
probe is awakened by a system event, the probe sends in- decide to migrate some processes or to checkpoint an appli-
formation about the probed entity to the local analyzers it cation, according to the scheduling policy, in order to have
is linked to. an efficient use of cluster resources. As a resnlt, the global
For example, an active probe can be used to probe CPU scheduler can execute any scheduling policy, performing
use (the CPU is periodically probed), whereas we can use hatch as well as dynamic load balancing.
a passive probe to detect ping-pong of memory pages be- All these layers can be configured using XML config-
tween two threads of a shared memory application (when uration files. The different probes, local analyzers and
a page arrives on a node, the probe is awakened). global scheduling nianagers can he dynamically loaded and
To simplify implementation of the global scheduler, a unloirded without stopping the OS, neither applications.
set of system probes is provided within the Kerrighed OS: Moreover, each layer provides a development framework
to simplify the programming of new components, allowing
a memory probe, a CPU probe, and a probe to detect
ping-pong of memory pages. Additional probes can be to simply create new global scheduling policies. Finally,
implemented hy operating system programmers. the scheduler of the Linnx kernel is not modified. Ker-
righed just adds or removes processes to or from the local
B.2.b Local Analyzers. Local analysers get probe infor- scheduler.
mation, analyze and filter it, and detect abnormal local
system state. This layer is also in charge of sending probe B.3 Process State Management: Placement, Migration
information to the global scheduling managers. A set of and Checkpointing
local analyzers runs on each nodes (see Figure 2). Kerrighed's schedulers are based OIL three mechanisms:
Each local analyzer can be linked with a set of probes. process placement, process migration and process check-
For example, consider a probe for CPU usage and another pointlrestart. For process placement, Kerrighed supplies
for CPU temperature. A local analyzer linked with these two mechanisms: remote process creation and remote pro-
two probes can detect high contention on local CPUs, cess dnplication. Remote process creation uses a dedicated
well as local thermd problems. If a CPU problem is de- interface semantically equivalent to a fork() immediately
tected, the local analyzer sends a scheduling request to the followed by an exec.() in the child process. Process dupli-
global scheduling manager (a local analyzer has no global cation is used during the application's execution when new
vision of the cluster state and therefore can take no deci- processes (using fork()) or threads (using pthreahcreate())
279
formation about dynamic streams are extracted before the
migration, transfert to the remote node. After the trans-
fer, a new link is created with dynamic streams, like for
containers, and the process can then use remotly stream
files.
B.3.b khnagetnent of the Process Identifier (PID). In a
standard Linux OS, threads are iniplernented by pmcesses
and by the pthread library. Hence, processes are identified
by a kernel uniqne identifier: the Process IDentifier ( P I D ) .
Threads are identified by internal identifiers in the pthread
library.
The Kerrighed OS adds one layer to this system. Each
process is assigned a Kerrighed Process IDentifier (KPID)
(a) Beiore migration (b) After migration as well ils its PID. At the kernel level, the PID is used
whereas at the user level, only the KPID is seen. There-
Fig. 3. Containers linked t o process segments fore, provided this KPID is unique accross the cluster,
KPIDs can be used to uniquely designated a process what-
ever the node it is running on. To ensure unicity, he KPID
are created that should inherit the applications context. of a K-process is composed of the initial L i n w process PID
To place such a new process, the system needs to extract created host the K-process and the current node identifier.
an image of the creator process and transfer it to a remote The Kerrighed thread library, krggtkiead, manages an addi-
node to create a new running clone. Similarly to remote tional internal thread identifier.
process duplication, process migration needs to extract a
process image and transfer it to a remote node to create a C. Global Memory Management
running clone, but the initial process is stopped. Process C.1 Requirements
checkpointing also needs to extract a process image and
store it on disk or on remote me1nory[9]. Global memory management in a cluster covers several
Note that remote process creation, duplication, check- services. First, in order to support the execution of multi-
point/restart all use the same underlying niechanisni of threaded applications and more generally the shared mem-
process extraction (see Figure 1). ory programming model a DSM system is needed that al-
Process extraction consists in creating a ghost process, low processes or threads to share data segment whatever
(process virtualization) composed of several parts: the ad- their execution node. Secondly, it is highly desirable in a
dress space, the opened files, the process identifier (PID), cluster to exploit the memory which is distributed in the
the proceswr registers, and signal related data. cluster nodes to increase the operating system services ef-
ficiency. There are two areas for improvement. The first
B.9.a Management of the Address Space and Opened Files is remote paging mechanisms to efficiently support appli-
of Processes. In a standard Linnx kernel, for a process, cations with huge memory requirements. As high speed
all the meniory information and opened files information networks used in clusters have a lower latency than disks,
need to be extracted in order to create a coherent ghost it may be more advantageous to swap into memory avail-
process. In the Kerrighed OS, address space and opened able in a remote node than into the local swap disk. The
files of a K-process are globally managed (by containers for second is a system of cooperative cache files to improve the
memory space and regulsr opened files and by dyiiainic 1/0 efficiently.
stream mechanisms for stream files. For this reason, in- A DSM system, a remote paging system and a system
formation about memory space and opened files do not of cooperative file caches rely on several common mecha-
need to be extracted for process migration. For example, nisms: localizing a page copy in the clnster, transferring
containers allow virtual memory pages of a process to be pages between nodes (either to serve page requests or to
accessed from anywhere in the cluster. New links with inject pages), and marraging the coherency of replicated
containers are just created after the process transfer, and page copies.
then memory pages are migrated on demand by the con- Kerrighed implements the concept of container as a
tainer niechanisni during the process execution (see Fig- unique set of mechanisms to globally manage the clus-
ure 3)>and opened files can be accessed remotly through ter physical memory. All operating system services using
containers. As a result, for process extraction, the con- memory pages access the physical memory through con-
tainer mechanism eases the creation of the ghost process: tainers.
instead of the whole process address space and information
about opened files, only container information needs to be C.2 Containers
extracted. This is the same approach for stream files: in-
In a cluster, each node executes its own operating system
there are different t y p a of files: regular and streams for pipe, kernel, which can he coarsely divided into two parts: (1)
system services and (2) device managers. We propose a
socket. ...
2x0
linkers and a low level linker called input/output linker.
The role of interface linkers is to divert device accesses of
system services to containers while an I/O linker allows a
container to access a device manager.
System services are connected to containers thanks to in-
terface linkers. An interface linker changes the interface of
a container to make it compatible with the high level sys-
tem services interface. This interface must give the illusion
to these services that they communicate with traditional
device managers. Thns, it is possible to "trick" the kernel
and to divert device accesses to containers. It is possible to
connect several system services to the same container. For
instance it is possihle to map a container in the address
space of a process P1 on a node A and to access it thanks
Node A Node B to a readlwrite interface within a process P2 on a node B.
During the creation of a new container, an input/output
Fig. 4. Integration ofcontainers and linkers within the host operating
system linker is associated to it. The container then stops being a
generic object to become and object sharing data coming
from the device it is linked with. The container is said
generic service inserted between the system services and to have been instanciated. For each semantically different
the device managers layers called container [Ill. Con- data to share, a new container is created. For instance,
tainers are integrated in the core kernel thanks to linkers, a new container is used for each filc to share and a new
which are software pieces inserted between existing device container for each memory segment to share or to be visible
managers and system services and containers. The key cluster wide.
idea is that container gives the illusion to system services Just after the creation of a container, it is completely
that the cluster physical memory h shared as in an SMP empty, i.e. it does not contain any page and no page
machine. frame contains data from this container. Page frames are
A container is a software object that allows the cluster- allocated on demand during the first access to a page. Sim-
wide storing and sharing of data. A container is a kernel ilarly, data can be removed from a container when it is de-
level mechanism completely transparent to user level soft- stroyed or in order to release page frames when the physical
ware. Data is stored in a container on host operating sys- memory of the cluster is saturated.
tem demand and can be shared and accessed by the host
kernel of other cluster nodes. Pages handled by a con- (2.4 Design of Distributed System Services
tainer are stored in page frames and can be used by the
Containers and linkers are used to implement several
host kernel as any other page frame. Container pages can
cluster wide OS services. We detail in this section the
be mapped in a process address space, be used as a file
implementation of virtual memory sharing and file m a p
cache entry, etc.
ping services. More details on cooperative file cache can
By integrating this generic sharing mechanism within
be found in [U].
the host system, it is possible to give the illusion to the
kernel that it relies on top of a physically shared mem- C.4.a Shared Virtual Memory. The virtual memory shar-
ory. On top of this virtual physically shared memory, it ing service of an OS allows to share data between threads or
is possible to extend to the cluster traditional services of- between processes through a system V segment. A shared
fered by a standard operating system (see figure 4). This virtual memory extends this service to a cluster by allow-
allows to keep the OS interface, as known by users, and ing several processes or threads running on different nodes
to take advantage of the existing low level local resource to share data through their address space. Providing this
management. service requires three properties: (1)data sharing between
The memory model offered by containers is sequential nodes, (2) coherence of replicated data and (3) simple ac-
consistency implemented with a write invalidation proto- cess to shared data thanks to processor read/write opera-
col. This model is the one offered by a physically shared tions.
memory. Moreover, an injection mechanism similar to [5] The container service ensures the two first properties.
is used to balance memory usage and avoid (or delay) disk The third one is ensured by the mapping interface linker.
swapping. Thus, mapping a memory container in the virtual address
space of several processes via a mapping linker leads to a
C.3 Linkers shared virtual memory.
Many mechanisms in a kernel rely on the handling of When a process page fault occurs, the memory map ui-
physical pages. Linkers divert these mechanism? to ensure terface linker diverts the fault to containers. The container
data sharing through containers. To each container is as- mechanism places a copy of the page in local memory and
sociated one or several high level linkers called interface ensures the coherence of data. Lastly, the map interface
28 1
The KerNet layer iniplernents the abstraction of dy-
naniic stream and KerNet sockets. It is a distributed
service which provides global stream management cluster
wide. In the remainder of this section, we focus on the
design and implementation of the KerNet layer anti of the
Unix socket interface and pipes. The low-level point-to-
Fig. 5 . Kerrighed network stack point communication system is only briefly described.
282
P1 -
PZ
(1)
11.1)
ji.zj
cpe(fd121)
-
I stream create(D1RECT. 2):
I fdC11 = attach(stream);
(1.3) I fdC21 = attach(stream);
Fig. 6 . Standard environment based on KerNet sockets +---.
(2) fork0
(3) close(fd2) close(fd1)
may be closed (4) urite(fd1,. . .) read(fd2, . . . I
283
For i = 0, k - 1, do:
For j = I), k - 1, do:
P, sends Block B,,
to Processot P,
endfor
endfor
I-[
Fig. 9. Low-Level communications service For all i: do parallel:
For j = 0, k - 1, do:
P, sends Block B;j
searchers have chosen these problems [IO], [SI, [15]and
to Processot Pi
we report our preliminary modest findings in this section.
endfor
Shearsort seems remarkably suitable for implementation
endfor
in computing clusters while the 2D FFT has proven its
importance and necessity in such high performance com-
putiug systems.
For i = 0, k - 1, do:
A. Matrix Transposition in a Cluster System For all j , do parallel:
Pi sends Block Bij
A common denominator to the programming of Shear-
to Processot Pj
sort and the ZD-FFT is Matrix Transposition. Hence, we
endfor
consider first the problem of transposing a n x n matrix A
endfor
stored in the cluster such that processor Pi, i = 0>k - I,
is given a band of f rows of the matrix starting with row
i x f , If a t each processor we divide the hands into slices Fig. 10. Communications Patterns and Loops t o Block-Transpose 3
284
For all j = 1,IC - 1, do:
Pisends Block B,(j+i)nLodk
to Processot P(j+i)madk
endfor
endfor
If the cluster network allows for siniultaneous all-
processor-pairs communications, the resulting complexity
becomes O ( k ) block transfers. The skewed technique is
currently in use in many rcedily availahle cluster 2D-FFT
routines (see Martin Siegerts for example).
On Kerrighed, using the DSM support and a double
buffering technique, a good overlap between communica-
tions and coniputations can be achieved.
B. S/Lt%TsO&
285
of workload.
A prototypc of Kerrighed hns been built as a set of
modules extending the Linux kernel. The kernel itself has
only been slightly modified. Kerrighed is available as open
source software6.
Our current work ainis a t integrating additional basic
niechanisms into Kerrighed and at studying global schediil-
ing policies for different workloads. Another important re-
search direction relates t o the design of high availability
mechanisms.
In the long term, we plan to integrate new mechanisms
into Kerrighed to ease the use and programniing of a fed-
eration of clusters, each of them running a single system
operating system such as Kerrighed.
REFERENCES
http://openmosix.sourceforge.net/
http://openssi.org/index.shtml.
Ramamurthy Badrinath and Christine Morin. Common mecha-
nisms for supporting fault tolerance in DSM and message p a s s
ing systems. Rapport d e recherche 4613, INRIA, November
2002.
Amnon Barak, Shai Guday, and Richard G. Wheeler. The
MOSIX Distnbvted Operating System, Load Balancing for
UNIX, volume 672 of Lecture Notes in Computer Science.
Springer-Verlag. 1993.
M. J. Feeley, W. E. Morgan, F. H. Pighin, A. R. Karlin, and
H. M. Levy Implementing global memory management in a
the one they were originally conceived for. In addition, the workstation cluster. In Proc. of the 15th ACM Symp. on Oper-
data parallelism needs to he combined with efficient single ating Systems Principles (SOSP-I5), pages 201-212, December
1995.
processor procedures to solve the smaller partitions of the Erik Hendriks. Bproc: the heowulf distributed process space. In
problem. Proceedings of the 16th intsmntional conjwerence on Supercom-
Marly DP algorithms were designed t o take advantage puting, p a g u 12+136. ACM Press, 2002.
Sandeep Sen Isaac D. Schorson and Adi Shamir. Shear sort: A
of permutation/alignment networks that contributed data true two?dimensionnl sorting technique for vlsi networksr. In
permuting functionality to the communications function. Proceedings of the 1986 International Conference on Parallel
Many algorithnls would bog down in a cluster if ported Processing. p a g 903-908,
~ August 1986.
Constantine Katsinis. Merging, sorting and matrix operations
directly from their original fine grain instantiation. on the SOMEbus multiprocessor architecture. Future Genera-
These ideas were illustrated in this section and we note tion Computer Systems, Elsevier, September 2003.
Anne-Marie Kermarrec and Christine Morin. Smooth and ef-
that reasonable performance and scalability are obtained ficient integration of high-availability in a parallcl single level
if the proper balance is striken between the data parallel store system. In Proceedings of Euro-Par 2001, August 2001.
algorithm, the single processor procedure to solve the par- Rancescomaria Marino and Jr. Earl E. Swartdander. Parallel
implementations of multidimensional transform without inter-
allel subproblems and the proper capabilities are provided processor communication. IEEE 7bnsactions on Computers,
by the cluster interconnection network. 48(9):951-961, September 1999.
We are in the process of evaluating the effects of the R.Lottiaux and C.Morin. Containers : A sound hasis for a true
single system image. In Pmceeding o j IEEE Intemotional Sym-
DSM and MPI realizations on the performance of the so- posium on Cluster Computing and the Grid, pages 6G73, May
lution of data parallel programs. l"",
-""I.
286