Sie sind auf Seite 1von 10

in: Parallel Computation, 4th International ACPC Conference, Salzburg, Austria,

LNCS 1557, pages 549558, February 1999.


MPI-parallelized Radiance on
SGI CoW and SMP
Roland Koholka, Heinz Mayer, Alois Goller
Institute for Computer Graphics and Vision (ICG), University of Technology,
M unzgrabenstrae 11, A8010 Graz, Austria
{koholka,mayer,goller}@icg.tu-graz.ac.at,
WWW home page: http://www.icg-tu.graz.ac.at/~Radiance
Abstract. For lighting simulations in architecture there is the need for
correct illumination calculation of virtual scenes. The Radiance Synthetic
Imaging System delivers an excellent solution to that problem. Unfortu-
nately, simulating complex scenes leads to long computation times even
for one frame. This paper proposes a parallelization strategy which is
suited for scenes with medium to high complexity to decrease calcula-
tion time. For a set of scenes the obtained speedup indicates the good
performance of the chosen load balancing method. The use of MPI deliv-
ers a platform independent solution for clusters of workstations (CoWs)
as well as for shared-memory multiprocessors (SMPs).
1 Introduction
Both ray-tracing and radiosity represent a solution for one of the two main
global lighting eects, specular and diuse interreection. To extend one of the
two models to get a complete solution for arbitrary global lighting suers from
long lasting calculations even for scenes with medium complexity.
The Radiance software package oers an accurate solution for that problem
which will be described in more detail in section 2. A project about comparison
of indoor lighting simulation with real photographs shows that for scenes with
high complexity and detailed surface and light source description simulation can
take hours or even days [7]. Parallelization seems to be an appropriate method
for faster rendering but keeping the level of quality.
Especially in Radiance there exists a gap between rapid prototyping which
is possible with the rview program, and the production of lm sequences which
is best done using queuing systems. Our parallel version of Radiance allows fast
generation of images from realistic scenes without any loss of quality within a
few minutes.
Since massive parallel systems still are rare and expensive, we primarily focus
on parallelizing Radiance for architectures commonly used at universities and
in industry, which are CoWs and SMPs. Platform independence and portability
are major advantages of Radiance from the systems point of view. These points
are met basing our code on the Message Passing Interface (MPI).
2
2 Structure of Radiance
As any rendering system Radiance tries to oer a solution for the so-called
rendering equation [11] for global lighting of a computer generated scene. The
basic intention was to support a complete accurate solution. One main dierence
between Radiance and most other rendering systems is that it can accurately
calculate the luminance of the scene which is done by physical denition of light
sources and surface properties. For practical issues the founder of the Radiance
Synthetic Imaging System implemented a lot of CAD-le translators for easy
import of dierent geometry data. This forms a versatile framework for anyone
who has to do physically correct lighting simulations [6]. Technically, Radiance
uses a light-backwards ray-tracing algorithm with a special extension to global
diuse interreections. Global specular lighting is the main contribution of this
algorithm and is well situated in computer graphics [1]. The diculty lies in
extending this algorithm to global diuse illumination which is briey explained
in the next section.
2.1 Calculating Ambient Light
Normally a set of rays distributed over the hemisphere is used for calculating
diuse interreections which is called distributed ray-tracing. One can imagine
that in the case of valuable scene complexity calculating multiple reections of
sets of rays is very time consuming. Radiance solves this problem by estimating
the gradient in global diuse illumination for each pixel, then it decides whether
the current diuse illumination can be interpolated from points in the neighbor-
hood or must be calculated from scratch [5]. Exactly calculated points are cached
in an octree for fast neighborhood search. Performance is further increased since
diuse interreections are only calculated at points necessary for a particular
view in contrast to other radiosity algorithms. However, all these optimizations
appear to be not enough to render a complex scene with adequate lighting con-
ditions in reasonable time on a single workstation. Parallelization seems to be
appropriate for further reduction of execution time.
2.2 Existing Parallelization: rpiece
Since version 2.3, there is an extra program for parallel execution of Radiance,
called rpiece. To keep installation as simple as possible, rpiece only bases on
standard UNIX commands like the pipe command. All data are exchanged via
les, requiring that NFS runs on all computation nodes. While the presence of
NFS usually is no problem, some le lock managers are not designed for fast and
frequent change of the writing process. Additionally, remote processes are not
initiated automatically but must be invoked by the user.
In his release notes, Ward [4] claimed a speedup of 8.2 on 10 workstations. We
can conrm this result only on suitable scenes, and where sequential computation
lasts more than one day. However, rpiece performs badly if execution does
not last at least several 10 minutes in parallel. Table 1 shows the time rpiece
3
needs on a CoW for smaller scenes. Obviously, there is an enormous overhead in
partitioning and splitting the scene as well as in communicating these pieces via
NFS. Consequently, rpiece helps rendering very large scenes in shorter time,
but does not produce quick results one might be willing to await.
le rpict (1 proc.) rpiece (1 proc.) rpiece (3 proc.) rpiece (10 proc.)
A 1:49.66 19:02.09 14:22.93 14:48.25
B 5:26.79 17:38.85 12:49.82 14:23.04
C 17:03.23 29:11.56 14:15.85 15:12.59
Table 1. Performance (time) of rpiece on a cluster of workstations (CoW).
3 Parallelization Strategies
Because of the nature of ray-tracing these algorithms can be easily parallelized
by dividing the image space into equal blocks. But since no one can predict
the scene complexity per block which depends on the model itself and the cur-
rent viewpoint, we now discuss strategies which are general and deliver a good
speedup for almost all cases.
3.1 The Standard Interface: MPI
Using standards is benecial in many aspects, since they ensure portability and
platform-independence is also given. Moreover, standards are well documented
and software is well maintained. PVM was a de-facto standard prior to the
denition of MPI. Consequently, we use MPI (Message Passing Interface) since
now it is the standard for communication on parallel architectures and has widely
been accepted.
MPI only denes the interface (API), thus it is subject to the implementation
how to handle a specic architecture. Many implementations of MPI exist, the
most popular free versions are MPICH from Argonne National Lab. [10] and
LAM, which was developed at the Ohio Supercomputer Center and is currently
housed with the Laboratory for Scientic Computing at the University of Notre
Dame, Indiana [9]. There are also several commercial implementations, as from
SGI especially adapted and optimized to run on their PowerChallenge and Origin
shared-memory machines. We use MPICH for both the CoW and the SMP, and
additionally SGIs own version. MPICH proved to be easy to install, to support
many dierent architectures and communication devices, and to run stable on
all tested platforms.
Another possibility would have been to use shared memory communication
directly. However, this is only applicable to SMPs. There does not exist any
standard for virtual shared memory that is similarly accepted and widespread
4
as MPI. Since Radiance is a software package designed to run on nearly any
platform, not using a common means for communication would be a drawback.
Thus, we refrained from using shared memory calls, although it might have been
benecial for more ecient distribution of the calculated ambient light values,
and for holding the scene description. The upcoming extension to MPI as dened
in the MPI-2 standard seems to solve these insuciencies. We look forward to a
stable and publicly available implementation of MPI-2.
3.2 Data Decomposition
There are two strategies to implement parallel ray-tracing. One way is to divide
the scene into parts, introducing virtual walls [8]. Due to diculties in load-
balancing and the need for very high-bandwidth communication channels virtual
walls seem to be not appropriate for the architectures we want to run Radiance.
The second strategy is to divide the frame, implementing the manager/worker
paradigm [2]. The manager coordinates the execution of all workers and deter-
mines how to split the frame. The workers perform the actual rendering and
send the results to the manager, who combines the pieces and writes the output
frame to disk. Since a global shared memory cannot be assumed on any archi-
tecture, every worker must hold its own version of the scene. This requires much
memory and also slows down initialization since every single worker has to read
the octree le from disk.
3.3 Concurrency Control
Since Radiance calculates a picture line by line, the frame is divided into blocks
consisting of successive lines. The manager only has to transmit the start and
the end line of the block. Most of the time the manager waits for data. To avoid
busy-waiting the manager is suspended for one second each time the test for
nished blocks ends negative, as illustrated in gure 1. This way the manager
needs less than one percent of CPU-time, and therefore it is possible to run the
manager and one worker on the same processor. Suspending the manager for
one second causes a nondeterministic behavior that results in slight variations
of execution time.
One feature of Radiance is that it can handle diuse interreection. These
values must be distributed to every worker to avoid needlessly calculations.
Broadcasting every new calculated value would cause too much communication
(O(n
2
)), n being the number of workers. To reduce communication trac, only
blocks of 50 values are delivered. This communication must to be non-blocking,
because these blocks are only received by a worker when he enters the subroutine
for calculating diuse interreection.
3.4 Load Balancing
The time it takes to calculate a pixel diers for every casted ray, depending on
the complexity of the intersections that occur. Therefore load-balancing prob-
lems arise when distributing large parts. Radiance uses an advanced ray-tracing
5
Send equal blocks
with BL= EL/(3*N)
to all workers
RL = EL * 2/3
Wait a second
Any task
finished
New job with
BL= RL/(3*(N +1))
RL= RL - BL
RL = 0
YES
YES
NO
N number of worker
BL blocklength
RL remaining lines
EL endline
Wait for all jobs,
save and exit
NO
Fig. 1. Dynamic load balancing strategy.
algorithm that uses information of neighboring rays to improve speed. Thus, rays
are not independent anymore, and scattering the frame in small parts would solve
the load-balancing problem, but also decreases the prot from the advanced al-
gorithm.
A compromise that meets these two controversial requirements is to combine
dynamic partitioning and dynamic load-balancing as stated in [3]. At the begin-
ning of the calculation relatively large parts (see gure 1) of the frame are sent
to the workers to take advantage of the fast algorithm. The distributed portions
of the frame become smaller with the progress of the calculation. Near the end
the tasks are that small so that they can be executed fast enough to get a good
load-balance.
4 Computing Platforms
We evaluated performance on computers accessible at our department and at
partner institutes. At rst, we concentrated on evaluating two dierent but com-
mon architectures.
The rst architecture is a cluster of workstations (CoW). It was built at our
institute by connecting 11 O2 workstations (see table 2) from SGI via a 10 Mbit
Ethernet network. All workstations are equally powerful except machines sg50
and sg51, which are about 10% faster. They are only used for experiments with
10 and 11 computing nodes.
The second platform is a PowerChallenge from SGI containing 20 processors.
Each R10000 processor is clocked with 194 MHz and equipped with primary
data and instruction cache of 32 Kbytes each, and 2 Mbyte secondary cache
(p:I32k+D32k, s:2M). All 20 processors share an 8-way interleaved, global main
memory of 2.5 GByte.
6
Name CPU-clock RAM Processor Cache
[MHz] [Mbyte] primary, secondary
sg41 150 128 R10000 p:I32k+D32k, s:1M
sg42 . . . sg49 180 128 R5000 p:I32k+D32k, s:512k
sg50, sg51 200 128 R5000 p:I32k+D32k, s:1M
Table 2. Workstations used for the cluster experiments at ICG.
5 Test Environment
We selected a set of scenes with completely dierent requirements for the render-
ing algorithm and its parallelization: One very simple unrealistic scene (spheres)
with a bad scene complexity distribution over the image space, an indoor (hall)
and an outdoor lighting simulation (townhouse). These 3 scenes come along with
the Radiance distribution. The most important test scene is the detailed model
of an oce at our department with a physical description of light sources (bu-
reau). All scenes and the corresponding parameter settings are listed in table 3,
and the corresponding pictures are shown in gure 2.
name octree size image size options comment
hall0 2212157 -x 2000 -y 1352 -ab 0 large hall (indoor simulation)
spheres0 251 -x 1280 -y 960 -ab 0 3 spheres, pure ray-tracing
spheres2 251 -x 1280 -y 960 -ab 2 3 spheres, ambient lighting
bureau0 5221329 -x 512 -y 512 -ab 0 bureau room at ICG
bureau1 5221329 -x 512 -y 512 -ab 1 + ambient lighting
bureau2 5221329 -x 512 -y 512 -ab 2 + increased ambient lighting
townhouse0 3327443 -x 2000 -y 1352 -ab 0 townhouse (outdoor simulation)
townhouse1 3327443 -x 2000 -y 1352 -ab 1 + ambient lighting
Table 3. Used scenes for evaluation of the MPI-parallel version of Radiance.
This test set covers a wide spectrum in size of the octree, complexity of
the scene, time required for rendering and visual impression. All scenes were
computed with and without diuse interreection. Table 4 summarizes the times
measured for the sequential version.
Due to the manager/worker paradigm, x processes result in x 1 workers.
Consequently, whenever we mention n processors (nodes), we start n+1 processes
where one of them is the manager. Since the manager is idle nearly all the time,
we do not count it for the speedup on the SMP. This is acceptable since we know
from tests on the CoW, where the manager and one worker is running on the
same workstation, that the result is falsied less than 1%.
On very small scenes no big advantages will be seen. However, an (almost)
linear speedup should be noticed for larger scenes if no diuse interreections
7

A: Hall (clipped).

B: 3 spheres.
C: ICG room.

D: Townhouse.
Fig. 2. The 4 scenes used for evaluation.
name time on sg48 time on SMP
hall0 1:19:13.48 33:50.34
spheres0 40.89 19.27
spheres2 8:27:00 3:10.11
bureau0 2:32:55.84 1:14:42.73
bureau1 3:56:03.65 1:54:23.19
bureau2 5:13:30.72 2:29:16.35
townhouse0 5:17.98 2:40.64
townhouse1 9:21.82 3:55.07
Table 4. Execution times of the sequential Radiance for the test scenes.
8
are to be calculated (-ab 0). Due to heavy all-to-all communication to distribute
newly calculated ambient lighting values, performance will decrease on such runs
(-ab 2). This eect will mainly inuence the performance on the CoW, since
network bandwidth is very limited. It should show nearly no eects on the SMP.
6 Results
The times for sequential execution T(1) are already shown in table 4. For good
comparison of the platforms and to visualize the performance, we use the speedup
S(n) :=
T(1)
T(n)
and the eciency (n) :=
S(n)
n
, being n the number of processors.
Performance on CoW
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11
# Workers
S
p
e
e
d
u
p

S
hall0 sphere0
sphere2 bureau0
bureau1 bureau2
town0 town1
A: Speedup.
Performance on CoW (2)
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 10 11
# Workers
E
f
f
i
c
i
e
n
c
y
hall0 sphere0
sphere2 bureau0
bureau1 bureau2
town0 town1
B: Eciency.
Fig. 3. Performance on the cluster of workstations (CoW).
Although one node of the cluster is only half as fast as one SMP-processor, all
larger scenes show a good parallel performance. Utilizing 11 processors, time for
rendering the bureau scene without ambient lighting decreases from 2:32:55.84
to 15:14.35 giving a speedup of 10. Even more processors could be utilized since
the eciency always exceeds 85%, as shown in gure 3.
The four smaller scenes illustrate that speedup cannot be increased if the
time for parallel execution falls below about one minute. This is mainly caused
by the remote shells MPICH has to open in combination with the (slow) 10-
Mbit Ethernet network. However, MPICH proved to run stable during all test.
Furthermore, the chosen load-balancing strategy appears to be well suited even
for the CoW, since no communication bottlenecks occurred during the all-to-all
communication when calculating diuse interreection.
When comparing gures 3 and 4 one can see no big dierence. The four
smaller scenes are rendered a little bit more eciently on the SMP, but above 13
workers, eects of the all-to-all communication decrease performance. Changing
the communication pattern to log n-broadcast instead of sequentially communi-
cating to every other node would reduce this eect. Improving this communica-
tion pattern will be a near future goal.
Most surprisingly, execution time could not be moved well below two minutes,
even if there would have been enough processors. As we experienced, sequential
9
Performance with MPICH on SMP
0
2
4
6
8
10
12
14
16
18
20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Workers
S
p
e
e
d
u
p

S
hall0 sphere0
sphere2 bureau0
bureau1 bureau2
town0 town1
A: Speedup.
Performance with MPICH on SMP (2)
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Workers
E
f
f
i
c
i
e
n
c
y
hall0 sphere0
sphere2 bureau0
bureau1 bureau2
town0 town1
B: Eciency.
Fig. 4. Performance on the PowerChallenge using MPICH.
overhead starting the parallel tasks and minor load imbalances prohibit scaling
execution time down to seconds even on a shared-memory machine. As on the
cluster, MPICH also worked well on this architecture.
Performance with SGIMPI on SMP
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Workers
S
p
e
e
d
u
p

S
hall0
sphere0
sphere2
bureau0
bureau1
bureau2
town0
town1
A: Speedup.
Performance with SGIMPI on SMP (2)
0
0,2
0,4
0,6
0,8
1
1,2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Workers
E
f
f
i
c
i
e
n
c
y
hall0 sphere0
sphere2 bureau0
bureau1 bureau2
town0 town1
B: Eciency.
Fig. 5. Performance on the PowerChallenge using native MPI.
Initially we expected the native MPI version of SGI to be much faster than
MPICH. However, as gure 5 illustrates, only a few scenes could be rendered
using an arbitrary number of processors. The number of open communication
channels seems to be the problem. Moreover, we could nd no advantage regard-
ing performance when using native MPI.
7 Conclusion and Future Work
Our main goal was to nd a parallelized version of the Radiance Synthetic Imag-
ing System for scenes with medium to high complexity to close the gap between
the rview and rpict programs. Another prerequisite was the platform indepen-
dence of the overall system. In section 6 we show that our implementation works
well for most of the selected scenes running on an SMP and also on a CoW.
10
The near to linear speedup indicates a good load balancing for these cases. MPI
helps us to remain platform independent but this was only tested for SGI CoWs
and SMPs. Testing on other platforms will be a topic for future work as well
as improving data I/O during the initialization phase. Broadcasting the octree
over the network will probably eliminate the NFS bottleneck during the scene
reading phase of all workers. Another important point for further investigations
is to evaluate the time consumption for diuse interreection calculation com-
pared to trac. We look forward that these investigations will lead to a more
sophisticated load balancing.
Acknowledgments
This work is nanced by the Austrian Research Funds Fonds zur Forderung
der wissenschaftlichen Forschung (FWF), Project 7001, Task 1.4 Mathemat-
ical and Algorithmic Tools for Digital Image Processing. We also thank the
Research Institute for Software Technology (RIST++), University of Salzburg,
for allowing us to run our code on their 20-node PowerChallenge.
References
1. Watt A. and Watt M. Advanced Animation and Rendering Techniques, Theory
and Practice. ACM Press New York, Addison-Wesley, 1992.
2. Zomaya A. Y. H., editor. Parallel and Distributed Computing Handbook, chapter
9, Partitioning and Scheduling. McGraw-Hill, 1996.
3. Pandzic I.-S. and Magnenat-Thalmann N. Parallel raytracing on the IBM SP2 and
T3D. In Supercomputing Review, volume 7, November 1995.
4. Ward G. J. Parallel rendering on the ICSD SPARC-10s. Radiance Reference
Notes, http://radsite.lbl.gov/radiance/refer/Notes/parallel.html.
5. Ward G. J. A ray tracing solution for diuse interreexion. In Computer Graphics
(SIGGRAPH 88 Proceedings), volume 22, pages 8592, August 1988.
6. Ward G. J. The radiance lighting simulation and rendering system. In Computer
Graphics (SIGGRAPH 94 Proceedings), volume 28, pages 459472, July 1994.
7. Karner K. Assessing the Realism of Local and Global Illumination Models. PhD
thesis, Computer Graphics and Vision (ICG), Graz University of Technology, 1996.
8. Menzel K. Parallel Rendering Techniques for Multiprocessor Systems. In Pro-
ceedings of the Spring School on Computer Graphics (SSCG 94), pages 91103.
Comenius University Press, 1994.
9. LAM/MPI parallel computing. Home page of LAM Local Area Multicomputer:
http://www.lsc.nd.edu/lam/.
10. MPICH A portable implementation of MPI. Home page of MPICH:
http://www.mcs.anl.gov/mpi/mpich/.
11. Kajiya J. T. The rendering equation. In Computer Graphics (SIGGRAPH 86
Proceedings), volume 20, pages 143150, August 1986.

Das könnte Ihnen auch gefallen