Beruflich Dokumente
Kultur Dokumente
Abstract. Converting a problem to run on a distributed system is not trivial and often involves many trade-offs. The Java
api provides cross platform access to four distributed computing communication mechanisms (RMI with Serializable or
Externalizable Interface, socket and datagram socket). Matrix multiplication is used as an experimental framework to
present an analysis of performance and insights towards an understanding of the trade-offs encountered when adapting an
algorithm to a distributed system.
Keywords: Java, Distributed Computing, RMI, Socket, Datagram, Serializable, Externalizable.
1. Introduction
The purpose of this paper is to provide a detailed performance and programming complexity analysis of four Java network
communication mechanisms. These four mechanisms are: RMI using the Serializable interface and the Externalizable
interface [1][5], the socket, and datagram socket [2]. To help illuminate the trade-offs involved in finding the most efficient
communication mechanism, two parallel matrix multiplication algorithms are implemented: a simple parallel algorithm
(SPA), and a modified Cannon’s algorithm (MCA) [14].
Matrix multiplication was utilized as the test application because it is commonly used and suitable for evaluating the
performance of each communication mechanism. Instead of trying to increase Java’s distributed performance through
external packages, implementations, etc. [7], this paper considers the performance of the existing Java communication
mechanisms. Included is an examination of the Externalizable interface which eliminates the issue of generic marshalling
and demarshalling. Understanding these trade-offs is important to programmers who wish to obtain the best performance
without giving up the portability, ease of deployment, and other advantages offered by the Java environment.
Section 2 continues with an explanation of the experimental setup. Results and analysis are given in section 3 followed
by a discussion and conclusion in section 4.
2. Experiment Implementation
Experiments were performed using five identical computers connected to a 100 MHz LAN (the specifications are given in
the following subsection.) One computer acted as the manager process, while the remaining four computers performed as
worker processes (k = 4). The manager process was responsible for partitioning the matrices, transmission of the
submatrices, reception of the results, and timekeeping duties. The worker processes performed all of the actual
calculations. The data matrices used in this experiment contained randomly generated integers with values ranging from 0
to 20. The size of the matrices multiplied ranged from 100 × 100 to 1000 × 1000 in increments of 100. The data reported
in this paper represent an average from five trial runs.
2.1 Settings
For the socket implementation, buffered input and output streams were used. For the datagram socket implementation, a
packet size had to be determined. Packet size has an important impact on system efficiency with smaller packet sizes
allowing for improved pipelining while larger packet sizes have a lower header overhead. Because the maximum number
of hops was limited to one due to our LAN topology, a larger packet size was more advantageous. Since the maximum
packet size allowed by the Internet Protocol [6] is 64 KB, a data packet size of 63 KB (allowing for overhead) was chosen
for this experiment. 63 KB only specified the upper bound, if the size of the data to be transmitted is 5 KB, then the packet
size will be 5 KB; if the size of the data is 75 KB, then a 63 KB and a 12 KB packet will be sent.
In terms of the computer specifications, each computer had an Intel Pentium 4 clocked at 1.7 GHz, 512 MB of ram,
Microsoft Windows XP Professional, and Java version 1.4.2. In addition, like many computers connected to a LAN, these
five computers had typical programs running in the background such as virus and client programs. These background
programs were not disabled during the experiment. The manager process computer used an initial heap size of 64 MB and
max heap size of 96 MB. The four worker process computers had their initial and max heap size set to 64 MB.
theManager.dataNodeArray[index].getWorkerInterface().requestMUL(theMULNode);
One line of code is all that is required to send the necessary data to a worker process. The first portion,
theManager.dataNodeArray[index].getWorkerInterface(), specifies the particular worker process to communicate with.
Next, requestMUL(theMULNode), identifies the specific remote method to be invoked. In this case, the remote method is a
multiplication request with one argument of type class MultiplyDataNode.
RMI using the Externalizable interface is only slightly more difficult to implement then the Serializable interface. The
added complexity results from the burden placed on the programmer to implement the readExternal(), and writeExternal()
methods. Indeed, Figure 1 is also the code used to invoke a remote method with externalization. The only difference lies in
the programmer’s responsibility to code the marshalling and demarshalling algorithms within readExternal(), and
writeExternal() methods.
Personifying low level communications, working with sockets was the third most difficult mechanism. Dealing with
sockets requires multiple buffered I/O (input /output) streams, IP addresses, and port numbers. Similar to RMI using the
Externalizable interface, the programmer must convert object instances to and from a stream of bytes. However, sockets do
not directly support RMI invocations requiring extra work to encode and decode remote methods. Figure 2 illustrates the
socket implementation.
The first line of the code creates a stream socket and connects it to the specified port number at the specified IP address.
The next four lines of code create buffered input and output streams. After the I/O streams have been initialized, the
process number and job command are written to the socket. The process number is used for record keeping purposes. The
job command substitutes for the lack of RMI and provides a code directing the worker process to execute a particular
method. The last line of code writes the class, of type MultiplyDataNode, to the socket. This method is nearly identical to
the writeExternal() method discussed in the previous section. The only distinctions are a different argument and the
inclusion of a flush command.
The datagram socket was the most difficult communication mechanism to implement since object instances needed to
be converted to and from a stream of bytes, and also constructed into packets. At the receiving end, packets must be
deconstructed in the proper order to maintain data integrity.
The datagram socket uses a connectionless protocol, which by definition is unreliable. Thus to ensure reliability, some
form of packet synchronization is required. Introducing packet synchronization involves the design and coordination of
multithreaded processes to read packets as they become available and deal with timeout issues. Figure 3 highlights the
difficulty involved with the datagram socket implementation.
setupDataTransmission();
theMULNode.writeExternal(theManager.out[clientIndex]);
sendData();
The first method, setupDataTransmission(), creates an output stream in which data may be written into a byte array.
The byte array automatically grows as data is written to it. The next line of code writes the class, of type
MultiplyDataNode, to the newly created byte array. As before, this method is very similar to previous methods. The
challenge of implementing the datagram socket lies in the sendData() method as shown in Figure 4.
limit = totalPackets;
offset = 0;
try {
while (limit > 0) {
theManager.packetNum =getNextPacketNumber(theManager.packetNum);
bufferSize = getBufferSize(limit);
packetOutBuffer = createPacketBuffer(offset, bufferSize,
theManager.oStream[clientIndex].toByteArray(),
theManager.packetNum);
}
Figure 4: The SendData() Method
The first step is to calculate the total number of packets needed to transfer the required data. Once inside the while loop,
a single packet is created and sent during each iteration. When a worker process receives a data packet, it will send an
acknowledgment. An important note, at this point another thread is running with the sole purpose of reading
acknowledgments from the receiving worker process. After an individual packet is transmitted, a new thread of type class
PacketTimer is created. This thread acts as a timer and counts down to zero. If the timer reaches zero, a duplicate packet is
sent, the timer is reset, and the countdown starts again. Each time it reaches zero, a duplicate packet is transmitted.
The variable controlPacketReceived is a semaphore with an initial value of zero. The line, controlPacketReceived.p(),
is a wait operation. Therefore this process halts and waits until the listening thread receives an acknowledgment. Once an
acknowledgment is received, a signal operation is performed on the semaphore, and the halted process continues execution.
At this point the timer thread is destroyed and if there are more packets to send, execution of the next iteration begins.
Looking at Figure 4 again, the method getTotalPacketNum(int) looks at the data to be transmitted and calculates the
total number of packets needed. The method getBufferSize(int) determines the correct packet size by looking at the total
number of packets needed. If only one packet is needed, then a packet size is created that matches the size of the data to be
sent. If more than one packet is needed, then a 63KB packet size is chosen during the current iteration.
00
00
00
00
00
00
00
00
x1
0x
0x
0x
0x
0x
0x
0x
0x
0x
00
10
20
30
40
50
60
70
80
90
10
12
10
Time 8
(seconds) 6
4
2
0
0
00
10
20
30
40
50
60
70
80
90
0
0x
0x
0x
0x
0x
0x
0x
0x
0x
x1
10
20
30
40
50
60
70
80
90
00
10
Matrix Dimension
3.5 Speedup
As mentioned earlier, one advantage of distributed computing is computational speedup. Here, to help gauge the relative
performance of the four Java communications mechanisms, the sequential algorithm used for speedup calculations is the
basic matrix multiplication algorithm with complexity Θ n . ( )3
Figure 7 illustrates the speedup results achieved for the two parallel matrix multiplication algorithms using sockets.
Only the socket speedups were compared since they achieved the best performance. For the simple parallel algorithm, the
computation speedup approaches 4 as the size of the matrices increase. This is expected due to the fact there were 4 worker
processes performing the calculations. Unfortunately, ideal computation speedup is rarely obtained due toseveral factors:
data dependencies, input/output bottlenecks, synchronization overhead, and inter-processor communication overhead [3],
[4]. However, the modified Cannon’s algorithm achieves a computation speedup actually exceeding the ideal case. This
phenomenon is referred to as superlinear speedup [14] and occurs when the sequential algorithm has overheads greater than
its parallel implementation.
Recall the modified Cannon’s algorithm had an interesting performance characteristic due to its more memory efficient
design. At matrix sizes of 600 x 600 and smaller, all the SPA implementations outperform their corresponding MCA
implementations (refer to Table 1). But at 700 x 700 and larger, the roles are reversed with the MCA implementations
offering better performance. As the size of matrices approach and pass 700 x 700, it appears the SPA implementations start
to suffer from memory related performance problems. Whether these potential memory problems are a result of paging
faults, garbage collection, or some other problem is not clear. The surprise observation of superlinear speedup highlights
another advantage of distributed computing: access to a larger pool of memory.
12
10
8
Speed up
6
4
2
0
100x100 200x200 300x300 400x400 500x500 600x600 700x700 800x800 900x900
1000x1000
Matrix Dimension
REFERENCES
[1] J. Maassen, R. V. Nieuwpoort, R. Veldema, H. Bal, and A. Plaat, “An Efficient Implementation of Java's Remote
Method Invocation,” Proceedings of the Seventh ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming, Atlanta, GA, 1999, pp. 173-182.
[2] F. Breg, and C. D. Polychronopoulos, “Java Virtual Machine Support for Object Serialization,” Proceedings of the 2001
joint ACM-ISCOPE Conference on Java Grande, Palo Alto, CA, 2001, pp. 173-180.
[3] M. R. Zargham, Computer Architecture Single and Parallel Systems, Prentice Hall, Upper Saddle River, NJ, 1995, pp.
300-303.
[4] H. Sheil, “Distributed Scientific Computing in Java: Observations and Recommendations,” Proceedings of the 2nd
International Conference on Principles and Practice of Programming in Java, Kilkenny City, Ireland, 2003, pp 219-222.
[5] J. Maassen, R. V. Nieuwpoort, R. Veldema, H. Bal, T. Kielmann, C. Jacobs, and R. Hofman, “Efficient Java RMI for
Parallel Programming,” ACM Transactions on Programming Languages and Systems, New York, NY, 2001, pp. 747 – 775.
[6] W. Grosso, Java RMI, O’Reilly & Associates, Sebastopol, CA, 2002, pp. 179-198.
10. D. A. Curry, Unix Systems Programming for SVR4, O’Reilly & Associates, Sebastopol, CA, 1996, pp. 391-396.
[7] J. Maassen, R. V. Nieuwpoort, R. Veldema, H. Bal, and T. Kielmann, “Wide-Area Parallel Computing in Java,”
Proceedings of the ACM 1999 Conference on Java Grande, San Francisco, CA, 1999, pp. 8-14.
[8] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing (Second Edition), Pearson
Education Limited, Harlow, England, 2003, pp. 345-349.
[9] D. Flanagan, Java in a Nutshell (Third Edition), O’Reilly & Associates Inc., Sebastopol, CA, 1999, pp. 74-93.