Sie sind auf Seite 1von 12

Parallel Collision Search

Making money the old-fashioned way|


the NOW as a cash cow
David Wagner and Ian Goldberg

University of California, Berkeley


fdaw,iangg@cs.berkeley.edu
CS 267 Final Project

May 23, 1997

Abstract Shamir in [8] utilizes the collision search problem to


allow the bank to mint tokens while making it infeas-
Parallel collision search is the hard problem underly- ible for customers to do the same.
ing a number of interesting applications in electronic A high-performance solution to the collision search
commerce, graph theory, and codebreaking. Several problem is extremely important because, instead of
algorithms, including one by Rivest and Shamir and relying on a \trapdoor" function, in which some kind
another by van Oorschot and Wiener, have been pro- of data known only to the bank makes the problem
posed to solve this problem. In this paper, we present easier to solve, this problem requires that the bank
enhancements to these algorithms to achieve practical have more time and more computing power that it
eciency and near-perfect scalability. can devote to solving this problem than would-be for-
gers do. No single machine can provide the perform-
ance these applications demand. Because of the need
1 Introduction for speed, parallel architectures are the only way we
can hope to amass enough computing power to solve
useful instances of the collision search problem. We
An electronic payment system provides the ability for therefore require a solution to this problem that per-
banks to produce \tokens" of data, each of which rep- forms well in parallel environments, as well as scaling
resents a small amount of money (typically on the well in both absolute and incremental fashions. But
order of cents or fractions of cents). These tokens parallel computing brings with it new challenges, with
can be sold to customers, who can spend them at innovations required in both algorithms and imple-
shops over the Internet. The shops then redeem these mentations. Therefore, this paper focuses on parallel
tokens at the bank to receive their payment. One key collision search: challenges and solutions.
factor in any micropayment system is how to prevent The goals of this project were twofold. First, we
customers from \minting" their own tokens. aimed to study previous and recent work on collision-
Whereas a number of electronic payment systems nding algorithms. Speci cally, we were interested
[1, 3, 9] solve this problem by using (somewhat heavy- in implementing and evaluating real-world perform-
weight) digital signatures on tokens, the MicroMint ance on the NOW, attempting to gain some insight
electronic payment system proposed by Rivest and into likely bottlenecks with this practical experience.
Second, we hoped to extend that experience to handle The collision search problem can also be generalized
a generalized (\k-way") collision search problem, to the problem of nding k-way collisions: we are to
which is used by MicroMint. Most of the previous nd a set of k di erent inputs x1 ; : : : ; xk with h(x1 ) =
work did not address this problem, but we hoped to : : : = h(xk ).
be able to extend the algorithms to eciently solve The mathematics can be tightened up a bit: we form-
these problems too, with performance requirements ally model h : f0; 1g ! f0; 1gn as a random func-
motivated from MicroMint. tion with n-bit outputs; therefore if we apply the hash
Collision search also has a number of other applica- function to any j di erent inputs, we will see j in-
tions; for example, it is closely related to the graph- dependent outputs, each uniformly distributed over
theoretic problem of nding cycles on a large sparse f0; 1gn. The probability that no collision is observed
directed graph, as well as a number of other applic- is (1 1=2n)(1 2=2n)    (1 (j 1)=2n )  e j2 =2 +1 . n

ations throughout codebreaking, a eld with ever- We see that after hashing 1:177  2n=2 values, the prob-
increasing demands for high performance. ability of observing a collision is about 1=2. Further-
Background on the collision search problem is given more, the expected number p of values needed before
observing a collision is 2  2n  1:253  2n=2 [10].
in Section 2. Section 3 outlines previous work on
this problem. Section 4 describes our enhancements We see that the rough estimates given above are very
to these previous algorithms. We describe our imple- close to the truth.
mentation of our algorithm and performance measure- The analysis can be extended to estimate the case
ments in Sections 5 and 6, and Section 7 concludes. where we are searching for a number of collisions.
After hashing j values for j  2n=2 , the expected
number of collisions found is approximately j 2 =2n+1 .
2 Background In otherp words, if one is searching for m collisions, a
set of m 2n=2 hashed values will suce. This means
that later collisions start coming faster and faster than
2.1 Problem statement and Mathem- the rst one; if many collisions are wanted, it would
atical Model be a very bad idea to restart the search after each
new collision. In other words, nding many collisions
The collision search problem is easy to state: given a is less costly, per collision, than nding just a few.
hash function h, we are to nd a pair of inputs x 6= y The problem exhibits a \soft" threshold at roughly
with h(x) = h(y). In some applications, we may wish 2n=2 hash evaluations: signi cantly before that point,
to nd a great number of collisions eciently. The collisions are quite unlikely, but a ways after the
hash function is viewed as a black box; we assume that threshold collisions come increasingly quickly. In the
h has no mathematical structure we can conveniently more general case of k-way collisions, about 2(k 1)n=k
analyze. values are needed to nd the rst k-way collision, and
m1=k 2(k 1)n=k hashed values suce to nd about m
Collision search is tied to the birthday paradox, which k-way collisions. In particular, k-way collision search
(in its best-known formulation) states that a party of exhibits an even sharper threshold for k > 2: one has
23 people will likely contain a pair of people who were very little chance of observing even one collision much
born on the same day of the year. In general, on pa before the threshold, but after the threshold, collisions
planet with m days in a year, one needs about m abound. This interesting behavior is critical to the
people to nd such a pair. Viewed in terms of collision design of the MicroMint electronic payment system.
search, the birthday paradox says that we will have
a good chance of nding a collision in p an n-bit hash
function after hashing about 2n=2 = 2n di erent
values.

2
2.2 MicroMint Thus we see why a high performance solution to the
collision search problem is critically important to the
MicroMint broker.
The MicroMint electronic payment system relies
heavily on properties of the collision search problem,
and as such, motivates our exploration of parallel col- 2.3 Other applications
lision search. This is a three-party payment system:
a broker (typically a bank) generates and redeems 2.3.1 Graph theory

coins, customers purchase coins from the broker, and


vendors o er electronic goods in exchange for such Graph theory is intimately involved in the collision
coins. A coin is nothing more than a bit string with search problem. As we will see later in Section 3.2,
a special structure: in particular, a MicroMint coin collision search can be related to the cycle- nding
is just a k-way collision (x1 ; : : : ; xk ); the authors sug- problem on a certain large sparse random directed
gest k = 4 as a convenient choice. graph determined in a natural way from the hash
function. This insight is fundamental to a number
This leads us to an interesting conundrum. Anyone of the most powerful algorithms for collision search;
who can generate k-way collisions eciently can de- collision search algorithms bene t greatly from graph
fraud the system and obtain goods for free; thus, all of theory. The converse is also true; a number of the al-
the security from forgery lies in the diculty of colli- gorithms we describe can be used to identify cycles
sion search. At the same time, the broker mints coins on large sparse random graphs.
en masse by nding a great many collisions on high-
performance platforms; therefore the business model
of the broker rests on his ability to nd collisions ef- 2.3.2 Codebreaking
ciently.
Now the alert reader might ask, \Aren't these con- Collision search also nds a long list of applications in
icting requirements?" The answer is a resound- cryptanalysis, or codebreaking. First of all, ability to
ing no, and the resolution of this apparent paradox perform ecient collision search would give cryptana-
provides interesting insights into MicroMint. Note lysts the ability to break cryptographic-strength hash
that we may typically assume that the broker pos- functions. Digital signatures, authentication, time-
sesses more computing power than would-be forgers stamping, and a number of other important cryp-
have access to; at the same time, to be pro table, the tographic applications all rely on the infeasibility of
broker needs to mint many more coins than a typ- nding collisions in cryptographic hash functions, so
ical forger would. But the collision search problem is this problem is of great interest to cryptographers and
perfectly suited to this scenario: it provides a funda- system designers. Of course, blind search is not likely
mental economy of scale for the broker. Generating a to nd a collision in MD5 (which has 128-bit outputs)
great many collisions is cheaper (per collision) than anytime soon, but cryptographers remain vitally in-
nding just a few. After a large initial investment, terested in tracking the state of the art, so they can
the broker can mint coins eciently, but without com- retire MD5 before Moore's law and collision search
parable capital, would-be counterfeiters cannot create technology catches up with it.
forged coins as economically. (Here's another way to Second, van Oorschot and Wiener have shown how to
look at this situation: the collision search problem calculate discrete logarithms over an arbitrary general
provides a natural threshold between \collisions ex- nite group by using a collision search algorithm. For
tremely rare" and \collisions abound", and when the a number of groups in widespread use (especially el-
broker has more computing power than counterfeiters, liptic curve groups), the collision-based algorithm is
the broker can tune the problem to place the threshold the best known. A number of important public-key
rmly inside that performance gap.) systems rely on the security of discrete logarithms

3
over these groups, so users of such systems watch col- be an extreme amount of communication (one mes-
lision search technology closely to understand the se- sage for every hash evaluation) sent to the central
curity of their system. server. The server|and in particular the server's
Third, collision search can be used to optimize certain I/O subsystem, including both the network and disk
types of attacks known as \meet-in-the-middle" at- interfaces|forms a central bottleneck, which inher-
tacks; typically they would require enormous amounts ently limits scalability. In short, communications
of memory, but ecient collision search algorithms costs will totally overwhelm the computational costs,
remove much of that need. thus preventing any hope of scalability.
In the end, collision search nds widespread applic-
ation to codebreaking because of a fundamental phe- 3.2 Other serial algorithms
nomenon: the powerful birthday paradox has been
used to break many cryptographic systems, and colli- Pollard is credited with identifying the strong link
sion search algorithms primarily focus on speeding up between collision search and graph theory [4, 5]. Con-
birthday paradox computations on high-performance sider the (very large and sparse) graph with 2n nodes
parallel architectures. (labelled 0 though 2n 1), and a directed edge from
x to y i h(x) = y. Note that each node in this graph
has outdegree exactly 1. Now pick a random start-
3 Related work ing point and perform a random walk on this graph;
this amounts to choosing x0 at random and iterating
the recurrence xn = h(xn 1 ). Note that the resulting
3.1 MicroMint's naive algorithm trail will eventually circle back and cross over itself;
when it does, it will thereafter retrace the same steps
The simplest algorithm one could imagine for solv- (because each node has outdegree 1) and continue to
ing the collision search problem would have a single cycle forever. The resulting path will look something
machine continually picking x's at random, calculat- like the Greek letter \rho" (), as shown in Figure 1.
ing h(x), and comparing this output to all previous
outputs. Note that this requires storing each hash
output generated; however, in practice the costs of Figure 1: Pollard's Rho: viewing collision search as
I/O are much greater than those of computation, so cycle- nding in a graph
the time spent accessing long-term storage will dom-
inate. Of course, the computation costs are unavoid-
able (because of the birthday paradox), but the I/O
costs form a considerable amount of waste which we
would very much like to avoid.
The original algorithm described in the MicroMint
paper [8] attempted to solve the collision search us-
ing a naive parallelization of the above algorithm. p
processors each continually pick x's at random, calcu-
late h(x), and stream them to a central server. This
central server then stores and compares the expected
2n=2 hash values as before. The critical observation is that we can identify a hash
Needless to say, MicroMint's naive parallel algorithm collision from such a trail: if y is the node where
is not the paragon of scalability. All of the I/O costs the leader meets the cycle, x is its immediate prede-
found in the serial algorithm remain. And there will cessor on the leader, and x0 is its immediate prede-

4
cessor on the cycle, we see that h(x) = y = h(x0 ). bounds: there simply is very little room for improve-
Therefore, as Pollard noted, we can use any method ment. Serial collision search is a solved problem.
for detecting cycles in a random walk to build a col-
lision search algorithm. This key insight into the
connection between collision search and graph the- 3.3 Parallel algorithms
ory is central to a number of ecient collision search
algorithms. Algorithms based on this approach are Parallel algorithms for collision search have tended
known as \Pollard's rho" methods. Note that, by the to mirror the serial ones; until recently, however, no
birthday paradox, we expect to walk for about 2n=2 good approach to parallel collision search was known.
steps before cycling; therefore, Pollard's rho methods
tend to detect a collision with about the same number One simple approach is to have each of p processors
of hash computations as more naive collision search independently execute a random walk, and do Pol-
algorithms. lard's rho cycling checks independently on each; this
approach is attributed to Brent [10].
One cycle-detection technique, called \distinguished
points" [6, 7] (attributed to Rivest [10]), is worth
mentioning in particular. The idea is to single out Figure 2: Running separate, independent Pollard's
a small subset of the graph nodes as \distinguished" Rhos: only intra-processor collisions are detected
based on some simple recognizable property; a con-
venient choice is to identify a node x as distinguished
if the rst d bits of x are zero, for some d. The
collision search algorithm would record only the dis-
tinguished points it encounters on its random walk;
the algorithm can detect a cycle by recognizing when
a distinguished point is encountered for the second
time. Note that memory requirements are reduced
by a factor of 2d, while computation costs remain
roughly the same. P1 P2 ... Pn
Another clever technique, Floyd's cycle-detecting al-
gorithm, avoids all memory costs. Think of it as two This avoids the need for any communication, but one
concurrent processes: one does an ordinary random pays for the lack of sophistication with a signi cant
walk, and another follows in its footsteps but at half
the rate. At each step, one checks to see if the two performance hit: the speedup is only pp when us-
ing p processors. The reason for this is that after
processes are at the same node in the graph; if they
are, a cycle has been detected. It is an amazing and each processor has walked 2n=2 =pp steps, the expec-
beautiful fact that this simple algorithm will always ted number of collisions at each processor is about
correctly detect cycling [2]. (Floyd's algorithm po- 1=p; by linearity of expectations, we get that the ex-
pected total number of collisions across all processors
tentially imposes up to a factor of three increase in
computational cost, but that can be virtually elimin- is about 1 after a total of 2n=2  pp hash computations.
ated with some additional algorithmic design [10].) See Figure 2.
With these and other techniques, one can almost en- There is a much better approach, as discovered and
tirely eliminate the memory requirements and the ex- related by van Oorschot and Wiener in recent work
pensive I/O operations that dominated performance [10]. Note that one of the major causes of ineciency
of the naive algorithm. In particular, serial colli- in the previous algorithm is the independence between
sion search algorithms run near the theoretical upper processors: if two processors' trails converge (in a
\lambda", or , shape, Figure 3) the algorithm will

5
not notice, even though a useful collision can be ob- Once the central server receives two reports (x; c; y)
tained from such a useful convergence. and (x0 ; c0 ; y0 ) that satisfy y = y0 but x 6= x0 , it
realizes a  or  event has occurred, and begins
a second phase of computation. The goal of this
Figure 3: van Oorschot and Wiener's lambda: taking second phase is to recover the collision implied by
advantage of inter-processor collisions the event; this is, to nd z 6= z 0 with h(z ) = h(z 0).
1100 0011 00110011 We note that, by the way the triplets were reported,
hc (x) = y = hc (x0 ), where superscripts refer to it-
0

11 0011 00111100 0011 0011


00 eration. Without loss of generality, c0  c, and we
let c00 = c0 c and x00 = hc (x0 ). Now we note that
00

1100 0011 0011 hc (x00 ) = hc (x0 ) = hc (x). It is also important to note


0

that x00 6= x, because x is a distinguished point, and


111
000 111
000 so could not be on the path between x0 and y0 (= y).
11001100 0011 00 We now have x 6= x00 but hc (x) = hc (x00 ), so we
11
00 11 simply calculate hi (x) and hi (x00 ) for each i from 1
111 000
000 111. . .
0011 until the rst such time as the two values are equal
(which will happen at latest when i = c). The val-
P1 P2 Pn ues from the previous iteration are then the desired
collision.
In van Oorschot and Wiener's algorithm, processors For more details on this algorithm of van Oorschot
periodically communicate the current state of their and Wiener, we refer the interested reader to their
random walks to detect 's, which ensures that we paper
can take advantage of collisions between processors. in more[10]. We decided to investigate their algorithm
detail, implement in on a real-world parallel
In particular, the method of distinguished points is processing platform,
applied, and whenever a processor encounters a dis- in practice. We reportand on
examine how well it behaves
that investigation in further
tinguished point, it communicates that to a cent- detail in the following sections.
ral server; if that distinguished point has already
been encountered once before (usually by some other It is worth pointing out that almost all of the previous
processor|a  event|although occasionally 's do work in this eld concentrates on pairwise collisions,
happen too), the central server identi es that a colli- and ignores the problem of nding k-way collisions
sion has occurred and recovers it. With this innova- for k > 2. We remind the reader that MicroMint de-
tion, processors spend most of the time in local com- pends crucially on the k-way collision search problem.
putation, with occasional communication, thus im- Therefore, we also set out to investigate the problem
proving eciency. of k-way collision search. Can the previous work on
2-way collisions be generalized to apply to the k-way
Finally, we describe the particulars of how collisions collision
are recovered from the incoming reports of distin- performance problem? Can one develop ecient, high-
guished points. A processor performing a random needed by applications solutions for natural problem instances
walk encounters points x1 , x2 = h(x1 ), x3 = h(x2 ), swer these questions in the such as MicroMint? We an-
and so on. If 1 out of every 2d points in the overall sections. armative in the following
graph are distinguished, we also expect 1 out of every
2d points in this sequence to be distinguished points.
Each time the processor encounters a distinguished
point (say xj ), it recalls the previous distinguished
point it encountered (say xi ), and reports the triplet
(xi ; j i; xj ) to the central server.

6
4 Enhancing the algorithm centralized data partitioning will severely limit the
scalability of the algorithm.
We present two important improvements to the par- In fact, we started to observe e ects attributable to
allel collision search algorithm. The rst gives bet- these problems in early experiments with relatively
ter scalability by improving the data partitioning; the small numbers of nodes, which is what motivated us
second extends the algorithm to handle k-way col- to examine the data partitioning issue. We suggest
lisions for k > 2. These modi cations make up a distributing the list of distinguished points among all
substantial part of this paper's contribution. of the processors to avoid these central bottlenecks.
In particular, we suggest using a hash table (keyed on
the reported distinguished point, but not on the pre-
4.1 Data partitioning vious distinguished point or the trail length between
them), with buckets distributed among all the pro-
Note that van Oorschot and Wiener's algorithm has cessors. When a processor has a distinguished point
a central bottleneck. It relies on a central server to to report, instead of sending it to the central pro-
maintain a central list of all distinguished points en- cessor, it identi es its proper location in the distrib-
countered to date; all processors periodically com- uted data structure, and sends it directly to the ap-
municate with the central server. This has a number propriate destination. The destination incorporates
of disadvantages. First of all, the amount of com- it directly into the hash table, checks to see whether
munication required clearly grows as we increase the this distinguished point has been seen before (note
number of processors, and communication with the that this check is entirely local, due to the choice of
central server is likely to soon become a major bot- partitioning), and if not, continues with its computa-
tleneck. Second, the amount of data processed by the tion. (These communications are not synchronized,
central server also grows in the same fashion, so: but occur at random times as distinguished points are
encountered.) The random nature of h implies that it
 the storage capacity required will increase, which is appropriate to distribute the hash buckets among
means that soon main memory will be exhausted the processors in a uniform, static manner.
and we will be required to store the data on disk, This improvement allows us to avoid all of the draw-
which is substantially slower;
backs of a central server and associated central bot-
 the I/O bandwidth needed will increase, which tleneck; in return, we expect to see signi cantly bet-
means that it will soon become a bottleneck (and ter scaling behavior. For instance, while the amount
in fact that point will come even faster than one of data generated still grows linearly in the number
might otherwise expect because of the need to of processors p, now the storage resources available
use slower storage such as hard disks); and also grow linearly in p. It turns out that we can
keep the entire hash table in RAM, which speeds up
 the amount of computation needed to manage accesses to the data structure signi cantly. Further-
the data will increase, which means that it may more, I/O bandwidth and network bandwidth are now
eventually also become a bottleneck. split among all of the processors, eliminating those
bottlenecks. Finally, we note that a careful choice of
Third, the placement of all distinguished point data partitioning ensures that data is distributed in such a
at the central server means that (unless special ef- way that each processor does its second-phase com-
forts are taken) all of the phase 2 calculations will putation entirely on local data, and we get the desired
occur at that one node, so the second phase will not parallelization and scalability there for free. This
be parallelized. If we are attempting to nd a great is an especially big advantage when generating colli-
many collisions, the poor performance of the second sions en masse, and even more so for k-way collisions
phase will become a serious problem. In short, the (k > 2).

7
In short, we recommend our improved data partition- same, their previous values form (at least) a k-
ing scheme for anyone implementing van Oorschot and way collision, which you record.
Wiener's parallel algorithm: there's no reason not to
use it, and it greatly improves scalability. 3. Remove all duplicates from the list (thus shrink-
ing j ), and go back to step 1 until there are less
than k triplets in the list.
4.2 k-way collisions
This algorithm for the second phase is very ecient.
In this section, we generalize the parallel collision Note that it performs (at most) the number of hash
search algorithm to handle k-way collisions, for k > 2. calculations that would be performed in the second
We modify only the second phase; the rst phase pro- phase for a k = 2 collision search (given the same
ceeds exactly as before (except that it needs to con- set of triplets produced by the rst phase). Further-
tinue for far longer, to ensure that we have performed more, the clusters may be analyzed in parallel|and
at least 2(k 1)n=k hash calculations). The second in fact, they will be, due to the data partitioning
phase examines each hash bucket, looking for clusters scheme described above|and such computations are
of entries which all share the same distinguished point purely local, requiring no communications. There-
value (but for which the previous distinguished point fore, we see that this is very scalable; in practice, the
values are all distinct). We ignore clusters of size cost of the rst phase will dominate the total cost of
less than k, as they cannot possibly lead to k-way the k-way collision search e ort.
collision. Also note that, unlike 2-way collisions, hav- This demonstrates that the successful parallel colli-
ing a cluster of size k is not sucient to ensure a sion search algorithm of van Oorschot and Wiener
k-way hash collision for k > 2. For example, we extends very cleanly to the k > 2 case; moreover,
could have k distinguished points x1 , : : : , xk for which much of our experience with the k = 2 algorithm ap-
h(h(xi )) = y for all i (y is the common distinguished plies directly to this extended algorithm as well.
point). Then the k triplets (xi ; 2; y) would be repor-
ted, but it may be that h(xi ) = z for 1  i  k 1
and h(xk ) = z 0 for two points z 6= z 0 that also happen
to satisfy h(z ) = y = h(z 0 ). 5 Implementation
Suppose we are given a single cluster of
size j . List the triplets in this cluster as We implemented this algorithm on the NOW, a high-
(x1 ; c1 ; y); (x2 ; c2 ; y); : : : ; (xj ; cj ; y), so that performance parallel platform, using GLUnix to se-
c1  c2      cj . We search for the pos- lect idle workstations. The NOW was well-suited to
sible k-way collision resulting from this cluster by this project: it had some convenient (though far from
repeatedly performing the following steps 1{3: perfect) software development tools, it supports in-
cremental scalability as well as absolute scalability
1. For each i such that ci = c1 (note that this will (which helps avoid any need for the dreaded \forklift
be all i  i for some i , because of the ordering upgrade"), and it was readily available for our use.
of the ci ), replace (xi ; ci ; y) with (h(xi ); ci 1; y). We targetted the Sun UltraSPARC workstations, as
Note that this replacement maintains the invari- they provide extremely high performance, a fast net-
ants that the ci are in non-increasing order, and work interface, and there are a large number of them
that hc (xi ) = y for each i.
i
(80+) available through the NOW.
We coded our initial implementation with the MPI
2. After this replacement, you may nd multiple library. At the time this seemed reasonably well-
entries in the list that have become the same. suited to our application. However, at a late point
If there are k or more entries that are now the in development (we had nearly completed imple-

8
mentation, or so we thought), we ran into one ser- Furthermore, AM-2 was widely deployed on all the
ious stumbling block: a bug in MPI. To achieve NOW UltraSPARC nodes; it was written here at
maximum performance, we desperately needed asyn- Berkeley, so support seemed likely to be easier to nd;
chronous, non-blocking primitives to both send and good documentation and example code was readily
receive messages; however, after some investigation, available; several other parallel languages seemed to
we learned that the non-blocking network-polling be implemented in AM-2, which suggested that AM-2
primitive simply didn't work on the NOW due to would be at least as stable as them; and AM-2 did not
some bug in the implementation of the MPI library. have any arti cial restrictions on the number of nodes
The local MPI expert was out of town for several it could run on (such as requiring that they must be
weeks, and we had no hope of tracking down and x- a power of two, for instance). Best of all, local AM-2
ing the bug ourselves, being relatively new to MPI gurus were rumored to be highly available.
programming. We were blocked on the MPI bug; we The MPI bug that prompted the move to AM-2 was
could have waited for the expert to return, but we de- eventually xed. However, we do not regret the time
cided the wisest thing to do was to re-implement on it took to re-implement for AM-2. The MPI bug
another language with better support for the prim- forced us to do a better job selecting our tools; as
itives we needed. In hindsight, that was for the we found out, our initial choice had been sub-optimal.
best|the fundamental problem was that we were us- We learned a valuable lesson: a little time spent nd-
ing primitives that are part of the periphery of MPI, ing the best tool before coding pays o in the end.
not the core functionality that everyone uses (and thus
is presumably more stable and correct and higher- We learned another lesson from the experience: some
performance). of the parallel development tools on the NOW are
With that experience, we investigated the available woefully inadequate. GLUnix was a mess, su ering
tools a bit further, and decided that AM-2 (Active from persistent stability problems (it would often get
Messages 2) was a good choice, and would have made into a funny state, and need to be restarted) and from
a much better t for our application in the rst place. occasional con guration errors that violated the ab-
One would be hard-pressed to imagine a pre-existing straction of a homogenous cluster. The NOW hard-
tool better suited to our needs: ware also had occasional connectivity problems (the
poor cross-cluster communications performance was
notable), even including one notable instance (follow-
 Our algorithm is totally message-driven; im- ing some wiring upgrades|no surprise there) where
plementing with a shared-memory abstraction the entire network became partitioned into two halves.
would have been very painful. Fortunately, Alan Mainwaring stepped in to x the
hardware problems each time those arose. In the
 AM-2 operates close to the metal, and it has a end, the recurring GLUnix stability problems were
simple interface with low overhead for short mes- the most noticeable.
sages, which ts the pro le of our application's However, we also observed a very promising phe-
communication needs. (No fancy features, but nomenon: these tools seemed to improve signi cantly
we didn't really need bells and whistles; there- during the last two months of the semester when the
fore, the simplicity and high performance of AM- NOW saw a lot of usage from CS267 projects. Sev-
2 were ideal.) eral students involved in the NOW project evidently
 The AM-2 library provides a convenient, easy put in a lot of hard work to make the NOW as usable
way to get event-driven semantics without need- as possible for CS267 students, and that paid o . We
ing to write threaded code. The algorithm is very hope those trends will continue.
event-driven, and so this ease of development was
well appreciated.

9
Figure 4: Performance of our enhanced collision search algorithm, as compared to the naive MicroMint
algorithm (bottom; almost at), perfect linear scaling (second from top), and the theoretical maximum (top)
9e+06
"Naive-Algorithm"
"Parallel-Performance"
8e+06 "Scaling"
"Theoretical-Bound"

7e+06

6e+06
Performance

5e+06

4e+06

3e+06

2e+06

1e+06

0
0 20 40 60 80 100 120
Nodes

6 Performance The diagonal line just above that is the straight line
passing though the origin and our data point for one
This study would not be complete without a care- processor: it depicts the performance that we would
ful look at the performance of our application. We see if we had achieved perfect linear scaling; this
conducted extensive measurements on the NOW Ul- lets us compare scaling behavior readily. Finally,
traSPARCs, running on up to 84 nodes, and running the topmost line gives the theoretical upper bound
a large number of measurements when we had re- on the maximum performance possible with any al-
served all the nodes to ensure other user's jobs did gorithm; this is the performance one would achieve if
not interfere with our application. To ensure that a everything except Phase 1 calculations of h were free.
comparison between performance gures made sense, Figures 4 and 5 show us that the naive MicroMint par-
we scaled up the problem size proportionately as we allel algorithm has very poor scaling behavior. After
increased the degree of parallelism. about 3 nodes, the performance curve is at, and
The primary gure of interest is performance graphed there is no bene t to adding additional processors;
against the number of processors. We have plotted even before then, it is noticeably less than ideal. We
our results in Figure 4. The bottommost (nearly at) attribute this to both the cost of I/O and communic-
line in the graph shows the performance of our imple- ations. If the bottleneck was solely disk performance,
mentation of the naive MicroMint parallel collision the curve would atten after 1 processor; if the bot-
search algorithm from [8]; Figure 5 shows a clos- tleneck was solely network performance, we would ex-
eup of the plot near the origin, which is where all pect the single-processor performance to be roughly
the interesting behavior happens for this algorithm. comparable to the theoretical bound.
The next line above that gives the performance of In contrast, we get excellent performance from the
our implementation of the parallel collision search al- more sophisticated parallel collision search algorithm.
gorithm (based on van Oorschot and Wiener's work, We observe very close to perfect linear scaling all the
with our improvements as described in Section 4). way up to 84 nodes, which is a very positive res-

10
Figure 5: Closeup of Figure 4, displaying the performance of the naive MicroMint algorithm in more detail
300000
"Naive-Algorithm"
"Parallel-Performance"
"Scaling"
"Theoretical-Bound"
250000

200000
Performance

150000

100000

50000

0
0 1 2 3
Nodes

ult, as both absolute and incremental scalability were tical experience with it on a real parallel architecture.
important to us. Furthermore, we see that we are Measurements on up to 84 processors indicate that
not far away from the theoretical maximum. That is our implementation performs extremely well, with ef-
well-suited to MicroMint (where real money is poten- ciency near the theoretical maximum, and displays
tially at stake), because brokers want to be sure that near-perfect scaling behavior.
would-be forgers can't get better performance than
they with less powerful hardware just by upgrading
their algorithms. Instead, we know that one simply
cannot do much better (at least at these ranges of
References
scale) than our implementation. [1] D. Chaum, \Blind Signatures for Untraceable Pay-
ments," Proc. of CRYPTO'82, Plenum, D. Chaum,
R.L. Rivest, & A.T. Sherman (Eds.).
7 Conclusion [2] D.E. Knuth, The Art of Computer Programming,
vol.2, Addison-Wesley, 1981.
[3] T. Okamoto and K. Ohta, \Universal Electronic
In this project, we studied the parallel collision search Cash," Proc. of CRYPTO'91, Springer.
problem. We pointed out serious shortcomings of [4] J.M. Pollard, \A Monte Carlo method for factoriza-
MicroMint's naive parallel algorithm, and were thus tion," BIT, vol. 15 (1975), pp. 331{334.
motivated to look for a better approach. We identi- [5] J.M. Pollard, \Monte Carlo Methods for Index Com-
ed and eliminated a central bottleneck in van Oors- putation (mod p)," Math. Comp., vol. 32, no. 143,
chot and Wiener's parallel algorithm; with clever data July 1978, pp. 918{924.
[6] J.-J. Quisquater and J.-P. Delescaille, \How easy
partitioning, we were able to achieve excellent scal- is collision search? Applications to DES," Proc. of
ing behavior, including both absolute and incremental EUROCRYPT'89, Springer-Verlag.
scalability. We also extended the algorithm to handle [7] J.-J. Quisquater and J.-P. Delescaille, \How easy is
k-way collisions eciently and cleanly. Finally, we collision search? New results and applications to
implemented the algorithm on the NOW to gain prac- DES," Proc. of CRYPTO'89, Springer-Verlag.

11
[8] R.L. Rivest and A. Shamir, \PayWord and Micro-
Mint: two simple micropayments schemes," presen-
ted at the 1996 Security Protocols Workshop, Cam-
bridge, UK.
[9] M. Sirbu and J.D. Tygar, \NetBill: An Internet
Commerce System Optimized for Network Delivered
Services," IEEE COMPCON'95, 1995.
[10] P.C. van Oorschot and M.J. Wiener, \Parallel Colli-
sion Search with Cryptanalytic Applications," to ap-
pear, Sept 23 1996.

12

Das könnte Ihnen auch gefallen