Beruflich Dokumente
Kultur Dokumente
ations throughout codebreaking, a eld with ever- We see that after hashing 1:177 2n=2 values, the prob-
increasing demands for high performance. ability of observing a collision is about 1=2. Further-
Background on the collision search problem is given more, the expected number p of values needed before
observing a collision is 2 2n 1:253 2n=2 [10].
in Section 2. Section 3 outlines previous work on
this problem. Section 4 describes our enhancements We see that the rough estimates given above are very
to these previous algorithms. We describe our imple- close to the truth.
mentation of our algorithm and performance measure- The analysis can be extended to estimate the case
ments in Sections 5 and 6, and Section 7 concludes. where we are searching for a number of collisions.
After hashing j values for j 2n=2 , the expected
number of collisions found is approximately j 2 =2n+1 .
2 Background In otherp words, if one is searching for m collisions, a
set of m 2n=2 hashed values will suce. This means
that later collisions start coming faster and faster than
2.1 Problem statement and Mathem- the rst one; if many collisions are wanted, it would
atical Model be a very bad idea to restart the search after each
new collision. In other words, nding many collisions
The collision search problem is easy to state: given a is less costly, per collision, than nding just a few.
hash function h, we are to nd a pair of inputs x 6= y The problem exhibits a \soft" threshold at roughly
with h(x) = h(y). In some applications, we may wish 2n=2 hash evaluations: signicantly before that point,
to nd a great number of collisions eciently. The collisions are quite unlikely, but a ways after the
hash function is viewed as a black box; we assume that threshold collisions come increasingly quickly. In the
h has no mathematical structure we can conveniently more general case of k-way collisions, about 2(k 1)n=k
analyze. values are needed to nd the rst k-way collision, and
m1=k 2(k 1)n=k hashed values suce to nd about m
Collision search is tied to the birthday paradox, which k-way collisions. In particular, k-way collision search
(in its best-known formulation) states that a party of exhibits an even sharper threshold for k > 2: one has
23 people will likely contain a pair of people who were very little chance of observing even one collision much
born on the same day of the year. In general, on pa before the threshold, but after the threshold, collisions
planet with m days in a year, one needs about m abound. This interesting behavior is critical to the
people to nd such a pair. Viewed in terms of collision design of the MicroMint electronic payment system.
search, the birthday paradox says that we will have
a good chance of nding a collision in p an n-bit hash
function after hashing about 2n=2 = 2n dierent
values.
2
2.2 MicroMint Thus we see why a high performance solution to the
collision search problem is critically important to the
MicroMint broker.
The MicroMint electronic payment system relies
heavily on properties of the collision search problem,
and as such, motivates our exploration of parallel col- 2.3 Other applications
lision search. This is a three-party payment system:
a broker (typically a bank) generates and redeems 2.3.1 Graph theory
3
over these groups, so users of such systems watch col- be an extreme amount of communication (one mes-
lision search technology closely to understand the se- sage for every hash evaluation) sent to the central
curity of their system. server. The server|and in particular the server's
Third, collision search can be used to optimize certain I/O subsystem, including both the network and disk
types of attacks known as \meet-in-the-middle" at- interfaces|forms a central bottleneck, which inher-
tacks; typically they would require enormous amounts ently limits scalability. In short, communications
of memory, but ecient collision search algorithms costs will totally overwhelm the computational costs,
remove much of that need. thus preventing any hope of scalability.
In the end, collision search nds widespread applic-
ation to codebreaking because of a fundamental phe- 3.2 Other serial algorithms
nomenon: the powerful birthday paradox has been
used to break many cryptographic systems, and colli- Pollard is credited with identifying the strong link
sion search algorithms primarily focus on speeding up between collision search and graph theory [4, 5]. Con-
birthday paradox computations on high-performance sider the (very large and sparse) graph with 2n nodes
parallel architectures. (labelled 0 though 2n 1), and a directed edge from
x to y i h(x) = y. Note that each node in this graph
has outdegree exactly 1. Now pick a random start-
3 Related work ing point and perform a random walk on this graph;
this amounts to choosing x0 at random and iterating
the recurrence xn = h(xn 1 ). Note that the resulting
3.1 MicroMint's naive algorithm trail will eventually circle back and cross over itself;
when it does, it will thereafter retrace the same steps
The simplest algorithm one could imagine for solv- (because each node has outdegree 1) and continue to
ing the collision search problem would have a single cycle forever. The resulting path will look something
machine continually picking x's at random, calculat- like the Greek letter \rho" (), as shown in Figure 1.
ing h(x), and comparing this output to all previous
outputs. Note that this requires storing each hash
output generated; however, in practice the costs of Figure 1: Pollard's Rho: viewing collision search as
I/O are much greater than those of computation, so cycle-nding in a graph
the time spent accessing long-term storage will dom-
inate. Of course, the computation costs are unavoid-
able (because of the birthday paradox), but the I/O
costs form a considerable amount of waste which we
would very much like to avoid.
The original algorithm described in the MicroMint
paper [8] attempted to solve the collision search us-
ing a naive parallelization of the above algorithm. p
processors each continually pick x's at random, calcu-
late h(x), and stream them to a central server. This
central server then stores and compares the expected
2n=2 hash values as before. The critical observation is that we can identify a hash
Needless to say, MicroMint's naive parallel algorithm collision from such a trail: if y is the node where
is not the paragon of scalability. All of the I/O costs the leader meets the cycle, x is its immediate prede-
found in the serial algorithm remain. And there will cessor on the leader, and x0 is its immediate prede-
4
cessor on the cycle, we see that h(x) = y = h(x0 ). bounds: there simply is very little room for improve-
Therefore, as Pollard noted, we can use any method ment. Serial collision search is a solved problem.
for detecting cycles in a random walk to build a col-
lision search algorithm. This key insight into the
connection between collision search and graph the- 3.3 Parallel algorithms
ory is central to a number of ecient collision search
algorithms. Algorithms based on this approach are Parallel algorithms for collision search have tended
known as \Pollard's rho" methods. Note that, by the to mirror the serial ones; until recently, however, no
birthday paradox, we expect to walk for about 2n=2 good approach to parallel collision search was known.
steps before cycling; therefore, Pollard's rho methods
tend to detect a collision with about the same number One simple approach is to have each of p processors
of hash computations as more naive collision search independently execute a random walk, and do Pol-
algorithms. lard's rho cycling checks independently on each; this
approach is attributed to Brent [10].
One cycle-detection technique, called \distinguished
points" [6, 7] (attributed to Rivest [10]), is worth
mentioning in particular. The idea is to single out Figure 2: Running separate, independent Pollard's
a small subset of the graph nodes as \distinguished" Rhos: only intra-processor collisions are detected
based on some simple recognizable property; a con-
venient choice is to identify a node x as distinguished
if the rst d bits of x are zero, for some d. The
collision search algorithm would record only the dis-
tinguished points it encounters on its random walk;
the algorithm can detect a cycle by recognizing when
a distinguished point is encountered for the second
time. Note that memory requirements are reduced
by a factor of 2d, while computation costs remain
roughly the same. P1 P2 ... Pn
Another clever technique, Floyd's cycle-detecting al-
gorithm, avoids all memory costs. Think of it as two This avoids the need for any communication, but one
concurrent processes: one does an ordinary random pays for the lack of sophistication with a signicant
walk, and another follows in its footsteps but at half
the rate. At each step, one checks to see if the two performance hit: the speedup is only pp when us-
ing p processors. The reason for this is that after
processes are at the same node in the graph; if they
are, a cycle has been detected. It is an amazing and each processor has walked 2n=2 =pp steps, the expec-
beautiful fact that this simple algorithm will always ted number of collisions at each processor is about
correctly detect cycling [2]. (Floyd's algorithm po- 1=p; by linearity of expectations, we get that the ex-
pected total number of collisions across all processors
tentially imposes up to a factor of three increase in
computational cost, but that can be virtually elimin- is about 1 after a total of 2n=2 pp hash computations.
ated with some additional algorithmic design [10].) See Figure 2.
With these and other techniques, one can almost en- There is a much better approach, as discovered and
tirely eliminate the memory requirements and the ex- related by van Oorschot and Wiener in recent work
pensive I/O operations that dominated performance [10]. Note that one of the major causes of ineciency
of the naive algorithm. In particular, serial colli- in the previous algorithm is the independence between
sion search algorithms run near the theoretical upper processors: if two processors' trails converge (in a
\lambda", or , shape, Figure 3) the algorithm will
5
not notice, even though a useful collision can be ob- Once the central server receives two reports (x; c; y)
tained from such a useful convergence. and (x0 ; c0 ; y0 ) that satisfy y = y0 but x 6= x0 , it
realizes a or event has occurred, and begins
a second phase of computation. The goal of this
Figure 3: van Oorschot and Wiener's lambda: taking second phase is to recover the collision implied by
advantage of inter-processor collisions the event; this is, to nd z 6= z 0 with h(z ) = h(z 0).
1100 0011 00110011 We note that, by the way the triplets were reported,
hc (x) = y = hc (x0 ), where superscripts refer to it-
0
6
4 Enhancing the algorithm centralized data partitioning will severely limit the
scalability of the algorithm.
We present two important improvements to the par- In fact, we started to observe eects attributable to
allel collision search algorithm. The rst gives bet- these problems in early experiments with relatively
ter scalability by improving the data partitioning; the small numbers of nodes, which is what motivated us
second extends the algorithm to handle k-way col- to examine the data partitioning issue. We suggest
lisions for k > 2. These modications make up a distributing the list of distinguished points among all
substantial part of this paper's contribution. of the processors to avoid these central bottlenecks.
In particular, we suggest using a hash table (keyed on
the reported distinguished point, but not on the pre-
4.1 Data partitioning vious distinguished point or the trail length between
them), with buckets distributed among all the pro-
Note that van Oorschot and Wiener's algorithm has cessors. When a processor has a distinguished point
a central bottleneck. It relies on a central server to to report, instead of sending it to the central pro-
maintain a central list of all distinguished points en- cessor, it identies its proper location in the distrib-
countered to date; all processors periodically com- uted data structure, and sends it directly to the ap-
municate with the central server. This has a number propriate destination. The destination incorporates
of disadvantages. First of all, the amount of com- it directly into the hash table, checks to see whether
munication required clearly grows as we increase the this distinguished point has been seen before (note
number of processors, and communication with the that this check is entirely local, due to the choice of
central server is likely to soon become a major bot- partitioning), and if not, continues with its computa-
tleneck. Second, the amount of data processed by the tion. (These communications are not synchronized,
central server also grows in the same fashion, so: but occur at random times as distinguished points are
encountered.) The random nature of h implies that it
the storage capacity required will increase, which is appropriate to distribute the hash buckets among
means that soon main memory will be exhausted the processors in a uniform, static manner.
and we will be required to store the data on disk, This improvement allows us to avoid all of the draw-
which is substantially slower;
backs of a central server and associated central bot-
the I/O bandwidth needed will increase, which tleneck; in return, we expect to see signicantly bet-
means that it will soon become a bottleneck (and ter scaling behavior. For instance, while the amount
in fact that point will come even faster than one of data generated still grows linearly in the number
might otherwise expect because of the need to of processors p, now the storage resources available
use slower storage such as hard disks); and also grow linearly in p. It turns out that we can
keep the entire hash table in RAM, which speeds up
the amount of computation needed to manage accesses to the data structure signicantly. Further-
the data will increase, which means that it may more, I/O bandwidth and network bandwidth are now
eventually also become a bottleneck. split among all of the processors, eliminating those
bottlenecks. Finally, we note that a careful choice of
Third, the placement of all distinguished point data partitioning ensures that data is distributed in such a
at the central server means that (unless special ef- way that each processor does its second-phase com-
forts are taken) all of the phase 2 calculations will putation entirely on local data, and we get the desired
occur at that one node, so the second phase will not parallelization and scalability there for free. This
be parallelized. If we are attempting to nd a great is an especially big advantage when generating colli-
many collisions, the poor performance of the second sions en masse, and even more so for k-way collisions
phase will become a serious problem. In short, the (k > 2).
7
In short, we recommend our improved data partition- same, their previous values form (at least) a k-
ing scheme for anyone implementing van Oorschot and way collision, which you record.
Wiener's parallel algorithm: there's no reason not to
use it, and it greatly improves scalability. 3. Remove all duplicates from the list (thus shrink-
ing j ), and go back to step 1 until there are less
than k triplets in the list.
4.2 k-way collisions
This algorithm for the second phase is very ecient.
In this section, we generalize the parallel collision Note that it performs (at most) the number of hash
search algorithm to handle k-way collisions, for k > 2. calculations that would be performed in the second
We modify only the second phase; the rst phase pro- phase for a k = 2 collision search (given the same
ceeds exactly as before (except that it needs to con- set of triplets produced by the rst phase). Further-
tinue for far longer, to ensure that we have performed more, the clusters may be analyzed in parallel|and
at least 2(k 1)n=k hash calculations). The second in fact, they will be, due to the data partitioning
phase examines each hash bucket, looking for clusters scheme described above|and such computations are
of entries which all share the same distinguished point purely local, requiring no communications. There-
value (but for which the previous distinguished point fore, we see that this is very scalable; in practice, the
values are all distinct). We ignore clusters of size cost of the rst phase will dominate the total cost of
less than k, as they cannot possibly lead to k-way the k-way collision search eort.
collision. Also note that, unlike 2-way collisions, hav- This demonstrates that the successful parallel colli-
ing a cluster of size k is not sucient to ensure a sion search algorithm of van Oorschot and Wiener
k-way hash collision for k > 2. For example, we extends very cleanly to the k > 2 case; moreover,
could have k distinguished points x1 , : : : , xk for which much of our experience with the k = 2 algorithm ap-
h(h(xi )) = y for all i (y is the common distinguished plies directly to this extended algorithm as well.
point). Then the k triplets (xi ; 2; y) would be repor-
ted, but it may be that h(xi ) = z for 1 i k 1
and h(xk ) = z 0 for two points z 6= z 0 that also happen
to satisfy h(z ) = y = h(z 0 ). 5 Implementation
Suppose we are given a single cluster of
size j . List the triplets in this cluster as We implemented this algorithm on the NOW, a high-
(x1 ; c1 ; y); (x2 ; c2 ; y); : : : ; (xj ; cj ; y), so that performance parallel platform, using GLUnix to se-
c1 c2 cj . We search for the pos- lect idle workstations. The NOW was well-suited to
sible k-way collision resulting from this cluster by this project: it had some convenient (though far from
repeatedly performing the following steps 1{3: perfect) software development tools, it supports in-
cremental scalability as well as absolute scalability
1. For each i such that ci = c1 (note that this will (which helps avoid any need for the dreaded \forklift
be all i i for some i , because of the ordering upgrade"), and it was readily available for our use.
of the ci ), replace (xi ; ci ; y) with (h(xi ); ci 1; y). We targetted the Sun UltraSPARC workstations, as
Note that this replacement maintains the invari- they provide extremely high performance, a fast net-
ants that the ci are in non-increasing order, and work interface, and there are a large number of them
that hc (xi ) = y for each i.
i
(80+) available through the NOW.
We coded our initial implementation with the MPI
2. After this replacement, you may nd multiple library. At the time this seemed reasonably well-
entries in the list that have become the same. suited to our application. However, at a late point
If there are k or more entries that are now the in development (we had nearly completed imple-
8
mentation, or so we thought), we ran into one ser- Furthermore, AM-2 was widely deployed on all the
ious stumbling block: a bug in MPI. To achieve NOW UltraSPARC nodes; it was written here at
maximum performance, we desperately needed asyn- Berkeley, so support seemed likely to be easier to nd;
chronous, non-blocking primitives to both send and good documentation and example code was readily
receive messages; however, after some investigation, available; several other parallel languages seemed to
we learned that the non-blocking network-polling be implemented in AM-2, which suggested that AM-2
primitive simply didn't work on the NOW due to would be at least as stable as them; and AM-2 did not
some bug in the implementation of the MPI library. have any articial restrictions on the number of nodes
The local MPI expert was out of town for several it could run on (such as requiring that they must be
weeks, and we had no hope of tracking down and x- a power of two, for instance). Best of all, local AM-2
ing the bug ourselves, being relatively new to MPI gurus were rumored to be highly available.
programming. We were blocked on the MPI bug; we The MPI bug that prompted the move to AM-2 was
could have waited for the expert to return, but we de- eventually xed. However, we do not regret the time
cided the wisest thing to do was to re-implement on it took to re-implement for AM-2. The MPI bug
another language with better support for the prim- forced us to do a better job selecting our tools; as
itives we needed. In hindsight, that was for the we found out, our initial choice had been sub-optimal.
best|the fundamental problem was that we were us- We learned a valuable lesson: a little time spent nd-
ing primitives that are part of the periphery of MPI, ing the best tool before coding pays o in the end.
not the core functionality that everyone uses (and thus
is presumably more stable and correct and higher- We learned another lesson from the experience: some
performance). of the parallel development tools on the NOW are
With that experience, we investigated the available woefully inadequate. GLUnix was a mess, suering
tools a bit further, and decided that AM-2 (Active from persistent stability problems (it would often get
Messages 2) was a good choice, and would have made into a funny state, and need to be restarted) and from
a much better t for our application in the rst place. occasional conguration errors that violated the ab-
One would be hard-pressed to imagine a pre-existing straction of a homogenous cluster. The NOW hard-
tool better suited to our needs: ware also had occasional connectivity problems (the
poor cross-cluster communications performance was
notable), even including one notable instance (follow-
Our algorithm is totally message-driven; im- ing some wiring upgrades|no surprise there) where
plementing with a shared-memory abstraction the entire network became partitioned into two halves.
would have been very painful. Fortunately, Alan Mainwaring stepped in to x the
hardware problems each time those arose. In the
AM-2 operates close to the metal, and it has a end, the recurring GLUnix stability problems were
simple interface with low overhead for short mes- the most noticeable.
sages, which ts the prole of our application's However, we also observed a very promising phe-
communication needs. (No fancy features, but nomenon: these tools seemed to improve signicantly
we didn't really need bells and whistles; there- during the last two months of the semester when the
fore, the simplicity and high performance of AM- NOW saw a lot of usage from CS267 projects. Sev-
2 were ideal.) eral students involved in the NOW project evidently
The AM-2 library provides a convenient, easy put in a lot of hard work to make the NOW as usable
way to get event-driven semantics without need- as possible for CS267 students, and that paid o. We
ing to write threaded code. The algorithm is very hope those trends will continue.
event-driven, and so this ease of development was
well appreciated.
9
Figure 4: Performance of our enhanced collision search algorithm, as compared to the naive MicroMint
algorithm (bottom; almost
at), perfect linear scaling (second from top), and the theoretical maximum (top)
9e+06
"Naive-Algorithm"
"Parallel-Performance"
8e+06 "Scaling"
"Theoretical-Bound"
7e+06
6e+06
Performance
5e+06
4e+06
3e+06
2e+06
1e+06
0
0 20 40 60 80 100 120
Nodes
6 Performance The diagonal line just above that is the straight line
passing though the origin and our data point for one
This study would not be complete without a care- processor: it depicts the performance that we would
ful look at the performance of our application. We see if we had achieved perfect linear scaling; this
conducted extensive measurements on the NOW Ul- lets us compare scaling behavior readily. Finally,
traSPARCs, running on up to 84 nodes, and running the topmost line gives the theoretical upper bound
a large number of measurements when we had re- on the maximum performance possible with any al-
served all the nodes to ensure other user's jobs did gorithm; this is the performance one would achieve if
not interfere with our application. To ensure that a everything except Phase 1 calculations of h were free.
comparison between performance gures made sense, Figures 4 and 5 show us that the naive MicroMint par-
we scaled up the problem size proportionately as we allel algorithm has very poor scaling behavior. After
increased the degree of parallelism. about 3 nodes, the performance curve is
at, and
The primary gure of interest is performance graphed there is no benet to adding additional processors;
against the number of processors. We have plotted even before then, it is noticeably less than ideal. We
our results in Figure 4. The bottommost (nearly
at) attribute this to both the cost of I/O and communic-
line in the graph shows the performance of our imple- ations. If the bottleneck was solely disk performance,
mentation of the naive MicroMint parallel collision the curve would
atten after 1 processor; if the bot-
search algorithm from [8]; Figure 5 shows a clos- tleneck was solely network performance, we would ex-
eup of the plot near the origin, which is where all pect the single-processor performance to be roughly
the interesting behavior happens for this algorithm. comparable to the theoretical bound.
The next line above that gives the performance of In contrast, we get excellent performance from the
our implementation of the parallel collision search al- more sophisticated parallel collision search algorithm.
gorithm (based on van Oorschot and Wiener's work, We observe very close to perfect linear scaling all the
with our improvements as described in Section 4). way up to 84 nodes, which is a very positive res-
10
Figure 5: Closeup of Figure 4, displaying the performance of the naive MicroMint algorithm in more detail
300000
"Naive-Algorithm"
"Parallel-Performance"
"Scaling"
"Theoretical-Bound"
250000
200000
Performance
150000
100000
50000
0
0 1 2 3
Nodes
ult, as both absolute and incremental scalability were tical experience with it on a real parallel architecture.
important to us. Furthermore, we see that we are Measurements on up to 84 processors indicate that
not far away from the theoretical maximum. That is our implementation performs extremely well, with ef-
well-suited to MicroMint (where real money is poten- ciency near the theoretical maximum, and displays
tially at stake), because brokers want to be sure that near-perfect scaling behavior.
would-be forgers can't get better performance than
they with less powerful hardware just by upgrading
their algorithms. Instead, we know that one simply
cannot do much better (at least at these ranges of
References
scale) than our implementation. [1] D. Chaum, \Blind Signatures for Untraceable Pay-
ments," Proc. of CRYPTO'82, Plenum, D. Chaum,
R.L. Rivest, & A.T. Sherman (Eds.).
7 Conclusion [2] D.E. Knuth, The Art of Computer Programming,
vol.2, Addison-Wesley, 1981.
[3] T. Okamoto and K. Ohta, \Universal Electronic
In this project, we studied the parallel collision search Cash," Proc. of CRYPTO'91, Springer.
problem. We pointed out serious shortcomings of [4] J.M. Pollard, \A Monte Carlo method for factoriza-
MicroMint's naive parallel algorithm, and were thus tion," BIT, vol. 15 (1975), pp. 331{334.
motivated to look for a better approach. We identi- [5] J.M. Pollard, \Monte Carlo Methods for Index Com-
ed and eliminated a central bottleneck in van Oors- putation (mod p)," Math. Comp., vol. 32, no. 143,
chot and Wiener's parallel algorithm; with clever data July 1978, pp. 918{924.
[6] J.-J. Quisquater and J.-P. Delescaille, \How easy
partitioning, we were able to achieve excellent scal- is collision search? Applications to DES," Proc. of
ing behavior, including both absolute and incremental EUROCRYPT'89, Springer-Verlag.
scalability. We also extended the algorithm to handle [7] J.-J. Quisquater and J.-P. Delescaille, \How easy is
k-way collisions eciently and cleanly. Finally, we collision search? New results and applications to
implemented the algorithm on the NOW to gain prac- DES," Proc. of CRYPTO'89, Springer-Verlag.
11
[8] R.L. Rivest and A. Shamir, \PayWord and Micro-
Mint: two simple micropayments schemes," presen-
ted at the 1996 Security Protocols Workshop, Cam-
bridge, UK.
[9] M. Sirbu and J.D. Tygar, \NetBill: An Internet
Commerce System Optimized for Network Delivered
Services," IEEE COMPCON'95, 1995.
[10] P.C. van Oorschot and M.J. Wiener, \Parallel Colli-
sion Search with Cryptanalytic Applications," to ap-
pear, Sept 23 1996.
12