Sie sind auf Seite 1von 20


Center for Advanced Computation Reed College Portland, OR 97202

Richard E. Crandall

Dept. of Mech. & Aerospace Engineering Case Western Reserve University Cleveland, OH 44106

Ernst W. Mayer

Dept. of Elec. & Comp. Engineering University of Maryland College Park, MD 20742

Jason S. Papadopoulos

y To whom correspondence should be addressed. z Submitted to Mathematics of Computation. x 1991 Mathematics Subject Classi cation. Primary 11Y11, 11Y16, 68Q25, 11A51.

Present address: 10190 Parkwood Dr. Apt. 1, Cupertino, CA 95014

We have shown by machine proof that F24 = 2224 +1 is composite. The rigorous Pepin primality test was performed using independently developed programs running simultaneously on two di erent, physically separated processors. Each program employed a oating-point, FFT-based discrete weighted transform (DWT) to e ect multiplication modulo F24. The nal, respective Pepin residues obtained by these two machines were in complete agreement. Using intermediate residues stored periodically during one of the oating-point runs, a separate algorithm for pure-integer negacyclic convolution veri ed the result in a \wavefront" paradigm, by running simultaneously on numerous additional machines, to e ect piecewise veri cation of a saturating set of deterministic links for the Pepin chain. We deposited a nal Pepin residue for possible use by future investigators in the event that a proper factor of F24 be discovered herein we report the more compact, traditional Selfridge-Hurwitz residues.

1 Computational history of Fermat numbers

It is well known that P. Fermat did, in the early part of the 17th century, describe the numbers

Fn = 22n + 1
noting that for n = 0 1 2 3 4 these are all primes, and claiming that the primality property surely must hold for all subsequent n > 4. In a remarkable oversight, Fermat did not go on to test the status of F5 , even though he could have done so quite easily using the compositeness test that now bears his name. Later modi cation, due to Euler, of this celebrated compositeness test led in fact to the rigorous Pepin test for the Fn . After the discovery of certain small factors of various Fn through the ensuing centuries, and after the machine-aided work of Selfridge and Hurwitz 26] with the spectacular resolution of F14 as composite, it was known by the early 1980s that Fn is composite for all 5 n 30, except for the four cases n =20, 22, 24 and 28, for which the character of Fn remained unresolved 24]. Many of the compositeness proofs have simply involved direct sieving| to nd small factors|and there seems to be no end to such discoveries. For example, in 1997 Taura found a small factor of F28 16]. Then began a series of three massive computations, showing 1

that F20 29], F22 10, 27], and in the present work, F24 are all composite. Thus every Fn with 5 n 30 is now known to be composite. Thanks to an informal but dedicated band of Fermatnumber enthusiasts (there seems to have been such a group in most every era since Fermat), the current status of known factors and the like is available on the Internet 16]. The culturally interesting term \genuine composite" refers to proven composites that nevertheless enjoy not a single known proper factor. The only known genuine composite Fn are now for n = 14 20 22 24. It is amusing that the discovery rate exhibited by these four indices n appears to follow a rule of thumb: to get from the resolution of Fm to the resolution of Fn takes us about 3(n ; m) years. Amusement and heuristics aside, it is important to observe that the resolution of each of these genuine composites was carried out, in its own era, near the edge of technical feasibility. In our proof for F24 we were able to use relatively small|say \uniprocessor"|machines for both the oating-point \wavefront" phase and the pure-integer proof phase, as explained later. In view of the substantial leap in size between F24 (5050446 decimal digits, something like a \book" of digits) and F31 (646456994 digits, more like a \bookshelf"), it is di cult to guess what manner of algorithms and machinery will succeed on F31 . The rule of thumb of 3(n ; m) if taken seriously would imply a resolution of F31 by about 2020 A.D. We are saying that for such resolution to occur that soon, machinery and algorithms must undergo considerable change (as they arguably have had to to so far to establish every genuine composite). Regarding our computation, we paraphrase the claim of Young and Buell 29] regarding their calculation for F20 , that this is the deepest ever performed for a \one-bit answer." And the bit can be made explicit: we shall see that the high bit of the nal Pepin residue can be interpreted as a Boolean bit rigorously signaling prime or composite. This is not to say that F24 is the largest number ever subjected to a direct (non-factorial) compositeness test per se: It has recently come to the authors' attention 28] that at least one Mersenne number, namely 220295631 ; 1, has been subjected to a Lucas-Lehmer test|with the conclusion of composite character|by G. Spence, but that computation cannot be considered a rigorous result since it has not been double-checked, a crucial expedient for any number of such size, and even more so for one tested on relatively errorprone PC-class hardware, as the aforementioned record Mersenne was. Thus, F24 is also the largest veri ed genuine composite. 1 It is also the case that at the time of proof for F22, some of the best machinery available was pressed hard for a good fraction of a year, to achieve the proof. Incidentally in the case of F22, an entirely independent team of J. Carvalho and V. Trevisan 27] surprised the authors of 10] by announcing shortly after the latter group's run a machine proof of the same result, together with identical Selfridge-Hurwitz residues. The two proofs used di erent software and machinery, and so there can be no reasonable doubt that F22 is composite. When we say \no reasonable doubt" here, we mean certainty up to, say, what might be called \unthinkable" happenstance, such as space-time-separated coincidence of cosmic rays impingent on machines causing accidental false proof and accidentally veri ed false proof. For example, the probability of two nal Pepin residues in the F22 case agreeing because of random bit- ips at both sites independently is say 2;222 , or about 1=101000000. It would seem that the probability that both sites are running incorrect algorithms|perhaps even genuinely di erent algorithms that accidentally agree on an erroneous conclusion|just has to be greater than this. We assume that human minds too are susceptible to physical phenomena and one asks, what is the probability that every investigator has missed a
G. Woltman of GIMPS cites an error (false- nal-residue) rate on the order of 1% for the million-digit Mersennes, which runs primarily occur on PCs. Assuming for simplicity's sake an error rate linear in the number of machine operations, i.e. roughly quadratic in the number of digits of the number under test, it is clear why one cannot claim a high degree of con dence in the results of such a large single computation performed on a PC, until at least a second oating-point run has been performed.

aw in the proof of Fermat's \last theorem," or for that matter in any theorem at all no matter how trivial on the face of it? Certainly such probability is not identically zero, even though we tend to assert with total con dence that such-and-such a theorem has been proven. Perhaps, after all this nonrigorous language, it is best merely to circumvent the issue of human conceptual error and guarantee in machine proofs per se that error probabilities fall below some reasonable minimum, such as the 10;50 which was touted at the inception of the eld of thermodynamics in the 19th century as a kind of bound on the probability of \impossible" events. During the preparations for our runs of F24 we again reproduced the previous Selfridge-Hurwitz residues for F22. Because we were unaware of any independent team also testing F24 , we took extreme care to verify our proof, as explained below. Again (in 1999) the machinery was hardpressed for months, algorithm re nement required substantial labor (but was well worth the e ort), and so on. To convey an idea of scale, we note that F24 is a number of nearly 17 million binary bits. Thus it is larger even than the square of the currently largest known explicit prime 26972593 ; 1, found in June of 1999 by N. Hajratwala, G. Woltman, and S. Kurowski, and veri ed by one of the authors (EWM). If F24 had turned out to be prime, it would therefore have dwarfed all other known primes. But such was not the case. In fact what appears to be a reasonable estimate of the \probability" that Fn be prime is attainable via a straightforward sieve-based argument 18]. Since for n > 1 every prime factor p of Fn must be of the form

p = 1 + k2n+2
one may deduce approximate formulae such as: ProbfFn is primeg


(2) 2n where 0:577::: is the Euler constant and B is the current lower bound (e.g. for n > 20 or so, the upper bound for sieve-based factoring) on prime factors of Fn . Since F24 has been sieved by various investigators, in regard to the constraint (1), up to roughly B 1020 17], we obtain a probability less than 10;5 for F24 to be prime. Incidentally, no matter how one twists and turns, any manner of rough summation of the given probability on Fn leads us to the conclusion that there are probably not any more prime Fermat numbers beyond F4 . We note that Mersenne candidates Mq = 2q ; 1 with q itself prime also have a special form p = 1 + 2kq for possible factors, but even though this leads to a qualitatively similar estimate for the probability of an individual Mq being prime, we expect to nd Mersenne primes with some regularity, since in a given exponent range there are simply so many more primality candidates among the numbers for which no explicit factor has been found (all primes in the given exponent range for Mersennes, versus only integer powers of 2 for the Fermats).

e log2(B ) log2 (Fn )

e log2(B )

2 Method of proof
The classical Pepin test for primality asserts that if F = Fn > 5 is a quadratic nonresidue of an odd prime q , then F is prime if and only if

q(F ;1)=2 ;1 (mod F ):


This is really just an Euler pseudoprime test|Pepin's contribution was to show that such a test to base 3 in fact constitutes a rigorous primality test for the Fn . It is most typical to choose q = 3 3

(and in fact the reporting of the Selfridge-Hurwitz residues assumes this is the case), and compute the nal Pepin residue Rn which we de ne:

Rn = 3(Fn ;1)=2 mod Fn

where by this notation we mean Rn is the least nonnegative residue modulo Fn . Now the Pepin criterion says that Fn is prime if and only if Rn = Fn ; 1. In the case of primality the nal Pepin residue Rn is the largestn possible binary value 10::::0 modulo Fn , so the highest bit position for residues|namely the 22 position|has a '1' if and only if Fn is prime. We continue herein the tradition of Selfridge and Hurwitz by reporting the numbers

Rn (mod 235 ; 1 236 236 ; 1)

for use by future investigators in matters of veri cation. Note that as a quick check, Fn n > 5 is composite if the second Selfridge-Hurwitz residue is nonvanishing. We also stored the complete residue, which will be of use whenever a new factor of F24 is discovered. The value of such permanent storage is based on the Suyama method for checking compositeness of cofactors. If we can decompose Fn = fG where f is a (not necessarily prime) known factor, then the cofactor G is prime only if (R2 mod Fn ) mod G (3f ;1 mod Fn ) mod G n otherwise, G is composite. The point of having the nested mod operations is that when the product of known prime factors f is a relatively small part of Fn , the number of modular multiplications needed to calculate 3f ;1 mod Fn is much less than would be required to perform a direct Fermat or Euler pseudoprime test of the large cofactor G. Additionally, due to the availability of fast discrete weighted transform (DWT) arithmetic 11], each Fermat-mod multiplication with respect Fn is much more e cient than an operation modulo G. It was in just this way that the hardest cofactors to date, namely of F19 , F21 were established as composite 10]. Incidentally there is yet another practical use for a permanently stored residue Rn . First observe that no Fermat number can be a prime power pk , k 2 because, as is easily proven, the Diophantine equation

pk ; 4n = 1 for k > 1 has no solutions 10]. When a new factor f of an Fn = fG is discovered, one can further use the stored nal Pepin residue Rn to determine whether the cofactor G is a prime power. As
explained in 10], one approach is to calculate gcd(3fG ; 3f G) = gcd(R2 ; 3f ;1 G) n for if this should be 1, then G is neither prime nor a prime power, i.e. one neatly combines both the Suyama and the prime-power tests into evaluation of a single gcd.

3 Algorithms
Now we turn to the algorithmic issues, the main idea being to calculate (and store some subset of) the Pepin residues 32k mod F24 . It is evident that we need to square essentially random residues of about 224 bits each, a total of 224 ; 1 times, in the sense that 3 is the 0-th square and the nal Pepin residue Rn is the result of the (224 ; 1)-th squaring. 4

Let us adopt some nomenclature. Denote a residue x modulo Fn by its digits (in some treatments called a \signal" 11]) fx0 x1 ::: xN ;1g in the sense that we have a base-W expansion


N ;1 X j =0

xj W j :

Ignoring for the moment the possibility of x ;1, which case can easily be handled separately in actual implementation, we see that (for the particular case of convolution modulo Fn )

W N = 2n

0 xj < W

may be assumed. (We should note right o that modern oating-point-based implementations have better controlled error behavior if balanced digits are used i.e. one forces ;W=2 xj < W=2, this being just one of many enhancements discovered in recent times 11].) Now it is an important fact that integer multiplication modulo a Fermat number is equivalent to negacyclic convolution with proper carry 9] 11]. What this means is that for two digit decompositions on residues x y , the value of xy mod Fn can be obtained by adjusting the carries in the negacyclic convolution (x ; y ) de ned elementwise as: X X (x ; y )m = xj yk ; xj yk : In the case of generating Pepin residues, we only need an \autonegacyclic" convolution, meaning (x ; x), and this leads to important simpli cations. The basic method of calculating an autonegacyclic via DFT (and therefore employing FFT methods) is this: for digit expansion fxj g one can form a DWT:
j +k=m j +k=N +m

xk = ^

N ;1 X j =0

xj g j g;2jk

where g is a primitive N -th root of (;1) and use this to construct the autonegacyclic convolution as: ;m N ;1 X 2 +2jm xj g : ^ (x ; x)m = gN Of course the DWT and its inverse are calculated via FFTs, and various enhancements abound for reducing run length and so on. It is crucial to some of the possible calculation schemes that the right-hand sum involves squarings per se of the x terms, and as is well known squaring in some ^ cases has half the complexity of general multiplication. This observation may not be an issue for standard oating-point implementations, but if arbitrary-precision arithmetic is to be used (for example some day for the stultifying F31), or alternative convolution schemes are employed, the opportunity for pure dyadic squaring can be telling. Perhaps surprisingly, there are a good many available methods for e ecting a negacyclic convolution. We mention various methods below, although after assessing performance issues we only applied a subset of these in the compositeness proof. Though obvious, it should be said right o that all these methods have the same computational goal, and only the pure-integer methods are devoid of round-o problems, so the oating-point methods, while current being the fastest of the lot, are not rigorous until rendered so. Note that the following discussion assumes DWT-based arithmetic is used, which requires (for Fermat-mod DWT) the base W to be a power of two with an exponent that is itself a power of two, e.g. W = 216. 5
j =0

1) Floating-point discrete weighted transform (DWT). This is essentially Schonhage-Strassen

multiplication via oating-point FFT 25], but with the signal \weighted" (some authors say \twisted") with an appropriate root of (;1) so that the intended convolution is negacyclic. It is important to note that there are various ways to implement such a DWT, such as real-signal and folded-complex techniques that e ectively halve complex signal lengths 9] 11]. For this oating-point scenario, it is useful to embark on a rough derivation of expected convolution error. There have been over the years interesting attempts at rigorous error bounding, but such bounds tend to be over-conservative in practice, as one might expect. So for the present we shall just sketch some relevant heuristics pertaining not to rigorous bounds, rather to expectations that seem, in fact, to be in agreement with numerical experiment. To estimate how the maximum allowable word size (and hence minimum transform length) depends on the algorithm and the underlying precision, we can make use of random-walk statistics to quantify the oating-point errors. Think of a residue modulo F24 (say) to be a signal of N digits, each digit (word) of size W , that is NW 224. Let a unit-stepsize ( 1) random walk have position x(n) after n steps. Then, by the Fisher theorem 14], q x(n) nlim sup 1 n log log n = 1 !1 2 with probability one. An FFT-based autoconvolution (squaring) for such a signal might be modeled as a random walk of O(N log N ) steps ( cW 2), and thus having a most-probable maximum displacement of order 1=2 cW 2 N log N log log N : (4) Here c is a prefactor whose precise value depends on many details of the implementation, such as whether a standard- or balanced-digit normalization is used for the digit representation 11], and on the properties of the actual machine arithmetic, including the rounding mode used. (It is possible that c itself depends on W , especially in the instance of balanced representation in which scenario the bipolar digits evidently allow some favorable and subtle cancellation, but our data indicate that if c does depend on W this dependence is weak.) In the absence of a good a priori estimate for c, its magnitude can be e ectively estimated by tting (4) to empirical roundo error data. The name of the game is, of course, to minimize the computation time (which is proportional to N log N at xed precision) while not allowing the maximum convolution error estimated in (4) to be out of bounds. As the product NW is naturally constrained, we therefore want the largest possible W . We require unnormalized output digits, upon rounding, to have all of their whole-number bits be signi cant, i.e. (number of exact output bits) + (number of error bits) (number of mantissa bits): 2 In words, the accumulated round-o error must remain small enough to permit one, during the round-and-propagate-carries phase of each multiprecision multiply, to con dently round each output digit to the nearest integer. For example, if a typical output digit xj has fractional part (which we de ne as frac(xj ) = jxj ; nint(xj )j, which by de nition is in the interval 0 0:5]) no larger than 0.1 we are safe, but when fractional parts approach 0.3-0.4 we are dangerously close to an incorrect rounding, and a fractional part of 0.5 (especially if it occurs repeatedly) virtually guarantees that a catastrophic loss of precision has occurred. Thus, letting Bmant be the number of
Since the largest digit of the autoconvolution will|similarly to the largest error in our random-walk model|have p magnitude O(W 2 N log N log log N ), we can lump the output magnitude part of this relation into the constant c in (4).

mantissa bits of our oating-point representation, we have, upon taking the base-2 logarithm of (4) (with exact output and error lumped together into the as-yet-undetermined c) and substituting into the above inequality,

C + 2 log2(W ) + 0:5 log2 N + log2(log log N ) ; 1] Bmant where C = log2(c):

Thus the maximum allowable number of bits in an input word is constrained by 1 (5) log2 (W ) 1 fBmant ; C ; 2 log2 N + log2 (log log N ) ; 1]g: 2 Clearly this model has shortcomings - one obvious one is that an FFT does not do its operations on a single operand, whose error would then obviously correspond to the deviations of a classical random walk. Rather, one makes O(log2 N ) passes through the data, and on each pass does relatively few operations on each datum. Since the number of total operations per input datum is O(log2 N ) rather than O(N log2 N ), the na ve random-walk analysis based on O(N log2 N ) steps will tend (for large N ) to greatly overestimate the error. On the other hand, at each level of the FFT the average errors tend to get larger, i.e. in reality the stepsize of our random walk is increasing with the number of steps, and modeling the error accumulation as a xed-stepsize random walk will tend to underestimate the error accumulation. These two e ects will be o setting to some unknown degree, but the fact remains: the simple model appears qualitatively accurate, and using empirical error data to make good our ignorance regarding the details (via the constant C ) works very well in practice. In our tests using IEEE 64-bit precision (Bmant = 53), calibrating C using actual error data for large vector lengths gives a relation which ts our observed data extremely well over a wide range of N . Better still, the above relation can easily be recalibrated for di erent architectures and rounding modes to yield highly accurate machine-speci c error estimates. 3 We can also use the properties of random walks to quantify the most-probable distributions of oating-point errors (cf. 14], p. 76): for an ensemble of unit-stepsize random walks, the distribution of locations Sn after n steps tends asymptotically to normal (Gaussian), i.e. the probability that p that the nth-step location is some multiple x of the ensemble average n is given by png ! erfc(x) = p1 Z 1 e; 2 t2 dt as n ! 1: 1 (6) ProbfSn > x 2 x For our oating-point convolution with wordsize W everything carries through when properly scaled by the stepsize cW 2 . This allows us to answer questions such as, \if is the mean fractional error, what is the likelihood that the maximum fractional error is (say) ve times , for a particular value of n?4 Using IEEE 64-bit precision, we nd that C is close to unity, and hence that a vector length 220 (yielding n 100 220) permits a wordsize of aroundm19 bits in a careful implementation. Since the Fermat-mod DWT requires a wordsize of form 22 , for F24 work a convenient word size is
3 There are occasional subtleties: e.g. for the Intel x86 family with its 64-bit-mantissa oating-point registers, one might be tempted to set Bmant = 64, but one should only do this if one is using the 80-bit oating data type throughout the computation, i.e. not just when the data are in registers if one is using 64-bit oating loads and stores, then the e ect of the 80-bit register format appears via a slight decrease in the constant C , and one should set Bmant = 53, not 64. 4 Even a fractional part of, say, 0.3 may seem very bad to those accustomed to doing things via the all-integer route, but vast experience (especially via the GIMPS project) has shown that oating-point errors in large-integer arithmetic tend to be smoothly distributed, i.e. if the largest detected fractional part is 0.3, it is extremely unlikely that the corresponding fractional error was in fact a 0.7 which was aliased to 0.3. The random-walk analysis provides a detailed qualitative and quantitative framework which supports and extends these empirical observations.

W = 216, yielding a real signal length N = 220, or slightly more than one million signal elements. The average error for the code used for wavefront 1 (see x3.1) was 3:55 10;4 (and after
climbing from zero to this value during the rst few dozen squarings, deviated from this mean value by less than 1% for the rest of the run) so a fatal fractional error >= 0:5 should occur (in the absence of hardware error) with a probability corresponding to an event lying 1400 standard deviations outside the mean. For large deviations x, we can use the leading term of the asymptotic expansion of the complementary error function (6), ProbfSn > x ng

1 e; 2 x2 as x ! 1 p x 2


to estimate the likelihood of a fatal error occuring during any of the roughly 224 squarings needed to resolve F24 as less than e;1000000, indicating that our earlier target of 10;50 would be wildly conservative if hardware error were not a consideration. In fact this estimate demonstrates that if there is to be a fatal error in such a computation, it is overwhelmingly more likely to be due to a hardware (or software) error than to actual oating-point convolution errors. 2) Chinese remainder theorem (CRT) negacyclic convolution. In this algorithm we obtain the negacyclic convolution elements modulo a set of small distinct primes q , each admitting of a N -th root of (;1), and hence admit also of number-theoretical transforms relevant to negacyclic convolution of length N . The digit size W and therefore the signal length N are exible here, depending on the nature of the CRT prime set. This approach lends itself to parallel implementation, but not to a distributed or heterogeneous-network approach, owing to the large interprocessor bandwidth needed during the CRT (carry propagation) phase, and to the fact that for e ciency's sake we need all of the processors involved to be dispatching data at similar rates. It is however however ideally suited to massively parallel implementation, and is the most promising near-term approach for something like a direct (non-factorial) resolution of F31 . One could imagine a large number of relatively cheap commodity microprocessors (either of the dedicated integer integer variety or also possessing a oating-point capability), each performing convolution modulo its own small prime having the desired algebraic properties. The connectivity needed during the CRT phase appears to be the most di cult hurdle to be overcome, and of course even using many processors, to achieve a reasonable squaring time these processors need to be very fast. One also has to deal with issues of error correction, i.e. the ability to prevent an error occurring in one node from contaminating the entire computation. 3) Discrete Galois transform (DGT). This scheme is so named because all operations occur in the eld GF (q 2) for a Mersenne prime q = 2p ; 1. The DGT approach allows massive run lengths|all divisors of the multiplicative group order q 2 ; 1 = 2p+1 (2p;1 ; 1), which automatically allows a large power-of-two signal length component, and generally a decent variety of small odd primes. (3, for instance, is always a factor of 2p;1 ; 1.) For F24, convenient Mersenne primes (the precise choice depends on many issues, probably the most important being the integer-arithmetic capabilities of the underlying hardware) are 261 ; 1 and 289 ; 1, which correspond to word sizes W = 216 and 232 and signal lengths N = 220 and 219 for mod-F24 DWT. By interpreting the elements of GF (q 2 ) as Gaussian integers fa + big, the run length can be e ectively cut in half at the expense of complex integer arithmetic modulo q . However, the use of a Mersenne number allows for a fast modular reduction, so all the integer multiplications (which are generally much more expensive from a hardware perspective than oating-point multiplies, a fact which may be surprising to many) can be used for the Gaussian integer multiplication, and one may use the Karatsuba expedient to e ect internal complex multiplies in 3 rather than 4 scalar multiplies. For 8

Mersenne-mod DWT, one has the added advantage that the modular weight factors are themselves powers of two.5 Further details of this convolution option appear in 13]. 4) Direct number-theoretical transform (NTT). The idea is to use not a CRT prime set as in option (2), but a single prime well-matched to the underlying processor's capabilities, such as p = 264 ; 232 + 1 21]. This particular prime admits roots of high power-of-two orders, as well as permitting a fast mod operation, and works well on hardware with fast 32-bit capabilities, but perhaps lacking good 64-bit integer support. 5) Nussbaumer convolution. This method involves no actual numerical FFTs, only \symbolic" ones 9]. A negacyclic convolution is built recursively, using smaller negacyclic ones. The digit size W is entirely exible, for example W = 2512 yields a signal length of only N = 215, at the expense of size-W 0 multiplications and additions where W 0 is slightly larger than W . 6) Schonhage multiplication, modulo any 2M + 1, and therefore negacyclic in nature. This method has the lowest known bit complexity for negacyclic multiplication, in a theoretical tie with that of optimal Nussbaumer convolution 25] 4] 12]. It might be expected to behave in practice similarly to Nussbaumer, although each of these two methods has its own special di culties and advantages upon implementation. What can be said is that each of the two ultimately involves sizeW 0 multiplications, and so the ultimate performance issues involve memory handling and moderatesize multiprecision integer multiplications. 7) Floating-integer hybrid schemes. Since most modern general-purpose microprocessor hardware manufacturers have put a preponderance of money and silicon into the oating-point functional units (FPU) as opposed to the integer units, schemes (2)-(6) above are wasteful in the sense that they make no use whatsoever of the FPU. One can contemplate doing certain integer operations in the FPU to speed the computation, but there is perhaps a more attractive alternative, which appears to combine the best features of both the oating-point and integer worlds. Since this approach has not been described elsewhere, we will examine it in some detail here. Consider combining an FFT-based convolution with the DGT described in (3), on hardware that permits fast operations on both oating-point and integer data 64 bits in length. (The operand length is entirely exible, and in fact the oating and integer operands need not be of the same length, but 64-bit for both is representative of good currently available hardware). Basically, at the same time we do a oating FFT, we do an all-integer convolution, in the 64-bit case, over the ring of complex (Gaussian) integers modulo the Mersenne prime M61 , a convenient modulus for integer operations on 64-bit hardware. When the code gets to the round-and-propagate-carries stage, if our input residue digits were, say, 50 bits (as opposed to the roughly 20 we can use in an all- oating code), our oating-point outputs (and associated errors) are of course huge, but we now have redundant information|an all-integer version of the digit in question, which tells us (exactly) what the output is modulo M61. Thus we can incur rounding errors roughly 261 times as large as before and still reconstruct an exact result. Since rounding errors scale as the square of the word p size W , we note that 261 ; 1 has roughly 30.5 bits, so our inputs can be 30.5 bits larger than in an all- oating implementation, i.e. we may reduce the convolution length by more than half 6. In fact, since the reduction in vector length also reduces the number of oating-point operations (and
One can in fact view the above DGT as an interesting recursive use of Mersenne primes|one uses a small one to help speed the search for large ones. 6 There is debate in some quarters about whether the scaling is truly quadratic in input wordsize, especially when balanced-digit representation is used. However, our accuracy tests with a fully functional (but as yet, relatively slow, at least on the hardware available to us) prototype hybrid code allow us to directly test this assumption by comparing the error distributions of an all- oating and a hybrid implementation. These empirical error data strongly support quadraticity, or something very close to it.

accumulated error), we can in practice use a word size roughly 31 bits larger than with a purely oating-point transform. From a coding perspective, it is desirable to have oating and modular parts of the code mirror 2 each other insofar as possible, thus making FGT over GF(M61) a particularly attractive choice in parallel with a (64-bit oating) complex FFT. M61 allows for a huge power-of-two component of the transform length (cf. (3) above), as well as wide variety of small odd radices, in particular the crucial 3, 5 and 7, which permit every interval between adjacent powers of two to be neatly split into four equal-sized subintervals (e.g. the interval 4,8) gets split into 4,5,3 2,7]). Of the Mersenne primes up to and including M127, M61 is the only one that allows the three smallest odd primes (and any combinations thereof, in fact any combination of radices whose product divides the surprisingly smooth q 2 ; 1 = 262 32 52 7 11 13 31 41 61 151 331 1321) to be used as radices for the transform. Another requirement for this scheme to pay o in terms of overall performance is that the oating and modular transforms execute at roughly the same speed, which needs both a capability to do at least as many integer as oating instructions each clock cycle, and to have good pipelining of all predominant arithmetic operations ( oating and integer adds and multiplies). This rules out most currently available commodity microprocessors, but on leading-edge platforms like the Alpha 21264 and Intel IA-64, which have excellent integer capabilities (especially a fast 64 64 ! 128-bit integer multiply|this is crucial for speedily e ecting transform arithmetic modulo M61 ) we expect the modular transform will run about as fast as a oating transform alone, producing a big gain. To complete our compositeness proof for F24 we used essentially two variants of (1) for the oating-point machines, and two variants of (5) for the integer-based \drones." Our implementation of option (5), Nussbaumer convolution, is an adaptation of established software for general integer convolution 9]. Such software has been used in many disparate domains, ranging from number theory to signal processing, but was in fact pioneered by J. P. Buhler for the purpose of numerical investigations on Fermat's \last theorem" (FLT) 5, 6, 7]. It has been known to Buhler and colleagues for about a decade that integer convolutions of lengths into the millions are indeed possible in error-free fashion, on conventional (even by 1980s standards) machinery, as no oatingpoint arithmetic is involved. Thus the scheme was a natural choice for checking the Pepin residues via \drone" machines, as described in Section 4. Many optimizations are possible beyond the original description of Nussbaumer, and even beyond the variant so successful for FLT computations. Let us name a few optimizations, to convey the kind of thinking that improved the speed of proof. First, the small negacyclics at the bottom recursive level of Nussbaumer can be done especially fast because they turn out to be autonegacyclics. For example, a length-4 autonegacyclic can be done|amazingly enough|in 3 multiplies and 4 squares, for an equivalent count of just 5 multiplies, which is very much faster than the na ve 16 multiplies. Second, special FFT structure as discussed next, but as applied to the symbolic FFTs within Nussbaumer recursion, results in substantial improvement. Third, there are special ways to combine (or even remove) many transposition operations and memory motion, to exploit the known cache behavior on certain machinery. All of these together resulted in an implementation of option (5) that required around 5 CPU-seconds per squaring on a 500MHz Apple G4 processor, and comparable but somewhat longer times on similar-frequency Pentium II machinery. Note that either manifestation on one \wavefront" machine would have required several years years to complete the proof. As it was, the oating-point option (1) implementation we detail next needed just 6 CPU-months on hardware signi cantly less than cutting-edge. In deciding on one's particular implementation of option (1), one must consider the important discoveries that have accrued in the 1990s in regard to large-signal FFTs. In particular, the \par10

allel" or \four-step" FFT 1, 2, 3, 12], which essentially allows the mapping of a one-dimensional FFT to a row-column matrix FFT, is a genuine boon to the large-integer industry. For various reasons, neither of the two independently developed oating-point codes used herein does a standard four-step FFT, although they do have the same aim, namely to allow for excellent performance of the fundamentally data-nonlocal FFT algorithm on modern, multilevel-cache-based microprocessor architectures. In this regard, the most crucial levels of the memory hierarchy are the two closest to the FPU itself, namely the registers (\L0 cache") and the on-chip data (L1) cache. It is desirable to do as many operations as possible on the data while they are in registers, which naturally leads to higher-radix FFTs, and to move data, especially ones that are widely separated in terms of their indices, in and out of memory with as few L1 cache con icts as possible, i.e. to minimize thrashing. The two oating-point codes accomplish both tasks admirably well, but in strikingly di erent ways, which will be described next. Nomeclaturally, we shall be considering the two major approaches to the FFT, the Cooley-Tukey or decimation-in-time (DIT) transform, which begins with bit-reversal-reordered input data and outputs an ordered transform vector, and the Gentleman-Sande or decimation-in-frequency (DIF) transform, which begins with ordered input data and outputs a bit-reversed transform vector 8]. Owing to this order-of-magnitude speed advantage over the best available all-integer algorithms, (and the as-yet nascent nature of the hybrid algorithm), the oating-point machinery emerged as the natural \wavefront" candidate, to which consideration we next turn. FFT Algorithm 1 (used by EWM in his proof7) begins with two simple observations. First, for a length-N (N real signal elements) FFT of standard design (radix-2, with bit-reversal data preordering in the DIT case), the index strides between data pairs range from 1 to N=2 as one performs the log2(N=2) passes of the FFT (N=2 to 1, in that order, for DIF). Assuming N to be a power of 2, the large strides are a well-known problem, since L1 cache lines are usually allocated using the bottom dozen or so bits of the data locations in memory, and large power-of-two strides are thus guaranteed to cause massive cache thrashing. One way to avoid this while still retaining the nice algorithmic properties of a power-of-two FFT is to reorder the data in the form of a two-dimensional array (with power-of-two row and column dimension) and add a small number of padding rows to this matrix, so elements in the same row and di erent columns will no longer be separated by a power-of-two stride in memory (this assumes the Fortran strorgae convention, namely columnwise storage of 2-D arrays|in a C implementation one would add padding columns instead). This was in fact the strategy used in the rst, awed, run of F24 (cf. x4 note that the failure of that run was unrelated to any of these algorithmic issues.) In preparing for the second run, and with a view toward eventual implementation of a hybrid oating/integer scheme (where each integer operation is rather precious), it was decided to eschew the extra integer computations needed for conversion of 2-D array addresses to linear memory locations and instead to do the array padding directly within the original 1-D array. One particularly simple way to do this is to divide the array into R2 contiguous data blocks, where R2 is the largest power-of-two radix used during the computation (we chose R2 = 16, which is well-suited to architectures with 32 to 64 oating registers) and to insert padding elements between these blocks. If the number of (unused) data elements in each padding block is itself a small power of two, Npad = 2pad bits , then an index j relative to the unpadded array is easily converted to a padded-array value via the three-operation sequence

3.1 FFT Algorithm 1

jpad = j + (j << nblock bits) >> pad bits]



The source code for this implementation is available at oat.em


where >> and << denote rightward and leftward logical shifts and nblock bits is the base-two logarithm of N=R2, the number of elements in each block of contiguous data. We also de ne Npad as the dimension of the padded array, obtained by replacing j by N in the above formula. For FFT lengths of form p 2k , where p is an odd prime, most of the above discussion carries through, except that it is convenient to break the array into p R2 contiguous data blocks. The second key insight is that, while one can never do an FFT in completely data-local fashion, one can replace the highly variable strides of a standard implementation with uniform strides, thus ensuring that at each level of the multipass FFT algorithm, data access patterns are the same. Then, optimizing the padding parameters for one pass optimizes them for all. In our implementation, in a radix-R pass of a DIT transform, we gather R complex data separated by strides Npad=R in an array A, process them together with any needed \twiddle factors" (complex roots of unity cf. 23]), and dispatch the resulting data to contiguous locations in a receiving (B ) array, as shown schematically in Fig. 1. For a radix-R pass of a DIF transform, we gather R contiguous data from the B -array, process them, and dispatch the result to Npad=R-separated locations in the A-array, again as in Fig. 1 but with data owing from bottom to top. Since the larger stride Npad=R is constant, it can be de ned as a parameter, and thus one only needs to explicitly compute a padded-array index using (8) for the rst element of each set of R data.8 For transform lengths of form p 2k one is faced not with a single radix but rather a mixture of, e.g. three di erent radices p, R1 and R2 , where R1 and R2 are two di erent power-of-two radices. In that case one simply de nes three strides: s1 = Npad=p, s2 = Npad =R1 and s3 = Npad =R2, and proceeds as above. As long as the strides are not too disparate, the choice of padding parameters that yields optimal performance for one generally is nearly optimal for the others. To give an idea of the potential impact of array padding, in timing tests on an Alpha 21264 processor (unfortunately not available to us for more than small tests|such hardware would have reduced the proof time by more than a factor of two), the runtime for an F24-length transform was cut by nearly a factor of three through use of an optimal padding. In our test of F24, the complex transform length was N = 219, and each squaring began with ordered input data, did a forward DIF transform using the set of radices f8 16 16 16 16g, a dyadic squaring of the (now bit-reversed) output data, followed by an inverse DIT transform using radices f16 16 16 16 8g and a rounding-and-carry-propagation step. The combination of DIF and DIT transforms avoids the need for any explicit bit-reversal reordering, which, although not overwhelming, can consume 10-20% of the execution time in a single-FFT (i.e. DIF or DIT used for both the forward and inverse transform) implementation. Moreover, this time fraction tends to increase rather than decrease with N (even though the relative arithmetic operation count decreases), as bit-reversal is di cult to do in a cache-friendly way, even with array padding. We used the right-angle-transform idea described in 11], because it eliminates the need for real-complex wrapper passes at the end of the forward and prior to the inverse transform, and thus makes the dyadic squaring step trivial, even in the presence of bit-reversed data. The reason we reverse the order of the radices (which is necessary for coprime radices, but not for power-of-two transform lengths) in doing the inverse transform is related to one nal optimization stategy permitted by the above-described data movement scheme. Whenever a radix-R DIT pass
Note that, although independently developed by the author of the program in question, this data movement strategy is not new| 20] describe similar, and such was also used by Colquitt and Welsh 19] in their program which discovered the Mersenne prime M110503 . However, such out-of-place transforms seem to have fallen into general disfavor, especially among users of small commodity microprocessors (as opposed to supercomputers). We do not claim that out-of-place is suitable for all or even most architectures, but have found it to be well-suited to systems which are not bandwidth-limited, i.e. where the speeds of the system bus and caches are well-matched to the capabilities of the underlying processor. The large-but-uniform strides strategy also bene ts greatly on architectures with good prefetch capabilities.


is preceded by a radix-R DIF pass (or vice versa) with few or no intervening operations, one can eschew the gather-scatter phases of the two passes (which do impose a cost in terms of loads and stores, and thus potential cache misses) and instead roll the two passes and the intervening operations into a single in-place pass through the data. (The reader should place a mirror either above the A-matrix or below the B -matrix in Fig. 1 for a quick visualization of this). During each Pepin squaring, there are two opportunities for such a streamlining: the two passes ( nal DIF pass, initial DIT pass) surrounding the dyadic squaring, and the two ( nal DIT pass, initial DIF pass) surrounding the carry propagation step. For this to work, however, the radices of the two adjacent passes must be identical, hence our reversal of the order of radices during the inverse transform. By exploiting both of these optimization opportunities, the number of passes through the large data arrays required for each squaring was reduced from 22 to 14, and the per-squaring machine time on Wavefront machine 1 (a Silicon Graphics Octane wokstation with a 250MHz MIPS R10000 microprocessor) was reduced from 1.1 to 0.85 seconds, i.e. a reduction in overall runtime of nearly two months was achieved. We note that this timing was achieved using all high-level compiled (Fortran-90) code. 9 There is one more potential gain related to the hybrid oating-integer algorithm (7) described in x3 which bears mentioning, although our implementation thereof came too late to help with F24. In an all- oating implementation, the above FFT strategy is easy to code, debug and optimize, but is out-of-place, which hurts its performance on many systems, especially those which are relatively bandwidth-limited or have poor data prefetch. If we now add a parallel modular transform (say 2 over GF(M61),) it would seem that we are doubling the array storage and adding yet another out-of-place transform. But we can actually combine the two out-of-place transforms in a way that allows us to do both the oating and modular transforms in an e ectively in-place fashion, i.e. retaining the simple data access patterns we like so much, and requiring no more array storage (except for the additional modular twiddle factors) than the all- oating transform does. This is possible because oating and modular data occupy the same number of bytes in memory, and (at the risk of making software engineers feel rather squeamish) can be freely swapped between the two data arrays already de ned for the oating transform. We then do the FFT just as before, but at the start of each squaring, put a bit-reversed copy of the residue vector in all-integer form into the second storage array. We then do a modular forward DIT FGT in parallel with the DIF FFT, on each pass exchanging data between the two arrays (i.e. on one pass, the A-array will begin with oating and end up holding modular data one the next, the reverse will occur.) During the dyadic squaring step, the oating data will be bit-reversed and the modular data ordered. Then comes an inverse DIF FGT in parallel with the DIT FFT, followed by an error-correctionand-carry-propagation step during which the modular data will be bit-reversed and the oating data ordered. This scheme does require us to reintroduce an explicit bit-reversal index array (but needs no actual data reorderings, an important distinction), but the bene t of doing the transforms in-place appears to greatly outweigh the small added cost of tracking bit-reversed modular data locations during the carry propagation step.
Interestingly, the initial compiled version, compiled using the MIPSpro V6 F90 compiler, needed nearly 2 seconds per square, but when the very same source code was recompiled a month later with the V7 compiler, the runtime was cut in half. The initial suspicion was that the speedup was due to exploitation of the fused multiply/add instruction on the R10000, but examination of the assembly code generated by the two compilers revealed that the major di erence was extensive use of data prefetch by the V7 compiler.


3.2 FFT Algorithm 2

The third author's (JSP's) oating point program10 also uses a discrete weighted transform to square numbers modulo F24, and also uses the fully complex right-angle variant to keep the pointwise squaring simple and avoid complicated real-valued FFTs. The primary goal here is to e ciently use the limited amount of high-speed storage available to a modern microprocessor. Although both wavefronts use the same style of DWT to e ect Fermat-mod squaring, FFT Algorithm 1 is fundamentally di erent from FFT Algorithm 2. The former emphasizes uniform but highly nonlocal data access, where the latter tries to load blocks of data into high-speed memory and keep them there as long as possible. Whereas Algorithm 1, in its nal form, performs 14 passes through its data and requires an eight-megabyte array for scratch space, Algorithm 2 performs the convolution in-place, makes exactly three passes through the residue per squaring and needs little storage beyond the residue itself. The target machine was an aged UltraSPARC running in a busy daytime environment, and low memory use was mandatory. Algorithm 2 begins with Bailey's 4-step FFT 3] 8], simpli ed for in-place convolution. Let x be a signal of size N = n1 n2 , let ! be an N th complex root of unity, and let the DWT weights be complex N th roots of -1. In what follows, \DIF" and \DIT" have the same meanings as previously. If X is an n1 -by-n2 complex matrix consisting of the data in x arranged in row-major order (as is customary for the C language, and the opposite of the Fortran convention), then the squaring algorithm is as follows: 1. Multiply xi by the ith DWT weight, 0 i < N 2. Perform a DIF FFT of size n1 on each column of X 3. Multiply Xij by ! Ij 0 i < n1 , 0 j < n2 (I is the bit-reversal of i) 4. Perform a DIF FFT of size n2 on each row of X 5. Square each element of X 6. Perform a DIT FFT of size n2 on each row of X 7. Multiply Xij by ! ;Ij 0 i < n1 , 0 j < n2 (I is the bit-reversal of i) 8. Perform a DIT FFT of size n1 on each column of X 9. Multiply xi by the ith conjugate DWT weight, 0 i < N 10. Scale x by 1=N and propagate carries. Since both DIF and DIT transforms are used, the data are processed in-place and in order. Even better, the matrix transposes that Bailey's algorithm needs are unnecessary, since they amount to reordering intermediate results. A modi ed "6-step" FFT squaring is also possible, and the six matrix transposes it would need can be reduced to two if the array X starts o and ends up in column-major order, i.e. is pre-transposed. The key to making this 4-step-like algorithm e cient is to collapse several operations into each pass through the array X . For example, one does not have to wait for every row to be transformed in step (4) to begin the squaring in step (5). A better approach would be to perform steps (4) through (6) together on one row of X at a time, and to make the row size n2 small enough for all the relevant data and precomputed constants to t in fast memory. In this way data are loaded from slower memory only once, in the initial stages of step (4). Given that a load from main memory can stall a microprocessor for the equivalent of dozens or even hundreds of arithmetic operations, the more often such thinking is applied the better. In this way, the ten steps listed can collapse into four passes through the data: one pass for every three steps, and a nal pass to propagate carries (it will be shown later how this last pass can

The source code for this implementation is available at


be folded into one of the others). However, memory performance may still be poor if we precompute too much information. Large tables of DWT weights and powers of ! (twiddle factors) may increase accuracy and reduce the operation count, but they also force four times as many data to stream through fast memory caches every squaring. Use of recurrences solves the memory problem, but na ve recurrences (including the \stable" variety advocated in references such as 22]|one might think from the authors' wording that such a recurrence has rounding errors which magically do not grow, but such is de nitely not the case!) build up rounding errors that may not be acceptable on machines which do not have "guard bits" in oating point registers, like many RISC workstations. The solution adopted here uses another idea of Bailey's, again modi ed for microprocessors: instead of a single large array of size N , keep two small arrays. In the twiddle factor case, one array contains ! m , 0 m < r and r a power of two and the other array contains ! kr , 0 k < r. Then to form ! ij (for ij < r2), the number ij is represented as k r + m (by masking and shifting) then k and m are used as lookups into their respective tables, and the two table entries are multiplied together. Although somewhat complicated, the storage needed by this scheme increases only as the square root of the problem size, and the tables needed are small enough to stay in cache most of the time. Since each result operand needs exactly one complex multiply of two precomputed roots of unity, the accuracy is also excellent, being only marginally less than that of a fully-precomputed implementation and (unlike recurrences) su ering no deterioration with increasing runlength. The savings can be surprising: for F24 , the memory bandwidth saved speeds up the entire squaring by 15% on the target machine. It also points out a recurring theme with this algorithm, namely that doing more arithmetic in the name of improving memory use is a bargain when runlengths become large. Memory e ciency also drives the choice of n1 and n2 . In a supercomputer environment, a 4-step FFT bene ts the most from X being square or nearly so, i.e. from n1 being large. On workstations the opposite is true desktop machines maintain a translation lookaside bu er (TLB) that sharply limits the number of widely separated memory locations that can be quickly accessed. Since this is a C language implementation, data in the same column of X are far apart in physical memory, and so to avoid yet more memory stalls the number of rows n1 must be kept quite small. Another (standard) optimization is to put \padding" after each row of X , to break up power-of-two memory strides during the column operations. Finally, there is the matter of combining the carry propagation with the other passes through the data. This is actually rather simple: instead of releasing one carry at a time as intuition would suggest, it is better to keep an array of n1 separate carries, one for each row in X . This array can then sweep through X on the heels of the computations in steps (7)-(9) above, while the relevant data are still in cache. However, there may still be unpropagated carries after a single sweep through X thus a second, much shorter, carry propagation step is necessary, with each of the n1 carries \wrapping around" to the next row in X . (We note FFT algorithm 1 similarly does a parallel carry step). The 5-10% overall speedup on the target machine justi es the extra complexity, as even a small gain like this translates to weeks of machine work saved. After all of these considerations, the component FFTs themselves still have to be performed. F24 -mod squaring requires an FFT multiply of size 219 wavefront 2 uses a single radix-32 pass to break the problem into chunks somewhat smaller than the UltraSPARC's external cache, then tackles each chunk in turn. Wavefront 2 was a 167-MHz Sun UltraSPARC-1. The chief advantage of such a machine is that the large (512kB) o -chip memory cache is non-blocking, i.e. loads and stores may take a while but will not actually stall the CPU unless computations require those data. On the other hand, the Ultra-1 has a slow clock by modern standards and does not have a memory prefetch instruction. Worse, the Gnu gcc and native Sun C compilers ignore half of the 32 oating point registers and 15

will only schedule code for the 16kB on-chip cache (if at all!)11. These limitations meant that none of the time-critical portions of the code could be entrusted a C compiler, and the nal version includes large amounts of hand-written assembly code. The program used the FFTW library 15] for the rst half of the Pepin test, switching nally to handcoded FFTs which overcame the aforementioned limitations in the compilers available for the UltraSPARC. All of the row and column operations in Algorithm 2 use radix-4 FFTs internally this o ers a good mix of adds with multiplies (which can execute concurrently on this machine), and uses comparatively few registers. The registers freed up by such a low radix allow all the data for the next FFT butter y to be fetched from non-blocking cache while the present butter y is being processed, so that FFTs much larger than the on-chip cache are dispatched at full CPU speed. With FFTW the program managed a square in 1.1 seconds, and replacing FFTW with highly optimized assembly code brought that down to 0.885 seconds per square. 12

4 \Wavefront" generation
In our proof of compositeness we used a method pioneered by C. Norrie 10] on F22. Elegant in its simplicity, this method is to use fast machines to deposit|at convenient intervals|the Pepin residues 32k mod F24. Having these in storage for various k, one may check the deterministic link between say the k1-th square and the k2-th square the point being that this checking can be done on a relatively slow (for whatever reason) machine. Thus, a fast \wavefront" machine is expected to deposit Pepin residues at the highest possible rate, with a host of slower machines, or \drones" acting in parallel fashion to test each link in the resulting Pepin chain. We chose to divide the labors cleanly into: two wavefront machines running the independently developed oating-point DWT squarers described in the preceding section and squaring at similar speeds (and thus making fairly frequent cross-checks convenient to do), and a set of drones each of which ran an integer squarer via Nussbaumer convolution, each drone beginning with a selected squaring residue|call it the a-th square|previously deposited by one of the wavefront machines, then performing a total of (b ; a) squarings modulo F24, eventually comparing the resulting residue with a previously-deposited b-th square 13 . Incidentally we cannot resist here an anecdote which will likely be horrifying to computational number theorists, but which fortunately has a happy ending. The very rst complete Pepin test of F24 was actually carried out by one of us (EWM) in 1998-99, with a residue that ultimately was shown erroneous by both an alternative oating-point run and an integer-convolution run. In fact this faulty run was a primary motive for continued optimization and deployment of our pure-integer algorithm. This rst run was begun in June 1998 and nished in early February 1999, and (not surprisingly) indicated that F24 was composite. The reason a second run was not performed in parallel is that this author had access to one fast dedicated machine, but not to another which could even come close to keeping up with the rst. There was reason to expect that a second fast engine might become available during the course of the rst test, but this did not come to pass, and so a second run (using an improved version of the code) on the same machine was planned.
Version 5.0 of Sun's compiler does use all 32 registers, but this version was not available to us There seems to be a widely held belief that modern RISC processors are too complicated for humans to program directly anymore, and that only a compiler can hope to produce fast code. Neither of these is true: certain routines ran over ve times faster after translation to assembly language, and the radix-4 FFT used as a building block in the Fermat-mod squaring ended up 30% faster than FFTW's more complex radix-16 and radix-32 transforms. 13 The source code for this implementation is available at
11 12


During the initial run, the bottom 64 residue bits were saved every 2000th squaring, and when the other two authors of the present joint paper (whose own oating-point run was then about one-eighth of the way to completion) got the rather surprising news that someone unknown to them had completed an initial test of F24, they naturally wanted to do some cross-checking against their early residues. This cross-checking turned up a discrepancy around the 150000-th squaring, and was eventually traced to a faulty residue le which was subsequently used to restart the initial run following a power outage. The precise source of the error was a awed hardware conversion of eight-byte oating residue data to the 2-byte signed integer form used for the save le. A previous test run of F22 had been deliberately interrupted and restarted several times, yielding a correct nal result, but that run had used default 4-byte signed integers for the save le data. During preparations for the assault on F24 , the 4-byte save format was changed to 2-byte almost as an afterthought, in order to halve the size of the (then) four-megabyte save les. The resulting code passed all the 1000-squaring self-tests used for program validation but, in a crucial oversight, was not subjected to a validation test involving a restart-from-interrupt. The save les also did not contain the simple expedient of a checksum, and the kind of error that occurred (an incorrect 2-byte integer resulting from conversion of an 8-byte oat) was of a kind not detectable by the extensive round-o -error-related consistency checking built into the program. In any event, as soon as it became clear that the rst result was undoubtedly erroneous, it was decided to join forces: two oating runs using the separate codes would execute independently, but with frequent cross-checking of residues mod 264 , and all-integer veri cation hopefully keeping pace after su cient hardware was assembled for that purpose. A second run by EWM using a revised version of his code was launched in late February, which caught up with JSP's test in early summer 1999. Further improvements by JSP to his code meant that, as of late June 1999 the two oating runs were dispatching squarings at roughly the same rate, and in fact they nished within days of each other, on 27 and 31 August 1999, respectively, with exactly matching nal residues. The third and nal link, the integer veri cation, was completed about a month after the second oating run.

5 Conclusions, and comments on the future

We hereby report that F24 is composite, the Selfridge-Hurwitz residues for n = 24 being, in decimal: 8 > 32898231088 mod 235 ; 1 < Rn > 60450279532 mod 236 : 68627742455 mod 236 ; 1: Furthermore, the lower 64-bit word of the nal residue is, in hexadecimal and decimal respectively:

Rn mod 264 = 17311B7E131E106C16 = 167114716503173130810:

The total elapsed time for the oating-point wavefront machines was about 200 days (either implementation), while as expected about 5 ; 10 drone machines running the integer autonegacyclic convolver kept up essentially with the wavefront speed, and as we imply established integrity for every link in the Pepin chain. We submit that there can be no doubt that F24 is composite. Thus F24 is now atop the list of genuine composites. What about the next Fermat number of unknown character? At the time of this writing that would be F31. It is fair to say that the only hope for resolving the character of F31 in the near term would be sieving with respect to the form (1) (and of course that has been tried roughly up to 20 decimal digits for possible factors), or perhaps some limited factoring via the p ; 1 method 24] which, even in a highly memory-optimized implementation, would need a machine with a quite large physical memory. The fact is, a single 17

FFT-based convolution modulo F31 would, on a state-of-the-art workstation, consume around two CPU-minutes, so that the complete Pepin test on our current brands of machinery would require some 10000 years, taking us well into the next ice age. But we have already remarked on the e ective rate of resolution for the Fn , and noted that one cannot estimate such achievements based on current resources. On the basis of vague heuristics, but more in keeping with the way that computational history has actually evolved, one might expect F31 to be resolved within the next two decades, or at worst, well before the year 2100 A.D., on the good grace, if you will, of future algorists and new technology.

6 Acknowledgements
We heartily thank J. Buhler (Mathematical Sciences Research Institute, Berkeley) for his original (1990) implementation of Nussbaumer convolution P. Montgomery (CWI, Amsterdam), H. Lenstra (University of California, Berkeley), J. Selfridge (Northern Illinois University), and C. Pomerance (Bell Laboratories) for their theoretical and heuristic insights, J. Klivington (Apple Computer), for his fast Apple G4 convolution, and G. Woltman for his engineering expertise. We are grateful to R. Knapp, P. Wellin, S. Wolfram (Wolfram Research, Inc.) for their algorithm support of various kinds. We thank C. Curry (University of Southern Mississippi), K. Dilcher and R. Milson (Dalhousie University) and especially A. Kruppa and the sta of the Infohalle der Fakultat fur Informatik an der Technischen Universitat Munchen for their tireless integer convolution runs. EWM thanks J. Alexander of Case Western Reserve University for generous access to the SGI Octane used for wavefront 1. JSP thanks the management at 3S Group Incorporated for the extended use of the UltraSPARC leserver used for wavefront 2. We also acknowledge a grant from the Number Theory Foundation in the later stages of our work, which grant signi cantly enhanced the all-integer e ort. We are indebted to Apple Computer for resources pertinent to the new G4 processor. A Reed College team of sta and students: N. Essy, B. Hanson, C. Chen, J. Dodson, R. Richter, W. Cooley, J. Heilman, D. Turner and (from the University of Georgia) C. Gunning nished in glorious and sel ess fashion the last stages of the all-integer proof.

1] R. C. Agarwal and J. W. Cooley,\Fourier Transform and Convolution Subroutines for the IBM 3090 Vector Facility," IBM Journal of Research and Development 30 (1986), 45 - 162. 2] M. Ashworth and A. G. Lyne, \A Segmented FFT Algorithm for Vector Computers," Parallel Computing 6 (1988), 217-224. 3] D. Bailey, \FFTs in External or Hierarchical Memory," (1989) manuscript. 4] D. Bernstein, \Multidigit multiplication for mathematicians," manuscript. ( 5] J. Buhler, R. Crandall, R. Ernvall, T. Metsankyla, \Irregular Primes to Four Million," Math. Comp. 61, 151{153, (1993). 6] J. Buhler, R. Crandall, R. W. Sompolski, \Irregular Primes to One Million," Math. Comp. 59 717{722 (1992). 7] J. Buhler, R. Crandall, R. Ernvall, T. Metsankyla, A. Shokrollahi, \Irregular Primes and Cyclotomic Invariants to Eight Million," manuscript (1996). 18

8] C. Burrus, DFT/FFT and Convolution Algorithms: Theory and Implementation, Wiley, New York, 1985. 9] R. E. Crandall, Topics in Advanced Scienti c Computation, Springer, New York, 1996. 10] R. Crandall, J. Doenias, C. Norrie, and J. Young, \The Twenty-Second Fermat Number is Composite," Math. Comp. 64 (1995), 863-868. 11] R. Crandall and B. Fagin, \Discrete Weighted Transforms and Large-Integer Arithmetic," Math. Comp. 62 (1994), 305-324. 12] R. Crandall and C. Pomerance, Prime numbers: a computational perspective, Springer, New York, (1999) manuscript. 13] R. Crandall, \Integer convolution via split-radix fast Galois transform," (Feb 1999) manuscript, 14] W. Feller, \An Introduction to Probability Theory and Its Applications," Wiley, New York, 1971. 15] M. Frigo and S. Johnson, \The fastest Fourier transform in the west," tw. 16] W. Keller, Fermat-number website data: http://vamri.xray.u .edu/proths/fermat.html. 17] W. Keller, private communication (1999). 18] H. Lenstra, private communication. 19] L. Welsh, private communication. 20] J. H. McClellan and C. M. Rader, Number Theory in Digital Signal Processing, Prentice-Hall, 1979. 21] N. Craig-Wood, private communication (1999). 22] W. H. Press et al., Numerical Recipes in FORTRAN:The Art of Scienti c Computation, 2nd ed., Cambridge Univ. Press, 1992. 23] H. J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms, 2nd ed., Volume 2 of Springer Series in Information Sciences, Springer, New York, 1982. 24] H. Riesel, Prime Numbers and Computer Methods for Factorization, Birkhauser, Boston, 1994. 25] A. Schonhage 1971, Schnelle Multiplikation grosser Zahlen, Computing 7 (1971) 282-292. 26] J. L. Selfridge and A. Hurwitz, \Fermat numbers and Mersenne numbers," Math. Comp. 18 (1964), 146-148. 27] V. Trevisan and J. B. Carvalho, "The composite character of the twenty-second Fermat number," J. Supercomputing 9 (1994), 179-182. 28] G. Woltman, private communication (1999) 29] J. Young amd D. Buell, \The Twentieth Fermat Number is Composite," Math. Comp. 50 (1988), 261-263. 19

Figure 1: Data movement during a radix-R pass of FFT algorithm 1. For a decimation-in-frequency transform data move from A to B for a decimation-in-time transform, data move from B to A. N is the complex vector length (including padding elements, if array padding is used).