Beruflich Dokumente
Kultur Dokumente
Abstract
We present MuMax, a general-purpose micromagnetic simulation tool running on Graphical Processing Units (GPUs). MuMax
arXiv:1102.3069v2 [physics.comp-ph] 13 Apr 2011
is designed for high performance computations and specifically targets large simulations. In that case speedups of over a factor
100× can easily be obtained compared to the CPU-based OOMMF program developed at NIST. MuMax aims to be general and
broadly applicable. It solves the classical Landau-Lifshitz equation taking into account the magnetostatic, exchange and anisotropy
interactions, thermal effects and spin-transfer torque. Periodic boundary conditions can optionally be imposed. A spatial discretiza-
tion using finite differences in 2 or 3 dimensions can be employed. MuMax is publicly available as open source software. It can
thus be freely used and extended by community. Due to its high computational performance, MuMax should open up the possibility
of running extensive simulations that would be nearly inaccessible with typical CPU-based simulators.
Keywords: micromagnetism, simulation, GPU
PACS: 75.78.Cd, 02.70.Bf
<My> (Ms)
0.1
<M> (Ms)
0.9
0
0.05
0.85 my
-0.5 Mz
Mx
0
0.8
0 20 40 60 80 100 -1
d/lex 0 0.2 0.4 0.6 0.8 1
time (ns)
Figure 1: Standard problem #2. Remanent magnetization along the x-axis (left
axis) and along the y-axis (right axis) as a function of the scaling parameter
d. The full line represents the simulation results from MuMax, while the cir-
cles represent simulation points obtained from McMichael et al. [17] and from
Donahue et al.[18]
for the first time, for field 1 (µ0 H x =-24.6 mT, µ0 Hy = 4.3 mT,
µ0 Hz = 0.0 mT) and field 2 (µ0 H x =-35.5 mT, µ0 Hy = -6.3 mT, 0
µ0 Hz = 0.0 mT). A discretization using 128 × 32 × 1 FD cells
and the RK23 time stepping scheme with a time step around
-0.5 Mz
600 fs (dynamically adapted during the simulation) was used.
Mx
The relatively large time steps used to solve this standard prob-
lem demonstrate that MuMax incorporates robust time stepping
schemes with accurate adaptive step mechanisms. The adap- -1
tive step algorithms ensure optimal time step lengths and thus 0 0.2 0.4 0.6 0.8 1
reduce the number of field evaluations, speeding up the simu- time (ns)
lation. Here, a total time of only 2.5 seconds was needed to
finish this simulation on the GPU compared to 16 seconds on
the CPU. In this case the speedup on GPU is limited due to the
small number of FD cells. When the simulation is repeated with
a finer discretization of 256 ×64 × 2 cells, on the other hand, the
GPU speedup already becomes more pronounced: the simula-
tion finishes in 46 seconds on the GPU but takes 20’32” on the
CPU.
Figure 3: (top) Time evolution of the average magnetization during the reversal
From Figs. 2 and 3 it is clear that the results obtained with considered in µMag standard problem #4, field2. This field was chosen to cause
MuMax are well within the spread of the curves obtained by a bifurcation point to make the different solutions diverge. (bottom) Magneti-
zation configuration when < M x > crosses the zero magnetization point for the
other authors. Also the magnetization plots are in close agree- first time.
ment with those available at the µMag website [16].
5
2
<Mx>
1 <My>
<M> (10 A/m)
0
5
-1
-2
-3
0 1 2 3 4 5 6 7 8
time (ns)
Figure 4: Time evolution of the average in plane magnetization during the first 8
ns of the spin-transfer torque standard problem. To facilitate the visual compar-
Figure 5: Magnetization configuration at t=0.73 ns as found during the simula-
ison of our results with [19], the average magnetization is expressed in [A/m]
tion of a standard problem incorporating spin-transfer torque [19]. The vortex
and the same axes ratios are chosen.
core evolves towards a new equilibrium state under influence of a spin-polarized
dc current allong the horizontal direction.This figure is rendered with the built-
in graphics features present in MuMax.
4.3. Spin-transfer torque standard problem [19]
µMag does not propose any standard problems that include
spin-transfer torque. Therefore we rely on a standard prob-
lem proposed by M. Najafi et al. [19] to check the valid- can be understood by the fact that in these simulations not
ity of the spin-transfer torque description implemented in Mu- enough FD cells are considered to have all O(104 ) available
Max. The standard problem considers a permalloy sample threads at work at the same time. Hence, the computational
(A = 1.3 × 10−11 J/m, M s = 8.0 × 105 Am−1 ) with dimensions power is not fully exploited. Furthermore, Fig. 6 shows that
100 nm × 100 nm × 10 nm. The initial equilibrium magnetiza- the CPU performance as well as the GPU performance does
tion state is a predefined vortex, positioned in the center of the not follow a smooth curve. This is a consequence of the FFTs
sample and relaxed without any spin-transfer torque interaction which are most efficient for powers of two, possibly multiplied
(α = 1.0). Once relaxed, an homogeneous spin-polarized dc with one small prime (in the benchmarks shown in Fig. 6, the
current j = 1012 Am−2 along the x-axis is applied on the sam- default rounding to these optimal sizes is not performed). In
ple. Now, α is 0.1 and the degree of non-adiabicity ξ is 0.05, the GPU implementation this is even more the case than in the
see expression (7). Under these circumstances, the vortex cen- CPU implementation. E.g., the 2D simulation with dimensions
ter moves towards a new equilibrium position. The time evolu- 992 × 992 is five times slower than the 2D simulation with di-
tion of the average in plane magnetization and the magnetiza- mensions 1024×1024. This shows that not only for accuracy
tion configuration at t=0.73 ns are shown respectively in Fig. 4 reasons, but also for time efficiency reasons, it is most advan-
and Fig. 5. The results are in good agreement with those pre- tageous to restrict the simulations domain to the optimal di-
sented in reference [19]. With a discretization of 128 × 128 FD mensions defined by 2n × {1, 3, 5 or 7}. Because of the extreme
cells, 10 minutes of simulation time were needed to obtain the impact on the performance of MuMax, we opted to standardly
presented data. rescale the size of the FD cells such that the dimensions are
rounded of to one of these optimal sizes.
5. Performance
Due to the typical architecture of GPUs and the nature of the
The performance of MuMax on the CPU is roughly compa- FFT algorithm, simulations of geometries with power of two
rable to OOMMF. The CPU performance is thus good, but our sizes run extremely fast on GPU. Table 1 gives an overview of
main focus is optimizing the GPU code. Special attention went the speedups for these sizes for the 2D and 3D case and Fig.
to fully exploiting the numerical power of the GPU while fo- 7 shows the speedup obtained for these power of two sizes by
cussing on the time- and memory-efficient simulation of large MuMax compared to the OOMMF code. Both comparisons re-
micromagnetic problems. Figure 6 shows the time required to sult in speedups larger than a factor 100. This means that sim-
take one time step with the Euler method (i.e. effective field ulations that used to take several hours can now be performed
evaluation, evaluation of the LL-equation and magnetization in minutes. The comparison between the speedups shown in
update) on CPU (1 core) and on GPU for 2D and 3D simu- Table 1 and Fig. 7 further show that our CPU implementation
lations. has indeed a comparable efficiency regarding to OOMMF. The
In both the 2D and 3D case, speedups of up two orders of immense speedups evidence the fact that MuMax can indeed
magnitude are obtained for large dimensions. For smaller ge- open completely new research opportunities in micromagnetic
ometries, the speedups decrease but remain significant. This modelling.
6
140x
3D 2D
2D CPU 120x
10
100x
1
speedup
80x
time (s)
0.1 60x
GPU 40x
0.01
20x
0.001
16 32 64 128 256 512 1024 2048
N
0 500 1000 1500 2000
N
Figure 7: Speedup obtained with MuMax running on a GTX580 GPU com-
10 pared to OOMMF on a 2.8GHz core i7-930 CPU. The 2D and 3D geometries
have sizes N×N and N×N×N respectively. The lowest speedup for the 16 x 16
3D x 16 case – an unusually small simulation– is still a factor 4.
CPU
1
GPU
0.01
8
and G. Meier. Proposal for a standard problem for micromagnetic simula-
tions including spin-transfer torque. Journal of Applied Physics, 105(11),
2009.