ProofMS Revised PDF

AUTHOR QUERY FORM
Journal: J. Chem. Phys. Please provide your responses and any corrections by
annotating this PDF and uploading it to AIPs eProof
website as detailed in the Welcome email.
Article Number: 021417JCP
Dear Author,
Below are the queries associated with your article. Please answer all of these queries before sending the proof back to AIP.
Article checklist: In order to ensure greater accuracy, please check the following and make all necessary corrections before
returning your proof.
1. Is the title of your article accurate and spelled correctly?
2. Are the author names in the proper order and spelled correctly?
3. Please check afliations including spelling, completeness, and correct linking to authors.
4. Did you remember to include acknowledgment of funding, if required, and is it accurate?
Location in Query / Remark: click on the Q link to navigate
article to the appropriate spot in the proof. There, insert your comments as a PDF annotation.
Q1 AU: In sentence beginning Next Section recalls the, please verify that Next section and following one refer to Section II
and Sec. III, respectively.
Q2 AU: In sentence beginning The section compares the, please verify that The section refers to Section IV.
Q3 AU: In sentence beginning In order to test the performance, please verify that previous section refers to Sec. III.
Q4 AU: In Refs. 1 and 67, please provide a brief description of the information available at the website. For example, See XX for
information about XXX.
Q5 AU: Please verify the changes made in the journal name in Ref. 3, 4, and 57.
Q6 AU: Please provide publishers name in Ref. 5.
Q7 AU: Please verify the changes made in the page no. in Ref. 7a.
Q8 AU: Please verify the changes made in year in Ref. 7f.
Q9 AU: Please verify the changes made in year in Ref. 7g
Q10 AU: Please provide volume number of series in Ref. 13.
Q11 AU: Please provide book title in Refs. 18 and 31.
Q12 AU: Please check volume number and page range in Ref. 24, as we have inserted the required information.
Q13 AU: Please verify the changes made in year in Ref. 24.
Q14 AU: Refs. 7g and 32b contain identical information. Please check and provide the correct reference or delete the duplicate
reference. If the duplicate is deleted, renumber the reference list as needed and update all citations in the text.
Q15 AU: Please provide a digital object identier (doi) for Ref(s) 43c, 43d, and 44. For additional information on dois please select
this link: http://www.doi.org/. If a doi is not available, no other information is needed from you.
Q16 AU: Please verify the changes made in the page no. in Ref. 52.
Thank you for your assistance.
OK
OK
OK
done
OK
done
OK
OK
OK
OK
OK
done
done
|ust remove Ref. 7g, t s not necessary to renumber the reference st.
The changes are not correct and the correct page s reported
THE JOURNAL OF CHEMICAL PHYSICS 140, 000000 (2014) 1
Graphics processing units accelerated semiclassical initial value
representation molecular dynamics
2
3
Dario Tamascelli,
1
Francesco Saverio Dambrosio,
1
Riccardo Conte,
2
and Michele Ceotto
3,a)
4
5
1
Dipartimento di Fisica, Universit degli Studi di Milano, via Celoria 16, 20133 Milano, Italy 6
2
Department of Chemistry and Cherry L. Emerson Center for Scientic Computation, Emory University,
Atlanta, Georgia 30322, USA
7
8
3
Dipartimento di Chimica, Universit degli Studi di Milano, via Golgi 19, 20133 Milano, Italy 9
(Received 11 February 2014; accepted 14 April 2014; published online XX XX XXXX) 10
This paper presents a Graphics Processing Units (GPUs) implementation of the Semiclassical Ini-
tial Value Representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations.
The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. De-
tails about the GPU implementation of the semiclassical code are provided. Four molecules with an
increasing number of atoms are considered and the GPU-calculated vibrational frequencies perfectly
match the benchmark values. The computational time scaling of two GPUs (NVIDIA Tesla C2075
and Kepler K20), respectively, versus two CPUs (Intel Core i5 and Intel Xeon E5-2687W) and the
critical issues related to the GPU implementation are discussed. The resulting reduction in compu-
tational time and power consumption is signicant and semiclassical GPU calculations are shown to
be environment friendly. 2014 AIP Publishing LLC. [http://dx.doi.org/10.1063/1.4873137]
11
12
13
14
15
16
17
18
19
20
I. INTRODUCTION 21
The exponentially increasing demand for advanced 22
graphics solutions for many software applications including 23
entertainment, visual simulation, computer-aided design, and 24
scientic visualization has boosted high-performance graph- 25
ics systems architectural innovation.
1
Nowadays, Graphics 26
Processing Units (GPUs) are ubiquitous, affordable, and de- 27
signed to exploit the tremendous amount of data parallelism 28
of graphics algorithms. 29
In recent years, GPUs have evolved into fully pro- 30
grammable devices and they are now ideal resources for ac- 31
celerating several scientic applications. GPUs are designed 32
with a philosophy which is very different from CPUs. On one 33
hand, CPUs are more exible than GPUs and able to provide 34
a fast response to a single task instruction. On the other hand, 35
GPUs are best performing for highly parallelized processes. 36
CPUs provide caches and this hardware tool has been de- 37
veloped in a way to better assist programmers. In particular, 38
caches are transparent to programmers and the recent bigger 39
ones can capture most used data. GPUs, instead, achieve high 40
performances by means of hundreds of cores which are fed 41
by multiple independent parallel memory systems. A single 42
GPU is composed of groups of single-instruction multiple- 43
data (SIMD) processing units and each unit is made of mul- 44
tiple smaller processing parts called threads. These are set to 45
execute the same instructions concurrently. The advantages 46
of this type of architecture consist in a reduced power con- 47
sumption and an increased number of oating point arith- 48
metic units per unit area. In other words, a reduced amount 49
of space, power, and cooling is necessary to operate. How- 50
ever, parallelization efciency depends critically on threads 51
a)
Electronic mail: michele.ceotto@unimi.it
synchronization. In fact, accidental or forced inter-thread syn- 52
chronization can turn out to be very costly, because it involves 53
a kernel termination. Generation of a new kernel implies over- 54
head from the host. Another GPU drawback is represented by 55
its better efciency for single precision arithmetics. Unfortu- 56
nately, single precision is not enough for most scientic cal- 57
culations. In general, then, there may not be a single stable 58
GPU programming model and CPU codes need usually to be 59
extensively changed in order to t the GPU hardware. 60
GPUs are becoming more and more popular among 61
the scientic community mainly thanks to the release of 62
NVIDIAs Compute Unied Device Architecture (CUDA) 63
toolkit.
2
This is a programming model based on a user- 64
friendly decomposition of the code into grids and threads 65
which signicantly simplies the code development. It allows 66
to exploit all of the key hardware capabilities, such as scat- 67
ter/gather and thread synchronizations. 68
Applications of GPU programming in theoretical chem- 69
istry include implementations for classical molecular dy- 70
namics (MD),
36
quantum chemistry,
724
protein folding,
25
71
quantum dynamics,
2631
and quantum mechanics/molecular 72
mechanics (QM/MM)
32
simulations. For instance, classical 73
MD can be sped up by using GPUs for the calculation of 74
long-range electrostatics and non-bonded forces.
3, 4
The direct 75
Coulomb summation algorithm accesses the shared memory 76
area only at the very beginning and the very end of the pro- 77
cessing for each thread block, so MD takes full advantage of 78
the GPUs architecture by eliminating any use of thread syn- 79
chronizations. For instance, the popular Not (just) Another 80
Molecular Dynamics (NAMD) program
4, 5
is accelerated sev- 81
eral times by directing the electrostatics and implicit solvent 82
model calculations to GPUs while the remaining tasks are 83
handled by CPUs. Signicant progress by Friedrichs et al.
33
84
has determined a MD speed up of about 500 times over an 85
0021-9606/2014/140(17)/000000/10/$30.00 2014 AIP Publishing LLC 140, 000000-1
000000-2 Tamascelli et al. J. Chem. Phys. 140, 000000 (2014)
8-core CPU by using the OpenMM library. More difcult 86
has been the adoption of GPUs for quantum chemistry. The 87
rst full electronic-structure implementation on GPUs was 88
Umtsev and Martinezs TeraChem.
7
Currently, there are sev- 89
eral electronic structure codes that have to some extent im- 90
plemented GPU accelerations.
8, 11, 17
For example, Aspurus 91
group successfully accelerated real-space DFT calculations 92
making this approach interesting and competitive.
34
A simple 93
GPU implementation of the Cublas SGEMM subroutine in 94
quantumchemistry has been shown to be about 17 times faster 95
than the parent DGEMM subroutine on CPU.
11
Recently, 96
CPU/GPU-implemented time-independent quantum scatter- 97
ing calculations featured a 7-time acceleration by employing 98
3 GPU and 3 CPU cores. 99
In a time-dependent quantum propagation, instead, al- 100
most all of the computational resources are spent for the time 101
propagation of the wavepacket. Only the initial wavepacket 102
is calculated on the CPU. Data are copied from the CPU 103
memory to the GPU one for wavepacket propagation. Fur- 104
thermore, it has been shown that quantum time-dependent ap- 105
proaches can be boosted up to two orders of magnitude by 106
taking advantage of the matrix-matrix multiplication for the 107
time-evolution that maps well to GPU architectures.
26, 27
La- 108
ganas group demonstrated that quantum reactive scattering 109
for reactive probabilities calculations can be accelerated as 110
much as 20 times.
2830, 35
111
The main goal of this paper is to speed up our semiclassi- 112
cal dynamics CPU code by exploiting the GPU hardware. We 113
show how and when it is convenient to employ GPU devices 114
to perform semiclassical simulations. The GPU approach is 115
also demonstrated to require a largely reduced amount of 116
power supply. Unfortunately, GPU accelerated programming 117
experiences gained for quantum propagation matrix-matrix 118
multiplications or for the classical Coulombic MD force eld 119
are not helpful in the case of semiclassical simulations, due to 120
the need to calculate concurrently quantum delocalization and 121
classical localization. Given the mixed classical and quantum 122
nature of the semiclassical propagator, a general purpose (GP) 123
GPU approach is taken. With this approach host codes run 124
on CPUs and kernel codes on GPUs. GPGPU programming 125
is principally aimed at minimizing data transfer between the 126
host and the kernel, since this communication is made via bus 127
with relatively low speed. 128
The paper is organized as follows. Section II recalls the 129
semiclassical initial value representation quantum propaga- 130
tor and Sec. III describes the GPGPU programming approach 131
adopted here. Section IV compares the performances of CPU Q1 132
and GPGPU codes and discusses them. Section V reports our Q2 133
conclusions. 134
II. SEMICLASSICAL INITIAL VALUE 135
REPRESENTATION OF THE QUANTUM PROPAGATOR 136
The semiclassical propagator can be derived from the 137
Feynman Path Integral formulation of the quantum evolution 138
operator
36
from point q to q
139
_
q
e
i

Ht /
q
_
=
_
m
2it
_
1/2
_
D[q(t )]e
iS
t
(q,q
)/
, (1)
where S
t
_
q, q
_
is the path action for time t and D
_
q(t )
_
indi- 140
cates the differential over all paths. Stationary phase approx- 141
imation of Eq. (1) (see, for instance, Refs. 37 and 38) yields 142
the semiclassical van Vleck-Gutzwiller propagator
39, 40
143
_
q
e
i

Ht /
q
_
roots
_
1
_
2i
_
F

2
S
q
_
1/2
e
iS
t
(q,q
)/i/2
,
(2)
where the sum is over all classical trajectories going from q 144
to q
in an amount of time t, F is the number of degrees of 145

freedom, and is the Maslov or Morse index, i.e., the number 146
of points along the trajectory where the determinant in Eq. (2) 147
diverges.
41, 42
To apply Eq. (2) as written, one needs to solve 148
a nonlinear boundary value problem. The classical trajectory 149
evolved from the initial phase space point (p(0) , q(0)) is 150
such that q
t
(p(0) , q(0)) = q
. In general, there will be mul- 151

tiple roots to this equation and the summation of Eq. (2) is 152
over all such roots. Finding these roots is a formidable task 153
that has hindered use and diffusion of semiclassical dynam- 154
ics. The issue was overcome by Millers Semiclassical Initial 155
Value Representation (SC-IVR), whereby the boundary con- 156
dition summation is replaced by an initial phase space integra- 157
tion amenable to Monte Carlo (MC) implementation.
4349
By 158
representing the van Vleck-Gutzwiller propagator by direct 159
product of one-dimensional
i
-width coherent states
45, 5054
160
dened by 161
q|p(t ), q(t ) =
i
(
i
/)
F/4
exp
_
i
2
(q
i
q
i
(t ))
2
+
i
p
i
(t )(q
i
q
i
(t ))
_
(3)
and using Millers IVR trick, the semiclassical propagator 162
becomes 163
e
i

Ht /
=
1
_
2
_
F
_
dp(0)
_
dq(0) C
t
(p(0) , q(0))
e
iS
t
(p(0),q(0))/
|p(t ) , q(t ) p(0) , q(0)| . (4)
(p(t ) , q(t )) represent the set of classically evolved phase 164
space coordinates and C
t
is a pre-exponential factor. In the 165
Herman-Kluk frozen Gaussian version of SC-IVR, the pre- 166
exponential factor is written as
45, 50, 51
167
C
t
(p(0) , q(0))
=
_
1
2
q(t )
q(0)
+
p(t )
p(0)
i
q(t )
p(0)
+
i
p(t )
q(0)
,
(5)
where = diag(
1
, . . . ,
F
) is the coherent state matrix 168
which denes the Gaussian width of the coherent state. The 169
calculation of C
t
is conveniently performed from blocks of 170
size F F by introducing a 2F 2F symplectic (monodromy 171
or stability) matrix M(t ) ((p
t
, q
t
) / (p
0
, q
0
)). The accu- 172
racy of time-evolved classical trajectories is monitored by 173
calculating the deviation of the determinant of the positive- 174
denite matrix M
T
M from unity.
55
In this work, a trajectory 175
is discarded when its deviation is greater than 10
6
. For semi- 176
classical dynamics of bound systems, a reasonable choice for 177
the
i
width parameters is provided by the harmonic oscillator 178
approximation to the wave function at the global minimum. 179
In this paper, we employ the SC-IVR propagator to cal- 180
culate the spectral density 181
I(E) |(

H E)| =
n
||
n
|
2
(E E
n
) , (6)
where | is some reference state, {|
n
} are the exact 182
eigenfunctions, and {E
n
} the corresponding eigenvalues of 183
the Hamiltonian

H. A more practical dynamical represen- 184
tation of Eq. (6) is given by the following time-dependent 185
representation:
56
186
I (E) =
1
2
_
+
|e
i

Ht /
|e
iEt /
dt
=
Re
_
+
0
|e
i

Ht /
|e
iEt /
dt (7)
which is obtained by replacing the Dirac delta function in 187
Eq. (6) by its Fourier representation. According to 188
Eqs. (4) and (7), the SC-IVR spectral density representation 189
becomes
57
190
I(E) =
1
2
_
+
e
iEt /
1
(2)
F
_
dp(0)
_
dq(0) C
t
(p(0), q(0))
e
iS
t
(p(0),q(0))/
|p(t ), q(t )p(0), q(0)|dt,
(8)
where the reference state | =
p
eq
, q
eq
_
is represented in 191
phase space coordinates. The Monte Carlo phase space inte- 192
gration is made easier to treat by introducing a time averaging 193
(TA) lter at the cost of a longer simulation time. This im- 194
plementation was introduced by Kaledin and Miller
58
result- 195
ing in the following TA SC-IVR formulation for the spectral 196
density 197
I(E) =
1
(2)
F
_
dp(0)
_
dq(0)
Re
T
_
T
0
dt
1
_
T
t
1
dt
2
C
t
2
(p(t
1
), q(t
1
))
|p(t
2
), q(t
2
)e
i(S
t
2
(p(0),q(0))+Et
2
)/
_
|p(t
1
), q(t
1
)e
i(S
t
1
(p(0),q(0))+Et
1
)/
_
. (9)
Equation (9) presents two time variables. The integration over 198
t
2
is taking care of the Fourier transform of Eq. (7) (limited to 199
the simulation time T), while the one over t
1
does the lter- 200
ing job. The positions (p(t
1
) , q(t
1
)) and (p(t
2
) , q(t
2
)) are re- 201
ferred to the same trajectories but at different times. By adopt- 202
ing a reasonable approximation for the pre-exponential fac- 203
tor, C
t
2
(p(t
1
) , q(t
1
)) = Exp
_
i ( (t
2
) (t
1
)) /
_
,
58
where 204
(t ) = phase
_
C
t
(p(0) , q(0))
_
, the double-time integration 205
of Eq. (9) is reduced to a single one and the spectral density 206
becomes 207
I (E) =
1
(2)
F
1
2T
_
dp(0)
_
dq(0)
_
T
0
dt |p(t ), q(t )
e
i(S
t
(p(0),q(0))+Et +
t
(p(0),q(0)))/
2
. (10)
Equation (10) offers the advantage that the integrand is now 208
positive-denite and the integration is less computationally 209
demanding. Several applications
5865
have demonstrated that 210
this approximation is quite accurate. 211
III. GPU IMPLEMENTATION OF THE SC-IVR 212
SPECTRAL DENSITY 213
A. Monte Carlo SC-IVR algorithm 214
To point out the degree of parallelism available in the 215
SC-IVR procedure, we describe the main steps that lead to 216
the computation of the semiclassical power spectrum. 217
The spectrum is conveniently represented as a 218
k-dimensional vector (E
1
= E
min
, E
2
, . . . , E
k
= E
max
) of 219
equally spaced points in the range [E
min
, E
max
]. To evaluate 220
each element of the discretized spectrum, we need to calcu- 221
late I(E
i
), i = 1, 2, . . . , k from Eq. (10). To this end we use a 222
MC method. The phase-space integral of Eq. (10) is approxi- 223
mated by means of the following MC sum of n
traj
classical 224
trajectories: 225
I(E
i
) =
1
(2)
F+1
1
n
good
T
n
t raj
j=1
w
j
n
st eps
c=0
|p
j
(ct ),
q
j
(ct )e
i(S
ct
(p
j
(0),q
j
(0))+E
i
ct +
ct (p
j
(0),q
j
(0)))
2
,
(11)
where atomic units have been employed, and n
good
is the 226
actual number of trajectories (n
good
n
traj
) over which the 227
sum is averaged, as discussed at the end of next paragraph. 228
The MC phase space sampling is performed according to the 229
Husimi distribution which determines the weight w
j
of each 230
trajectory.
58
The classical trajectories
_
p
j
(t ), q
j
(t )
_
and the 231
actions are then evolved, through n
steps
discrete time steps of 232
length t from time 0 to time T by means of a fourth order 233
symplectic algorithm.
66
234
The structure of the sequential (CPU) code is shown in 235
Fig. 1. First, all the relevant information about the molecule 236
under investigation are read from the conguration les. 237
These are the masses and the equilibrium positions of the 238
atoms. Then, normal mode coordinates are generated together 239
with the conversion matrix from normal modes to Cartesian 240
coordinates. This matrix is necessary, since the simulations 241
are performed in normal mode coordinates, while the poten- 242
tial subroutines are written in Cartesian or Internal coordi- 243
nates. Once all the simulation and molecule-conguration pa- 244
rameters have been loaded into the program, the sequential 245
generation of MC trajectories starts. In order to check the 246
stability of the symplectic evolution of each trajectory, the 247
FIG. 1. The structure of the sequential code.
determinant of the monodromy matrix |M(t)| is evaluated. As 248
soon as it deviates from unity by an amount greater than 10
6
, 249
the evolution of the trajectory is interrupted and its contribu- 250
tion to the MC integration discarded. The intermediate results 251
produced by the trajectory are stored in a buffer (Temp- 252
Spectrum). The buffered results contribute to the compu- 253
tation of the spectrum(which is done in the Spectrum ar- 254
ray) only if the trajectory has completed its evolution over the 255
whole [0, T] time interval. We use a counter n
good
to count 256
the number of good trajectories. When all the trajectories 257
have been generated, the spectrum is normalized over n
good
258
and copied into a le. 259
B. GP-GPU implementation 260
Since each trajectory evolves independently, the design 261
of a parallel version of the MC algorithm described in 262
Eq. (11) is rather straightforward. As the most direct imple- 263
mentation, a n
traj
simulation can be performed by means of 264
n
traj
independent computational units, each one working on a 265
private memory space. Once all the trajectories have been run, 266
the results can be summed up to obtain the nal spectrum. 267
Here, we describe the implementation of the MC SC-IVR 268
algorithm for NVIDIA
R
GP-GPU with compute capability 269
2.0. Double precision oating point operations are sup- 270
ported. Therefore, we can adopt the terminology of NVIDIA 271
CUDA. The design guidelines that we are going to introduce 272
can also be implemented on other GP-GPUs and more gen- 273
erally on any Single Instruction Multiple Data (SIMD) archi- 274
tectures, e.g., through OpenCL.
67
275
The present parallel implementation of the SC-IVR al- 276
gorithm uses two kernels, which are the Evolution Ker- 277
nel(KernelEvolution in the code and in Fig. 2) and the 278
Spectrum Kernel (KernelSpectrum in the code and in 279
Fig. 2). In the Evolution Kernel, the cycle over the trajectories 280
(see Fig. 1) is distributed over a number of n
traj
threads. The 281
jth thread evolves a given initial condition
_
p
j
(0), q
j
(0)
_
from 282
time 0 to time T and works on its private copy of the work- 283
ing variables used in the sequential code. Details about the 284
memory usage will be presented below. In order to avoid in- 285
struction branching, which is highly detrimental in the SIMD 286
setup, all the trajectories are evolved up to time T. The tra- 287
jectory status is monitored by means of a ag variable which 288
is initialized to good and switched to bad as soon as the de- 289
terminant of the monodromy matrix associated to the trajec- 290
tory deviates from the allowed tolerance. Information about 291
the spectrum contributed by each trajectory during its evo- 292
lution are stored in its private copy of the buffer array. We 293
organize these private copies into the n
traj
k buffer matrix 294
(Traj-WaveLength in Fig. 2). 295
When the Evolution kernel terminates, the Spectrum ker- 296
nel is launched with k threads. At the end of the time evo- 297
lution, the jth thread computes the weighted sum of the el- 298
ements of the jth column of the buffer matrix. The results 299
of bad trajectories are not considered by setting their weight 300
to zero. The structure of the CUDA code just described is 301
shown in Fig. 2. The main advantage of using two separate 302
kernels is that we are able to take into account the different 303
dimensions of the problem, that is the number n
traj
of MC 304
trajectories and the number k of sampled energies. Another 305
advantage is that the threads work always on separate mem- 306
ory locations, making therefore unnecessary the use of atomic 307
operations or any other kind of thread synchronization mech- 308
anism that otherwise would slash the performance of the par- 309
allel code. 310
A central task in the development of CUDA coding con- 311
cerns the optimization of the use of different types of GPU 312
memories. As a matter of fact, memory bandwidth can be 313
the real bottleneck in a GPGPU computation. As mentioned 314
above, in order to reproduce the independence on the MC tra- 315
jectory in the code, each thread works on a private copy of 316
variables. On one hand, this could be easily accomplished 317
by reserving to each thread a portion of consecutive Global 318
Memory words large enough to contain all its working vari- 319
ables. On the other hand, this nave approach would lead to 320
highly misaligned memory accesses and to a large amount 321
of unnecessary and costly memory trafc. In order to al- 322
low for coalesced read/write operations in the global mem- 323
ory, we store the n
traj
copies of the same variable in contigu- 324
ous positions. For instance, the momenta of the different MC 325
FIG. 2. (a) CUDA code structure. The jth row (T W(j, 1), T W(j, 2), . . . , T W(j, k)) of the Traj-WaveLength matrix contains the private copy of the
TempSpectrum array of the jth trajectory. After the threads of KernelEvolution have lled the matrix, KernelSpectrum is launched. Each thread
of KernelSpectrum performs the sum of the elements over each column of Traj-WaveLength. (b) Use of the memory hierarchy while executing the
Evolution Kernel.
trajectories are stored as (p
1
1
, p
1
2
, . . . , p
1
n
t raj
, . . . , p
F
1
, p
F
2
, . . . , 326
p
F
n
t raj
), where F is the number of degrees of freedom (the nor- 327
mal modes) of the molecule. In this way, neighbor threads 328
will issue read/write memory requests to neighbor memory 329
locations that can be simultaneously served. 330
We recall that atom masses, equilibrium positions, the 331
conversion matrix, and other potential-structure matrices are 332
constant and trajectory-independent. Thus, we store these pa- 333
rameters in the Constant Memory. In this way, when all the 334
threads in a half-warp issue a read of the same constant 335
memory address, i.e., for the same parameter, a single read 336
request is generated and the result is broadcasted to all the re- 337
quiring threads. Moreover, since constant memory is cached, 338
all the subsequent requests of the same parameter by other 339
threads will not generate memory trafc. 340
Finally, we discuss the matter of the L1-Cache/shared 341
memory usage. This 64 kB memory is located close (on chip) 342
to the processing units (CUDA cores) and provides the low- 343
est latency times. By default, 16 kB of this memory are used 344
as L1-Cache memory, which is automatically managed by the 345
device, whereas the remaining 48 kB can be used either to 346
share information between the threads in a block or as a pro- 347
grammable cache. Since there is no ow of information be- 348
tween threads, we use the shared memory as programmable 349
cache. Due to the limited size of this memory, we employ it 350
to store only the position vectors of the trajectories in a block 351
and some intensively used parameters. Fig. 2(b) shows where 352
the main data structures employed by the code are allocated. 353
We conclude this section with a remark about the block- 354
versus-thread structure. The threads that are used to generate 355
the MC trajectories or to compute the components of the spec- 356
trum are evenly distributed among the blocks. The number 357
of blocks, therefore, determines the threads/block ratio. Tak- 358
ing into account the dimension of the scheduling unit (warp), 359
we constrain the number of threads-per-block to be an inte- 360
ger multiple of 32. Subsequently, we choose the conguration 361
that provides the best performance, i.e., the shortest comput- 362
ing time. This procedure allows us to slash most of memory 363
latency times and guarantees a sufcient number of Registers 364
for each thread as well. 365
IV. RESULTS AND DISCUSSION 366
Initially, debugging calculations are performed with GPU 367
NVIDIA
R
Tesla
TM
C2075 and CPU Intel core i5-3550 (6M 368
Cache, 3.3 GHz) processors. Then, performance calculations 369
are done employing the GPU NVIDIA
R
Kepler
R
K20 and 370
CPU Intel Xeon E5-2687W (20M Cache, 3.10 GHz) at the 371
Eurora cluster of the Italian supercomputer center CINECA. 372
In order to avoid any accidental over-estimation of the 373
GPU code performance, we stress that the CPU code uses a 374
single thread and does not make any use of SIMD instruction 375
sets, such as Intel SSE. This means that the CPU code is not 376
designed to fully exploit the computational power of multi- 377
core or SSE-enabled processors. Considered the parallel na- 378
ture of the described MC algorithm, a multi-thread version of 379
the code will require in the best case 1/k of the single-thread 380
CPU time, where k is the number of available cores. 381
As for GPUs, we use the same code on both Tesla C2075 382
and K20, with the exception of the block vs thread congura- 383
tion that is set to maximize the performance on each device. 384
The new functionalities introduced by the Kepler architecture 385
(such as dynamic parallelism and the 48K Read-Only Data 386
Cache) are not exploited. 387
In order to test the performance of the CUDA SC-IVR 388
code described in Sec. III, we look at four molecules with 389
an increasing number of degrees of freedom. It is important Q3 390
to study the time scaling not only for increasing number of 391
trajectories, but also for increasing complexity of the molec- 392
ular system. The chosen molecules are H
2
, H
2
O, H
2
CO, and 393
CH
2
D
2
and the number of their vibrational degrees of free- 394
dom is, respectively, 1, 3, 6, and 9. So, one should keep in 395
mind that the vibrational mode number grows at the fast pace 396
of three times the number of atoms. Another aspect to take 397
into account for a proper time scaling evaluation is repre- 398
sented by the potential energy subroutine adopted. For the H
2
399
molecule, a simple Morse oscillator is employed, while for 400
the other molecules we use analytical potential energy sur- 401
faces tted to ab initio quantum electronic energies.
6870
402
We are aware that about 1000 trajectories per vibrational 403
degree of freedom
58
are necessary in order to reach conver- 404
gence in the Monte Carlo integration of Eq. (10). However, 405
we report calculations performed up to 65 536 trajectories. 406
This allows for a study of computing capability saturation of 407
the two GPUs under consideration (see below) as well as a 408
better description of the different computational simulation 409
time trends of CPUs and GPUs. Fig. 3 shows the computa- 410
tional time at different numbers of classical trajectories and 411
for different molecules. Semiclassical CPU calculations show 412
a linear scaling up to the maximum number of 65 536 trajec- 413
tories tested. Instead, the computational time of the SC-IVR 414
GPU CUDA code described above is roughly constant up to 415
n
traj
= 2048 for K20 and n
traj
= 4096 for C2075, indepen- 416
dently of the molecule under investigation. While the serial 417
operation modality enforced by CPU architecture is clearly at 418
the origin of the linear scaling, the GPU behavior is a more 419
sophisticated one. As a matter of fact, the execution time for 420
a number of trajectories smaller than the indicated thresholds 421
(2048/4096) is very close to the time required by the GPU to 422
complete the evolution of a single trajectory. By accurately 423
10
-1
10
0
10
1
10
2
10
3
10
0
10
1
10
2
10
3
10
2
10
3
10
4
10
5
10
0
10
1
10
2
10
3
10
4
10
2
10
3
10
4
10
5
10
0
10
1
10
2
10
3
10
4
10
5
CPU Intel core i5
CPU Intel Xeon
GPU Tesla C2075
GPU Tesla K20
e
l
a
p
s
e
d

t
i
m
e

[
m
i
n
u
t
e
s
]
number of trajectories
H
2
H
2
O
H
2
CO CH
2
D
2
GPU vs CPU calculation times
FIG. 3. Elapsed computational time for CPU-SCIVR and GPU-SCIVR cal-
culations. For a small number of trajectories (<2048), GPU times are roughly
constant: the GPUs computational capabilities are not fully exploited. For a
large number of trajectories (4096) GPU times scale linearly with respect
to the number of trajectories. CPU times grow linearly with the number of
trajectory on the whole range [128, 65 536].
proling the execution of the code, we nd that this behav- 424
ior is largely due to the high memory trafc generated by the 425
code, since a single trajectory requires the manipulation of a 426
large number of data. Thanks to an accurate memory map- 427
ping of the information needed by the code (see below), we 428
are able to minimize the on-chip/off-chip data transfer. We 429
nd that the latency-time (i.e., the amount of time required 430
for data to become available to a thread) is playing a cen- 431
tral role. However, when the number of threads is larger than 432
the number of Streaming Multiprocessors (the computational 433
units in CUDA), part of the latencies is hidden by the thread 434
scheduler. Instead, when a thread is inactive while waiting for 435
data to arrive, another one, which is ready for execution, is 436
run. The execution time ceases to remain constant as soon as 437
the number of threads becomes larger than the time needed to 438
hide the latencies. This occurs when the computational power 439
of the GPU is saturated. Interestingly enough, the more pow- 440
erful K20 gets saturated sooner (2048 threads) than the old 441
C2075. We will address this issue later in this section. 442
For every molecular system, once the number of trajecto- 443
ries is large enough for the Monte Carlo integral to converge, 444
the resulting power spectrum is compared to the one reported 445
by Kaledin and Miller.
58
We nd our eigenvalues to be in 446
FIG. 4. The deuterated methane CH
2
D
2
power spectrum using the K20 GPU. The MC integration was converged with 8192 trajectories and the spectrum
has been projected onto the four irreducible representations for a peak attribution. Red and black lines are different coherent state combinations for the same
irreducible representation.
agreement within 0.1%. This negligible discrepancy is due to 447
the slightly different number of trajectories used in the GPU 448
calculations. As an example, we report in Fig. 4 the power 449
spectrum of di-deuterated methane. We stress once more that 450
our main goal is to test accuracy and efciency of the GPU 451
implemented SC-IVR code and not just the determination of 452
the spectrum, a problem which has already been solved. In 453
Fig. 4, the power spectrum of the deuterated methane CH
2
D
2
454
is projected onto the irreducible representations of the rele- 455
vant molecular point group. This procedure helps the reader, 456
assists the authors to assign peaks more easily, and permits 457
a stricter comparison with previous calculations on the same 458
systems.
58, 63
459
After verifying that indeed the GPU implemented code 460
preserves the same accuracy of the CPU one, we turn to 461
the computational performance difference between the two 462
NVIDIA graphics units, i.e., C2075 and K20. Clearly, for 463
each molecule tested, one would expect a better performance 464
of the more recent K20 with respect to C2075, as reported 465
in Fig. 5. The acceleration amount shown in Fig. 5 increases 466
with the number of trajectories. The upper left panel of the g- 467
ure reports the speed-up for the Morse oscillator power spec- 468
trum calculation. The acceleration is comparable in magni- 469
tude for the water molecule presented on the upper right panel 470
and the two graphics units performances are quite similar. In- 471
stead, for the complex systems reported on the lower panels, 472
the acceleration of the K20 graphic card is larger, as pointed 473
out by the log-scale. 474
The trends of computational time (see Fig. 3) and acceler- 475
ation beyond n
traj
=2048 for K20 and n
traj
=4096 for C2075 476
deserve some further discussion. As mentioned above, for a 477
given number n
traj
of MC trajectories, the number of threads 478
per block conguration is always chosen to maximize the per- 479
formance. This number is usually kept high enough to make 480
it possible for the warp scheduler to hide memory access la- 481
tency. We nd out that for 256 n
traj
8192 the best results 482
are obtained with 128 threads in each block, independently of 483
the device we run the code on. The real occupancy, i.e., the 484
number of warps (execution units of 32 threads) running con- 485
currently on a multiprocessor, however, is determined by the 486
register needed by each thread. This is a key issue when codes 487
are using a high number of variables, like in the present case. 488
The dimension of the SMX register le is twice the size of the 489
register memory for C2075 (256 kB vs. 128 kB), so the occu- 490
pancy of K20 can be higher than that of C2075. On one side, 491
this contributes to speed-up the calculations as shown by the 492
K20 device performances. On the other side, the size of the 493
L1 cache memory we use (48 kB) is the same on both de- 494
vices. This means that the same amount of L1 cache is shared 495
among more really concurrent threads on K20 than on C2075. 496
This results in a higher on-chip/off-chip memory trafc, and it 497
is likely a reason for the earlier reduction of the GPU acceler- 498
ation growth rate of Kepler K20 with respect to Tesla C2075. 499
Table I reports the computational time for the fully converged 500
TA-SC-IVR spectra calculations for the devices employed. 501
This table shows that the K20 computational time is always 502
10
0
10
1
10
2
10
3
10
0
10
1
10
2
Tesla C2075
Kepler K20
10
2
10
3
10
4
10
5
10
0
10
1
10
2
10
2
10
3
10
4
10
5
10
0
10
1
10
2
G
P
U

a
c
c
e
l
e
r
a
t
i
o
n
number of trajectories
H
2
H
2
O
H
2
CO CH
2
D
2
GPU vs CPU calculation times
FIG. 5. NVIDIA graphics units C2075 and K20 performances, respectively,
versus the Intel Core i5 and the Intel Xeon E5-2687W for the calculation of
the power spectra.
smaller than that of the older C2075 and of any other CPU 503
device. 504
If we consider the computational time for the better per- 505
forming K20 GPU and look at the time scaling for all the 506
molecules under examination, we obtain the plot reported by 507
the black lled circles in Fig. 6. 508
Opposite to what one would expect, the ratio of the CPU 509
computational time over the GPU one is not monotonically 510
decreasing with the number of vibrational degrees of free- 511
dom for the molecule calculations. To nd the source of such 512
an irregular behavior, we treat a set of uncoupled Morse os- 513
cillators. We calculate the ratio between the CPU and GPU 514
computational time for an increasing number of oscillators 515
while keeping the same number of trajectories used for the 516
molecules considered. The red lled squares in Fig. 6 report 517
6 8
number of degrees of freedom
10
100
1000
G
P
U

c
o
m
p
u
t
a
t
i
o
n
a
l

s
p
e
e
d
-
u
p
molecules
uncoupled oscillators
H
2
H
2
O
H
2
CO CH
2
D
2
FIG. 6. Power spectrum GPU computational speed-up for H
2
, H
2
O, H
2
CO,
and CH
2
D
2
(lled black circles) compared to uncoupled Morse oscillators
(lled red squares) with the same number of degrees of freedom and for a
65 536 trajectory Monte Carlo integration.
these values. In this case, the GPU acceleration contribution 518
is slightly decreasing with the number of degrees of freedom. 519
In the case of a single oscillator, the GPU speed-up factor is 520
exactly the same found for the hydrogen molecule because 521
the H
2
potential is a Morse potential. Also the speed-up for 522
the molecules with six and nine degrees of freedom is simi- 523
lar to that for the corresponding oscillators. Conversely, the 524
GPU acceleration for the water molecule strongly deviates 525
from its Morse oscillator reference. We ascribe this speed- 526
up discrepancy to the potential subroutine. This subroutine is 527
called several times, i.e., at each time step and for each trajec- 528
tory. For a typical 65 536 trajectory simulation with a fourth 529
order symplectic algorithm iterated for 4000 time steps, the 530
potential subroutine is called about 2.9 10
11
times and we 531
estimate it to take, for all the molecules except H
2
, approx- 532
imately 70% of the overall running time. Different analyti- 533
cal expressions for the tting potential surfaces lead to dif- 534
ferent performances after GPU implementation. For instance, 535
an additional square root calculation can signicantly change 536
the computational time considered the number of times the 537
potential subroutine is called. Thus, Fig. 6 eloquently shows 538
how important it is to write the potential energy surface in an 539
analytical form as simple as possible. We actually think that 540
this consideration is valid beyond the employment of the GPU 541
hardware. These limitations related to the tted analytical 542
potential are not present in a direct on-the-y semiclassical 543
dynamics simulation. However, in this last case, the bottle- 544
neck is represented by the cost of ab initio electronic energy 545
TABLE I. Performance of each computing device.
a
Molecule i5-3550 C2075 Ratio CPU/GPU E5-2687W K20 Ratio CPU/GPU
H
2
28.20
b
0.25 113 55.03 0.15 367
H
2
O 73.63 1.87 39 91.60 1.62 57
H
2
CO 798.07 14.42 55 936.73 9.18 102
CH
2
D
2
1428.35 22.35 64 1715.73 16.63 103
a
The number of trajectories is 65 536.
b
The computational time is measured in minutes.
calculations, especially when high level electronic theory and 546
large basis sets are employed. A viable and convenient future 547
perspective would be to combine the present SC-IVR Monte 548
Carlo GPU parallelization with the available GPU ab initio 549
codes,
7
to investigate if and to which extent GPUs slash di- 550
rect dynamics times and allow accurate calculations for size- 551
able molecules. 552
Finally, we discuss the power consumption convenience 553
of using GPU devices for semiclassical calculations. Thanks 554
to the support of the Italian Supercomputing Center CINECA, 555
we have been able to measure the amount of energy dissipated 556
by each job. As an example, we focus on 65 536 trajectory 557
runs for deuterated methane, which is the largest molecule 558
considered in this work. We found that the power dissipated 559
by the K20 GPU computation is 0.33 kW h, whereas for the 560
single-thread CPU computation on Xeon is 24.50 kW h. Even 561
assuming that an eight concurrent threads simulation is con- 562
suming 24.50/8 kWh, the GPUrun is still ten times more con- 563
venient, in terms of power consumption, than the CPU one. 564
V. CONCLUSIONS 565
This paper describes the implementation of the SC-IVR 566
algorithm for CUDA GPUs. Through a careful usage of the 567
memory hierarchy, it is possible to use a GPU as if it were 568
a cluster of CPUs, each working on an independent mem- 569
ory space. We nd a signicant speed-up with respect to 570
CPU simulations. Taking a multi-thread simulation over eight 571
cores, the GPU speed-ups is lowered to about 12 for most 572
of the molecules here considered. Interestingly enough, the 573
performance delivered by the GPU is strongly dependent on 574
the kind of operations required by the potential energy sur- 575
face subroutine. We bench-marked the code on molecules up 576
to nine degrees of freedom. Our future work will be mainly 577
focused on the development of new implementations able to 578
offer a viable alternative route to the use of multiple parallel 579
GPUs in applications where a large number of trajectories is 580
necessary. 581
ACKNOWLEDGMENTS 582
The authors thank NVIDIAs Academic Research Team 583
for the grant MuPP@UniMi and for providing the Tesla 584
C2075 graphic card under the Hardware Donation Program. 585
We acknowledge the CINECA and the Regione Lombardia 586
award under the LISA initiative (grant MATGREEN), for 587
the availability of high performance computing resources and 588
support. Professor A. Lagana, Professor E. Pollak, and Profes- 589
sor A. Aspuru-Guzik are warmly thanked for useful advices 590
and fruitful discussions. X. Andrade and S. Blau are thanked 591
for revising the paper. 592
1
See http://www.nvidia.com for . Q4 593
2
J. Sanders and E. Kandrot, CUDA by Example: An Introduction to 594
General-Purpose GPU Programming (Addison-Wesley, Boston, MA, 595
2010);CUDA Community Showcase, Available at http://www.nvidia.com/ 596
object/cuda_apps_ash_new.html. 597
3
J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy, L. G. Trabuco, and 598
K. J. Schulten, J. Comput. Chem. 28, 2618 (2007). Q5 599
4
J. E. Stone, D. J. Hardy, I. S. Umtsev, and K. J. Schulten, J. Mol. Graphics 600
Modell. 29, 116 (2010). 601
5
J. C. Phillips, J. E. Stone, and K. Schulten, Adapting a message-driven 602
parallel application to GPU-accelerated clusters, in Proceedings of the 603
International Conference for High Performance Computing, Networking, 604
Storage, and Analysis, Austin, TX, November 2008 (, 2008), pp. 1521. Q6 605
6
A. P. Ruymgaart and R. Elber, J. Chem. Theory Comput. 8, 4624 (2012).
Q7
606
7
I. Umtsev and T. J. Martnez, Comput. Sci. Eng. 10, 26 (2008); J. Chem. 607
Theory Comput. 4, 222 (2008); 5, 1004 (2009); 5, 2619 (2009); N. 608
Luehr, I. S. Umtsev, and T. J. Martinez, ibid. 7, 949 (2011); C. M. Isborn, 609
N. Luehr, I. S. Umtsev, and T. J. Martinez, ibid. 7, 1814 (2011); I. S. 610
Umtsev, N. Luehr, and T. J. Martinez, J. Phys. Chem. Lett. 2, 1789 (2011); 611
A. V. Titov, I. S. Umtsev, N. Luehr, and T. J. Martinez, J. Chem. Theory 612
Comput. 9, 213 (2013). Q8
Q9
613
8
K. J. Yasuda, Chem. Theory Comput. 4, 1230 (2008). 614
9
L. Vogt, R. Olivares-Amaya, S. Kermes, Y. Shao, C. Amador-Bedolla, and 615
A. Aspuru-Guzik, J. Phys. Chem. A 112, 2049 (2008). 616
10
L. Genovese, M. Ospici, T. Deutsch, J.-F. Mhaut, A. Neelov, and S. 617
Goedecker, J. Chem. Phys. 131, 034103 (2009). 618
11
M. Watson, R. Olivares-Amaya, R. G. Edgar, and A. Aspuru-Guzik, Com- 619
put. Sci. Eng. 12, 40 (2010). 620
12
H. Tomono, M. Aoki, T. Iitaka, and K. Tsumuraya, J. Phys.: Conf. Ser. 215, 621
012121 (2010). 622
13
X. Andrade and L. Genovese, in Fundamentals of Time-Dependent Density 623
Functional Theory, Lecture Notes in Physics Vol. , edited by M. A. 624
Marques, N. T. Maitra, F. M. Nogueira, E. Gross, and A. Rubio (Springer, 625
Berlin, 2012), Vol. 837, p. 401. Q10 626
14
X. Andrade, J. Alberdi-Rodriguez, D. A. Strubbe, M. J. Oliveira, F. 627
Nogueira, A. Castro, J. Muguerza, A. Arruabarrena, S. G. Louie, A. 628
Aspuru-Guzik, A. Rubio, and M. A. L. Marques, J. Phys.: Condens. Matter 629
24, 233202 (2012). 630
15
S. Maintz, B. Eck, and R. Dronskowski, Comput. Phys. Commun. 182, 631
1421 (2011). 632
16
A. E. DePrince and J. R. Hammond, J. Chem. Theory Comput. 7, 1287 633
(2011). 634
17
F. Spiga and I. Girotto, phiGEMM: A CPU-GPU library for porting quan- 635
tum ESPRESSO on hybrid systems, in Proceedings of the 20th Euromicro 636
International Conference on Parallel, Distributed and Network-based Pro- 637
cessing (PDP), Garching, Germany, 1517 February, 2012. 638
18
, edited by R. Stotzka, M. Schiffers, and Y. Cotronis (The Institute of 639
Electrical and Electronics Engineers, Inc., New York, 2012). Q11 640
19
J. D. C. Maia, G. A. Urquiza Carvalho, C. P. Mangueira, S. R. Santana, 641
L. A. F. Cabral, and G. B. Rocha, J. Chem. Theory Comput. 8, 3072 642
(2012). 643
20
M. Hacene, A. Anciaux-Sedrakian, X. Rozanska, D. Klahr, T. Guignon, 644
and P. Fleurat-Lessard, J. Comput. Chem. 33, 2581 (2012). 645
21
K. Esler, J. Kim, D. M. Ceperley, and L. Shulenburger, Comput. Sci. Eng. 646
14, 40 (2012). 647
22
S. Hakala, V. Havu, J. Enkovaara, and R. Nieminen, in Applied Parallel and 648
Scientic Computing, Lecture Notes in Computer Science Vol. , edited 649
by P. Manninen and P. ster (Springer, Berlin, 2013), Vol. 7782, p. 63. 650
23
W. Jia, Z. Cao, L. Wang, J. Fu, X. Chi, W. Gao, and L.-W. Wang, Comput. 651
Phys. Commun. 184, 9 (2013); W. Jia, J. Fu, Z. Cao, L. Wang, X. Chi, W. 652
Gao, and L.-W. Wang, J. Comput. Phys. 251, 102 (2013). 653
24
J. Hutter, M. Iannuzzi, F. Schiffmann, and J. VandeVondele, WIREs: Com- 654
put. Mol. Sci. 4, 1525 (2014). Q12
Q13
655
25
J. C. Sweet, J. N. Ronald, T. Cickovski, C. R. Sweet, V. S. Pande, and J. A. 656
Izaguirre, J. Chem. Theory Comput. 9, 3267 (2013). 657
26
S. Hoenger, A. Acocella, S. C. Pop, T. Narumi, K. Yasuoka, T. Beu, and 658
F. Zerbetto, J. Comput. Chem. 33, 2351 (2012). 659
27
P.-Y. Zhang and K.-L. Han, J. Phys. Chem. A 117, 8512 (2013). 660
28
R. Baraglia, M. Bravi, G. Capannini, A. Lagana, and E. Zambonini, Lect. 661
Notes Comput. Sci. 6784, 412 (2011). 662
29
L. Pacici, D. Nalli, D. Skouteris, and A. Lagana, Lect. Notes Comput. 663
Sci. 6784, 428 (2011). 664
30
L. Pacici, D. Nalli, and A. Lagana, Lect. Notes Comput. Sci. 7333, 292 665
(2012). 666
31
, edited by B. Murgante, O. Gervasi, A. Iglesias, D. Taniar, and B. O. 667
Apduhan (Springer-Verlag, Berlin, 2011), Vol. 6784, p. 412. 668
32
Y. Uejima, T. Terashima, and R. Maezono, J. Comput. Chem. 32, 2264 669
(2011);I. S. Umtsev, N. Luehr, and T. J. Martinez, J. Phys. Chem. Lett. 670
2, 1789 (2011); C. M. Isborn, A. W. Gotz, M. A. Clark, R. C. Walker, and 671
T. J. Martinez, J. Chem. Theory Comput. 8, 5092 (2012); C. M. Isborn, B. 672
D. Mar, B. F. E. Curchod, I. Tavernelli, and T. J. Martinez, J. Phys. Chem. B 673
117, 12189 (2013); H. J. Kulik, N. Luehr, I. S. Umtsev, and T. J. Martinez, 674
ibid. 116, 12501 (2012). Q14 675
nformaton about CUDA
SC '08 Proceedngs of the 2008 ACM/IEEE conference on Supercomputng
Artce No. 8, IEEE Press Pscataway, N|, USA 2008 ISBN: 978-1-4244-2835-9
Vo. 837, DOI: 10.1007/978-3-642-23518-4_21
33
M. Friedrichs, P. Eastman, V. Vaidyanathan, M. Houston, S. Legrand, A. 676
Beberg, D. Ensign, C. Bruns, and V. S. Pande, J. Comput. Chem. 30, 864 677
(2009); P. Eastman and V. S. Pande, Comput. Sci. Eng. 12, 34 (2010). 678
34
X. Andrade and A. Aspuru-Guzik, J. Chem. Theory Comput. 9, 4360 679
(2013). 680
35
L. Pacici, A. Nalli, and A. Lagana, Comput. Phys. Commun. 184, 1372 681
(2013). 682
36
R. P. Feynman and A. R. Hibbs, Quantum Mechanics and Path Integrals 683
(McGraw-Hill Companies, 1965). 684
37
M. V. Berry and K. E. Mount, Rep. Prog. Phys. 35, 315 (1972). 685
38
D. J. Tannor, Introduction to Quantum Mechanics a Time-Dependent Per- 686
spective (University Science Books, Sausalito, CA, 2007). 687
39
J. H. van Vleck, Proc. Natl. Acad. Sci. U.S.A. 14, 178 (1928). 688
40
M. C. Gutzwiller, J. Math. Phys. 8, 1979 (1967). 689
41
M. Morse, Variational Analysis (Wiley, New York, 1973). 690
42
V. P. Maslow, Thorie des Perturbations et Mthodes Asymptotiques 691
(Dunod, Paris, 1972). 692
43
W. H. Miller, J. Chem. Phys. 53, 3578 (1970); 53, 1949 (1970); Adv. 693
Chem. Phys. 30, 77 (1975); 25, 69 (1974). Q15 694
44
M. A. Sepulveda and F. Grossmann, Adv. Chem. Phys. 96, 191 (1996). 695
45
K. G. Kay, J. Chem. Phys. 100, 4377 (1994); 100, 4432 (1994); 101, 2250 696
(1994). 697
46
S. Zhang and E. Pollak, J. Chem. Phys. 119, 11058 (2003); 121, 3384 698
(2004); J. Chem. Theory Comput. 1, 345 (2005);J. Tatchen, E. Pollak, G. 699
Tao, and W. H. Miller, J. Chem. Phys. 134, 134104 (2011);J. Tatchen and E. 700
Pollak, ibid. 130, 041103 (2009); R. Ianconescu, J. Tatchen, and E. Pollak, 701
ibid. 139, 154311 (2013). 702
47
R. Conte and E. Pollak, Phys. Rev. E 81, 036704 (2010). 703
48
R. Conte and E. Pollak, J. Chem. Phys. 136, 094101 (2012). 704
49
J. Vanicek and E. J. Heller, Phys. Rev. E 67, 016211 (2003); 64, 026215 705
(2001);C. Mollica and J. Vanicek, Phys. Rev. Lett. 107, 214101 (2011);S. 706
Miroslav and J. Vanicek, Mol. Phys. 110, 945 (2012). 707
50
E. J. Heller, J. Chem. Phys. 62, 1544 (1975); 75, 2923 (1981).
51
M. F. Herman and E. Kluk, Chem. Phys. 91, 27 (1984); M. F. Herman, J. 708
Chem. Phys. 85, 2069 (1986); E. Kluk, M. F. Herman, and H. L. Davis, 709
ibid. 84, 326 (1986). 710
52
X. Sun and W. H. Miller, J. Chem. Phys. 108, 436 (1998). Q16 711
53
W. H. Miller, J. Phys. Chem. A 105, 2942 (2001). 712
54
W. H. Miller, Proc. Natl. Acad. Sci. U.S.A. 102, 6660 (2005). 713
55
H. Wang, D. E. Manolopoulos, and W. H. Miller, J. Chem. Phys. 115, 6317 714
(2001). 715
56
E. J. Heller, Acc. Chem. Res. 14, 368 (1981). 716
57
W. H. Miller, Faraday Discuss. 110, 1 (1998). 717
58
A. L. Kaledin and W. H. Miller, J. Chem. Phys. 118, 7174 (2003); 119, 718
3078 (2003). 719
59
M. Ceotto, S. Atahan, S. Shim, G. F. Tantardini, and A. Aspuru-Guzik, 720
Phys. Chem. Chem. Phys. 11, 3861 (2009). 721
60
M. Ceotto, S. Atahan, G. F. Tantardini, and A. Aspuru-Guzik, J. Chem. 722
Phys. 130, 234113 (2009). 723
61
M. Ceotto, D. dellAngelo, and G. F. Tantardini, J. Chem. Phys. 133, 724
054701 (2010). 725
62
M. Ceotto, S. Valleau, G. F. Tantardini, and A. Aspuru-Guzik, J. Chem. 726
Phys. 134, 234103 (2011). 727
63
M. Ceotto, G. F. Tantardini, and A. Aspuru-Guzik, J. Chem. Phys. 135, 728
214108 (2011). 729
64
M. Ceotto, Y. Zhuang, and W. L. Hase, J. Chem. Phys. 138, 054116 (2013). 730
65
R. Conte, A. Aspuru-Guzik, and M. Ceotto, J. Phys. Chem. Lett. 4, 3407 731
(2013). 732
66
Y. Zhuang, M. R. Siebert, W. L. Hase, K. G. Kay, and M. Ceotto, J. Chem. 733
Theory Comput. 9, 54 (2013). 734
67
See http://www.khronos.org/opencl/ for . 735
68
J. M. Bowman, A. Wierzbicki, and J. Zuniga, Chem. Phys. Lett. 150, 269 736
(1988). 737
69
J. M. L. Martin, T. J. Lee, and P. R. Taylor, J. Mol. Spectrosc. 160, 105 738
(1993). 739
70
T. J. Lee, J. M. L. Martin, and P. R. Taylor, J. Chem. Phys. 102, 254 (1995). 740
nformaton about openCL
Stotzka, R.; Schers, M. & Cotrons, Y., ed., (2012), Proceedngs of the 20th Euromcro Internatona
Conference on Parae, Dstrbuted and Network-Based Processng (PDP), Conference Pubshng Servces of the IEEE Computer Socety.
Computatona Scence and Its Appcatons - ICCSA 2011
(Internatona Conference, Santander, Span, |une 20-23, 2011. Proceedngs, Part III)
B. Murgante, O. Gervas, A. Igesas, D. Tanar, B.O. Apduhan (Eds.)
Sprnger-Verag, Bern Hedeberg (Lecture Notes n Computer Scence Seres, vo. 6784)
ISBN: 978-3-642-21930-6, Pages: 700 (August 2011)
8870

ProofMS Revised PDF

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Originaltitel

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

ProofMS Revised PDF

Hochgeladen von

Copyright:

Verfügbare Formate

AUTHOR QUERY FORM

in an amount of time t, F is the number of degrees of 145

. In general, there will be mul- 151

Das könnte Ihnen auch gefallen