Beruflich Dokumente
Kultur Dokumente
Introdction.— Machine learning gathers considerable and SAVB do not. This fact is noteworthy because our
attention in a wide range of fields, and much effort is de- algorithm is one of the few algorithms that can obtain a
voted to develop effective algorithms. Variational Bayes global optimum of non-convex optimization in practical
(VB) inference [1–6] is one of the most fundamental computational time without using random numbers.
methods in machine learning, and widely used for param- Problem setting of VB.— For preparation of a quan-
eter estimation and model selection. In particular, VB tum extension of VB, we briefly review the problem set-
has succeeded to compensate some disadvantages of the ting of VB [1–6]. First, we summarize the definitions of
expectation-maximization (EM) algorithm [5–7], which variables. Suppose that we have N data points Y obs =
is a well-used approach for maximum likelihood estima- {yiobs }N
i=1 , which are independent and identically dis-
tion. For example, overfitting, which is often occurred in tributed by the conditional distribution py,σ|θ (yi , σi |θ),
EM, is greatly moderated in VB. Furthermore, a variant where yi , σi , and θ are an observable variable, a hidden
of VB based on classical statistical mechanics, which we variable and a parameter,
QN respectively. Thus, we have
call simulated annealing variational Bayes (SAVB) infer- pY,Σ|θ (Y, Σ|θ) = i=1 py,σ|θ (yi , σi |θ), where Y = {yi }Ni=1
ence in this paper, was proposed [8] and has been getting and Σ = {σi }N i=1 . The joint distribution is also given by
popular in many fields due to its effectiveness. However, pY,Σ,θ (Y, Σ, θ) = pY,Σ,θ (Y, Σ|θ)pθpr (θ), where pθpr (θ) de-
it is also known that VB and SAVB often fail to estimate notes the prior distribution of θ. Furthermore, we define
appropriate parameters of an assumed model depending N
the domains of Σ and θ as S Σ := ⊗ S σ and S θ , respec-
on prior distributions and initial conditions. i=1
tively.
In the field of physics, the study of quantum computa- The goal of VB is to approximate the poste-
tion and how to exploit it for machine learning are getting rior distributions given by pΣ,θ|Y (Σ, θ|Y obs ) =
popular. For example, while experimentalists are inten- pY,Σ,θ
(Y obs
, Σ, θ)/p Y
(Y obs
) with pY
(Y obs
) =
sively developing quantum machines [9–13], theorists are P R obs
Σ∈S Σ θ∈S θ dθ p(Y , Σ, θ) in the mean field ap-
developing quantum error correction schemes [14–18] and proximation. Here, we have used the Bayes theorem
quantum algorithms [19–29]. In particular, the study of for the derivation of the posterior distribution. Us-
quantum annealing (QA) has a history for more than two ing function q Σ,θ (Σ, θ) that satisfies
decades [22–25] and is still progressing [26]. P a Rvariational Σ,θ
Σ∈S Σ θ∈S θ dθ q (Σ, θ) = 1, the objective function of
In this Letter, by focusing on QA and VB, we devise VB is given by
a quantum-mechanically inspired algorithm that works
on a classical computer in practical time and achieves KL q Σ,θ (·, ·)
pΣ,θ|Y (·, ·|Y obs )
a considerable improvement over VB and SAVB. More X Z pΣ,θ|Y (Σ, θ|Y obs )
specifically, we introduce the mathematical mechanism := − dθ q Σ,θ (Σ, θ) ln ,
of quantum fluctuations into VB, and propose a new al- Σ θ∈S θ q Σ,θ (Σ, θ)
Σ∈S
gorithm, which we call quantum annealing variational (1)
Bayes (QAVB) inference. To demonstrate the perfor-
mance of QAVB, we consider a clustering problem and which is the KL divergence [30, 31]. In VB, we minimize
employ a Gaussian mixture model, which is one of im- Eq. (1) in the mean field approximation given by
portant applications of VB. Then, we see that QAVB
succeeds in estimation with high probability while VB q Σ,θ (Σ, θ) = q Σ (Σ)q θ (θ). (2)
2
Σ Σ :=
Thus, by setting the functional derivatives of Eq. (1) un- where Ĥqu is a non-commutative term, defined as Ĥqu
der Eq. (2) with respect to q Σ (Σ) and q θ (θ) equal to 0 PN σi σi
i=1 Ĥqu , and Ĥqu is defined such that
and solving for q Σ (Σ) and q θ (θ), we obtain the update
equations for Σ and θ:
i−1
N
σi
Ĥqu , ⊗ Iˆσj ⊗ σ̂i ⊗ ⊗ Iˆσj ⊗ Iˆθ 6= 0, (10)
j=1 j=i+1
Z
Σ θ Y,Σ,θ obs
qt+1 (Σ) ∝ exp dθ qt+1 (θ) ln p (Y , Σ, θ) ,
θ∈S θ
for any i [32]. Here, Iˆθ represents the identity opera-
(3) tor for the space spanned by |θi. This Gibbs opera-
tor, Eq. (9), involves two annealing parameters β and
!
X
θ
qt+1 (θ) ∝ exp qtΣ (Σ) ln pY,Σ,θ (Y obs , Σ, θ) , (4) s, where, in terms of physics, β is regarded as the inverse
Σ∈S Σ temperature and s represents the strength of quantum
fluctuations.
Thus, when s = E0 and β = 1, we re-
where qtΣ (Σ) and qtθ (θ) is the distributions of Σ and θ at D
cover Σ, θ fˆ(β = 1, s = 0) Σ, θ = pY,Σ,θ (Y obs , Σ, θ).
the t-th iteration [5, 6].
VB is widely used due to its effectiveness. In some Although we consider only the quantization of Σ, the
cases, the performance of VB is much better than that quantization of θ is almost straightforward [33].
of EM [5–7], and VB can be directly used for model se- Using Eq. (9), we define a quantum extension of the
lection [1–6]. However, it is also known that the perfor- KL divergence [34] by
mance of VB heavily depends on initial conditions. To
!
relax this problem, we introduce quantum fluctuations to
fˆ(β, s)
Σ,θ
S ρ̂
VB in the rest of this Letter.
Z(β, s)
Quantum annealing variational Bayes inference.—
fˆ(β, s)
Here, we formulate a quantum extension of VB. We first := −TrΣ,θ ρ̂Σ,θ ln − ln ρ̂Σ,θ , (11)
define the classical Hamiltonians by pY,Σ|θ (Y obs , Σ|θ) and Z(β, s)
pθpr (θ): h i
where Z(β, s) := TrΣ,θ fˆ(β, s) and TrΣ,θ [·] :=
Σ|θ
Hcl := − ln pY,Σ|θ (Y obs , Σ|θ), (5) P R Σ,θ
Σ∈S Σ θ∈S θ dθ hΣ, θ | · | Σ, θi. Also, ρ̂ denotes a den-
θ
Hpr := − ln pθpr (θ). (6)
sity operator over Σ and θ that satisfies TrΣ,θ ρ̂Σ,θ = 1.
In particular, when β = 1, s = 0, and ρ̂ is diagonal,
Next, we define operators σ̂i and θ̂ whose eigenvalues are the quantum relative entropy, Eq. (11), reduces to the
σi and θ, respectively; that is, σ̂i and θ̂ satisfy σ̂i |σi i = classical KL divergence, Eq. (1).
σi |σi i and θ̂ |θi = θ |θi, where |σi i and |θi are eigenstates To derive the update equations, we repeat the almost
of σ̂i and θ̂, respectively. In this paper, we assume σ̂i same procedure of VB; that is, we employ the mean field
and θ̂ are commutative with each other. Using the above approximation ρ̂Σ,θ = ρ̂Σ ⊗ ρ̂θ , where ρ̂Σ and ρ̂θ repre-
N sent the density operators for Σ and θ,respectively; then
definition of |σi i, we also define |Σi := ⊗ |σi i. Then,
i=1 Eq. (11) can be reduced to [35]
we replace Σ = {σi }N and θ in Eqs. (5) and (6) by
i=1 N
fˆ(β, s)
!
i−1 N
⊗ Iˆσj
⊗ σ̂i ⊗ ⊗ I ˆσj
and θ̂, respectively, Σ
S ρ̂ ⊗ ρ̂
θ
j=1 j=i+1 i=1
Z(β, s)
where Iˆσi denotes the identity operator for the spaces X X Z Z
spanned by |σi i. That is, we define =− dθ dθ0
Σ∈S Σ Σ0 ∈S Σ θ∈S θ θ 0 ∈S θ
Σ|θ
X Z Σ|θ
× Σ ρ̂Σ Σ0 θ ρ̂θ θ0
Ĥcl := dθ Hcl P̂ Σ,θ , (7)
Σ∈S Σ θ∈S θ h ih ih i
X Z × hΣ0 | ⊗ hθ0 | ln fˆ(β, s) |Σi ⊗ |θi
θ θ
Ĥpr := dθ Hpr P̂ Σ,θ , (8) X X
Σ∈S Σ θ∈S θ + hΣ| ρ̂Σ |Σ0 i hΣ0 | ln ρ̂Σ |Σi
Σ∈S Σ Σ0 ∈S Σ
N
Z Z
where P̂ Σ,θ := P̂ Σ ⊗ P̂ θ , P̂ Σ := ⊗ P̂ σi , P̂ σi := |σi i hσi |, + dθ dθ0 hθ| ρ̂θ |θ0 i hθ0 | ln ρ̂θ |θi
i=1 θ∈S θ θ 0 ∈S θ
and P̂ θ := |θi hθ|. To introduce quantum fluctuations + ln Z(β, s). (12)
to VB, we define a Gibbs operator that involves a non-
commutative term: Next, by setting
the
functional
derivatives
of Eq. (12)
with respect to Σ ρ̂Σ Σ0 and θ ρ̂θ θ0 equal to 0
fˆ(β, s) := exp −Ĥpr
θ Σ|θ Σ
− β(1 − s)Ĥcl − βsĤqu , (9)
and solving for ρ̂Σ and ρ̂θ , we obtain the update equa-
3
{π k }K k K k K
k=1 , {µ }k=1 , and {Λ }k=1 by π, µ, and Λ, respec-
ALGORITHM 1: Quantum annealing variational Bayes
(QAVB) inference tively, and we refer by θ to {π, µ, Λ} collectively.
Taking the logarithm of Eq. (15), we define the Hamil-
tonian of the GMM for σi with y = yiobs as
1: set ρ̂θpr and t ← 0, and initialize ρ̂Σ
0
2: set β ← β0 and s ← s0 σ |θ
3: while convergence criterion is not satisfied do Hcli = − ln py,σ|θ (yiobs , σi |θ). (16)
4: compute ρ̂θt+1 in Eq. (14)
5: compute ρ̂Σ t+1 in Eq. (13) for i = 1, 2, . . . , N Then the Hamiltonian of the GMM for Σ = {σi }Ni=1 with
6: change β and s Σ|θ PN σ |θ
7: t←t+1
Y = Yiobs is given by Hcl = i=1 Hcli . Using Eq. (7),
Σ|θ
8: end while we can also define the quantum representation of Hcl
Σ|θ
as Ĥcl .
Σ|θ
To introduce quantum fluctuations into Ĥcl , a non-
Σ
PN σi
tions [36]: commutative term Ĥqu = i=1 Ĥqu that satisfies
h i Eq. (10) should be added. In this Letter, we adopt
ρ̂Σ θ ˆ
t+1 ∝ exp Trθ ρ̂t+1 ln f (β, s) , (13)
h i
ρ̂θt+1 ∝ exp TrΣ ρ̂Σ ˆ
t ln f (β, s) , (14) i−1
⊗ Iˆσj
X
σi
Ĥqu = ⊗ |σi = li hσi = k|
j=1
P k=1,...,K,
Rwhere TrΣ [·] := Σ∈S Σ hΣ | · | Σi, Trθ [·] := l=k±1
θ∈S θ dθ hθ | · | θi, and t stands for the number of
N
iterations. We mention that TrΣ [·] and Trθ [·] represent ⊗ ⊗ Iˆσj ⊗ Iˆθ , (17)
j=i+1
partial traces, and they yield operators on the spaces
spanned by |θi and |Σi, respectively. We also note that where |σi = 0i = |σi = Ki and |σi = K + 1i = |σi = 1i.
the subscriptions t and t + 1 in the right-hand sides of We note that the form of Ĥquσi
is not limited to the above
Eqs. (13) and (14) depend on implementations of QAVB definition and has arbitrariness in general.
and the normalization factors of Eqs. (13) and (14)
Numerical setup and results.— We assess the perfor-
are determined by the condition of density operators
mances of three algorithms: QAVB, VB, and SAVB. In
TrΣ [ρ̂Σ ] = 1 and Trθ [ρ̂θ ] = 1. In QAVB, we iterate these
this numerical simulation, we use the data set shown in
two update equations changing the annealing parameters
Fig. 1(a). The number of Gaussian mixtures of the gen-
β and s until a termination condition is satisfied. In
θ erating model Kgen is 10. The means and covaricances
this algorithm, we obtain density operators
ρ̂Σ t and ρ̂
t
Σ of Gaussians are depicted by green crosses and blue lines
in each step, and their diagonal elements Σ ρ̂t Σ
in Fig. 1(a), respectively.
and θ ρ̂θt θ represent the distributions of Σ and θ, There are many candidates for annealing schedules; so,
respectively. In practical applications,
we may
use the we limit ourselves to some annealing schedules as follows.
mean Trθ [θ̂ρ̂θ ] or the mode arg maxθ θ ρ̂θ θ . Note Let βt and st be β and s at the t-th iteration, respectively.
that, when β = 1 and s = 0, Eqs. (13) and (14) exactly For QAVB, we vary st and βt as st = s0 × max(1 −
reduces to the update equations of VB, Eqs. (3) and (4). t/τQA1 , 0.0) and
Finally, we summarize this algorithm in Algo. 1.
Gaussian mixture models.— To see the performance of
β0 (t ≤ τQA1 )
QAVB, we consider the estimation problem of the pa-
(β0 −1)(τQA2 −t)
rameters and number of clusters of a GMM studied in βt = 1 + τQA2 −τQA1 (τQA1 ≤ t ≤ τQA2 ) , (18)
Ref. [2, 5, 6]. The joint probability distribution of the 1.0 (t ≥ τQA2 )
GMM over an observable variable yi and a hidden vari-
able σi conditioned by a set of parameters θ is given by respectively, where s0 and β0 are initial values of the an-
nealing schedules, τQA1 and τQA2 specify the time scales
K
X of the annealing schedules, and max(x, y) gives the max-
py,σ|θ (yi , σi |θ) = π k N (yi |µk , (Λk )−1 )δk,σi , (15)
imum of x and y. To visualize how Tt := 1/βt and st be-
k=1
have in the above annealing schedules, we illustrate them
where δk,σi is the Kronecker delta function, {πk }K k=1 in Fig. 1(b). The reason why we adopt the above anneal-
are the mixing coefficients for the GMM, and ing schedules will be discussed later. Note that QAVB
N (yi |µk , (Λk )−1 ) is a Gaussian distribution whose mean with s0 = 0 corresponds to SAVB and SAVB with β0 = 1
and precision, which is the inverse of covariance, are is identical to VB.
µk and Λk , respectively [37]. Here, we have assumed We show the numerical results of the three algo-
that each hidden variable σi takes 1, . . . , K; that is, rithms [39]. We set K = 15 hereafter. In Fig. 2(a), we
S σ = {k}K k=1 [38]. To simplify the notation, we denote first compare QAVB and VB by plotting the estimated
4
Posterior log-likelihood
Posterior log-likelihood
Data points 1.2 Tt −2750 −2750
15 st
µk 1 −2800 −2800
10 Σk −2850 −2850
0.8
5 −2900 −2900
0.6
−2950 −2950
0 0.4 QAVB QAVB
−3000 −3000
−5 0.2 VB SAVB
−3050 −3050
4 5 6 7 8 9 10 11 12 13 4 5 6 7 8 9 10 11 12 13
−10 0 Estimated number of clusters Estimated number of clusters
−15 0 100 200 300 400 500 600
−10 −5 0 5 10 15 20 t
FIG. 2: (a) Relation between the number of estimated
FIG. 1: (a) Data set generated by 10 Gaussian functions clusters and the posterior log-likelihood of QAVB and VB,
(Kgen = 10). The means and covaricances of Gaussians are and (b) that of QAVB and SAVB. We set s0 = 1.0 and
depicted by green crosses and blue lines, respectively. (b) β0 = 30.0 for QAVB and β0 = 0.9 for SAVB. The horizontal
Annealing schedules of QAVB. The red line represents axis represents the number of estimated clusters, while the
Tt = 1/βt with β0 = 30.0, and green lines depict st with vertical axis depicts the posterior log-likelihood. The error
s0 = 1.0. We set τQA1 = 450 and τQA2 = 500. bars along the horizontal axis represent frequency
normalized to ten for VB and SAVB, and to unity for
QAVB.
number of clusters and the posterior log-likelihood, which
is given by
L q Σ (·)q θ (·) = −KL q Σ (·)q θ (·)
pΣ,θ|Y (·, ·|Y obs )
A. Functional derivatives of G with respect to ρ̂Σ and tion, we prove the following equality:
ρ̂θ
δ h i D E
Tr X̂ ln ρ̂ = Σ0 X̂ ρ̂−1 Σ , (26)
Here, we derive Eq. (12) in the main text. By substi- 0
δ hΣ | ρ̂ | Σ i
tuting the mean field approximation ρ̂Σ,θ = ρ̂Σ ⊗ ρ̂θ into
Eq. (11), we obtain for and density operator ρ̂ and any X̂ that commutes
with ρ̂. The proof is as follows.
!
ˆ
Σ θ
f (β, s)
S ρ̂ ⊗ ρ̂
Z(β, s)
X X Z Z ˆ the definitions of the logarithm
Proof. When 0̂ ≺ ρ̂ ≺ 2I,
=− dθ dθ0
θ∈S θ θ 0 ∈S θ
and inverse are given by
Σ∈S Σ Σ0 ∈S Σ
h ihih i
∞
× hΣ| ⊗ hθ| ρ̂Σ ⊗ ρ̂θ |Σ0 i ⊗ |θ0 i (−1)n+1 ˆ n,
X
ln ρ̂ := (ρ̂ − I) (27)
h ih ih i n
× hΣ0 | ⊗ hθ0 | ln fˆ(β, s) |Σi ⊗ |θi n=1
∞
ˆ n−1 .
X X Z Z X
+ dθ dθ0 ρ̂−1 := (−1)n+1 (ρ̂ − I) (28)
θ∈S θ θ 0 ∈S θ n=1
Σ∈S Σ Σ0 ∈S Σ
h ih ih i
× hΣ| ⊗ hθ| ρ̂Σ ⊗ ρ̂θ |Σ0 i ⊗ |θ0 i By substituting Eq. (27) into the left-hand side of
h ih ih i Eq. (26), we get
× hΣ0 | ⊗ hθ0 | ln ρ̂Σ ⊗ ρ̂θ |Σi ⊗ |θi
+ ln Z(β, s). (22) δ h i
Tr X̂ ln ρ̂
δ hΣ | ρ̂ | Σ0 i
where |Σ, θi = |Σi⊗|θi. This expression can be simplified ∞
(−1)n+1
further using the following identities:
X δ ˆ n
= Tr X̂ (ρ̂ − I) . (29)
n=1
δ hΣ | ρ̂ | Σ0 i n
ln ρ̂a ⊗ ρ̂b = ln ρ̂a ⊗ Iˆb + Iˆa ⊗ ln ρ̂b , (23)
h ih ih i
ha| ⊗ hb| ρ̂a ⊗ ρ̂b |a0 i ⊗ |b0 i Each term in the summation in Eq. (29) can be calcu-
lated as
= ha | ρ̂a | a0 i b ρ̂b b0 ,
(24)
(−1)n+1
δ ˆ n
ˆa ˆb
where I and I are identity operators in the Hilbert Tr X̂ (ρ̂ − I)
δ hΣ | ρ̂ | Σ0 i n
spaces spanned by |ai and |bi, respectively. Then we n
obtain (−1)n+1 X D 0 ˆ n−i X̂(ρ̂ − I)
E
ˆ i−1 Σ
= Σ (ρ̂ − I)
ˆ
! n
θ
f (β, s)
i=1
Σ
S ρ̂ ⊗ ρ̂
(30)
Z(β, s) D E
X X Z Z ˆ n−1 Σ .
= (−1)n+1 Σ0 X̂(ρ̂ − I) (31)
=− dθ dθ0
Σ∈S Σ Σ0 ∈S Σ θ∈S
θ θ 0 ∈S θ
We have used [X̂, ρ̂] = 0 in Eq. (31). By summing
× Σ ρ̂ Σ θ ρ̂ θ0
Σ 0
θ
h ih ih i Eq. (31) over n, we have
× hΣ0 | ⊗ hθ0 | ln fˆ(β, s) |Σi ⊗ |θi
δ h i
Tr X̂ ln ρ̂
X X
+ hΣ| ρ̂Σ |Σ0 i hΣ0 | ln ρ̂Σ |Σi δ hΣ | ρ̂ | Σ0 i
Σ∈S Σ Σ0 ∈S Σ ∞ D E
ˆ n−1 Σ
Z Z X
= (−1)n+1 Σ0 X̂(ρ̂ − I) (32)
+ dθ dθ0 hθ| ρ̂θ |θ0 i hθ0 | ln ρ̂θ |θi
θ∈S θ θ 0 ∈S θ n=1
D E
+ ln Z(β, s), (25) = Σ0 X̂ ρ̂−1 Σ .
(33)
which is identical to Eq. (12).
Here, we note the definition of ρ̂−1 , Eq. (28).
B. Derivation of update equations
Next, by using Eq. (26), we derive the update equa-
We derive the update equations of QAVB, Eqs. (13) tions of QAVB, Eqs. (13) and (14).
The functional
derivative of Eq. (12) with respect to Σ ρ̂Σ Σ0 under
and (14), from Eq. (12). For preparation of the deriva-
7
the constraint TrΣ ρ̂Σ = 1 is given by By solving
"
!
δ
fˆ(β, s)
S ρ̂Σ ⊗ ρ̂θ
"
! δ hΣ | ρ̂Σ | Σ0 i
Z(β, s)
δ
fˆ(β, s) #
Σ θ
S ρ̂ ⊗ ρ̂
Σ
δ hΣ | ρ̂Σ | Σ0 i
Z(β, s)
− α TrΣ ρ̂ −1 = 0, (36)
#
− α TrΣ ρ̂Σ − 1
we obtain
D h i E
Σ ln ρ̂Σ Σ = Σ0 Trθ ρ̂θ ln fˆ(β, s) Σ
Z Z
0
dθ0 θ ρ̂θ θ0
=− dθ
θ∈S θ θ 0 ∈S θ D E
+ (α − 1) Σ0 IˆΣ Σ . (37)
h ih ih i
× hΣ | ⊗ hθ0 | ln fˆ(β, s) |Σi ⊗ |θi
0
Taking into account that |Σi and hΣ0 | are arbitrary vec-
D E
+ Σ0 ln ρ̂Σ Σ − (α − 1) Σ0 IˆΣ Σ
(34)
D h
i E tors, we obtain
= − Σ0 Trθ ρ̂θ ln fˆ(β, s) Σ
h i
D E ln ρ̂Σ = Trθ ρ̂θ ln fˆ(β, s) + (α − 1)IˆΣ . (38)
+ Σ0 ln ρ̂Σ Σ − (α − 1) Σ0 IˆΣ Σ ,
(35)
Hence we have the update equation of Σ, Eq. (13), where
α contributes as a normalization factor. On the other
hand, by using the same procedure, we obtain the update
where α is a Lagrange multiplier. equations of θ as Eq. (14).