Beruflich Dokumente
Kultur Dokumente
Abstract-The Gauss-Newton algorithm is often used to minimize a nonlinear least-squaresloss function instead of the original Newton-Raphson algorithm. The main reason is the fact that only first-order derivatives are needed to construct the Jacobian matrix. Some applications as, for instance multivariable system identification, give rise to weighted nonlinear least-squares problems for which it can become quite hard to obtain an analytical expression of the Jacobian matrix. To overcome that struggle, apseudo Jacobian matrix is introduced, which leaves the stationary points untouched and can he calculated analytically. Moreover, by slightly changing the pseudo-Jacobian matrix, a better approximation of the Hessian can be obtained resulting in faster convergence.
(4)
m=l
cF12f
(5)
I. INTRODUCTION
HIS paper is devoted to optimization problems in which the loss function is of the form
K
where r = r(p) E C is the residual vector. Notice that Cr1/2 in (5) is not unique. Because the weighting matrices { C,, are Hermitian-symmetric positive-definite matrices, Cholesky decompositions can be used, for instance, to calculate C, 1/2. If g = g(p) E RP is the gradient of Z(p) and H = H(p) E R P x P Hessian matrix, then we have its g== 2Re(JHr)
~ ( p= Cefc;:ek )
k=l
= fHCflf
(1)
dP
and
where ek = ek(p) E C n and C,, = C,,(p) E C R x R are analytically known complex-valued functions of the realvalued parameter vector p E RP. Moreover, the weighting matrices C,, are assumed to be positive-definite Hermitiansymmetric matrices. Hence, f = f(p) E C and Cf = Cf(p) E CIZ.lxM with M = RK are given by
where
is the M x P Jacobian matrix. The superscripts T and H stand ( 2 ) for, respectively, the transpose and the Hermitian-transpose Cf = block diag(C,, , Ce,, . . . CeK). ( 3 ) operator, whereas an overline is used to denote the complex conjugate. Since r is being minimized in the least-squares This unconstrained optimization problem can be solved by sense, it is often the case that the components rm are small. means of methods that deal with general problems of this This suggests that a good approximation of the Hessian in the class [l], [2].However, the general form of (1) suggests that neighborhood of the minimum might be obtained by neglecting a superior approach could be possible. One such method that the final term in (7) to give takes advantage of the form of the loss function is the GaussH N 2Re(JHJ). (9) Newton (GN) method. To apply the GN method, we have to rewrite (1) in the classical form of a so-called nonlinear It is in this way that the structure of (4) can be taken into
f = [eT,eT;.. . , e $ ]
Manuscript received December 16, 1994; revised March 20, 1996. This work was supported by the Belgian National Fund for Scientific Research (NFWO), the Flemish government (GOA-IMMI), and the Belgian government as a part of the Belgian programme on Inter-University Poles of Attraction (IUAP-50) initiated by the Belgian State, Prime Ministers Office, Science Policy Programming. The associate editor coordinating the review of this paper and approving it for publication was Dr. Andreas Spanias. The authors are with Vrije Universiteit Brussel, Department of Fundamental Electricity and Instrumentation (ELEC), B-1050 Brussel, Belgium (e-mail: paguilla@vnet3.vub.ac.be). Publisher Item Identifier S 1053-58713(96)06665-2.
account to improve the performance of first-order derivative methods. Whereas a quasi-Newton method might take P iterations to estimate H satisfactorily, here, an approximate is immediately available. Thus, there is the possibility of a more rapid convergence [2].Using the derivative expressions (6) and (9), the nth iteration of the basic GN method is therefore (a) solve 2Re(JFJ,)d,
(b) set P ~ + = ~n I
= -2Re(JFr,)
ford,
+d n
(10)
2223
where J, = J(p,), and r = r(pn).The previous two steps , are repeated until pn+l converges to a stationary point, which will be denoted by p, in the sequel. When the residuals are zero in the optimum (i.e., r, = r(p,) = 0), as in welldetermined problems for instance (i.e., M = P ) ,then the terms ~ arm ~ in (7) are zero, resulting lin second-order ~ ~
is completely deter-
2Re(J H r)lp=p, = 0 while the path that is followed during the optimization as well as the convergence rate depend on the quality of the Hessian approximation. Examining the ith entry of the vector Re(JHr)
convergence of (10). For overdetermined problems, however, it is more usual that r, is nonzero, in which case, the order of convergence is usually no better than linear. One area in which such problems arise is in system identijication, where M is often much larger than P. Re(jFr) = Re C,f) (14) The present paper is organized as follows. In Section 11, the GN algorithm will be applied to (1). However, in the multivariable case (i.e., when R > l), the analytical calcuwe observe that a , lI2 appears in a Hermitianacp p CP lation of the Jacobian matrix will turn out to be too difficult in quadratic form. Hence, only its Hermitian-symmetric part general. To avoid that problem, the notion pseudo-Jacobian matrix will be introduced. The pseudo-Jacobian matrix is not gives a nonzero contribution to the real part only much easier to compute analytically, but, what is more, it still yields the same stationary points. Therefore, replacing the Jacobian matrix by its pseudo one simplifies the problem \ without affecting the final results. This algorithm will be called the pseudo-Gauss-Newton (PGN) method. By taking a close where herm(S) = look at the resulting pseudo Hessian, it will be clear that by Lemma 1: Let S = S(p) E C R x R be a Hermitiansimply multiplying one of the pseudo-Jacobian terms by two, a symmetric positive-definite matrix and p E RP a real-valued better approximation of the Hessian is possible than with the vector: then classical GN method. However, this improved PGN (IPGN) method will require two Jacobian matrices: one to compute the gradient exactly and another to obtain a better approximation of the Hessian. Eventually, when f is a linear function of p , it will be shown that only a small additional effort is required Prooj5 Taking the derivative with respect to p L of both to implement the full Newton-Raphson (NR) method. In sides of S-H/2(p)S-1/2(p)= S-l(p) gives Section IV, a well-conditioned implementation of the PGN and the IPGN method is given. By way of illustration, the different schemes will be applied to a typical application of optimization: system identijication. Finally, in Section VI, conclusions will be drawn. which proves Lemma 1. 0 Lemma 1 and (15) are just what we need to simplify (14). 11. THE PSEUDO-JACOBIAN MATRIX After substitution, one obtains The Jacobian (8) can be written as
9.
>
culation of & is not a piece of cake in general. An alternative is to use finite differences, but this approach is time consuming and not always very accurate. To avoid that problem, we will substitute the Jacobian J with a pseudoJacobian matrix J + . This pseudo-Jacobian matrix should have, if possible, the following properties: 1) It should have invariance of the solution, i.e., replacing the Jacobian in the GN equations by the pseudo-Jacobian matrix should yield the same minimum. 2) The pseudo-Jacobian matrix should be easier to compute analytically that the original Jacobian.
Theorem 1: The gradient of the loss function g can be calculated by replacing the Jacobian matrix J in (6) by the pseudo-Jacobian matrix J +
g = 2Re(JFr).
(21)
2224
Proof: Using (5) and (20) and noticing that Cfa'2Cf1/2 = C,' (i.e., the Cholesky decomposition of Cy'), one obtains
Comparing (24) with (23),one notices that both expressions are quite similar and that only one term is missing in (24), i.e., the term containing the second-order derivative of Cf. The first term of (23) and (24) are the same, the second and third ones are equal up to a factor and the fourth one is equal up to a factor Remember that for scalar which equals (18). Hence, g = 2Re(JHr) = 2 R e ( J y r ) . 0 systems, the pseudo-Jacobian matrix is exactly equal to the In (20), only the derivatives of f and Cf occur, which Jacobian. Hence, the GN method does not model all terms of are both analytically known functions of p. Therefore, these the Hessian proportional to f z and f t f, correctly. An improved derivatives can be easily determined without the need of finite- approximation of the Hessian is readily obtained by slightly difference approximations. Thus, the pseudo-Jacobian matrix modifying the pseudo-Jacobian matrix. Dejinition 2: The improved pseudo-Jacobian matrix is deJ+, satisfies our two primary objectives: First, the pseudonoted by J++ and defined as J++ = [ j + + l , j + + 2 , . . . ,j++p] Jacobian matrix can be computed analytically, which in the case of the original Jacobian matrix is a very difficult (or even with impossible) task, and second, the same stationary points are still obtained.
i.
i,
Notice that the modification consists only in a multiplication by 2 of the last term of (20). Theorem 2: All terms of the Hessian (23) except the last one (containing the second-order derivative of C f ) are given by 2Re(JF+J++) when the equation error f is linear in the parameters. Proofi The (i,j)th entry of JT+J++equals
0
In general (i.e., R > l), the matrix Cf is block diagonal ( K blocks of dimension R x R) and J # J+.However, when the equation error f is small, J M J+. Lemma 3: When the equation error vector f is linear in the parameters p, the Hessian H = [H,,] becomes
0
Consequently, to implement the full Newton-Raphson (NR) method, one term has to be added to (24), requiring the second-order derivatives of Cf (where Cf is an analytically known function of p). Thus, only a relatively small extra effort is needed to implement the full NR method. However, this requires the explicit formation of the normal equations, which can give rise to a badly conditioned problem. Contrary to the NR method, the IPGN method is still based on the use of Jacobian matrices, allowing for an effective implementation, as will be demonstrated in the next section. Moreover, even when f is a nonlinear function of p, (26) will usually still be a better approximation of the Hessian than (24) as a larger number of terms are included.
Proof: Result (23) is easily obtained starting from (1) and noticing that Re(xH(S S H ) x )= 2Re(xHSx) when S is Hermitian symmetric. 0 On the other hand, the ( i , j ) t h entry of JFJ+ equals
d = -[Re(JTJ+)]-'Re(JTr)
should be avoided for conditioning reasons (see [5]).This can be done by using the singular value decomposition (SVD)
2225
U E
and V E R P x P orthogonal matrices, and are R P x P diagonal [5]. Hence, (27) becomes is
that the equation error vector (32) depends linearly on the parameter vector p and on the input/output Fourier coefficients
d = - [VS-lUT]
:{ :}.
{X(Wk)> YC~k)>E=,. The matrix C,, = C,(p,wk) E C R x R in (1) equals the covariance matrix of the equation error vector e(p,w k ) C,(p, w k ) = BCxxBH ACyyAH - 2herm(BCxyAH)
(37) with Cxy(wk) = I { ( X - I X ) ( Y - &Y)H}(I stands for the expectation operator). The Fourier coefficients {X(wk),Y(wk)}fxl as well as the covariance matrices { C X X ( ~C Y, Y ( ~Cxy(wk)}f=l are derived from ~) ~), measurements [ 111, [ 121.
This approach can be applied to the GN and the PGN method but not to the IPGN method because two different Jacobian matrices are present. One possible approach consists in using the generalized SVD (GSVD). The GSVD of Re J + Re J t + ({Im J + {Im J + + }) becomes [5]-[7]
{E;I}
Re J + + {Im J++
= =
u+s+x-l,
1 u++s++x-1.
reduces to
B. An Illustrative Example
The performance of the different optimization schemes will be compared by means of a simple example. Consider, therefore, the following single-input multiple-output system with one input and two outputs having as transfer function matrix
Consequently, d = -[Re(Jf+J++)]-Re(Jfr)
V. APPLICATION
The real-valued constant p is the unknown parameter to be estimated given the following input/output Fourier coefficients:
X=l,
and variances
Y={i}
(39)
c x x = 1 , CYY =
[o
1],
cxy={;}.
(40)
(32)
f=BX-Y=
pl} :l
The complex-valued vectors X(wk) E C s and Y ( w k ) E C R stand for the Fourier coefficients of the S stimulus and R response signals observed. The polynomial matrices B(p,wk) E C R x S and A(p,wk) E C R x R represent, respectively, the numerator and the denominator matrix of a left matrix-jraction description (LMFD) of the transferfunction matrix G(p,wk) E C R x S to be identified
cf = B
+~cyy=~ B
r2;
+
+
(43)
5p2 - 6 p 3 p2 2
G(P,Uk)= A-l(P,wk)B(P,Wk).
n h
(33)
and attains its minimum for p = p m = 2 / 3 (see Fig. 1). The Cholesky decomposition of Cf = C:2Cp2 equals
The Jacobian, pseudo-Jacobian, and improved pseudoJacobian are 2 x 1 vectors and equal, respectively
(35)
n=O
with
n(Wk)
{(e-zwk)n
( Z W ~ ) ~
(45)
(continuous-time model) * (discrete-time model) (36)
(46)
The parameters to be estimated p are the unknown real-valued coefficients b,,, and arsn occurring in (34) and (35). Notice
2226
-0.5
1
0
I I
I
-1
'
11
-LO
parameter value
Fig. 1.
parameter value
Fig. 2. Hessian approximations. NR (solid line), GN (dashed line), PGN (dotted line), and IPGN (dashdot line).
Notwithstanding the difference between the Jacobian and the pseudo-Jacobian, they yield the same gradient
i.e., the parametric identification of the two-by-two admittance matrix of a Briiel and Kjar octave filter (type 1613).A detailed description of the measurement setup can be found in [9]. This filter is passive and, thus, reciprocal. In other words, its admittance matrix has to be symmetric. The reciprocity can easily be taken into account by using a scalar matrix-fraction description
with
r
In Fig. 2, the Hessian approximations are compared with the exact one. We observe that the Jacobian and the pseudoJacobian yield very similar results, whereas the improved pseudo-Jacobian produces a better approximation of the Hessian in the neighborhood of the minimum pm = 213. More= over, talung into account the second-order derivative
with b 2 l ( p , w k ) = b 1 2 ( p , w k ) . By examining the cost function for different model orders as well as the uncertainty on the parameter estimates, the following model structure has been proposed in [9] to describe the dynamical behavior of the Bruel and Kjser octave filter
a ( p , w k ) = 1(iwk15 a4(iwk)4. . . al(zwk) a. h ( p , w d = b115(i~k)5 + h 4 ( 2 ~ k ) ... b l l l ( ~ ~ k ) ~ b i 2 ( P , W k ) = b i 2 4 ( z W 1 c ) ~+ b 1 2 3 ( 2 W l c ) ~ 4b122(?-wlc)2
+ +
:], the "full" Hessian is readily obtained. In Fig. 3, the absolute parameter error Ip, - pool is depicted as a function of the iteration index n. As a starting value, p = 0 has been chosen. One notice that the convergence rate of the IPGN and the NR method is superlinear, whereas the convergence rate of the GN and the PGN algorithm is no better than linear when the residuals are large. Further, it is worthwhile to notice that the Hessian approximations are always semi-positive definite by construction for the different GN implementations, whereas the Hessian used by the NR algorithm can become negative definite. In this example, the GN, the PGN, and the IPGN algorithm converge to the ML solution pm = 213 for any positive starting value; the NR algorithm diverges to infinity when the starting value is larger than 1.5 (see Fig. 2). Thus, the NR method has a smaller convergence region than the three other methods.
C. A Real Application
In this section, the performances of the different optimization schemes will be compared by means of a real application,
[i
% BP
+ +
a
b21(P,Wk) b 2 2 ( ~w k ) ,
= blZ(P,%)
=b
2 2 6 ( ~ ) ~ aZs(~wk)~ b
+ .. +b221(i~k)
+ b22o
with p = [ad,. . . , U O ,b115,. . . , b111, b 1 2 4 , . . . , b220IT. Starting values for p are obtained by solving the linear least-squares problem that results by putting the matrices { C,(p, (37) equal to the identity matrix. The convergence rate of the different optimization schemes (with the exception of the Gauss-Newton method because the Jacobian matrix is much to complex now to be computed analytically) is given in Fig. 4. As there are no modeling errors and the signal-to-noise ratios of the Fourier coefficients are quite large (approximately 60 dB), only a few iterations are required to converge. In Fig. 5 modeling errors are introduced by putting the coefficient b226 equal to a wrong value, i.e., (its correct To improve the convergence region value equals 1 . 2 6 6 ~ of the NR method, we first apply a few iterations of the PGN method (four in this example) before switching to the NR (and
2227
1oo
1o 2
8
3
E lo4
a *
0
ij lo4
;
iteration
Fig. 3. Convergence rate. NR (solid line), GN (dashed line), PGN (dotted line), and IPGN (dashdot line).
second-order derivatives to the Hessian matrix is no longer negligible. Including these derivatives (i.e., the NR method) results in a quadratic convergence rate. Notice that C,(p, w k ) is quadratic in the model parameters p. Consequently, its second-order derivative a 2 ~ ; ~ $ ; ~ h ) is independent of p and, thus, has to be calculated only once during the first iteration. In general, certainly when dealing with large problems, there is no guarantee at all that the IPGN method will need less iterations than the PGN method to converge. It has been shown that by means of a small change of the pseudo-Jacobian matrix, a better approximation of the Hessian is obtained. This results in faster local convergence, i.e., in a close neighborhood of the minimum. Moreover, when the Levenberg-Marquardt approach [l], [2] is used, for instance, the benefit resulting from the additional computations and memory needed to obtain a better approximation can be questioned. Furthermore, using the improved pseudo-Jacobian matrix, it has been shown that by means of a small additional effort, it is possible to implement the full NR method. To improve the convergence region, it is advised to first run the PGN algorithm. Once the parameter updates are sufficiently small (e.g., <1% of the parameter value), one can apply the full NR method to accelerate the convergence. VI. CONCLUSION In the present paper, we have examined the applicability of the Gauss-Newton method to weighted nonlinear least-squares problems. Such problems often occur in system identijication. For multivariable systems, the analytical calculation of the Jacobian matrix can be quite involved. Therefore, a pseudo Jacobian matrix has been introduced. This matrix is easier to compute but still yields the correct gradient and not an approximation of it. Next, it has been shown that by means of a small change of the pseudo-Jacobian matrix, a better approximation of the Hessian is readily obtained (in the multivariable as well as scalar case), resulting in faster convergence once we are close to the (local) minimum as long as the second-order derivatives can be neglected. Finally, is has been shown that if the second-order derivatives are not negligible, the full Newton-Raphson method can readily be implemented starting from the improved pseudo-Jacobian matrix, resulting in second-order convergence.
iteration
Fig. 4. Convergence rate in absence of modeling errors. NR (solid line), PGN (dotted line), and IPGN (dashdot line).
lo2
REFERENCES
D. Jacobs, Ed., The State of the Art in Numerical Analysis. London:
Academic, 1977. R. Fletcher, Practical Method3 of Optimization. New York Wiley, . . 1991, 2nd ed. P. E. Gill and W. Murray, Nonlinear least squares and nonlinearily constrained optimization, in Numevical Analysis. Dundee, 1975, A . Dold and B. Eckmann, Eds. Berlin: Springer-Verlag, Lecture Notes in Mathematics, 1976, pp. 134-147, vol. 506, 1976. . . Dennis Jr., D. M. Gay, and R. E. Welsch, An adaptive n o n I E. linear least-squares problem, ACM Truns. Murh. Sofmare, vol. 7 , pp.
348-368, 1981.
4 6 iteration
I 10
Fig. 5. Convergence rate with b226 fixed to lo-. (dotted line), and IPGN (dashdot line).
the IPGN) method. Fig. 5 shows that the convergence rate of the PGN and the IPGN method is not better than linear. This is due to the large modeling errors: the contribution of the
G. H. Golub and C. F. Van Loan, Matrix Compututions. Baltimore, MD: Johns Hopkins University Press, 1990, 2nd ed. C. C. Paige, Computing the generalized singular value decomposition, SIAM J. Sci. Stat. Comput., vol. 7 , no. 4, pp. 1126-1146, 1986. 2. Bai and J. Demmel, Computing the generalized singular value decomposition, SIAM J. Sci. Stat. Comput., vol. 14, no. 6, pp.
1464-1486, 1993.
2228
[9]
[ 101
[Ill
[I21
Rik Pintelon was horn in Gent, Belgium, on December 4, 1959 He received the degree of civil electrotechmcal-mechanical engineer (burgerlijk ingenieur) in July 1982, the degree of doctor in applied science in January 1988, and the qualification to teach at university level (geaggregeerde voor het hoger onderwijs) in April 1994, all from the Vrije Universiteit Brussel (VUB), Brussels, Belgium He i s presently a Senior Research Associate of the National Fund for Scientific Research (NFWO) and part-time lecturer at the VUB in the Electrical Measurement Department (ELEC) His main research interests are in the fields of paiameter estimatiodsystem identification and signal processing