Clases Identificación

SYSTEM IDENTIFICATION
Michele TARAGNA
Dipartimento di Automatica e Informatica
Politecnico di Torino
michele.taragna@polito.it
II level Specializing Master in Automatica and Control Technologies
Class System Identication, Estimation and Filtering
Academic Year 2010/2011
Politecnico di Torino - DAUIN M. Taragna
System identication
System identication is aimed at constructing or selecting mathematical models
Mof dynamical data generating systems S to serve certain purposes (forecast,
diagnostic, control, etc.)
A rst step is to determine a class Mof models within which the search for the
most suitable model is to be conducted
Classes of parametric models M() are often considered, where the parameter
vector belongs to some set , i.e., M= {M() : }
the choice problem is tackled as a parametric estimation problem

We start by discussing two model classes for linear time-invariant (LTI) systems:
transfer-function models
state-space models
System Identication, Estimation and Filtering 1
Transfer-function models
The transfer-function models, known also as black-box or Box-Jenkins models,
involve external variables only (i.e., input and output variables) and do not require
any auxiliary variable
Different structures of transfer-function models are available:
equation error or ARX model structure
ARMAX model structure
output error (OE) model structure
Equation error or ARX model structure
The input-output relationship is a linear difference equation:
y(t)+a
1
y(t1)+a
2
y(t2)+ +a
n
a
y(tn
a
)=b
1
u(t1)+ +b
n
b
u(tn
b
)+e(t)
where the white-noise term e(t) enters as a direct error
Let us denote by z
1
the unitary delay operator, such that z
1
y(t) = y(t 1),
z
2
y(t) = y(t 2), etc., and introduce the polynomials:
A(z) = 1 +a
1
z
1
+a
2
z
2
+ +a
n
a
z
n
a
B(z) = b
1
z
1
+b
2
z
2
+ +b
n
b
z
n
b
then, the above input-output relationship can be written as:
A(z)y(t) = B(z)u(t) +e(t)
y(t) =
B(z)
A(z)
u(t) +
1
A(z)
e(t) = G(z)u(t) +H(z)e(t)
where
G(z) =
B(z)
A(z)
, H(z) =
1
A(z)
If the input u() is present, also known as exogenous variable, then the model:
A(z)y(t) = B(z)u(t) +e(t)
contains theautoregressive (AR) A(z)y(t) and the exogenous (X) B(z)u(t) parts.
The integers n
a
and n
b
are the orders of these two parts of the ARX model,
denoted as ARX(n
a
, n
b
)
If n
a
=0, then A(z)=1 and y(t) is modeled as a nite impulse response (FIR)
If the input u() is missing, then the model:
A(z)y(t) = e(t)
contains only the autoregressive (AR) A(z)y(t) part.
The integer n
a
is the order of the resulting AR model, denoted as AR(n
a
)
ARMAX model structure
The input-output relationship is a linear difference equation:
y(t) + a
1
y(t1) + a
2
y(t2) + + a
n
a
y(tn
a
) =
= b
1
u(t1) + + b
n
b
u(tn
b
) + e(t) + c
1
e(t1) + + c
n
c
e(tn
c
)
where the white-noise term e(t) enters as a linear combination of n
c
+1 samples
By introducing the polynomials:
A(z) = 1 +a
1
z
1
+a
2
z
2
+ +a
n
a
z
n
a
B(z) = b
1
z
1
+b
2
z
2
+ +b
n
b
z
n
b
C(z) = 1 +c
1
z
1
+c
2
z
2
+ +c
n
c
z
n
c
the above input-output relationship can be written as:
A(z)y(t) = B(z)u(t) +C(z)e(t)
y(t) =
B(z)
A(z)
u(t) +
C(z)
A(z)
e(t) = G(z)u(t) +H(z)e(t)
where
G(z) =
B(z)
A(z)
, H(z) =
C(z)
A(z)
If the exogenous variable u() is present, then the model:
A(z)y(t) = B(z)u(t) +C(z)e(t)
contains theautoregressive (AR) part A(z)y(t), the exogenous (X) part B(z)u(t)
and the moving average (MA) part C(z)e(t), which is a colored noise instead of
a white one.
The integers n
a
, n
b
and n
c
are the orders of these three parts of the ARMAX
model, denoted as ARMAX(n
a
, n
b
, n
c
)
If the input u() is missing, then the model:
A(z)y(t) = C(z)e(t)
contains only theautoregressive
A(z)y(t) and themoving average C(z)e(t) parts.
The integers n
a
and n
c
are the orders of the resulting ARMA model, denoted as
ARMA(n
a
, n
c
)
Output error or OE model structures
The relationshipbetween input and undisturbed output is alinear difference equation:
w(t) +f
1
w(t1) + +f
n
f
w(tn
f
)=b
1
u(t1) + +b
n
b
u(tn
b
)
and the model output is corrupted by white measurement noise:
y(t) = w(t) +e(t)
By introducing the polynomials:
F(z) = 1 +f
1
z
1
+f
2
z
2
+ +f
n
f
z
n
f
B(z) = b
1
z
1
+b
2
z
2
+ +b
n
b
z
n
b
the above input-undisturbed output relationship can be written as:
F(z)w(t) = B(z)u(t)
y(t) = w(t) +e(t) =
B(z)
F(z)
u(t) +e(t) = G(z)u(t) +e(t)
where
G(z) =
B(z)
F(z)
The integers n
b
and n
f
are theorders of the OE model, denoted as OE(n
b
, n
f
)
State-space models
The discrete-time, linear time-invariant model Mis described by:
M:
_
x(t + 1) = Ax(t) +Bu(t) +v
1
(t)
y(t) = Cx(t) +v
2
(t)
t = 1, 2, . . .
where:
x(t)R
n
, y(t)R
q
, u(t)R
p
, v
1
(t)R
n
, v
2
(t)R
q
the process noise v
1
(t) and the measurement noise v
2
(t) are uncorrelated
white noises with zero mean value, i.e.:
v
1
(t) WN(0, V
1
) with V
1
R
nn
, v
2
(t) WN(0, V
2
) with V
2
R
qq
AR
nn
is the state matrix, BR
np
is the input matrix,
CR
qn
is the output matrix
The transfer matrix between the exogenous input u and the output y is:
G(z) = C (zI
n
A)
1
B
The system identication procedure
The system identication problem may be solved using an iterative approach:
1. Collect the data set
If possible, designthe experiment sothat the data becomemaximally informative
If useful and/or necessary, apply some preltering technique of the data
2. Choose the model set or the model structure, so that it is suitable for the aims
A physical model with some unknown parameters may be constructed
by exploiting the possible a priori knowledge and insight
Otherwise, a black box model may be employed, whose parameters are
simply tuned to t the data, without reference to the physical background
Otherwise, a gray box model may be used, with adjustable parameters
having physical interpretation
3. Determine the suitable complexity level of the model set or model structure
4. Tune the parameters to pick the best model in the set, guided by the data
5. Perform a model validation test: if the model is OK, then use it, otherwise revise it
The predictive approach
Let us consider a class Mof parametric models M():
M= {M() : }
where the parameter vector belongs to some set
The data are the measurements collected at the time instants t from 1 to N
of the variable y(t), in the case of time series
of the input u(t) and the output y(t), in the case of input-output systems
Givena model M(), acorresponding predictor

M() can beassociatedthat provides
the optimal one-step prediction y(t+1|t) of y(t+1) on the basis of the data, i.e.,
in the case of time series, the predictor is given by:
M() : y(t + 1) = y(t + 1|t) = f

_
y
t
,
_
in the case of input-output systems, the predictor is given by:
M() : y(t + 1) = y(t + 1|t) = f

_
u
t
, y
t
,
_
withy
t
={y(t),y(t1),y(t2), ... ,y(1)},u
t
={u(t),u(t1),u(t2), ... ,u(1)}
Given a model M() with a xed value of the parameter vector , the prediction error
at the time instant t + 1 is given by:
(t + 1) = y(t+1) y(t+1|t)
and the overall mean-square error (MSE) is dened as:
J
N
() =
1
N
N
t=
(t)
2
where is the rst time instant at which the prediction y(| 1) of y() can be
computed from the data ( = 1 is often assumed)
In the predictive approach to system identication, the parameters of the model M()
in the class Mare tuned to minimize the criterion J
N
() over all , i.e.,
N
= arg min
J
N
()
If the model quality is high, the prediction error has to be white, i.e., without its own
dynamics, sincethedynamics containedinthe data has to beexplainedby the model
many different whiteness tests can be performed on the sequence (t)
Models in predictor form
Let us consider the transfer-function model
M() : y(t) = G(z)u(t) +H(z)e(t)
where e(t) is a white noise with zero mean value
The term v(t) = H(z)e(t) is called residual and has to be small, so that the model
M() could satisfactorily describe the input-output relationship of a given system S
It is typically assumed that v(t) is a stationary process, i.e., a sequence of random

variables whose joint probability distribution does not change over time or space
the followingassumptions canbe made, leading tothe canonical representation of v(t):
1. H(z) is the ratio of two polynomials with the same degree that are:
monic, i.e., such that the coefcients of the highest order terms are equal to 1
coprime, i.e., without common roots
2. both the numerator and the denominator of H(z) are asymptotically stable,
i.e., the magnitude of all the zeros and poles of H(z) is less than 1
The predictor associatedtoM() canbe derivedfromthe model equation as follows:
1. subtract y(t) from both sides: 0 = y(t) +G(z)u(t) +H(z)e(t)
2. divide by H(z): 0 =
1
H(z)
y(t) +
G(z)
H(z)
u(t) +e(t)
3. add y(t) to both sides: y(t) =
_
1
1
H(z)
_
y(t) +
G(z)
H(z)
u(t) +e(t)
Since H(z) is the ratio of two monic polynomials with the same degree, then:
1
H(z)
= 1+
1
z
1
+
2
z
2
+. . . 1
1
H(z)
=
1
z
1
2
z
2
. . .
_
1
1
H(z)
_
y(t) =
1
y(t 1)
2
y(t 2) . . . = f
y
(y
t1
)
with y
t1
={y(t1), y(t2), . . .}. Analogously, since G(z) is strictly proper:
G(z)
H(z)
= G(z)
_
1+
1
z
1
+
2
z
2
+. . .
_
=
1
z
1
+
2
z
2
+. . .
G(z)
H(z)
u(t) =
1
u(t 1) +
2
u(t 2) +. . . = f
u
(u
t1
)
with u
t1
= {u(t 1), u(t 2), . . .} and then:
y(t) = f
y
(y
t1
) +f
u
(u
t1
) +e(t)
In the model equation
y(t) = f
y
(y
t1
) +f
u
(u
t1
) +e(t)
the output y(t) depends on past values u
t1
and y
t1
of the input and the output,
while the white noise term e(t) is unpredictable and independent of u
t1
and y
t1
the best prediction of e(t) is provided by its mean value, which is equal to 0
the optimal one-step predictor of the model M() is given by:
M() : y(t) = y(t|t 1) =

_
1
1
H(z)
_
y(t) +
G(z)
H(z)
u(t)
ARX, AR and FIR models in predictor form
In the case of the ARX transfer-function model:
M() : y(t) = G(z)u(t) +H(z)e(t), with G(z) =
B(z)
A(z)
, H(z) =
1
A(z)
the optimal predictor is given by:
M() : y(t) = [1 A(z)] y(t) +B(z)u(t)

y(t) is a linear combination of past values of the input and the output,
independent of past predictions
y(t) is linear in the parameters a
i
and b
i
of the polynomials A(z) and B(z)
the predictor is stable for any value of the parameters that dene A(z) and B(z)
In the case of the AR transfer-function model, where B(z) = 0, then:
M() : y(t) = [1 A(z)] y(t)

In the case of the FIR transfer-function model, where A(z) = 1, then:
M() : y(t) = B(z)u(t)

ARMAX, ARMA and MA models in predictor form
In the case of the ARMAX transfer-function model:
B(z)
A(z)
, H(z) =
C(z)
A(z)
M() : y(t) =
_
1
A(z)
C(z)
_
y(t) +
B(z)
C(z)
u(t)
y(t) is nonlinear inthe parameters a
i
, b
i
, c
i
of thepolynomials A(z), B(z), C(z)
the predictor stability depends on the values of the parameters that dene C(z)
In the case of the ARMA transfer-function model, where B(z) = 0, then:
M() : y(t) =
_
1
A(z)
C(z)
_
y(t)
In the case of the MA transfer-function model, where B(z)=0 and A(z)=1, then:
M() : y(t) =
_
1
1
C(z)
_
y(t)
OE models in predictor form
In the case of the OE transfer-function model:
B(z)
F(z)
, H(z) = 1
M() : y(t) =
B(z)
F(z)
u(t)
y(t) is a linear combination of past values of the exogenous input only,
independent of past predictions
y(t) is nonlinear in the parameters b
i
, f
i
of the polynomials B(z) and F(z)
the predictor stability depends on the values of the parameters that dene F(z)
Asymptotic analysis of
prediction-error identication methods
Using the prediction-error identication methods (PEM), the optimal model in the
parametric class M= {M() : } is obtained by minimizing the size of
the prediction-error sequence (), i.e.:
J
N
() =
1
N
N
t=
(t)
2
or, in general, J
N
() =
1
N
N
t=
((t))
where () is a scalar-valued (typically positive) function
Goal: analyze the asymptotic (i.e., as N) characteristics of the optimal estimate
N
= arg min
J
N
()
Assumptions: the predictor form

M() of the model M() is stable and the
sequences u() and y() are stationary processes the one-step prediction y()
and the prediction error () = y() y() are stationary processes as well
J
N
() =
1
N
N
t=
(t)
2
J() = E
_
(t)
2

Let us denote by D
the set of minimum points of

J(), i.e.:
D
=
_
:

J(
)

J(),
_
Result #1:
if the data generating system S M, i.e.,
o
: S = M(
o
)
o
D
Result #2:
1. if S Mand D
= {
o
} (i.e., D
is a singleton)
N

N
o
2. if S Mand
:

=
o
(i.e., D
is not asingleton) asymptotically:

either

N
tends to a point in D
(not necessarily
o
)
or it does not convergeto any particular point of D
but wanders around inD
3. if S / Mand D
=
_
_
(i.e., D
is a singleton)
N

N
and
M(
) is the best approximation of S within M

4. if S / Mand D
is not a singleton asymptotically, either

N
tends
to a point in D
or it wanders around in D

To measure the uncertainty and the convergence rate of the estimate

N
, we have
to study the random variable

, being

the limit of

N
as N
Result #3:
if S Mand D
= {
o
} R
n
, then:
N

o
decays as 1/
N for N
the random variable
N(
N

o
) is asymptotically normally distributed:
N(
N

o
) As N
_
0,

P
_
where
P = V ar[(t,
o
)]

R
1
R
nn
(asymptotic variance matrix)
R = E
_
(t,
o
)(t,
o
)
T
R
nn
(t, ) =
_
d
d
(t, )
T
=
_
d
d
y(t, )
T
R
n
N
As N
_
o
,
1
N
P
_
Note that the asymptotic variance matrix

P can be directly estimated from data
as follows, having processed N data points and determined

N
:
P = V ar[(t,
o
)]

R
1

P
N
=
2
N

R
1
N

2
N
=
1
N
N
t=1
(t,
N
)
2
R
R
N
=
1
N
N
t=1
(t,
N
)(t,
N
)
T
R
nn
the estimate uncertainty intervals can be derived from data
Linear regressions and least-squares method
In the case of equation error or ARX models, the optimal predictor is given by:
M() : y(t) = [1 A(z)] y(t) +B(z)u(t)

with A(z) = 1 +a
1
z
1
+ +a
n
a
z
n
a
, B(z) = b
1
z
1
+ +b
n
b
z
n
b
y(t) =
_
a
1
z
1
a
n
a
z
n
a
_
y(t) +
_
b
1
z
1
+ +b
n
b
z
n
b
_
u(t) =
= a
1
y(t1) a
n
a
y(tn
a
) +b
1
u(t1) + +b
n
b
u(tn
b
) =
= (t)
T
= y(t, )
where
(t) = [y(t1) y(tn
a
) u(t1) u(tn
b
)]
T
R
n
a
+n
b
= [a
1
a
n
a
b
1
b
n
b
]
T
R
n
a
+n
b
i.e., it denes a linear regression thevector (t) is known as the regression vector
Since the prediction error at the time instant t is given by:
(t, ) = y(t) y(t, ) = y(t) (t)
T
, t = 1, . . . , N
and the optimality criterion (assuming = 1, for the sake of simplicity) is quadratic:
J
N
() =
1
N
N
t=1
(t, )
2
the optimal parameter vector

N
that minimizes J
N
() over all = R
n
a
+n
b
is obtained by solving the normal equation system:
_
N
t=1
(t) (t)
T
_
=
N
t=1
(t) y(t)
if the matrix
_
N
t=1
(t) (t)
T
_
is nonsingular (known as identiability condition),
then there exists asingleunique solutiongivenby the least-squares (LS) estimate:
N
=
_
N
t=1
(t) (t)
T
_
1
N
t=1
(t) y(t)
otherwise, there are innite solutions
Remark: the least-squares method can beapplied toany model (not necessarily ARX)
such that the corresponding predictor is a linear or afne function of :
y(t, ) = (t)
T
+(t)
where (t)R is a known data-dependent vector. In fact, if the identiability condition
is satised, then:

N
=
_
N
t=1
(t) (t)
T
_
1
N
t=1
(t) (y(t) (t))
Such a situation may occur in many different situations:
when somecoefcients of the polynomials A(z), B(z) of anARX model are known
when the predictor (even of a nonlinear model) can be written as a linear function
of , by suitably choosing (t)
Example: given the nonlinear dynamic model
y(t) = ay(t 1)
2
+b
1
u(t 3) +b
2
u(t 5)
3
+e(t), e() WN(0,
2
)
the corresponding predictor is linear in the unknown parameters:
M() : y(t) = ay(t 1)

2
+b
1
u(t 3) +b
2
u(t 5)
3
= (t)
T
with (t) =
_
y(t1)
2
u(t3) u(t5)
3
T
and = [a b
1
b
2
]
T
Probabilistic analysis of the least-squares method
Let the predictor

M() of M() be stable and u(), y() be stationary processes.
The least-squares method is a PEM method the previous asymptotic results hold
asymptotically, either

N
tends to a point in D
or wanders around in D
, where
D
=
_
:

J(
)

J(),
_
is the set of minimumpoints of

J()=E
_
(t)
2
.
If S M
o
D
: S =M(
o
) y(t)=(t)
T
o
+e(t), e(t)WN(0,
2
)
If S Mand D
= {
o
}, then

N
As N
_
o
,

P/N
_
, where:
P = V ar[(t,
o
)]

R
1
=
2

R
1
R = E
_
(t,
o
)(t,
o
)
T
= E
_
(t)(t)
T
(t, ) =
_
d
d
(t, )
T
=
_
d
d
y(t, )
T
= (t)
since y(t, ) = (t)
T
, (t, ) = y(t) y(t, ) = (t)
T
(
o
) +e(t)
P can be directly estimated from N data as:

P =
2

R
1

P
N
=
2
N

R
1
N
, with

2
N
=
1
N
N
t=1
(t,
N
)
2
=
1
N
N
t=1
[y(t) (t)
T
N
]
2
R
N
=
1
N
N
t=1
(t,
N
)(t,
N
)
T
=
1
N
N
t=1
(t)(t)
T
Note that, under the assumption that S M, the set D
is a singleton that contains

the true parameter vector
o
only the matrix

R=E
_
(t)(t)
T
is nonsingular
In the case of an ARX(n
a
, n
b
) model,
(t) = [y(t1) y(tn
a
) u(t1) u(tn
b
)]
T
=
_
_
y
(t)
u
(t)
_
_
with
y
(t)=[y(t1) y(tn
a
)]
T
R
n
a
,
u
(t)=[u(t1) u(tn
b
)]
T
R
n
b
R = E
(t)(t)
T
= E
y
(t)
u
(t)
y
(t)
T
u
(t)
T
=
= E
y
(t)
y
(t)
T
y
(t)
u
(t)
T
u
(t)
y
(t)
T
u
(t)
u
(t)
T
y
(t)
y
(t)
T
y
(t)
u
(t)
T
u
(t)
y
(t)
T
u
(t)
u
(t)
T
R
(n
a
)
yy

R
yu
R
uy

R
(n
b
)
uu
R
(n
a
)
yy

R
yu
R
T
yu

R
(n
b
)
uu
, where

R
(n
a
)
yy
=
R
(n
a
)
yy
T
,

R
(n
b
)
uu
=
R
(n
b
)
uu
T
For structural reasons,

Ris symmetric and positivesemidenite, sincex R
n
a
+n
b
:
x
T

Rx = x
T
E
_
(t)(t)
T
x = E
_
x
T
(t)(t)
T
x
= E
_
_
x
T
(t)
_
2
_
0
R is nonsingular

R is positive denite (denoted as:

R > 0)
Schurs Lemma: given a symmetric matrix M partitioned as:
M =
_
_
M
11
M
12
M
T
12
M
22
_
_
(where obviously M
11
and M
22
are symmetric), M is positive denite if and only if:
M
22
> 0 M
11
M
12
M
1
22
M
T
12
> 0
A necessary condition for the invertibility of

R is that

R
uu
> 0, i.e., that

R
(n
b
)
uu
is
nonsingular, since

R
(n
b
)
uu
is symmetric and positivesemidenite; infact xR
n
b
:
x
T
R
(n
b
)
uu
x=x
T
E
u
(t)
u
(t)
T
x=E
x
T
u
(t)
u
(t)
T
x
=E
x
T
u
(t)
0
R
(n
b
)
uu
= E
u
(t)
u
(t)
T
= E
u(t1)
.
.
.
u(tn
b
)
[u(t1) u(tn
b
)]
=
=
u(t1)
2
E[u(t1)u(t2)] E [u(t1)u(tn
b
)]
E[u(t2)u(t1)] E
u(t2)
2
E [u(t2)u(tn
b
)]
.
.
.
.
.
.
.
.
.
.
.
.
E[u(tn
b
)u(t1)] E[u(tn
b
)u(t2)] E
u(tn
b
)
2
=
=
r
u
(t1, 0) r
u
(t1, 1) r
u
(t1, n
b
1)
r
u
(t1, 1) r
u
(t2, 0) r
u
(t2, n
b
2)
.
.
.
.
.
.
.
.
.
.
.
.
r
u
(t1, n
b
1) r
u
(t2, n
b
2) r
u
(tn
b
, 0)
=
=
r
u
(0) r
u
(1) r
u
(n
b
1)
r
u
(1) r
u
(0) r
u
(n
b
2)
.
.
.
.
.
.
.
.
.
.
.
.
r
u
(n
b
1) r
u
(n
b
2) r
u
(0)
where r
u
(t, )=E[u(t)u(t )] is the correlation function of the input u(), which is
independent of t for any stationary process u(): r
u
(t
1
,)=r
u
(t
2
,)=r
u
(), t
1
,t
2
,
A stationary signal u() is persistently exciting of order n

R
(n)
uu
is nonsingular
Examples:
the discrete-time unitary impulse u(t)= (t)=
_
1, if t = 1
0, if t = 1
is not persistently exciting of any order, since r
u
()=0,

R
(n)
uu
= 0
nn
the discrete-time unitary step u(t)= (t)=
_
1, if t = 1, 2, . . .
0, if t = . . . , 1, 0
is persistently exciting of order 1 only, since r
u
()=1,

R
(n)
uu
= 1
nn
the discrete-time signal u(t) consisting of mdifferent sinusoids:
u(t) =
m
k=1
k
cos(
k
t +
k
), where 0
1
<
2
< . . . <
m

is persistently exciting of order n =
_
_
_
2m, if 0 <
1
and
m
<
2m1, if 0 =
1
or
m
=
2m2, if 0 =
1
and
m
=
the white noise u(t) WN(0,
2
) is persistently exciting of all orders,
since r
u
(0)=
2
and r
u
()=0, = 0

R
(n)
uu
=
2
I
n
As aconsequence, a necessary conditionfor the invertibilityof

R is that thesignal u()
is persistently exciting of order n
b
at least
A necessary condition to univocally estimate the parameters of an ARX(n

a
, n
b
)
(i.e., to prevent any problem of experimental identiability related to the choice of u)
is that the signal u() is persistently exciting of order n
b
at least
The matrix

R may however be singular also for problems of structural identiability
related to the choice of the model class M: in fact, even in the case S M,
if Mis redundant or overparametrized (i.e., its orders are greater than necessary),
then an innite number of models may represent S by means of suitable pole-zero
cancellations in the denominator and numerator of the involved transfer functions
To summarize, only in the case that S = M(

o
) is an ARX(n
a
, n
b
) (without any
pole-zero cancellation in the transfer function) and Mis the class of ARX(n
a
, n
b
)
models, if the input signal u() is persistently exciting of order n
b
at least, then the
least-squares estimate

N
asymptotically converges to thetrue parameter vector
o
Least-squares method: practical procedure
1. Starting from N data points of u() and y(), build the regression vector (t)
and the matrix

R
N
=
1
N
N
t=1
(t)(t)
T
R if () is stationary;
in compact matrix form,

R
N
=
1
N
T
, where =
_
(1)
T
.
.
.
(N)
T
_
2. Check if

R
N
is nonsingular, i.e., if det

R
N
=0: if there exists the matrix

R
1
N
,
then the estimate is unique and it is given by:

N
=

R
1
N
1
N
N
t=1
(t) y(t);
in a matrix form,

N
=

R
1
N
1
N
T
y =
_
_
1
T
y, where y =
_
y(1)
.
.
.
y(N)
_
3. Evaluate the prediction error of the estimated model (t,
N
)=y(t)(t)
T
N
and approximate the estimate uncertainty as:
N
=

R
1
N
1
N
2
N
t=1
(t,
N
)
2
,
where the elements on the diagonal are the variances of each parameter [
N
]
i
4. Check the whiteness of (t,
N
) by means of a suitable test
Andersons whiteness test
Let () be the signal under test and N be the (sufciently large) number of samples
1. Compute the samplecorrelation function r
()=
1
N
N
t=+1
(t)(t), 0
( =25or 30), and the normalized samplecorrelation function
()=
r
()
r
(0)

if () is white with zero mean, then
() is asymptotically normally distributed:

() As N
_
0,
1
N
_
, > 0
moreover,
(
1
) and
(
2
) are asymptotically uncorrelated
1
=
2
2. Fix a condence level , i.e., the probability that asymptotically |
()| ,
and evaluate ; in particular, it turns out that: =
_
_
_
1/
N, for =68.3%
2/
N, for =95.4%
3/
N, for =99.7%
3. The test is failedif thenumber of values suchthat |
()| is less than ,

where x denotes the biggest integer less than or equal tox, otherwise it is passed
Recursive least-squares methods
The least-squares estimate referred to a generic time instant t is given by:
t
=
_
t
i=1
(i) (i)
T

1
t
i=1
(i) y(i) = S(t)
1
t
i=1
(i) y(i)
where
S(t)=
t
i=1
(i)(i)
T
=
t1
i=1
(i)(i)
T
+(t)(t)
T
=S(t1)+(t)(t)
T
The least-squares estimate referred to the time instant t 1 is given by:
t1
=
_
t1
i=1
(i) (i)
T

1
t1
i=1
(i) y(i) = S(t 1)
1
t1
i=1
(i) y(i)
and then:
t
= S(t)
1
t
i=1
(i) y(i) = S(t)
1
_
t1
i=1
(i) y(i) +(t) y(t)
_
=
= S(t)
1
[S(t 1)
t1
+(t) y(t)] =
= S(t)
1
{[S(t) (t) (t)
T
]
t1
+(t) y(t)} =
=

t1
S(t)
1
(t) (t)
T
t1
+S(t)
1
(t) y(t) =
=

t1
+S(t)
1
(t) [y(t) (t)
T
t1
]
Sincetheestimate canbe computed as:

t
=
t1
+S(t)
1
(t)[y(t)(t)
T
t1
],
arst recursive least-squares (RLS) algorithm(denoted as RLS-1) is the followingone:
S(t) = S(t 1) +(t) (t)
T
(time update)
K(t) = S(t)
1
(t) (algorithm gain)
(t) = y(t) (t)
T
t1
(prediction error)
t
=

t1
+K(t)(t) (estimate update)
Analternative algorithmis derivedby consideringthe matrix R(t)=
1
t
t
i=1
(i)(i)
T
:
R(t) =
1
t
S(t) =
1
t
S(t 1) +
1
t
(t) (t)
T
=
=
_
1
t
+
1
t1

1
t1
_
S(t 1) +
1
t
(t) (t)
T
=
=
1
t1
S(t 1) +
_
1
t

1
t1
_
S(t 1) +
1
t
(t) (t)
T
=
= R(t 1) +
t1t
t(t1)
S(t 1) +
1
t
(t) (t)
T
=
= R(t 1)
1
t
R(t 1) +
1
t
(t) (t)
T
=
=
_
1
1
t
_
R(t 1) +
1
t
(t) (t)
T
Asecondrecursiveleast-squares algorithm(denoted as RLS-2) is then thefollowing one:
R(t) =
_
1
1
t
_
R(t 1) +
1
t
(t) (t)
T
(time update)
K(t) =
1
t
R(t)
1
(t) (algorithm gain)
(t) = y(t) (t)
T
t1
(prediction error)
t
=

t1
The main drawback of RLS-1 and RLS-2 algorithms is the inversion at each step of
the square matrices S(t) and R(t), respectively, whose dimensions are equal to the
number of estimated parameters by applying the Matrix Inversion Lemma:
(A+BCD)
1
= A
1
A
1
B(C
1
+DA
1
B)
1
DA
1
takingA=S(t 1), B= D
T
=(t) , C=1and introducing V (t)=S(t)
1
gives:
V (t) = S(t)
1
=
_
S(t1) +(t) (t)
T

1
=
= S(t1)
1
S(t1)
1
(t)
_
1+(t)
T
S(t1)
1
(t)
. .
it is a scalar
_
1
(t)
T
S(t1)
1
=
= V (t 1)
_
1 +(t)
T
V (t 1)(t)
1
V (t 1)(t) (t)
T
V (t 1)
SinceV (t)=S(t)
1
=V(t1)
_
1+(t)
T
V(t1)(t)
1
V(t1)(t)(t)
T
V(t1)
athird recursiveleast-squares algorithm(denoted as RLS-3) is then the followingone:
t1
= 1 +(t)
T
V (t1)(t) (scalar weight)
V (t) = V (t1)
1
t1
V (t1)(t)(t)
T
V (t1) (time update)
K(t) = V (t)(t) (algorithm gain)
(t) = y(t) (t)
T
t1
(prediction error)
t
=

t1
To use the recursive algorithms, initial values for their start-up are obviously required;
in the case of the RLS-3 algorithm:
thecorrect initial conditions, at a timeinstant t
o
when S(t
o
) becomes invertible, are:
V (t
o
)=S(t
o
)
1
=
_
t
o
i=1
(i) (i)
T

1
,

t
o
=V (t
o
)
t
o
i=1
(i) y(i)
assuming n = dim(), a much simpler alternative is to use:
V (0) = I
n
, > 0, and

0
= 0
n1
t
rapidly changes from

0
if 1, while

t
slowly changes from

0
if 1
Model structure selection and validation
A most natural approach to search for a suitable model structure Mis simply to test
a number of different ones and to compare the resulting models
Given a model M() Mwith complexity n = dim(), the cost function
J()
(n)
=
1
N
N
t=1
(t)
2
=
1
N
N
t=1
(y(t) y(t, ))
2
provides a measure of the tting of the data set y provided by M()
if
N
=argminJ()
(n)
, then J(
N
)
(n)
measures the best ttingof datay provided by
Mand represents a subjective (and very optimistic) evaluation of the quality of M
In order to perform a more objective evaluation, it would be necessary to measure
the model class accuracy on data different from those used in the identication
to this purpose, there are different criteria:
Cross-Validation
Akaikes Final Prediction-Error Criterion (FPE)
Model Structure Selection Criteria: AIC and MDL (or BIC)
Cross-Validation
If the overall data set is sufciently huge, it can be partitioned into two subsets:
the estimation data are the ones used to estimate the model M(
N
) M
the validation data are the ones that have not been used to build any of the
models we would like to compare
For any given model class M, rst the model M(
N
) that better reproduces the
estimation data is identied, and then its performance is evaluated by computing
the mean square error on the validation data only: the model that minimizes such
a criterion among different classes Mis chosen as the most suitable one
It can be noted that, within any model class, higher order models usually suffer from
overtting, i.e., they t too much the estimation data to t also the noise term and
then their predictive capability on a fresh data set (corrupted by a different noise)
is smaller with respect to lower order models
Akaikes Final Prediction-Error Criterion (FPE)
In order to consider any possible realization of data y(t, s) that depends on the
outcome s of the random experiment, let us consider as objective criterion:
J() = E[(y(t, s) y(t, s, ))

2
]
Since

N
=

N
(s) depends on a particular data set y(t, s) generated by a
particular outcome s, the Final Prediction Error (FPE) criterion is dened as
the mean on any possible outcome s:
FPE = E[

J(
N
(s))]
In the case of the AR model class, it can be proved that:
FPE =
N +n
N n
J(
N
)
(n)
where J(
N
)
(n)
is amonotonic decreasing functionof nwhile
N+n
Nn
as nN
FPE is decreasing for lower values of n and it is increasing for higher values of n
the optimal model complexity corresponds to the minimum of FPE
The sameformula is usually usedalsointhe caseof other model classes (ARX, ARMAX)
Akaikes Information Criterion (AIC)
Such a criterion is derived on the basis of statistical considerations and aims at
minimizing the so-called Kullback distance between the true probability density
function of the data and the p.d.f. produced by a given model M(
N
):
AIC = n
2
N
+ lnJ(
N
)
(n)
The optimum model order n
minimizes the AIC criterion: n
= arg min AIC

For large values of N, the FPE and AIC criteria lead to the same result:
ln FPE = ln
N+n
Nn
J(
N
)
(n)
= ln
1+n/N
1n/N
J(
N
)
(n)
=
= ln(1 +n/N) ln(1 n/N) + ln J(
N
)
(n)

=
=
n/N (n/N) + lnJ(
N
)
(n)
= n
2
N
+ lnJ(
N
)
(n)
= AIC
AIC criterionis directed tondsystemdescriptions that givethesmallest mean-square
error: a model that apparently gives a smaller mean-square (prediction) error t will
be chosen even if it is quite complex
RissanensMinimumDescriptionLengthCriterion(MDL)
In practice, one may want to add an extra penalty for the model complexity, in order
to reect the cost of using it
What is meant by a complex model and what penalty should be associated with are
usually subjective issues; an approach that is conceptually related to code theory and
information measures has been taken by Rissanen, who stated that a model should
be sought that allows the shortest possible code or description of the observed data,
leading to the Minimum Description Length (MDL) criterion:
MDL = n
ln N
N
+ lnJ(
N
)
(n)
As in the AIC criterion, the model complexity penalty is proportional to n; however,
while in AIC the constant is
2
N
, in MDL the constant is
ln N
N
>
2
N
for any N 8
the MDL criterion leads to much more parsimonious models than those selected
by the AIC criterion, especially for large values of N
Such a criterion has also been termed BIC by Akaike

Clases Identificación

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Clases Identificación

Hochgeladen von

Copyright:

Verfügbare Formate

SYSTEM IDENTIFICATION

the choice problem is tackled as a parametric estimation problem

M() : y(t + 1) = y(t + 1|t) = f

M() : y(t + 1) = y(t + 1|t) = f

It is typically assumed that v(t) is a stationary process, i.e., a sequence of random

the optimal one-step predictor of the model M() is given by:

M() : y(t) = y(t|t 1) =

M() : y(t) = [1 A(z)] y(t) +B(z)u(t)

M() : y(t) = [1 A(z)] y(t)

M() : y(t) = B(z)u(t)

System Identication, Estimation and Filtering 18

the set of minimum points of

is not asingleton) asymptotically:

but wanders around inD

) is the best approximation of S within M

is not a singleton asymptotically, either

System Identication, Estimation and Filtering 19

M() : y(t) = [1 A(z)] y(t) +B(z)u(t)

M() : y(t) = ay(t 1)

P can be directly estimated from N data as:

is a singleton that contains

A necessary condition for the invertibility of

A necessary condition to univocally estimate the parameters of an ARX(n

To summarize, only in the case that S = M(

() is asymptotically normally distributed:

()| is less than ,

J() = E[(y(t, s) y(t, s, ))

minimizes the AIC criterion: n

= arg min AIC

Das könnte Ihnen auch gefallen