Data Pre Processing

4/CY/O8 Advanced System Identification Lecture 4
Data Pre-processing
When data have been collected from the identification

experiment, it is often necessary to do some processing
on the data set prior to using it for identification.
There are several possible deficiencies in the data that

should be attended to:
1. High-frequency disturbances in the data record, above

the frequencies of interest to the system dynamics.
2. Occasional outliers and missing data.
3. Drift and offset, low frequency disturbances.
Dr. V.M. Becerra 1

Drifts and detrending
There are two different approaches to dealing with these

problems:
1. Removing the disturbances by explicit pre-treatment

of the data.
2. Letting the noise model take care of the disturbances.
The first approach involves removing trends and off-sets

by direct subtraction.
Dr. V.M. Becerra 2

Signal offsets
There are several approaches to dealing with signal off-

sets or non-zero mean values:
1. Let y(t) and u(t) be deviations from a physical

equilibrium: y (t ) = y m (t ) − y (t ) , y (t ) = y m (t ) − y (t )
2. Subtract the mean values from the data.
1 N m 1 N m
y = ∑ y (t ) , u = ∑ u (t )
N t =1 N t =1
3. Estimate the offset explicitly. For example,
A(q −1 ) y m (t ) = B(q −1 )u m (t ) + α + v(t )
Dr. V.M. Becerra 3

Drift, trends and seasonal variations
Methods to cope with other slow disturbances in the

data are analogous to the previous approaches for
dealing with offsets.
Drifts and trends can be seen as time-varying

equilibrium points, or time-varying mean values.
With some knowledge about the frequencies of the

slow variations, an alternative is to high-pass filter the
data:
y (t ) = F (q ) y m (t ), u (t ) = F (q )u m (t )
Dr. V.M. Becerra 4

Outliers and missing data
In practice, data acquisition equipment is not perfect.

It may be that single values of the input-output data
are missing or corrupt due to malfunctioning of
sensors or communication links.
A practical method to deal with this problem is to

replace outliers or missing measurements by smoothed
estimates prior to parameter estimation.
Dr. V.M. Becerra 5

Model Validation – Residual Analysis
Once a model has been identified, it is important to

validate the model using a data set that should be
independent of the data used to calculate the model
parameters.
The ‘leftovers’ of the modeling process - the part of the

data that the model could not reproduce- are the
residuals:
ε (t ) = ε (t ,θˆN ) = y (t ) − yˆ (t ,θˆN )
These residuals carry information about the quality of the

model.
When doing model validation, one typically computes

the Mean Square Error, the autocorrelation of the
residuals, and the cross-correlation between the residuals
and the input. The values of the autocorrelation and
cross-correlation should be small and lie within certain
confidence limits.
Dr. V.M. Becerra 6

The Mean Square Error
This is the average of the squared error:
N
1
MSE =
N
∑ (t )
ε 2
t =1
This is a measure in a single positive number of how well

the model output fits the measured data.
Auto-correlation of the residuals.
As the residuals are assumed to be a white noise

sequence, it is good to check its auto-correlation:
N
1
Rε (τ ) =
N
N
∑ ε (t )ε (t − τ )
t =1
For different values of τ = 1, 2, 3, 4, …
If these numbers are not small for τ ≠ 0 then part of ε

could have been predicted from past data, and so this is a
sign of deficiency in the model.
Dr. V.M. Becerra 7

Cross-correlation between the residuals and the input
Similarly, the residuals should not be correlated with the

input, so it is also good to check the cross-correlation of
the residuals and the input:
N
1
R (τ ) =
N
εu
N
∑ ε (t )u(t − τ )
t =1
If there are traces of past inputs in the residuals, then

there is a part of y(t) that originates from the past input
and that has not been properly picked up by the model.
Hence, the model could be improved.
Dr. V.M. Becerra 8

Subspace Identification – An introduction
A linear system can always be represented in state space

form:
x(k +1) = A x(k) + Bu(k) + w(t)
y( k) = C x(k) + Du( k) + v( k)
where:
x is a n-dimensional state vector

u is a nu-dimensional input vector
y is a ny dimensional output vector
v is a ny-dimensional noise vector
w is a n-dimensional process noise vector
A, B, C and D are parameter matrices of the appropriate

dimensions.
Dr. V.M. Becerra 9

The main idea behind subspace identification techniques

is that given the input-output data sequences u(t), y(t),
t=1…N, the state sequence x(t), t=1…N, is estimated
first, and then the state space matrices A,B,C,D are found
using a least squares procedure.
Assume for a moment that not only y and u are

measured, but also the state vector x. Now, with known
u, y and x we can form a linear regression form the state
space model above:
LMx(t +1)OP L O
Y (t) = M ,P Θ = MM A B PP
MN y(t) PQ MNC D PQ
LMx(t)OP LMw(t)OP
Φ(t) = M P, E(t) = M
u(t
MN PQ ) MN v(t) PPQ
Then the state space model above may be written as:
Y (t) = Θ Φ(t) + E(t)
From this equation, the matrix elements in Θ can be

estimated by the simple least squares method.
Dr. V.M. Becerra 10

How do we obtain the state sequence x(t), t=1..N from

the input-output data?
All state vectors that can be reconstructed from input-

output data are linear combinations of the n k-step ahead
output predictors (See Ljung 1999):
y(t + k|),
t k =1,...n
where n is the model order (the dimension of x)
We can form these predictors and select an algebraic

basis among its components.
LM y(t+1|)t OP
x(t) = L
MMy(t+#n|)t PP
N Q
The choice of L will determine the basis of the state
space realization. The predictor y(t + k|)
t is a linear
function of u(s), y(s), s=1,…t
The method is called subspace identification, because it

is based on subspace projections – a concept from linear
algebra.
A well known algorithm for subspace identification is

the so called N4SID, originally developed by Peter Van
Overschee at the University of Leuven, Belgium.
Dr. V.M. Becerra 11

EXAMPLE
An identification experiment has been carried out on a

two tank mixing process. This process was located at the
Control Engineering Centre Laboratory at City
University, London, and a schematic diagram is given in
Figure 1.
ucold uhot
T1
L1 L2
T2
Tank 1 Tank 2
Figure 1: The two tank mixing process

The process has two manipulated inputs, which are the
openings (in percentage) of the cold (ucold) and hot
water valves (uhot).
Dr. V.M. Becerra 12

The four measured variables are the levels in tanks 1 and

2, L1 and L2 (in cm), and the temperatures in tanks 1 and
2, T1 and T2 (in degrees C).
The experiment was carried out in open loop by

manually specifying the values of the valve openings.
The input sequences to the process were chosen to be

binary pseudo-random signals shifting between 10% and
40% for the cold water valve, and between 20% and 40%
for the hot water valve.
The sampling time used was 20s. The data set consists
of 190 samples.
Dr. V.M. Becerra 13

MATLAB code to estimate a state space model for the level in

Tank 1 (uses the System Identification Toolbox):
load ex090398;
mix =iddata([L1],[ucold uhot]);
mixd = detrend(mix,’constant’);
mixe = mixd([1:95],:);
mixv = mixd([96:190],:)
ssmodel = n4sid(mixe,2);
compare(mixv, ssmodel);
resid(mixv, ssmodel);
present(ssmodel)
40
35
30
L (cm)
25
1
20
15
10
0 20 40 60 80 100 120 140 160 180 200
s a m p le
50
40
ucold (%)
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
s a m p le
50
40
uhot (%)
30
20
10
0
0 20 40 60 80 100 120 140 160 180 200
s a m p le
Dr. V.M. Becerra 14

y1. (sim)
mixv; measured
ssmodel; fit: 93.25%
10
5
1
y
-5
-10
100 110 120 130 140 150 160 170 180 190
Correlation function of residuals. Output y1

1
0.5
-0.5
0 5 10 15 20 25
lag
Cross corr. function between input u1 and residuals from output y1
0.4
0.2
-0.2
-0.4
-25 -20 -15 -10 -5 0 5 10 15 20 25
lag
Dr. V.M. Becerra 15

These are the model parameters returned by N4SID:

A=
x1 x2
x1 0.90194 0.053728
x2 -0.48226 -0.15668
B=
u1 u2
x1 0.0027808 0.0010754
x2 0.0085512 -0.0001901
C=
x1 x2
y1 41.407 -2.2392
D=
u1 u2
y1 0 0
Dr. V.M. Becerra 16

Data Pre Processing

Hochgeladen von

Dokumentinformationen

Originalbeschreibung:

Copyright

Verfügbare Formate

Dieses Dokument teilen

Dokument teilen oder einbetten

Freigabeoptionen

Stufen Sie dieses Dokument als nützlich ein?

Sind diese Inhalte unangemessen?

Copyright:

Verfügbare Formate

Data Pre Processing

Hochgeladen von

Copyright:

Verfügbare Formate

4/CY/O8 Advanced System Identification Lecture 4

When data have been collected from the identification

There are several possible deficiencies in the data that

1. High-frequency disturbances in the data record, above

2. Occasional outliers and missing data.

3. Drift and offset, low frequency disturbances.

Dr. V.M. Becerra 1

Drifts and detrending

There are two different approaches to dealing with these

1. Removing the disturbances by explicit pre-treatment

2. Letting the noise model take care of the disturbances.

The first approach involves removing trends and off-sets

Dr. V.M. Becerra 2

There are several approaches to dealing with signal off-

1. Let y(t) and u(t) be deviations from a physical

A(q −1 ) y m (t ) = B(q −1 )u m (t ) + α + v(t )

Dr. V.M. Becerra 3

Drift, trends and seasonal variations

Methods to cope with other slow disturbances in the

Drifts and trends can be seen as time-varying

With some knowledge about the frequencies of the

Dr. V.M. Becerra 4

Outliers and missing data

In practice, data acquisition equipment is not perfect.

A practical method to deal with this problem is to

Dr. V.M. Becerra 5

Model Validation – Residual Analysis

Once a model has been identified, it is important to

The ‘leftovers’ of the modeling process - the part of the

These residuals carry information about the quality of the

When doing model validation, one typically computes

Dr. V.M. Becerra 6

The Mean Square Error

This is the average of the squared error:

This is a measure in a single positive number of how well

Auto-correlation of the residuals.

As the residuals are assumed to be a white noise

For different values of τ = 1, 2, 3, 4, …

If these numbers are not small for τ ≠ 0 then part of ε

Dr. V.M. Becerra 7

Cross-correlation between the residuals and the input

Similarly, the residuals should not be correlated with the

If there are traces of past inputs in the residuals, then

Dr. V.M. Becerra 8

Subspace Identification – An introduction

A linear system can always be represented in state space

x(k +1) = A x(k) + Bu(k) + w(t)

x is a n-dimensional state vector

A, B, C and D are parameter matrices of the appropriate

Dr. V.M. Becerra 9

The main idea behind subspace identification techniques

Assume for a moment that not only y and u are

Then the state space model above may be written as:

Y (t) = Θ Φ(t) + E(t)

From this equation, the matrix elements in Θ can be

Dr. V.M. Becerra 10

How do we obtain the state sequence x(t), t=1..N from

All state vectors that can be reconstructed from input-

where n is the model order (the dimension of x)

We can form these predictors and select an algebraic

The method is called subspace identification, because it

A well known algorithm for subspace identification is

Dr. V.M. Becerra 11