Sie sind auf Seite 1von 58

Intelligent Systems

Dag Björklund

September 8, 2010
Contents

1 Introduction 3
1.1 Intelligent Systems and Soft computing . . . . . . . . . . . . . 3
1.2 System Identification . . . . . . . . . . . . . . . . . . . . . . . 4

2 Neural Networks 5
2.1 The Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Step Function (Threshold Function) . . . . . . . . . . . 8
2.2.2 Piecewise Linear . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Other Activation Functions . . . . . . . . . . . . . . . 9
2.2.5 Why Use Activation Functions . . . . . . . . . . . . . . 9
2.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Designing and using a Neural Network . . . . . . . . . . . . . 11
2.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Learning Processes . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.1 Learning with a teacher (supervised learning) . . . . . 12
2.7.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . 15
2.7.3 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Adaline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Function Approximation . . . . . . . . . . . . . . . . . . . . . 16
2.10.1 Linear Regression and Multiple Regression . . . . . . . 16
2.10.2 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . 18
2.10.3 Neural Networks for Function Approximation . . . . . 18
2.11 Dynamic Systems and Neural Networks . . . . . . . . . . . . . 18
2.12 Neural Control . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.1 Direct Neural Control . . . . . . . . . . . . . . . . . . 21
2.12.2 Indirect Neural Control . . . . . . . . . . . . . . . . . . 22
2.12.3 Example: Temperature Controller Design . . . . . . . . 23

1
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Fuzzy Systems 30
3.1 Fuzzy Sets and Linguistic Variables . . . . . . . . . . . . . . . 30
3.1.1 Linguistic variables . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Fuzzification . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Fuzzy If-Then Rules . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Evaluating a Single rule, with one Premise . . . . . . . 35
3.2.2 Evaluating a Single rule, with Several Premises . . . . 37
3.2.3 Evaluating Several Rules . . . . . . . . . . . . . . . . . 38
3.3 Set-Theoretic Operations . . . . . . . . . . . . . . . . . . . . . 39
3.4 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 The Centroid Method . . . . . . . . . . . . . . . . . . 40
3.4.2 Mean of Maximum (middle of maxima) . . . . . . . . . 40
3.4.3 First of Maxima Method . . . . . . . . . . . . . . . . . 40
3.5 Fuzzy Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.1 Example: Inverted Pendulum on a Cart . . . . . . . . 41
3.5.2 Stability of Fuzzy Control Systems . . . . . . . . . . . 47
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Neuro-Fuzzy and Fuzzy-Neural Control 49

5 Reinforcement Learning 50
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Elements of Reinforcement Learning . . . . . . . . . . 51
5.2 Example: Riding a Bicycle . . . . . . . . . . . . . . . . . . . . 51
5.3 Example: Jackpot Journey . . . . . . . . . . . . . . . . . . . . 52
5.4 Credit Assignment Problem . . . . . . . . . . . . . . . . . . . 53
5.5 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . 54
5.5.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

2
Chapter 1

Introduction

1.1 Intelligent Systems and Soft computing


Soft computing is a collection of methodologies for intelligent systems aimed
at dealing with the imprecision of the real world. The guiding principle
of soft computing is to tolerate imprecision, uncertainty and partial truth.
Soft computing methods can be used for many varying fields, e.g.. system
identification and function approximation, control, autonomous systems and
artificial intelligence, signal processing, image processing, medical applica-
tions, statistics etc. Traditional methods work well for mathematically well
defined problems, while soft computing can be used for complex and un-
certain problems that might not be completely mathematically understood.
Soft computing methods can be robust and noise tolerant. The collection
of methodologies comprising soft computing include e.g. fuzzy logic, neural
networks (neurocomputing) and genetic algorithms. These are often comple-
mentary rather than competing methodologies and can often be combined in
order to create intelligent systems. The strengths of the different method-
ologies are summarized in the table below.

Methodology Strength
Neural network Learning and adaptation
Fuzzy set theory Knowledge representation via fuzzy if-then rules
Genetic algorithms Systematic random search
Reinforcement learning Learning sequences of actions based on
previous failure and success

3
1.2 System Identification
The problem of determining a mathematical model for an unknown system by
observing its input-output data pairs is referred to as system identification.
The purposes of system identification are multiple:

• To predict future outputs

• To understand input-output relationships

• To design controllers

• To do simulations of the system

System identification usually involves two steps:

• structure identification.
In this step we apply a priori knowledge about the target system to
determine a class of models that we think could be fitted to the system.
We usually know a great deal about industrial processes for instance,
so that we can choose a model than can be fitted by tuning parameters.

• Parameter identification
In this step we apply optimization techniques to determine the param-
eters to the model chosen in step one.

After these steps we usually also do validation tests to check if the identified
model responds correctly to an unseen data set. The first to steps are re-
peated if not. In statistics they talk about linear regression and curve fitting,
which is really the same think, system identification.

4
Chapter 2

Neural Networks

The brain processes incomplete information obtained by perception at an


incredibly rapid rate. Nerve cells function about 106 times slower than elec-
tronic gates, but human brains process visual and auditory information much
faster than modern computers. This is because of the massively parallel de-
vice the brain is. Neural networks (NN), or rather artificial neural networks
(the brain is a natural neural network), try to mimic this structure. Neural
networks consist of large amounts of simple processing units interconnected
via weighted links, synapses. Commonly neural networks are adjusted, or
trained, so that a particular input leads to a specific target output. This is
illustrated in Figure 2.1. Here the whole neural network is shown as a box, a
black box. We don’t know what the neural network looks like inside, we just
present it with example training sets. This picture is a rather embellished
one. There is software available that we could use with little knowledge about
the inner workings of neural networks, but we need more information and
will take a closer look.
A NN is a massively parallel distributed processor. A NN consists of:

Figure 2.1: Training a neural network

5
• Simple processing units that can store experience

• A learning process

Some features of neural networks are:

• Input-output mapping.
By presenting random examples together with the desired output, a
neural network can be trained to generalize and solve problems it has
not previously encountered. Neural networks provide a generic method
of mapping or representing input output relationships.

• Adaptive
A NN can adapt to changes in its surroundings (remember that adap-
tive often means problems. If the system reacts too fast, it reacts
strongly to noise and can become unstable. If it reacts too slowly,
noise is filtered out, but the system might become too slow to react to
changes in the environment.)

• Confidence measure of decisions


A NN can tell how certain it is about its decisions. We can thus avoid
interpreting pure noise as some meaningful input. This can sometimes
be a difficult problem for traditional methods.

• VLSI implementation.
NN:s are as mentioned massively parallel. Unfortunately instruction
processors are sequential. In order to take advantage of the parallelism,
NN:s can be implemented in hardware, i.e.. ASIC chips.

2.1 The Neuron


A neural network consists of small interconnected processing units, neurons.
These mimic the structure of the neurons in the brain (Figure 2.2). A model
of an artificial neuron is shown in Figure 2.3. The adder produces a linear
combination of the inputs multiplied by weights w, which the activation
function usually squashes/saturates to some limits. The weights are the
parameters that we adjust when we actually train a neural network. The
index k in the figure denotes that this is the k th neuron in some neural
network. The neuron has m inputs. The bias bk can be used to adjust the
output to a suitable range before the activation function. Mathematically

6
Figure 2.2: A neuron in the brain

Figure 2.3: An artificial neuron

the output of the linear combiner (summation element) is given by:


m
X
vk = wkj xj + bk
j=1

and the output of the neuron with the activation function applied becomes

yk = ϕ(vk )

7
(sources [Hay99])

2.2 Activation functions


The activation function ϕ(·) seen in the model of the neuron is usually a step
function, a linear or piecewise linear function, or a sigmoid function. These
are summarized in Figure 2.4, and a brief explanation follows.

1
0.8
0.6
φ

0.4
0.2
0

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1 if v ≥ 0
a) v
ϕ(v) =
0 if v < 0
1
0.8
0.6
φ

0.4
0.2
 1 if v ≥ + 21

0
0
b)
−10 −8 −6 −4 −2
v
2 4 6 8 10

ϕ(v) = v if + 12 > v > − 12


0 if v ≤ − 12

1
0.8
0.6
φ

0.4
0.2
0
v0 1
c) ϕ(v) =
−10 −8 −6 −4 −2 2 4 6 8 10
k

1+e−av

Figure 2.4: a) Threshold b) piecewise linear (saturation) c) sigmoid

2.2.1 Step Function (Threshold Function)


These kinds of step activation functions are useful for binary classification
schemes. In other words, when we want to classify an input pattern into one
of two groups, we can use a binary classifier with a step activation function.
Another use for this would be to create a set of small feature identifiers. Each
identifier would be a small network that would output a 1 if a particular
input feature is present, and a 0 otherwise. Combining multiple feature
detectors into a single network would allow a very complicated clustering or
classification problem to be solved.

8
2.2.2 Piecewise Linear
A piecewise linear activation function can work as a step function, if the
linear region is made very small, or it can work as a linear combiner, if the
linear region is used, without running into saturation. It can naturally also
be used as a output limiting saturation device.

2.2.3 Sigmoid Function


The sigmoid has the property of being similar to the step function, or the
piecewise linear for that matter, but with the addition of a region of uncer-
tainty. The sigmoid function is also continuous and differentiable, which is
a requirement for gradient based learning algorithms. Sigmoid functions in
this respect are very similar to the input-output relationships of biological
neurons, although not exactly the same

2.2.4 Other Activation Functions


The three types above are often the only ones listed in literature. However, if
you e.g. look in the Matlab Neural networks toolbox you find several more.

A Pure Linear Function


One that often feels to be missing from the list above is a pure linear function:

ϕ(v) = v

In order to use the piecewise linear activation function as a linear combiner,


you would need to keep the induced field v between -0.5 and 0.5, which can
of course be done by tuning the weights in order to scale v, but were do you
unscale?

2.2.5 Why Use Activation Functions


Without activation functions a neural network would simply be a linear func-
tion of linear functions, that is, again a linear function. However, it is the
nonlinearity that makes multilayer networks so powerful.
For hidden units, sigmoid activation functions are usually preferable to
threshold functions. Networks with threshold units are difficult to train be-
cause the error function is stepwise constant, hence the gradient does not
exist or is zero. The backpropagation algorithm, for example, is gradient
based and thus requires the activation functions to be differentiable. With

9
sigmoid units, a small change in the weights will usually produce a change in
the output, which makes it possible to tell whether that change in the weights
is good or bad. With threshold units, a small change in the weights will often
produce no change in the outputs.
For the output units, you should choose an activation function suited to
the distribution of the target values:

2.3 Network Architectures


A single neuron does not possess much intelligence. Neurons are connected
in different ways, i.e. network architectures. The neurons, or nodes, are orga-
nized in one or more layers. We will limit ourselves to feedforward networks
where the neurons in each layer feed their output forward to the next layer
until we get the final output from the neural network, as opposed to feedback
networks where nodes can feed their output back to nodes in previous layers.
A simple feedforward network with one layer is shown in Figure 2.5.

Input layer Output layer


of source of neurons
nodes

Figure 2.5: A single layer feedforward fully connected neural network

Notice that the weights, biases and activation functions are omitted from
this figure, as they usually are when network structures are drawn. Notice
also that this single layer network has a layer of source nodes and a layer of
neurons, the output layer. It only has one layer of neurons, thus it is a single
layer network.
A two layer network is shown in Figure 2.6. Each input is sent to every
neuron in the hidden layer and then each hidden layer’s neuron’s output is
connected to every neuron in the next layer, i.e.. the network is fully con-
nected. The hidden layer is not directly connected to output, hence its name.
There can be any number of hidden layers within a feedforward network but
one is usually enough to suffice for most problems you will tackle

10
Input layer Hidden layer Output layer
of source of neurons of neurons
nodes
Figure 2.6: Network with one hidden layer and one output layer. Fully
connected feedforward

2.4 Knowledge
”Stored information used by a person or machine to interpret, predict and
respond to the outside world.” Fischler et.al. A NN shall learn a model of
its surroundings and maintain it.

2.5 Designing and using a Neural Network


We look at how a neural network is constructed and used through an in-
complete example (we will see a full implementation later). Example design:
Character (digit) recognition The construction of neural network for recog-
nizing hand written digits 0-9 proceeds through the following steps:

1. A network architecture is chosen


We choose the number of inputs = number of pixels in the images. We
choose the number of outputs to be 10, i.e. the number of different
characters to be recognized. Ten outputs also means ten neurons in
the output layer. So in this case we will have one output neuron rep-
resenting one character (we could also have a single output that would
be a code for each character). We use x nodes in y hidden layers?

2. Example learning sets are presented to the network, which is trained by


some learning algorithm. Both positive and negative (counterexamples)
can be given.

11
3. We test the network wit examples it has not previously seen. The
network should generalize to be able to recognize these examples.

2.6 Transformations
Usually objects need to be identified although they are transformed, e.g.
rotated, moved or scaled, in different ways compared to the original training
set. This can be achieved in a couple of different ways, like:
1. Invariance by structure
For example invariance to in plane rotations around a center point can
be achieved by forcing all synaptic weights at equal distances to be
equal. But what happens in the training process if we e.g. try to
recognize some complex patterns in an image, we tune the weights to
better match an input image, but then we pick one weight at a given
distance from origin and overwrite all other weights at the same dis-
tance with that one?? The solution lies in duplicating the synapses for
each pixel in the image, which will make the networks computationally
intense for moderate problem sizes.

2. Invariance by training
In principle a network can be trained to recognize transformations of
the same object, which requires presenting it with those transforma-
tions. The question is whether or not the networks learning capacity
will be enough for learning to map transformed versions of patterns into
the same output. This method also requires large amounts of training
data.
Not only objects in an shape recognition system are subject to transfor-
mations of course. In voice recognition, as another example, a voice pattern
arrives at different time instants and at different pitch depending on the
speaker.

2.7 Learning Processes


2.7.1 Learning with a teacher (supervised learning)
Error-correction learning
To illustrate our first learning rule, consider the simple case of one neuron k
constituting the only computational node in the output layer of a network.

12
As error-correction learning is a supervised learning algorithm, we present
the neuron with inputs x1 , . . . , xm and a corresponding desired output dk . We
do this over and over again during the training process, with different input
vector → output pairs. We denote the nth input data set as x1 (n), . . . , xm (n)
and the nth outputyk (n) which should be close to the nth desired output
dm (n) after training.
The error from our neuron k during the nth round is given by:

ek (n) = dk (n) − yk (n)

The aim is to tune the weights in order to reduce this error. The error-
correction learning is illustrated in Figure 2.7. At each step we will update

x1 (n)
wk1 (n)
dk (n)

wk2 (n) vk (n) −


x2 (n) ϕ
yk (n)
ek (n)

wkm (n)
xm (n)

Figure 2.7: Illustrating error-correction learning

the weights to neuron k according to:

∆wkj (n) = µek (n) · xj (n)

where µ is a positive constant that determines the rate of learning as we


proceed from one input-data→reference answer pair of learning samples.
So when we take the next set of training data, we will have use the new
weights:
wkj (n + 1) = wkj + ∆wkj (n)
Try to think about how µ will effect the learning. Too large a value will
overfit the weights to each set of data. This could perhaps be ok if the data
contained no noise at all. In reality, we have noisy data, and the network
should learn to approximate the original function based on the noisy samples.
With µ too large the training will never converge, but will oscillate.

13
The Backpropagation Algorithm
The error-correction learning presented above only works for single layer
networks. When we have hidden nodes in our neural network, we will face a
credit-assignment problem. We only see the output of the network, which we
can compare with a desired output in order to tune the output neurons, but
we do not directly know how the hidden layer nodes can be hold accountible
for the error e = d − y. This can however be solved by, rather complicated,
backpropagation algorithms.
I will briefly outline a gradient-descent based backpropagation algorithm.
Gradient descent means we use derivatives of the activation functions to see
in which direction weights should be tuned in order to reduced the error.

Case 1: Neuron j is an output node For an output node j we can


calculate the error as before as:

ej (n) = dj (n) − yj (n)

(only subscripts changed from error-correction version). When we have the


error, we can calculate the weight update as (skipping 2.5 pages of mathe-
matical derivations):
∆wji = µδj (n)yi (n)
Where the local gradient σj (n) is defined by

δj = ej (n)ϕ0j (vj (n))y (2.1)

That is we take the derivate ϕ0 of the activation function (the activation func-
tion needs to be differentiable) and adjust weights proportionally to it and
the error magnitude and the magnitude of the inputs to the neuron. Note,
for instance, that the sigmoid function will have 0 derivative in the saturated
regions, so the weights will not be changed ∆wji = 0. The derivative is con-
stant int the linear region of the sigmoid, and decreases when approaching
saturation.

Case 2: Neuron j is a hidden node When neuron j is located in a


hidden layer of the network, there is no specified desired responce for that
neuron. Instead, the error signal for a hidden neuron is determined recur-
sively in terms of the error signals of all the neurons to which that hidden
neuron is directly connected. So the ej (n) in Equation 2.1 needs to be re-
placed with something. That something, as below, is a sum of all the local

14
gradients of the output nodes k that j is connected to, times the weight
between j and k:
X
δj (n) = µϕ0j (vj (n)) δk (n)wkj (n)
k

2.7.2 Unsupervised Learning


Competitive Learning

µ(xj − wkj ), if neuron k wins the competition
∆wkj =
0, else
This learning rule moves the synaptic weight vector wk toward the input
vector x.

Hebbian learning
Selforganizing Maps
Re-inforcement Learning
Genetic algorithm based learning

2.7.3 Learning tasks


• Pattern recognition
Humans are extremely good at pattern recognition.
Pattern recognition is about mapping a pattern or a signal to a cate-
gory. A neural network for pattern recognition can e.g. be constructed
with two layers: One layer for feature extraction, which can be trained
without a teacher, and a layer for classification, which can be trained
by a teacher.
• Function approximation
• Control
Humans are again a very good example of how neural networks can
control thousands of actuator (muscle fibers) depending on noisy input
from many different senses. A regulator has to model the system un-
der control. A neural network can be trained to mimic the system.tt
system.
• Filtering Cocktail party problem.
• Beamforming Fladdermus lokaliserar och följer bytet med sin radar.

15
2.8 Perceptron
You will encounter the word perceptron when you use different NN tools. A
perceptron is just a certain kind of neuron and a certain training algorithm.
The activation function of a perceptron is a hard limiter (threshold), i.e. it
has a binary output 1,0 or -1,1. It can thus only classify the input into two
classes. In order to classify the input into several classes, several neurons
are required. Training of a single layer perceptron is done using an error-
correction algorithm:

1. If the correct classification was achieved, no update of weights is per-


formed.

2. Otherwise the weights are updated as

w(n + 1) = w(n) − µ(n)x(n) if wT (n)x(n) > 0 and x(n) belongs to class 2.


w(n + 1) = w(n) + µ(n)x(n) if wT (n)x(n) ≤ 0 and x(n) belongs to class 1.
Where µ again is the learning-rate parameter. If µ(n) is constant, we have a
fixed increment adaptation rule. It can be proven that the learning converges
if the classes are linearly separable.
A multilayer Perceptron is trained using the backpropagation algorithm.

2.9 Adaline
2.10 Function Approximation
Function approximation or function fitting is pretty much the same thing as
system identification, where the systems are rather simple, i.e. non dynami-
cal. A system is functional if given a certain input it always gives the same
output, regardless of other previous input. The output of a dynamical system
(see Section 2.11) varies depending on the history of inputs. The output of
a stationary dynamic system gives the same output given a history (of some
length) of inputs, while a time varying system can have changing dynamics
and behavior over time.

2.10.1 Linear Regression and Multiple Regression


Before we look at how neural networks can be used for function approxi-
mation, we briefly discuss the standard statistical methods, i.e. regression.
The simplest form of function approximation is fitting a single input single

16
output linear function to some measured data. That is, a function with one
independent and one dependent variable.

ŷ = a + bx
Figure 2.8 depicts a plot of measured unemployment rates versus eco-
nomical growth. It is suspected that the latter depends on the former (could
it be the other way around?). It is further suspected that the dependence
is linear so that a straight line could be fitted to the scattered points, i.e.
parameters a and b below should be tuned:

∆GDP (∆unemployment) = a + b · ∆unemployment

Figure 2.8: Fitting a linear function

When we have a set of measured values for xi and yi , we wish to find a


function ŷ = a + bx such that the sum of squared errors between measured
y and ŷ calculated from x using our model:
( n ) ( n ) ( n )
min X 2 min X min X
ei = (y − ŷ)2 = (yi − a − bxi )2
a,b a,b a,b
i=1 i=1 i=1

It turns out the parameters a and b satisfying the above can be obtained by:
Σni=1 (xi − x̄)(yi − ȳ)
b=
Σni=1 (xi − x̄)2

17
and
a = ȳ − bx̄
A probably more realistic situation than the single-input/single-output
model above is one where we have more than one independent variable. In
multiple regression we have several independent variables x1 , x2 , . . . that a
affects a dependent variable y as:

ŷ = a + b1 x1 + b2 x2 + . . . + bp xp

Again we then wish to minimize the error between the predicted and mea-
sured outputs ŷ and y:
( n )
min X
(y − ŷ)2
a,b1 ,b2 ,...
i=1

That is minimize the error squared. Tools like Excel and Matlab have features
for finding the optimal parameters a, b1 , b2 , . . ..

2.10.2 Nonlinear Models


In the above we assume a linear dependence between one or several inde-
pendent variables and a dependent variable. Many economical and technical
problems can be modeled by such linear models, but often though the re-
lation between input-output is more complex, i.e. nonlinear. Instead of a
linear model explained by a set of constants multiplied by variables we then
have some general function:

ŷ = f (x1 , x2 , . . .)

2.10.3 Neural Networks for Function Approximation


Neural networks are good at fitting functions. In fact, there is a proof that
a fairly simple neural network can fit any practical function. The simplest
case is a linear function, but the power of neural networks is their ability to
learn nonlinear functions.

2.11 Dynamic Systems and Neural Networks


The output of a dynamic system given a certain input depends on previous in-
puts. A dynamic system can mathematically be represented by a differential
equation (analog) or a difference equation (digital). It can also represented

18
by a transfer function in the Laplace plain (analog) or z plain (digital). We
constantly deal with dynamical systems in e.g. control theory and signal
processing (two closely related fields).
Feedforward neural networks have no memory or state. The output of the
system for a given input is the same regardless of previous inputs (the system
is functional). A neural network thus cannot model a dynamical system, or
can it?
We can use several samples in time of the input signal as inputs to the
neural network. The network will then still have the same output given a set
of input samples regardless of older samples, that is it will be time invariant,
but dynamic. Figure 2.9 shows a single input single output (SISO) system.
However the neural network has several inputs, but these are time shifted
samples of the one actual input. This dynamic system can be represented by

Figure 2.9: Representation of a SISO dynamic system by a feedforward neural


network

by
y(k) = f (x(k), x(k − 1), x(k − 2))
Where the function f (·) is the neural network. A reader familiar with digital
signal processing might notice that this NN resembles a digital filter. If we
remove the hidden layer, we get a neural network that is structurally the same
as a finite impulse response filter (Figure 2.10). The only difference is in the
tuning of the weights, the weights in the NN are tuned by learning algorithms,
while the weights in a FIR filter are tuned by filter design methods.
We can also have a recursive dynamic system as:

y(k) = f (x(k), x(k − 1), y(k − 1))

That is, where the output y(k) also depends on previous output y(k−1). This
difference equation can be modeled by the neural network in Figure 2.11. A

19
Figure 2.10: Comparison of a dynamic neural network and a finite impulse
response (FIR) filter

Figure 2.11: Representation of a recursive SISO dynamic system by a feed-


forward neural network

single layer network with y(k-1),y(k-2)... as inputs again is structurally the


same as a infinite impulse response (IIR) filter.
(sources [SX09] Chapter 2.4.1)

2.12 Neural Control


Neural control can be used we have data available in the form of measure-
ments of input output values of a plant. Neural control is specifically useful
when mathematical models of the plant dynamics are not available, which is
often the case in complex systems. A neural network can be used as a ”black
box” model for a plant.
Controllers based on neural networks are also suitable for adaptive control
where controllers need to adapt to a changing environment. Neural networks

20
have proven to be most useful for such time-invariant systems. This section
on neural control is mainly based on [SX09] Chapter 2.4.1, [NPWW03] and
[Cal03].

2.12.1 Direct Neural Control


Direct design means that a neural network directly implements the controller,
that is, the controller is a neural network. The network must be trained as
the controller using numerical input-output data.

Supervised Control
The training data could e.g. come from a human controller, that is, the
neural network would learn to act as the human (the human should solely
base his control actions upon the input signal available also to the neural
controller). The training data can also be obtained from an existing con-
troller. This is also called supervised control. You could actually call this
system identification, where the system, is the human controller and we try
to identify what he has eaten. The teaching of a neural network to mimic
a human operator is depicted in Figure 2.12. When then network is tuned,
it would replace the human operator as the controller of the plant. It is un-

e
Σ
a)
r human u y
Σ operator
plant

r u y
b) Σ plant

Figure 2.12: Training a) and using b) a neural network to act as a controller


mimicking a human one

derstandable that copying, and replacing, a human controller is worth while,


but why would we want to copy an existing controller that already does the
job? Most traditional controllers are based around an operating point. This

21
means that the controller can operate correctly if the plant/process operates
around a certain point. These controllers will fail if there is any sort of un-
certainty or change in the unknown plant. The advantages of neuro-control
is if an uncertainty in the plant occurs the ANN will be able to adapt it?s
parameters and maintain controlling the plant when other robust controllers
would fail.

Direct Inverse Control


The next type of control technique researched is direct inverse control. The
advantage of using inverse control over supervised control is that inverse
control does not require an existing controller in training. Inverse control
utilizes the inverse of the system model as shown in Figure 2.13.

Figure 2.13: Direct Inverse Control

When using neural networks for inverse control, a neural network is


trained to model the inverse of the process. When the inverse controller
is cascaded with the process the output of the combined system will be equal
to the setpoint. From the figure, as an example, we get:

Output = Setpoint · Gc (s) · Gp (s)


= Setpoint · G−1
p (s) · Gp (s)
= Setpoint

The inverse nonlinearities in the controller cancel out the nonlinearities


in the process. For the nonlinearities to be effectively canceled, the inverse
model must be very accurate.

2.12.2 Indirect Neural Control


Indirect neural control design is based on a neural network model of the
system to be controlled. The controller itself may not be a neural network,
but it is derived from a plant that is modeled by a neural network. This is
similar to standard control in that a mathematical model is needed, but here
the mathematical model is a neural network.

22
Indirect neural control designs involve two phases. First the plant dy-
namics are identified by a neural network from training data, that is, system
identification. In the second phase the control design can be rather conven-
tional though the controller is derived, not from a standard mathematical
model of a plant, but from a neural network model.

2.12.3 Example: Temperature Controller Design


Consider a naive model for a room to be heated:

y(k)=0.9y(k-1)+0.2u(k-1)+1.5
(2.2)
Here y is the temperature of the room and u is the power in some unknown
unit applied to a heating element. If zero power is applied, the temperature
stays at 15o C (say the temperature surrounding the room is 15o C). If a
power of ’5 units’ is applied, the temperature will rise to 25o , which is seen
by applying such a step. This step response is shown in Figure 2.14.

6 26
5 24
y (degrees celsius)

4
22
u (power)

3
20
2
18
1
0 16

−1 14
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
time time

Figure 2.14: Step response of open loop room model

We will now try out a couple of different control designs for controlling
the temperature of the room, but first of all, why do we need a controller?
We saw from the step response that we get 25o C if we apply the power 5.
But what if we want 22o C? We can guess some power, apply it and come
back an hour later to check the temperature. And then try another power (if
we do this, we have applied feedback to the system). But we want to have
a controller with a centigrade scale, so that we can set a temperature, say
25o C, and after some time, that is the temperature we will have.
Clearly we first of all need to establish some inverse relationship between
the room temperature y and the power u to achieve this.
Secondly, in our heating system we have infinite power to our disposal!.
Should we then not crank up the power considerably to achieve the reference

23
temperature faster, lowering the power when we get close to the target? With
infinite heating power, we can get to the target temperature in one time step.

Direct derivation of inverse controller


In order to establish what power u to apply to get a certain temperature
y(k + 1) given the current temperature y(k) (and infinite power), we can
derive the inverse function u(k) from the above difference equation:

y(k) = 0.9y(k − 1) + 0.2u(k − 1) + 1.5


≡ {move u(k − 1) to left}
y(k) − 0.9y(k − 1) − 1.5
u(k − 1) =
0.2
≡ {Substitute k + 1 for k}
y(k + 1) − 0.9y(k) − 1.5
u(k) =
0.2

We now have an expression for what power u(k) to apply to achieve a certain
temperature y(k + 1) give the current temperature y(k). The temperature
y(k + 1) is the one we want to achieve, i.e. we set y(k + 1) to the reference
temperature r, i.e. we substitute r(k + 1) for y(k + 1):

r(k + 1) − 0.9y(k) − 1.5


u(k) = (2.3)
0.2
This equation is our controller. It needs as inputs the reference temperature
we wish to achieve r(k + 1) and the current room temperature y(k) and its
output is the control signal, i.e. the power u to be applied to the heating
element. A block diagram (Simulink screenshot) of our controller cascaded
with the room model, along with a Simulink implementation of the controller
is depicted in Figure 2.15. Note that this is now a closed loop feedback
system, as the controller uses the output (the temperature) of the system.
We saw earlier that cascading an inverse controller with the system, the
output should equal the set point. In the Laplace plane this meant

r · Gc (s) · Gp (s) = r · Gp (s)−1 · Gp (s) = r

In our difference equation model cascading the controller with the room
model means substituting the function u(k) for the occurrences of u in the

24
Figure 2.15: Simulink model of direct inverse controller derived from differ-
ence equation by inverting

room model y(k) as below:

y(k) = 0.9y(k − 1) + 0.2u(k − 1) + 1.5


≡ {substitute equation 2.3 for u(k − 1)}
r(k) − 0.9y(k − 1) − 1.5
y(k) = 0.9y(k − 1) + 0.2 + 1.5
0.2
≡ {simplify}
y(k) = 0.9y(k − 1) + r(k) − 0.9y(k − 1)
≡ {simplify}
y(k) = r(k)

So using our controller given by equation 2.3 we force the output to directly
follow the reference. The results of the above derivation should come as
no surprise as we just inverted y(k) to obtain u(k). A heating system that
instantly reaches any desired temperature is not practically possible. A more
realistic model is obtained simply by saturating the control signal u, which

25
would slow down the rise (or fall) time of the temperature.

Supervised control
We can now try to train a neural network to work the same way as our math-
ematically derived controller. Our controller was given by the Equation 2.3
repeated below.
r(k + 1) − 0.9y(k) − 1.5
u(k) =
0.2
Since we know this equation we see that it will be no problem for a neural
network to learn it, we also understand that there is no point in replacing
it with a neural network, but we do it anyway as an example. The neural
network shall have two nodes in the input layer, y and r and one in the
output layer u. In Figure 2.16 a), the training of the neural controller to
mimic the mathematically derived one is illustrated, while Figure 2.16 a)
shows how the NN has replaced the original controller. We create and train

e
Σ

a)
u (power)
r (temp) controller room y (temp)
(inverse fcn)

b) r (temp) u (power) y (temp)


room

Figure 2.16: Training a) and using b) a supervised neural network controller

a network in Matlab. We run the simulink model for say 500 steps with
random values for the reference temperature r and record the output u from
the controller and y from the system. The neural network in Matlab takes
the set of inputs as a matrix with the input signals on separate rows, with
each test round represented by a column. We create this matrix, create and
train the network, and test it as:
% supervised learning
% create a copy cat of the inverse function controller

26
% create two row matrix for the set of samples for the two inputs
p=[r y]’;
% create a network with 10 nodes in the hidden layer
net = newff(pp, u’, 10);
% train the network with the output u
net = train(net, pp, u’)

% plot the output from the neural network when fed with the same input
% against the output u from the controller model
% We use the same data used for training to verify the performance, which is not
figure,plot(sim(net,p))
hold on
plot(u,’r’)

% create a simulink model of the neural network


gensim(net)
The obtained simulink model of the neural controller can then replace the
directly derived controller (except the neural network takes a vector input
instead of r and y separately). In the Matlab code above we plot the control
signal of the original controller along with the control signal produced by
the neural controller, and they match (plot excluded). You should really use
different data to test the neural network than the data used for training.

Direct Inverse
The supervised method above was nice and simple, but if you don’t already
have an existing controller you can learn to mimic (usually you don’t) and
you do not even have a model for the plant, you could try to teach a neural
network the inverse of the system. This inverse model should then also
behave the same as the inverse controller we derived from the difference
equation describing the system.
So the task is to do system identification on the inverse room dynam-
ics. The room system has one input and one output, heating power and
temperature respectively. In the inverse model the input/output would then
be reversed. So should our neural network have one input and one output?
The room is a dynamical system. The output depends on previous output.
We cannot create an inverse model that maps the temperature in some time
instant to a specific heating power. Our neural network needs two inputs,
the output y(k) of the room, and the previous output y(k − 1) of the room.
Figure 2.17 a) illustrates the training of the neural network. We do not
have any existing controller to copy cat, instead we try to learn how the

27
output of the room (temperature) is related to the input (power u) to it.
When the network is trained, we exchange the current temperature y input
to the neural network for the reference value r, and keep the delayed y as the
second. This is exactly what we did with the mathematically derived inverse
controller.

Σ

a) z −1

u (power) room y (temp)

r (temp) u (power) y (temp)


b) room

z −1

Figure 2.17: Training a) and using b) a direct inverse neural controller

The following Matlab code creates an input matrix p containing the sets of
y(k), y(k − 1) pairs used for training and creates and trains a neural network.

% direct inverse neural network


p=[y(1:len-1) y(2:len)]’; % the second row is the delayed y value
% y=y(1:l-1)

% create network with 10 neurons in the hidden layer


net2 = newff(p,u’,10);

% train the network


net2=train(net2,p,(u(2:length(u)))’)

% compare results
figure,plot(sim(net2,p))
hold on
plot(u(2:length(u)-1),’r’)
gensim(net2)

28
2.13 Exercises
1. How do you construct and train a neural network to approximate a
linear function, e.g. y = 2x + 4? Input layer? Hidden layer? Output
layer? Weights? Bias? Activation function?

2. Try to create and manually teach a neural network that implements


the logical operators AND and OR. Which activation function do you
choose? How many neurons do you need? Weights? Bias?

3. Can you create a network that solves the XOR operator as you did in
the previous exercise?

4. Think about how to construct and train a network to approximate a


second order function y(x) = x2
Note1: your network should approximate the function, it need not be
exact.
Note2: Neural networks usually work only in the range of inputs it has
been trained with.
Is your method somehow tailored, i.e. specialized for this problem, or
would any default (say default training methods in Matlab) be able to
train your network?

5. Train a network in Matlab to approximate a third order function. Use


as simple a network as you can. Test the network within the range you
used for training. Test the network with inputs outside the range of the
training set. How does it perform? Can you figure out how the Matlab
neural network approximates the function? Is it the same method you
thought of?

6. What types of functions could a sigmoid be fitted to?

7. It was claimed in the text that benefit of direct inverse control using
neural networks is that an existing controller is not required. But what
if the system is instable?

29
Chapter 3

Fuzzy Systems

Fuzzy logic and fuzzy systems are again a response to the fact that real-
ity is often not black or white, true or false. Fuzzy systems have perhaps
most notably been applied in control systems, hence the contents of this
chapter will focus on examples from control systems. This should also suite
the audience of this course. Other fields where fuzzy systems are used are
for example artificial intelligence and expert systems (an expert system is
software that attempts to provide an answer to a problem, or clarify un-
certainties where normally one or more human experts would need to be
consulted). Examples of expert systems with fuzzy logic central to their con-
trol are decision-support systems, financial planners, diagnostic systems for
determining soybean pathology, and a meteorological expert system in China
for determining areas in which to establish rubber tree orchards.

3.1 Fuzzy Sets and Linguistic Variables


A normal set is a set of values with crisp boundaries, a value either belongs or
does not belong to a set. As an example the following is a set A of numbers
taken from the universe of natural numbers (N):

A = {1, 2, 3, 4}

That is A is the set of numbers smaller than 5. There is a crisp unambiguous


boundary (the number 5) such that if a number is less than 5 it belongs to
the set, otherwise it does not.
A crisp set as the above often does not reflect the nature of human con-
cepts and thoughts, which tend to be more imprecise. As an example it does
not make sense to draw a sharp line between a set of tall and a set of not
tall persons, so that e.g. a 180cm person is not tall, while a 181cm person is

30
tall. Instead we would like to say that a 180cm person belongs a little bit less
to the set of tall persons than do the 181cm person. To express this we can
use fuzzy sets. Fuzzy sets include, in addition to the values, a membership
function, stating the grade (degree) of which a value belongs to the set.
As an example the fuzzy set tall below lists four persons along with at
degree of membership to the set:

tall = {(150, 0), (180, 0.5), (185, 0.7), (195, 1)}

The crisp set A above can be expressed as a fuzzy set with crisp boundaries
as:
A = {(1, 1), (2, 1), (3, 1), (4, 1), (5, 0), (6, 0), (7, 1) . . .}
The tall set can also be represented more generally by the membership
function µtall as
tall = {(x, µtall (x))|x ∈ N}
Where µtall is the membership function for the fuzzy set tall. The mem-
bership function can be plotted as in Figure 3.1. The universe of discourse
above is the set of natural numbers N. So the definition of the set tall above
states that it is the set of all (infinite number of values) pairs of values be-
longing to the natural numbers along with their degree of membership. The
membership degrees for values 0,1,2,3,4. . .160 would be zero, and those we
would quite naturally omit if we were to list the set. The set of heights with

0.8
membership grade

0.6

0.4

0.2

0
160 165 170 175 180 185 190 195 200
height

Figure 3.1: Membership function on a discrete universe. Tall persons

1cm spacing is a discrete universe, resulting in the plot with discrete points.
The membership function can also be continuous.

31
The function µtall (x) above resembles a zigmoid function, which is one of
the common shapes of membership functions. The most common shapes are
listed in Figure 3.1.

1
µ (x)

0.5

x−a

 α( c−a ) if a ≤ x ≤ c
α( x−b
0

a) µ(x) = ) if x ≤ x ≤ b
x

c−b
0 else

1
µ(x)

0.5

α( x−a


 c−a
) if a ≤ x ≤ c
α if c ≤ x ≤ d
0

b) x

µ(x) = x−b
 α( c−b )
 if x ≤ x ≤ b
0 else

1
µ(x)

0.5

x2
e− 2
0

c)
0
x

1
µ(x)

0.5

0
1
d)
0
x

1+e−x+1

Figure 3.2: Common shapes of membership functions a) Triangular b) Trape-


zoidal c) Gaussian d) sigmoidal

(main sources [JSCTE97] Chapter 2.4)

3.1.1 Linguistic variables


Just as numerical variables take numerical values, in fuzzy logic, linguistic
variables take on linguistic values. Instead of a variable height assuming a

32
numerical value of 1.75 meters, it is treated as a linguistic variable that may
assume, for example, linguistic values of ”tall” with a degree of membership
of 0.92, ”very short” with a degree of 0.06, or ”very tall” with a degree of
0.7. The set of values a linguistic variable can be assigned is called its term
set. For example, for linguistic variable height, the term set T (height) may
be defined as follows:

T (height) = {”short”, ”average”, ”tall”}


Translating the crisp numerical value 1.75m to degrees of membership of
linguistic terms can be called fuzzification (we talk about fuzzification again
in Section 3.1.2.

Note: the words linguistic term and linguistic value are synonyms, also
the word linguistic label is used in literature but not in this text.

Each linguistic term is associated with a fuzzy set, each of which has a
defined membership function. The possible values for a linguistic variable
can thus be illustrated by a plot containing the membership functions of all
the terms. Such a plot for the height variable is shown in Figure 3.3.

0.8
membership grade

0.6

0.4
µaverage
µtall
0.2 µshort

0
150 160 170 180 190 200 210
height

Figure 3.3: Membership functions of linguistic values/terms ”aver-


age”,”tall”,”short”

As another example, consider a linguistic variable body temperature with


five associated linguistic terms namely ”no fever”, ”slight fever”, ”moderately
high fever”, ”high fever”, and ”very high fever”. Each of these linguistic
terms is again associated with a fuzzy set defined by a corresponding mem-
bership function. In this case we use triangular membership functions. The
membership functions for the terms of this variable are plotted in Figure 3.4

33
Figure 3.4: Membership functions of linguistic values/terms for body tem-
perature

3.1.2 Fuzzification
The fuzzification comprises the process of transforming crisp values into
grades of membership for linguistic terms. The membership function is used
to associate a grade to each linguistic term. As an example we can fuzzify the
height 182cm of a person using the terms ”short”, ”average”, ”tall” with the
corresponding membership functions µshort , µaverage and µtall from Figure 3.3.
The crisp value 182cm in this case fits all these terms, and the membership
degrees from the membership functions are:
µshort (182) = 0.10
µaverage (182) = 0.80
µtall (182) = 0.17
So the fuzzification of the crisp value 182cm resulted in a a list of member-
ships to linguistic terms. So it is not as simple as stating that a linguistic
variable age would have value short. You also state how much, i.e. to which
degree, the variable age has the value short. It is not clear to me what
notation should actually be used to present the value of a linguistic variable.

3.1.3 Summary
• A fuzzy set is a set to which elements can belong to a degree 0 . . . 1. A
fuzzy set is best illustrated by a plot of its membership function.

• A linguistic variable is a variable that can take linguistic values.

• A linguistic value, or term, is something that humans normally use to


describe things, subjectively, e.g. hot, tall, fast etc. These linguistic
values/terms are represented by a fuzzy set. E.g. the term fast, is a
fuzzy set of speeds belonging to it to a certain degree. The degree of

34
membership to the set fast, is of course again described by its member-
ship function.

3.2 Fuzzy If-Then Rules


A fuzzy if-then rule assumes the form

if x is A then y is B

where A and B are linguistic values from some term set, e.g. ”young” and x
and y are linguistic variables. As an example consider:

If pressure is high, then volume is small


Here pressure and volume are linguistic variables and ”high” and ”small”
are linguistic values (terms).
These rules look very much like the kind of boolean logic we use in pro-
gramming. In an if statement in programming, the premise (the guard of the
if statement), should evaluate to either true or false. In our examples above,
the expression ”pressure is high” for example is not a crisp true false value,
but a degree of membership of the variable pressure to the fuzzy set high.

3.2.1 Evaluating a Single rule, with one Premise


So the premise is fuzzy. What about the result of the evaluation of the
rule? The ”then volume is small” part? In the two-valued logic we use in
programming the rule could look something like below:

if (pressure==HIGH) then
volume = SMALL;
endif

According to this piece of pseudo code: If the premise (pressure==HIGH)


evaluates to true, then the variable volume is set to SMALL. If in our fuzzy
rule, the premise is a little bit true, say to a degree of 0.4, then the fuzzy
volume variable should be set a little bit to SMALL, right? Say about 0.4
times the small thing. What is the small thing again? It is a linguistic
value/term. A linguistic value corresponds to a fuzzy set. A fuzzy set is
represented by its membership function. The result of evaluating the fuzzy
if-then rule is a fuzzy set, limited by the premise. Fuzzy in the head yet?
Let’s draw it.

35
In Figure 3.5 I have drawn membership functions µhigh and µsmall for
the fuzzy sets representing the linguistic terms/values high, for the linguistic
variable pressure, and small for the linguistic variable volume. I picked the
shapes for the membership functions arbitrarily, usually they would probably
both be the same shape but there is nothing thats says they must. According
pressure volume

µhigh µsmall
1 1

0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m

Figure 3.5: Membership functions for the term high of the pressure linguistic
variable, and the term small of the volume variable

to the If-then rule, high pressure implies small volume. So the plot to the
left somehow implies the plot to the right (implication means that if the one
holds, so does the second). Note that the two plots have different scales and
units altogether.
Our linguistic variable, as explained in Section 3.1.2, has some linguistic
value, e.g. high, but also a degree of membership to the fuzzy set represent-
ing the linguistic term high. And underneath, there is the original numerical
value. Say the pressure is numerically 3.6bars, as in Figure 3.6. Then the
linguistic variable pressure has value high to a degree of 0.4. Then, as men-
tioned, the volume also belongs to the fuzzy set small to a degree of 0.4.
That is, we limit the fuzzy set small, with the membership function µsmall
shown in the plot, to max 0.4.
pressure volume

µhigh µsmall
1 1
0.4

0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
3.6

Figure 3.6: A numerical pressure value 3.6 is fuzzified, and results in mem-
bership degree 0.4 of the high fuzzy set. The membership function of the
consequence of our if-then rule µsmall is limited accordingly

Or in other words, we take the minimum of µhigh (3.6) and µsmall , that is
the result of evaluating our if-then rule is a fuzzy set, with the membership
function given by min(µhigh (3.6), µsmall ). In this case we get the same results
if we just multiply µsmall by 0.4, but let us stick to the min operator.

36
The evaluation of an if-then rule:

if a is A then b is B

with one premise results in a fuzzy set with the membership function

µA→B (u) = min{µA (x), µB (u)}


For some specific point x in the membership function µA (x). This (the min
operator) is only one of many methods of evaluating these expressions, it is
however a popular one, and the only one I consider in this text.

3.2.2 Evaluating a Single rule, with Several Premises


In practice, the fuzzy rule sets usually have several premises that are com-
bined using operators, such as AND, OR, and NOT. The definitions of these
operators tend to vary, as did the evaluation of the single premise case above.
So how do we interpret the following if-then rule:

If pressure is high AND temperature is low, then volume is very small

Again in two-valued logic, the premise would evaluate to true or false. Now
the ”pressure is high” expression will be more or less true, and the ”temperature
is low” will be more or less true. Think about the case when one of those
is zero, e.g. the pressure is not at all member of the set high. Then, as in
two-valued logic, the AND should evaluate to false, or in fuzzy terms, the
membership of volume to the fuzzy set very small should be zero.
This gives us a hint that we could use the min operator also to combine
the subexpressions in our premise. This is in fact a popular way of doing it,
and the only one I consider in this text. We might also intuitively be able to
guess that the max operator could be used for OR expressions.
We illustrate the evaluation of the if-then rule in Figure 3.7. I again
choose a different shape for the membership function µlow , just for fun. The
result of combining the subexpressions with AND results in the value 0.4 in
this example, that is the temperature is ignored.
The evaluation of an if-then rule:

if a is A AND b is B then c is C

with two premises results in a fuzzy set with the membership function

µA&B→C (u) = min{µA (x), µB (y), µC (u)}


For some specific values x and y.

37
pressure

µhigh
1
0.4 volume
0.4 AND 0.9
0 µsmall
=min{0.4,0.9} 1
1 2 3 4 bar =0.4 min{0.4, µsmall }
3.6
0 3
temperature m
0.9 0.1 0.2 0.3
1
µlow
0
5 10 15 20 degrees

Figure 3.7: Evaluation of a if-then rule with two premises separated by AND

3.2.3 Evaluating Several Rules


A fuzzy system always consists of a (large) set of rules in which we gather
our intuitive knowledge about a system. Take the following (boring) rules as
an example:
If pressure is high, then volume is small
If pressure is medium, then volume is medium
If pressure is low, then volume is large
We can gather these rules in a table:

1 pressure volume
2 high small
3 medium medium
4 low large
The evaluation of each rule separately, as we did earlier, results three
fuzzy sets that are limited versions of some of the membership functions
µsmall etc. How do we combine the results? If we use the min operator
again, some of the resulting fuzzy sets will be ignored. But certainly all the
rules should be considered. In Figure 3.8 we use the max operator to combine
the results of evaluating the three rules above for a numerical value 1.9 for
the pressure. The result is the hatched area in the bottommost plot. Does
the max operator seem to be a valid choice? In fact it is a popular choice.
The evaluation of a set of rules thus becomes a min-max operation. The
individual rules are evaluated using min, and the results are combined using
max,

38
pressure volume
µmedium
µmedium
1 0.8 1

0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m

1 1 µsmall
µlow
0.2
0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m

µhigh
1 µlarge
1

0 0.1 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
1.9
1
max{min{0.4, µmedium }, min{0.2, µsmall }, min{0.1, µlarge }}

0 3
0.1 0.2 0.3 m

Figure 3.8: Evaluation of a if-then rule with two premises separated by AND

3.3 Set-Theoretic Operations


Union and intersection ∪, ∩ are the most basic operations in classical (crips)
sets. It is pretty intuitive how these are extended to fuzzy-sets. We will
however not be needing set operations in our control examples later, so I leave
the student to try to think about the definition of set operators intuitively.

3.4 Defuzzification
The most common methods for combining fuzzy rules produce a fuzzy set.
When we e.g. design a controller we need a crisp numerical output value.
This requires some process of defuzzification, i.e. producing a numerical
value that best reflects the fuzzy set. There are many techniques for for
defuzzification, we will mention only three.

39
3.4.1 The Centroid Method
The center of area, or center of gravity, or centroid method computes the
center of area of the region under the membership function.
R
∗ u · µA (u)du
u = R
µA (u)du

Or in discrete form: Pn
∗ i=1 xi · µA (xi )
u = P n
i=1 µA (xi )

3.4.2 Mean of Maximum (middle of maxima)


Denote m as the number of the output points, whose membership function
values reach the maximum value. The mean of maximum strategy calculates
the mean value of all the m output points.
Pm
xi
u = ii

m

3.4.3 First of Maxima Method


First of maxima takes the smallest value in the domain on which the fuzzy
set assumes its maximum.

40
3.5 Fuzzy Control
Fuzzy control is typically used when the explicit system analytical model is
not available. Fuzzy control is intuitive to understand and easy to design for
engineers who are unfamiliar with classical control theory. A fuzzy controller
can be designed based on e.g. a human operators experience. Fuzzy control
consists of selecting and using

1. a collection of rules that describe the control strategy

2. membership functions for the linguistic variables in the rules

3. logical connections for fuzzy relations

4. a defuzzication method

3.5.1 Example: Inverted Pendulum on a Cart


As an example of fuzzy control we will design a fuzzy controller for the
classical inverted pendulum on a moving cart. The system has one input,
the force applied to the cart u and two outputs, the angle of the rod θ and the
angular velocity θ0 , where the latter is the derivate of the first one, that is,
we actually have only one independent output θ. The objective is to balance
the rod in the vertical position by applying an appropriate force u(t).

Fuzzification
System Model We have at our disposal a discrete model of the system as
a difference equation:

y(k) = 2.1y(k − 1) − 0.98y(k − 2) − 0.06u(k − 1) − 0.06u(k − 2)

In reality we would not have this model, that is primarily the reason for using
fuzzy control instead of classical methods.
We can also do a z-transformation of the model in order to obtain a
transfer function:

y(z) = 2.1y(z)z −1 − 0.98y(z)z −2 − 0.6u(z)z −1 − 0.06u(z)z −2

y(z) − 2.1y(z)z −1 + 0.98y(z)z −2 = −0.6u(z)z −1 − 0.06u(z)z −2


y(z)(1 − 2.1z −1 + 0.98z −2 ) = −0.6u(z)z −1 − 0.06u(z)z −2

41
th

Figure 3.9: The inverted pendulum on a cart

−0.6z −1 − 0.06z −2
y(z) = u(z)
(1 − 2.1z −1 + 0.98z −2 )
We start by looking at the universe of discourse, that is the intervals,
of the input variables θ(t) and θ(t)0 and the output u(t). For θ we have an
interval X in degrees, say limited to X = [−45◦ , +45◦ ], for θ0 we have an
interval Y in degrees per second, which depends on the pendulum used and
how violent disturbances it can be the target of, lets say Y = [−30◦ /s, 30◦ /s].
The output, that is the control space, of u(t) could be an interval repre-
senting the force in Newtons applied on the cart. A pendulum in Technobot-
nia instead uses a positioning signal for the cart, with an interval -0.5-0.5
(unitless but should represent a force). We use this interval and call the
output universe of discourse U = [−0.5, 0.5].
Next we define linguistic variables that divide the universes into fuzzy
subsets and define the membership functions for the terms. In this example
it is suitable to use the same termset for all variables:

T (angle) = T (angular velocity) = T (positioning) =


{”negative big”, ”negative medium”, ”negative small”,”positive small”,
”positive medium”, ”positive big”}

We abbreviate the terms as NB,NM,NS,PS,PM and PB. Although the


termsets are the same, the membership functions are naturally different. The
shapes of the membership functions can be the same, but at least the intervals

42
differ in this example. For this example we choose triangular membership
functions for all the linguistic variables. We define membership functions
µN B , µN M , µN S , µP S , µP M , µP B for all linguistic variables as depicted in Fig-
ure 3.10. The five functions are the same for all linguistic variables, except
for the scale.
The choice of membership functions chosen to represent the linguistic
variables is somewhat arbitrary. One of the adjustments made during test-
ing of the control system will be experimenting with different membership
functions, that is changing the parameters of these triangular functions, or
perhaps switching to other shapes of functions e.g. gaussian. This process is
called tuning.

membership degree
muNB muNM muNS muPS muPM muPB
1

xlabel
a) xmin 0 xmax
membership degree
muNB muNM muNS muPS muPM muPB
1

xlabel
b) xmin 0 xmax
membership degree
muNB muNM muNS muPS muPM muPB
1

xlabel
c) xmin 0 xmax

Figure 3.10: The membership functions for the terms of the linguistic vari-
ables a) angle b) positioning and c) angular velocity

43
Designing If-Then Rules
We now use our intuition and experience to design a set of If...then rules.

1. If angle is negative medium and angular velocity is positive small then


positioning is negative small

2. If angle is negative medium and angular velocity is positive medium


then positioning is positive small

3. If angle is negative small and angular velocity is positive small then


positioning is positive small

4. If angle is negative small and angular velocity is positive medium then


positioning is positive small

5. . . .

We have six terms in both inputs. Designing rules for each combination of
the inputs thus results in 36 rules, three of which are listed above. Instead
of listing the rest of the rules in the format of above, we can (while we have
only two inputs) compactly represent them in a lookup table as below.

θ/θ0 NB NM NS PS PM PB
NB NB NB NB NM NS PS
NM NB NB NM NS PS PS
NS NB NM NS PS PS PM
PS NM NS NS PS PM PB
PM NS NS PS PM PB PB
PB NS PS PM PB PB PB
The rules again look like crisp rules but remember that we are talking
about fuzzy sets.
We take as an example the inputs θ = 13◦ and θ0 = 6◦ /s. This value for
θ falls under the terms NM and NS so that µN M (8) = 0.6 and µN S (8) = 0.4.
For θ0 = 6circ /s the fuzzy sets (terms) PS and PM have nonzero membership
grades so that µP S (6) = 0.8 and µP M (6) = 0.3. If we remove from the table
above the rules for terms equal to zero we are left with the table below.

θ/θ0 PS PM
NM NS PS
NS PS PS

The four rules in this table are those that are said to fire for the input θ = 13◦

44
and θ0 = 6◦ /s. These rules are the same that we listed earlier in the form of
If-then sentences.
The AND in the rules is, as explained in Section 3.2, means taking the
minimum of the membership degrees. We illustrate the evaluation of the
rules in Figures below.
input variable: angle

muNM
1
o.6 output variable: positioning

muNS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.6 AND 0.8=0.6
111111
000000
input variable: : anglar velocity
000000
111111
000000
111111
muPS 000000
111111
000000
111111 xlabel_pos
1 xmin_pos 0 xmax_pos
0.8

xlabel_vel
xmin_vel 0 xmax_vel

input variable: angle

muNM
1
output variable: positioning
o.6
muPS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.6 AND 0.3=0.3
1111
0000
input variable: : anglar velocity
0000
1111
0000
1111 xlabel_pos
muPM xmin_pos 0 xmax_pos
1

0.3

xlabel_vel
xmin_vel 0 xmax_vel

input variable: angle

muNS 1
output variable: positioning
0.4
muNS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.4 AND 0.8=0.4
input variable: : anglar velocity 11111
00000
00000
11111
00000
11111
muPS xmin_pos 00000
11111
0
xlabel_pos
xmax_pos
1
0.8

xlabel_vel
xmin_vel 0 xmax_vel

45
input variable: angle

muNS
1
output variable: positioning

o.4 muPS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.4 AND 0.3=0.3
1111
0000
input variable: : anglar velocity
0000
1111
0000
1111 xlabel_pos
muPM xmin_pos 0 xmax_pos
1

0.3

xlabel_vel
xmin_vel 0 xmax_vel

Formally the process of evaluating a rule results in a fuzzy subset of the


output universe U with a membership function ϕj for the j th rule. The
four rules that fire in our example thus gives the following four membership
functions:

ϕ1 (u) = min{µN M (13), µP S (6), µN S (u)}


= min{0.6, 0.8, µN S (u)}
= min{0.6, µN S (u)}

ϕ2 (u) = min{µN M (13), µP M (6), µP S (u)}


= min{0.6, 0.3, µP S (u)}
= min{0.3, µP S (u)}

ϕ3 (u) = min{µN S (13), µP S (6), µP S (u)}


= min{0.4, 0.8, µN S (u)}
= min{0.4, µN S (u)}
ϕ4 (u) = min{µN S (13), µP M (6), µP S (u)}
= min{0.4, 0.3, µP S (u)}
= min{0.3, µP S (u)}
...

The membership function ϕ1 (u) = min{0.6, µN S (u)}, for example, is a


restriction of the membership function µN S (u) to the maximum membership
degree value 0.6.

46
Combining the Rules
The application of the rules just illustrated resulted in a number of fuzzy
subsets of U with the membership functions ϕj (u). Next we combine these
membership functions into one function representing the control action. For
this we use the OR operator, which means taking the maximum.

Ψ(u) = max{min{0.6, µN S (u)},


min{0.3, µP S (u)},
min{0.4, µN S (u)},
min{0.3, µP S (u)}}
= max{min{0.6, µN S (u)}, min{0.3, µP S (u)}}

muNS
1
muPS

111111
000000
000000
111111
00000
11111
000000
111111
00000
11111
000000
111111
00000
11111
000000
111111
xmin_pos
00000
11111
000000
111111 xlabel_pos
xmax_pos
0

Defuzzification

3.5.2 Stability of Fuzzy Control Systems


In standard control theory, stability analysis is based on the availability of
the mathematical model for the system. In Fuzzy control we do not have an
exact model (well the Fuzzy model is also a mathematical one).
From a practical viewpoint, one can always use simulations to the sta-
bility of the system when a fuzzy controller has been designed. But blind
simulations cannot provide the level of certainty that mathematical proof
can.
The problem of stability analysis of fuzzy control systems is still an active
research area in the theory of fuzzy control.
(main sources [NPWW03] Chapter 5, [Goe])

3.6 Exercises
1. The terms/values ”moderately high fever” and ”high fever” for the
linguistic variable ”body temperature” are fuzzy sets with the mem-

47
bership functions shown in Figure 3.4. Try to draw the intersection
and union of these fuzzy sets (this was not covered in the text, use
your intuition).

2.

48
Chapter 4

Neuro-Fuzzy and Fuzzy-Neural


Control

So far we have discussed two distinct methods for building controllers: fuzzy
and neural. Often the choice of method is dictated by the data available on
the plant to be controlled. If the data are pairs of numbers, we may turn to
a neural method, and if the data are rules, fuzzy methods are appropriate.
Neural methods provide learning capability, whereas fuzzy methods provide
flexible knowledge-representational capability.

49
Chapter 5

Reinforcement Learning

5.1 Introduction
The following introduction is a selection of directly stolen pieces from [SB98].
Reinforcement learning is different from supervised learning, the kind of
learning studied in most current research in machine learning, statistical
pattern recognition, and artificial neural networks. Supervised learning is
learning from examples provided by a knowledgeable external supervisor.
This is an important kind of learning, but alone it is not adequate for learn-
ing from interaction. In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and representative of all
the situations in which the agent has to act. In uncharted territory where
one would expect learning to be most beneficial an agent must be able to
learn from its own experience.
One of the challenges that arise in reinforcement learning and not in other
kinds of learning is the trade-off between exploration and exploitation. To
obtain a lot of reward, a reinforcement learning agent must prefer actions
that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected
before. The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action selections
in the future. The dilemma is that neither exploration nor exploitation can
be pursued exclusively without failing at the task.
All reinforcement learning agents have explicit goals, can sense aspects of
their environments, and can choose actions to influence their environments.
Reinforcement learning agents try to function in environments where correct
choice requires taking into account indirect, delayed consequences of actions,
and thus may require foresight or planning.

50
5.1.1 Elements of Reinforcement Learning
One can identify four main subelements of a reinforcement learning system:

1. a policy

2. a reward function

3. a value function

4. optionally, a model of the environment

A policy defines the learning agent’s way of behaving at a given time. Roughly
speaking, a policy is a mapping from perceived states of the environment to
actions to be taken when in those states.
A reward function defines the goal in a reinforcement learning problem.
Roughly speaking, it maps each perceived state (or state/action pair) of the
environment to a single number, a reward, indicating the intrinsic desirability
of that state. A reinforcement learning agent’s sole objective is to maximize
the total reward it receives in the long run.
Whereas a reward function indicates what is good in an immediate sense,
a value function specifies what is good in the long run. Roughly speaking,
the value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state. For example, a state
might always yield a low immediate reward but still have a high value because
it is regularly followed by other states that yield high rewards. Or the reverse
could be true. We seek actions that bring about states of highest value, not
highest reward, because these actions obtain the greatest amount of reward
for us over the long run.
The fourth and final element of some reinforcement learning systems is
a model of the environment. Models are used for planning, by which we
mean any way of deciding on a course of action by considering possible future
situations before they are actually experienced. Early reinforcement learning
systems were explicitly trial-and error learners; what they did was viewed as
almost the opposite of planning.
Reinforcement learning can be used when the state set is very large or
even infinite.

5.2 Example: Riding a Bicycle


As a first example consider the following: The goal given to the Reinforce-
ment Learning (RL) system is simply to ride the bicycle without falling over.

51
In the first trial, the RL system begins riding the bicycle and performs a
series of actions that result in the bicycle being tilted 45 degrees to the right.
At this point their are two actions possible: turn the handle bars left or turn
them right. The RL system turns the handle bars to the left and immedi-
ately crashes to the ground, thus receiving a negative reinforcement. The
RL system has just learned not to turn the handle bars left when tilted 45
degrees to the right. In the next trial the RL system performs a series of ac-
tions that again result in the bicycle being tilted 45 degrees to the right. The
RL system knows not to turn the handle bars to the left, so it performs the
only other possible action: turn right. It immediately crashes to the ground,
again receiving a strong negative reinforcement. At this point the RL system
has not only learned that turning the handle bars right or left when tilted
45 degrees to the right is bad, but that the ”state” of being titled 45 degrees
to the right is bad. Again, the RL system begins another trial and performs
a series of actions that result in the bicycle being tilted 40 degrees to the
right. Two actions are possible: turn right or turn left. The RL system
turns the handle bars left which results in the bicycle being tilted 45 degrees
to the right, and ultimately results in a strong negative reinforcement. The
RL system has just learned not to turn the handle bars to the left when
titled 40 degrees to the right. By performing enough of these trial-and-error
interactions with the environment, the RL system will ultimately learn how
to prevent the bicycle from ever falling over.

5.3 Example: Jackpot Journey


The objective of reinforcement learning is to find an optimal policy for select-
ing a series of actions. The following example, taken from [JSCTE97], will
illustrate the basics. I feel this example is nice as it inherently is structured
as a tree of paths to goals and failures, which in other examples needs to
be constructed. Figure 5.1 illustrates the problem. The traveler, that is the
agent needs to find an optimal route to the gold. The states of the journey
are the position the agent is in. At each junction there is a signpost that
has a box with some white and black stones in it. The agent picks a stone
from the signpost box and goes up if the stone is white, otherwise down. The
journey is always started at state A.
The agent’s behavior can be described as follows: At vertex A, pick one
stone (selection) and put it on the signpost. Proceed to the next vertex
according to the color of the stone (action). Repeat this selection-action
procedure at the second and third vertices. After the third action, the traveler
will reach one of the four terminal vertices: G,H,I or J. When the jackpot is

52
J

F
C I

Traveler A
E
B H Gold

Figure 5.1: The jackpot journey problem

hit (success), prepare a reward scheme. When the gold is not found (failure),
prepare a penalty scheme. Then trace back to the starting vertex A; at each
visited vertex, apply the reward or penalty scheme. That is, put the placed
stone back into the signpost box with an additional stone of the same color
(reward), or take the placed stone away from the signpost (penalty). When
the the traveler returns, the next traveler will hit the road with a bit more
hope. Repeat the same journey many times.

5.4 Credit Assignment Problem


The jackpot journey example was strictly success or failure driven; its ad-
justment scheme (adding or removing stones) is always applied only when
the final outcome becomes known after the entire sequence of actions. Thus
it is close to ordinary supervised learning methods. In other words it ignores
the intrinsic sequential structure of the problem to make adjustments to each
state. The scheme seems to work well in small finite state space problems
such as the tic-tac-toe game and our jackpot journey problem; the entire
state space can be searched rather easily.
In playing chess, such a scheme seems impractical because of the huge
number of possible states. The player (agent) would only receive feedback
(a reinforcement signal) concerning win or loose, after a long sequence of
moves. How would the agent know which moves were inappropriate after an
unsuccessful experience? In chess playing, the learner needs to make better
moves with no performance indication regarding winning during the game.

53
The power of reinforcement learning actually lies in that the agent ac-
tually does not need to wait until it receives feedback at the end to make
adjustments. This is achieved by temporal difference methods discussed in
Section 5.5.

5.5 Temporal-Difference Learning


Temporal difference methods update estimates based in part on other learned
estimates. In the jackpot journey example, updates were done only after
reaching a final good or bad state. There are many learning schemes for RF
systems, and many variants utilizing temporal-difference. We will look at
one.

5.5.1 Q-Learning
Q-learning maintains a matrix, call it the Q-matrix, which has one row for
every possible state, and one column for every action the agent can take in
these states. We look at Q-learning through a cliff walker example ([SB98]).
Consider the grid world shown in Figure 5.2. The walker (agent) should find
the optimal path from start to stop without falling down the cliff (the black
area is no-no). For us it is easy to see that the optimal path is to follow
the edge of the cliff. The walker can be in 4*10 positions, i.e. states, so
1111111111111111111111
0000000000000000000000
0000000000000000000000
1111111111111111111111
start 0000000000000000000000
1111111111111111111111
stop
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111

Figure 5.2: The cliffwalker problem

the Q-matrix should have 40 rows. The agent can go up, down, right or
left (actions), that is the matrix should have four columns. The matrix is a
listing of state-action pairs. The values filled into the matrix are the expected
returns (rewards) of a certain action in a certain state. An episode stops when
we hit the goal, or run into the cliff. The Q-matrix is the memory of the
system, and the key issue is how to update the matrix so that the walker

54
makes better and better decisions. Initially the agent knows nothing, so the
matrix is filled with random values. The walker can do one of two things:

• take the optimal action, which means he’ll take the action correspond-
ing to the biggest Q-value.

• explore his world by taking a random action, just to see what happens.

This choice is the important exploration-exploitation issue. We use a param-


eter  to set the probability of choosing the optimal action.

Value function So the value of a given state, is the total amount of rewards
an agent can expect to accumulate in the long run. The value function V
can be defined as the value of the state’s best state-action pair:

max
V (s) = Q(s, a)
a

We then update the Q-matrix according to the following function:

Q(st , at ) = Q(st , at ) + α [rn + γV (st+1 ) − Q(st , at )]

Which is the same as:

Q(st , at ) = (1 − α)Q(st , at ) + α [rn + γV (st+1 )]

Where r is the reward (reinforcement) of taking the action at , α and γ are


factors used for tuning the learning. We can see γ = 1. The above equation
(I prefer the latter one) thus reads something like: Update the Q value for
state-action pair st , at so that you give weight 1 − α to its previous value, and
weight α to the value of the state reached by action at plus the reward/cost
of moving to that state.

Reward function We define the reward function informally as:

• The reward for walking into the cliff is -100.

• The reward for taking any other step is -1.

That is, there is a cost (negative reward) of each step we take in the grid.
Thus the agent should learn to take as few steps as possible.

55
Initialize Q(s, a) randomly
Repeat (for each episode):
Initialize s
Repeat (for each step of episode, i.e. until goal or cliff is met)
Choose a from s using policy derived from Q (depending on )
Take action a, observe r and the new state s0
Q(st , at ) = (1 − α)Q(st , at ) + α [rn + γV (s0 )]
s ← s0

5.6 Exercises
1. What is reinforcement good at, compared to, say neural networks?

2. Try to think about how a neural network could manage the bicycle
riding. The only input-output pairs we could present to the nn would
be disaster/not disaster for the output, and a sequence of actions as
inputs.

3. How many states are there in a game of tic-tac-toe? Draw a tree of


some depth.

4. Could a neural network solve the jackpot journey problem? What


would be the input layer, and output layer. Could it learn?

5. Think about the ”dynamic state” of the inverted pendulum mentioned


in the text. How many states are there? Depends doesn’t it? Maybe
the fuzzy pendulum controller could give some hints on how to reduce
the state space. Draw a tree of actions resulting in failure or success.
Define failure. Define actions. Define rewards. Choose some initial
state other than still and upright. Illustrate a training sequence and
how reinforcements are updated at each step on the way. Use a pure
delayed reward reinforcement function.

6. Try to implement the above on the inverted pendulum in the lab.

56
Bibliography

[Cal03] Tim Callinan. Artificial neural network identification and con-


trol of the inverted pendulum. Master’s thesis, -, 2003.

[Goe] Greg Goebel. An introduction to fuzzy control systems.


http://www.faqs.org/docs/fuzzy/.

[Hay99] Simon Haykin. Neural Networks, a Comprehensive Foundation.


Prentice Hall, 1999.

[JSCTE97] J.-S.Jang, C.-T.Sun, and E.Mizutani. Neuro-Fuzzy and Soft


Computing. Prentice Hall, 1997.

[NPWW03] Hung T. Nguyen, Nadipuram R. Prasad, Carol L. Walker, and


Elbert A. Walker. A First Course in Fuzzy and Neural Control.
Chapman Hall CRC, 2003.

[SB98] Richard S. Sutton and Andrew G. Barto. Reinforcement Learn-


ing: An Introduction. MIT Press, 1998.

[SX09] Yung C. Shing and Chengying Xu. Intelligent Systems, Model-


ing, Optimization and Control. CRC Press, 2009.

57

Das könnte Ihnen auch gefallen