Beruflich Dokumente
Kultur Dokumente
Dag Björklund
September 8, 2010
Contents
1 Introduction 3
1.1 Intelligent Systems and Soft computing . . . . . . . . . . . . . 3
1.2 System Identification . . . . . . . . . . . . . . . . . . . . . . . 4
2 Neural Networks 5
2.1 The Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Step Function (Threshold Function) . . . . . . . . . . . 8
2.2.2 Piecewise Linear . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Sigmoid Function . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Other Activation Functions . . . . . . . . . . . . . . . 9
2.2.5 Why Use Activation Functions . . . . . . . . . . . . . . 9
2.3 Network Architectures . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Designing and using a Neural Network . . . . . . . . . . . . . 11
2.6 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Learning Processes . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7.1 Learning with a teacher (supervised learning) . . . . . 12
2.7.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . 15
2.7.3 Learning tasks . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Adaline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Function Approximation . . . . . . . . . . . . . . . . . . . . . 16
2.10.1 Linear Regression and Multiple Regression . . . . . . . 16
2.10.2 Nonlinear Models . . . . . . . . . . . . . . . . . . . . . 18
2.10.3 Neural Networks for Function Approximation . . . . . 18
2.11 Dynamic Systems and Neural Networks . . . . . . . . . . . . . 18
2.12 Neural Control . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.12.1 Direct Neural Control . . . . . . . . . . . . . . . . . . 21
2.12.2 Indirect Neural Control . . . . . . . . . . . . . . . . . . 22
2.12.3 Example: Temperature Controller Design . . . . . . . . 23
1
2.13 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Fuzzy Systems 30
3.1 Fuzzy Sets and Linguistic Variables . . . . . . . . . . . . . . . 30
3.1.1 Linguistic variables . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Fuzzification . . . . . . . . . . . . . . . . . . . . . . . . 34
3.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2 Fuzzy If-Then Rules . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.1 Evaluating a Single rule, with one Premise . . . . . . . 35
3.2.2 Evaluating a Single rule, with Several Premises . . . . 37
3.2.3 Evaluating Several Rules . . . . . . . . . . . . . . . . . 38
3.3 Set-Theoretic Operations . . . . . . . . . . . . . . . . . . . . . 39
3.4 Defuzzification . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.4.1 The Centroid Method . . . . . . . . . . . . . . . . . . 40
3.4.2 Mean of Maximum (middle of maxima) . . . . . . . . . 40
3.4.3 First of Maxima Method . . . . . . . . . . . . . . . . . 40
3.5 Fuzzy Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.1 Example: Inverted Pendulum on a Cart . . . . . . . . 41
3.5.2 Stability of Fuzzy Control Systems . . . . . . . . . . . 47
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Reinforcement Learning 50
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.1.1 Elements of Reinforcement Learning . . . . . . . . . . 51
5.2 Example: Riding a Bicycle . . . . . . . . . . . . . . . . . . . . 51
5.3 Example: Jackpot Journey . . . . . . . . . . . . . . . . . . . . 52
5.4 Credit Assignment Problem . . . . . . . . . . . . . . . . . . . 53
5.5 Temporal-Difference Learning . . . . . . . . . . . . . . . . . . 54
5.5.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . 54
5.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2
Chapter 1
Introduction
Methodology Strength
Neural network Learning and adaptation
Fuzzy set theory Knowledge representation via fuzzy if-then rules
Genetic algorithms Systematic random search
Reinforcement learning Learning sequences of actions based on
previous failure and success
3
1.2 System Identification
The problem of determining a mathematical model for an unknown system by
observing its input-output data pairs is referred to as system identification.
The purposes of system identification are multiple:
• To design controllers
• structure identification.
In this step we apply a priori knowledge about the target system to
determine a class of models that we think could be fitted to the system.
We usually know a great deal about industrial processes for instance,
so that we can choose a model than can be fitted by tuning parameters.
• Parameter identification
In this step we apply optimization techniques to determine the param-
eters to the model chosen in step one.
After these steps we usually also do validation tests to check if the identified
model responds correctly to an unseen data set. The first to steps are re-
peated if not. In statistics they talk about linear regression and curve fitting,
which is really the same think, system identification.
4
Chapter 2
Neural Networks
5
• Simple processing units that can store experience
• A learning process
• Input-output mapping.
By presenting random examples together with the desired output, a
neural network can be trained to generalize and solve problems it has
not previously encountered. Neural networks provide a generic method
of mapping or representing input output relationships.
• Adaptive
A NN can adapt to changes in its surroundings (remember that adap-
tive often means problems. If the system reacts too fast, it reacts
strongly to noise and can become unstable. If it reacts too slowly,
noise is filtered out, but the system might become too slow to react to
changes in the environment.)
• VLSI implementation.
NN:s are as mentioned massively parallel. Unfortunately instruction
processors are sequential. In order to take advantage of the parallelism,
NN:s can be implemented in hardware, i.e.. ASIC chips.
6
Figure 2.2: A neuron in the brain
and the output of the neuron with the activation function applied becomes
yk = ϕ(vk )
7
(sources [Hay99])
1
0.8
0.6
φ
0.4
0.2
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1 if v ≥ 0
a) v
ϕ(v) =
0 if v < 0
1
0.8
0.6
φ
0.4
0.2
1 if v ≥ + 21
0
0
b)
−10 −8 −6 −4 −2
v
2 4 6 8 10
0.4
0.2
0
v0 1
c) ϕ(v) =
−10 −8 −6 −4 −2 2 4 6 8 10
k
1+e−av
8
2.2.2 Piecewise Linear
A piecewise linear activation function can work as a step function, if the
linear region is made very small, or it can work as a linear combiner, if the
linear region is used, without running into saturation. It can naturally also
be used as a output limiting saturation device.
ϕ(v) = v
9
sigmoid units, a small change in the weights will usually produce a change in
the output, which makes it possible to tell whether that change in the weights
is good or bad. With threshold units, a small change in the weights will often
produce no change in the outputs.
For the output units, you should choose an activation function suited to
the distribution of the target values:
Notice that the weights, biases and activation functions are omitted from
this figure, as they usually are when network structures are drawn. Notice
also that this single layer network has a layer of source nodes and a layer of
neurons, the output layer. It only has one layer of neurons, thus it is a single
layer network.
A two layer network is shown in Figure 2.6. Each input is sent to every
neuron in the hidden layer and then each hidden layer’s neuron’s output is
connected to every neuron in the next layer, i.e.. the network is fully con-
nected. The hidden layer is not directly connected to output, hence its name.
There can be any number of hidden layers within a feedforward network but
one is usually enough to suffice for most problems you will tackle
10
Input layer Hidden layer Output layer
of source of neurons of neurons
nodes
Figure 2.6: Network with one hidden layer and one output layer. Fully
connected feedforward
2.4 Knowledge
”Stored information used by a person or machine to interpret, predict and
respond to the outside world.” Fischler et.al. A NN shall learn a model of
its surroundings and maintain it.
11
3. We test the network wit examples it has not previously seen. The
network should generalize to be able to recognize these examples.
2.6 Transformations
Usually objects need to be identified although they are transformed, e.g.
rotated, moved or scaled, in different ways compared to the original training
set. This can be achieved in a couple of different ways, like:
1. Invariance by structure
For example invariance to in plane rotations around a center point can
be achieved by forcing all synaptic weights at equal distances to be
equal. But what happens in the training process if we e.g. try to
recognize some complex patterns in an image, we tune the weights to
better match an input image, but then we pick one weight at a given
distance from origin and overwrite all other weights at the same dis-
tance with that one?? The solution lies in duplicating the synapses for
each pixel in the image, which will make the networks computationally
intense for moderate problem sizes.
2. Invariance by training
In principle a network can be trained to recognize transformations of
the same object, which requires presenting it with those transforma-
tions. The question is whether or not the networks learning capacity
will be enough for learning to map transformed versions of patterns into
the same output. This method also requires large amounts of training
data.
Not only objects in an shape recognition system are subject to transfor-
mations of course. In voice recognition, as another example, a voice pattern
arrives at different time instants and at different pitch depending on the
speaker.
12
As error-correction learning is a supervised learning algorithm, we present
the neuron with inputs x1 , . . . , xm and a corresponding desired output dk . We
do this over and over again during the training process, with different input
vector → output pairs. We denote the nth input data set as x1 (n), . . . , xm (n)
and the nth outputyk (n) which should be close to the nth desired output
dm (n) after training.
The error from our neuron k during the nth round is given by:
The aim is to tune the weights in order to reduce this error. The error-
correction learning is illustrated in Figure 2.7. At each step we will update
x1 (n)
wk1 (n)
dk (n)
wkm (n)
xm (n)
13
The Backpropagation Algorithm
The error-correction learning presented above only works for single layer
networks. When we have hidden nodes in our neural network, we will face a
credit-assignment problem. We only see the output of the network, which we
can compare with a desired output in order to tune the output neurons, but
we do not directly know how the hidden layer nodes can be hold accountible
for the error e = d − y. This can however be solved by, rather complicated,
backpropagation algorithms.
I will briefly outline a gradient-descent based backpropagation algorithm.
Gradient descent means we use derivatives of the activation functions to see
in which direction weights should be tuned in order to reduced the error.
That is we take the derivate ϕ0 of the activation function (the activation func-
tion needs to be differentiable) and adjust weights proportionally to it and
the error magnitude and the magnitude of the inputs to the neuron. Note,
for instance, that the sigmoid function will have 0 derivative in the saturated
regions, so the weights will not be changed ∆wji = 0. The derivative is con-
stant int the linear region of the sigmoid, and decreases when approaching
saturation.
14
gradients of the output nodes k that j is connected to, times the weight
between j and k:
X
δj (n) = µϕ0j (vj (n)) δk (n)wkj (n)
k
Hebbian learning
Selforganizing Maps
Re-inforcement Learning
Genetic algorithm based learning
15
2.8 Perceptron
You will encounter the word perceptron when you use different NN tools. A
perceptron is just a certain kind of neuron and a certain training algorithm.
The activation function of a perceptron is a hard limiter (threshold), i.e. it
has a binary output 1,0 or -1,1. It can thus only classify the input into two
classes. In order to classify the input into several classes, several neurons
are required. Training of a single layer perceptron is done using an error-
correction algorithm:
2.9 Adaline
2.10 Function Approximation
Function approximation or function fitting is pretty much the same thing as
system identification, where the systems are rather simple, i.e. non dynami-
cal. A system is functional if given a certain input it always gives the same
output, regardless of other previous input. The output of a dynamical system
(see Section 2.11) varies depending on the history of inputs. The output of
a stationary dynamic system gives the same output given a history (of some
length) of inputs, while a time varying system can have changing dynamics
and behavior over time.
16
output linear function to some measured data. That is, a function with one
independent and one dependent variable.
ŷ = a + bx
Figure 2.8 depicts a plot of measured unemployment rates versus eco-
nomical growth. It is suspected that the latter depends on the former (could
it be the other way around?). It is further suspected that the dependence
is linear so that a straight line could be fitted to the scattered points, i.e.
parameters a and b below should be tuned:
It turns out the parameters a and b satisfying the above can be obtained by:
Σni=1 (xi − x̄)(yi − ȳ)
b=
Σni=1 (xi − x̄)2
17
and
a = ȳ − bx̄
A probably more realistic situation than the single-input/single-output
model above is one where we have more than one independent variable. In
multiple regression we have several independent variables x1 , x2 , . . . that a
affects a dependent variable y as:
ŷ = a + b1 x1 + b2 x2 + . . . + bp xp
Again we then wish to minimize the error between the predicted and mea-
sured outputs ŷ and y:
( n )
min X
(y − ŷ)2
a,b1 ,b2 ,...
i=1
That is minimize the error squared. Tools like Excel and Matlab have features
for finding the optimal parameters a, b1 , b2 , . . ..
ŷ = f (x1 , x2 , . . .)
18
by a transfer function in the Laplace plain (analog) or z plain (digital). We
constantly deal with dynamical systems in e.g. control theory and signal
processing (two closely related fields).
Feedforward neural networks have no memory or state. The output of the
system for a given input is the same regardless of previous inputs (the system
is functional). A neural network thus cannot model a dynamical system, or
can it?
We can use several samples in time of the input signal as inputs to the
neural network. The network will then still have the same output given a set
of input samples regardless of older samples, that is it will be time invariant,
but dynamic. Figure 2.9 shows a single input single output (SISO) system.
However the neural network has several inputs, but these are time shifted
samples of the one actual input. This dynamic system can be represented by
by
y(k) = f (x(k), x(k − 1), x(k − 2))
Where the function f (·) is the neural network. A reader familiar with digital
signal processing might notice that this NN resembles a digital filter. If we
remove the hidden layer, we get a neural network that is structurally the same
as a finite impulse response filter (Figure 2.10). The only difference is in the
tuning of the weights, the weights in the NN are tuned by learning algorithms,
while the weights in a FIR filter are tuned by filter design methods.
We can also have a recursive dynamic system as:
That is, where the output y(k) also depends on previous output y(k−1). This
difference equation can be modeled by the neural network in Figure 2.11. A
19
Figure 2.10: Comparison of a dynamic neural network and a finite impulse
response (FIR) filter
20
have proven to be most useful for such time-invariant systems. This section
on neural control is mainly based on [SX09] Chapter 2.4.1, [NPWW03] and
[Cal03].
Supervised Control
The training data could e.g. come from a human controller, that is, the
neural network would learn to act as the human (the human should solely
base his control actions upon the input signal available also to the neural
controller). The training data can also be obtained from an existing con-
troller. This is also called supervised control. You could actually call this
system identification, where the system, is the human controller and we try
to identify what he has eaten. The teaching of a neural network to mimic
a human operator is depicted in Figure 2.12. When then network is tuned,
it would replace the human operator as the controller of the plant. It is un-
e
Σ
a)
r human u y
Σ operator
plant
−
r u y
b) Σ plant
−
21
means that the controller can operate correctly if the plant/process operates
around a certain point. These controllers will fail if there is any sort of un-
certainty or change in the unknown plant. The advantages of neuro-control
is if an uncertainty in the plant occurs the ANN will be able to adapt it?s
parameters and maintain controlling the plant when other robust controllers
would fail.
22
Indirect neural control designs involve two phases. First the plant dy-
namics are identified by a neural network from training data, that is, system
identification. In the second phase the control design can be rather conven-
tional though the controller is derived, not from a standard mathematical
model of a plant, but from a neural network model.
y(k)=0.9y(k-1)+0.2u(k-1)+1.5
(2.2)
Here y is the temperature of the room and u is the power in some unknown
unit applied to a heating element. If zero power is applied, the temperature
stays at 15o C (say the temperature surrounding the room is 15o C). If a
power of ’5 units’ is applied, the temperature will rise to 25o , which is seen
by applying such a step. This step response is shown in Figure 2.14.
6 26
5 24
y (degrees celsius)
4
22
u (power)
3
20
2
18
1
0 16
−1 14
0 10 20 30 40 50 60 70 0 10 20 30 40 50 60 70
time time
We will now try out a couple of different control designs for controlling
the temperature of the room, but first of all, why do we need a controller?
We saw from the step response that we get 25o C if we apply the power 5.
But what if we want 22o C? We can guess some power, apply it and come
back an hour later to check the temperature. And then try another power (if
we do this, we have applied feedback to the system). But we want to have
a controller with a centigrade scale, so that we can set a temperature, say
25o C, and after some time, that is the temperature we will have.
Clearly we first of all need to establish some inverse relationship between
the room temperature y and the power u to achieve this.
Secondly, in our heating system we have infinite power to our disposal!.
Should we then not crank up the power considerably to achieve the reference
23
temperature faster, lowering the power when we get close to the target? With
infinite heating power, we can get to the target temperature in one time step.
We now have an expression for what power u(k) to apply to achieve a certain
temperature y(k + 1) give the current temperature y(k). The temperature
y(k + 1) is the one we want to achieve, i.e. we set y(k + 1) to the reference
temperature r, i.e. we substitute r(k + 1) for y(k + 1):
In our difference equation model cascading the controller with the room
model means substituting the function u(k) for the occurrences of u in the
24
Figure 2.15: Simulink model of direct inverse controller derived from differ-
ence equation by inverting
So using our controller given by equation 2.3 we force the output to directly
follow the reference. The results of the above derivation should come as
no surprise as we just inverted y(k) to obtain u(k). A heating system that
instantly reaches any desired temperature is not practically possible. A more
realistic model is obtained simply by saturating the control signal u, which
25
would slow down the rise (or fall) time of the temperature.
Supervised control
We can now try to train a neural network to work the same way as our math-
ematically derived controller. Our controller was given by the Equation 2.3
repeated below.
r(k + 1) − 0.9y(k) − 1.5
u(k) =
0.2
Since we know this equation we see that it will be no problem for a neural
network to learn it, we also understand that there is no point in replacing
it with a neural network, but we do it anyway as an example. The neural
network shall have two nodes in the input layer, y and r and one in the
output layer u. In Figure 2.16 a), the training of the neural controller to
mimic the mathematically derived one is illustrated, while Figure 2.16 a)
shows how the NN has replaced the original controller. We create and train
e
Σ
−
a)
u (power)
r (temp) controller room y (temp)
(inverse fcn)
a network in Matlab. We run the simulink model for say 500 steps with
random values for the reference temperature r and record the output u from
the controller and y from the system. The neural network in Matlab takes
the set of inputs as a matrix with the input signals on separate rows, with
each test round represented by a column. We create this matrix, create and
train the network, and test it as:
% supervised learning
% create a copy cat of the inverse function controller
26
% create two row matrix for the set of samples for the two inputs
p=[r y]’;
% create a network with 10 nodes in the hidden layer
net = newff(pp, u’, 10);
% train the network with the output u
net = train(net, pp, u’)
% plot the output from the neural network when fed with the same input
% against the output u from the controller model
% We use the same data used for training to verify the performance, which is not
figure,plot(sim(net,p))
hold on
plot(u,’r’)
Direct Inverse
The supervised method above was nice and simple, but if you don’t already
have an existing controller you can learn to mimic (usually you don’t) and
you do not even have a model for the plant, you could try to teach a neural
network the inverse of the system. This inverse model should then also
behave the same as the inverse controller we derived from the difference
equation describing the system.
So the task is to do system identification on the inverse room dynam-
ics. The room system has one input and one output, heating power and
temperature respectively. In the inverse model the input/output would then
be reversed. So should our neural network have one input and one output?
The room is a dynamical system. The output depends on previous output.
We cannot create an inverse model that maps the temperature in some time
instant to a specific heating power. Our neural network needs two inputs,
the output y(k) of the room, and the previous output y(k − 1) of the room.
Figure 2.17 a) illustrates the training of the neural network. We do not
have any existing controller to copy cat, instead we try to learn how the
27
output of the room (temperature) is related to the input (power u) to it.
When the network is trained, we exchange the current temperature y input
to the neural network for the reference value r, and keep the delayed y as the
second. This is exactly what we did with the mathematically derived inverse
controller.
Σ
−
a) z −1
z −1
The following Matlab code creates an input matrix p containing the sets of
y(k), y(k − 1) pairs used for training and creates and trains a neural network.
% compare results
figure,plot(sim(net2,p))
hold on
plot(u(2:length(u)-1),’r’)
gensim(net2)
28
2.13 Exercises
1. How do you construct and train a neural network to approximate a
linear function, e.g. y = 2x + 4? Input layer? Hidden layer? Output
layer? Weights? Bias? Activation function?
3. Can you create a network that solves the XOR operator as you did in
the previous exercise?
7. It was claimed in the text that benefit of direct inverse control using
neural networks is that an existing controller is not required. But what
if the system is instable?
29
Chapter 3
Fuzzy Systems
Fuzzy logic and fuzzy systems are again a response to the fact that real-
ity is often not black or white, true or false. Fuzzy systems have perhaps
most notably been applied in control systems, hence the contents of this
chapter will focus on examples from control systems. This should also suite
the audience of this course. Other fields where fuzzy systems are used are
for example artificial intelligence and expert systems (an expert system is
software that attempts to provide an answer to a problem, or clarify un-
certainties where normally one or more human experts would need to be
consulted). Examples of expert systems with fuzzy logic central to their con-
trol are decision-support systems, financial planners, diagnostic systems for
determining soybean pathology, and a meteorological expert system in China
for determining areas in which to establish rubber tree orchards.
A = {1, 2, 3, 4}
30
tall. Instead we would like to say that a 180cm person belongs a little bit less
to the set of tall persons than do the 181cm person. To express this we can
use fuzzy sets. Fuzzy sets include, in addition to the values, a membership
function, stating the grade (degree) of which a value belongs to the set.
As an example the fuzzy set tall below lists four persons along with at
degree of membership to the set:
The crisp set A above can be expressed as a fuzzy set with crisp boundaries
as:
A = {(1, 1), (2, 1), (3, 1), (4, 1), (5, 0), (6, 0), (7, 1) . . .}
The tall set can also be represented more generally by the membership
function µtall as
tall = {(x, µtall (x))|x ∈ N}
Where µtall is the membership function for the fuzzy set tall. The mem-
bership function can be plotted as in Figure 3.1. The universe of discourse
above is the set of natural numbers N. So the definition of the set tall above
states that it is the set of all (infinite number of values) pairs of values be-
longing to the natural numbers along with their degree of membership. The
membership degrees for values 0,1,2,3,4. . .160 would be zero, and those we
would quite naturally omit if we were to list the set. The set of heights with
0.8
membership grade
0.6
0.4
0.2
0
160 165 170 175 180 185 190 195 200
height
1cm spacing is a discrete universe, resulting in the plot with discrete points.
The membership function can also be continuous.
31
The function µtall (x) above resembles a zigmoid function, which is one of
the common shapes of membership functions. The most common shapes are
listed in Figure 3.1.
1
µ (x)
0.5
x−a
α( c−a ) if a ≤ x ≤ c
α( x−b
0
a) µ(x) = ) if x ≤ x ≤ b
x
c−b
0 else
1
µ(x)
0.5
α( x−a
c−a
) if a ≤ x ≤ c
α if c ≤ x ≤ d
0
b) x
µ(x) = x−b
α( c−b )
if x ≤ x ≤ b
0 else
1
µ(x)
0.5
x2
e− 2
0
c)
0
x
1
µ(x)
0.5
0
1
d)
0
x
1+e−x+1
32
numerical value of 1.75 meters, it is treated as a linguistic variable that may
assume, for example, linguistic values of ”tall” with a degree of membership
of 0.92, ”very short” with a degree of 0.06, or ”very tall” with a degree of
0.7. The set of values a linguistic variable can be assigned is called its term
set. For example, for linguistic variable height, the term set T (height) may
be defined as follows:
Note: the words linguistic term and linguistic value are synonyms, also
the word linguistic label is used in literature but not in this text.
Each linguistic term is associated with a fuzzy set, each of which has a
defined membership function. The possible values for a linguistic variable
can thus be illustrated by a plot containing the membership functions of all
the terms. Such a plot for the height variable is shown in Figure 3.3.
0.8
membership grade
0.6
0.4
µaverage
µtall
0.2 µshort
0
150 160 170 180 190 200 210
height
33
Figure 3.4: Membership functions of linguistic values/terms for body tem-
perature
3.1.2 Fuzzification
The fuzzification comprises the process of transforming crisp values into
grades of membership for linguistic terms. The membership function is used
to associate a grade to each linguistic term. As an example we can fuzzify the
height 182cm of a person using the terms ”short”, ”average”, ”tall” with the
corresponding membership functions µshort , µaverage and µtall from Figure 3.3.
The crisp value 182cm in this case fits all these terms, and the membership
degrees from the membership functions are:
µshort (182) = 0.10
µaverage (182) = 0.80
µtall (182) = 0.17
So the fuzzification of the crisp value 182cm resulted in a a list of member-
ships to linguistic terms. So it is not as simple as stating that a linguistic
variable age would have value short. You also state how much, i.e. to which
degree, the variable age has the value short. It is not clear to me what
notation should actually be used to present the value of a linguistic variable.
3.1.3 Summary
• A fuzzy set is a set to which elements can belong to a degree 0 . . . 1. A
fuzzy set is best illustrated by a plot of its membership function.
34
membership to the set fast, is of course again described by its member-
ship function.
if x is A then y is B
where A and B are linguistic values from some term set, e.g. ”young” and x
and y are linguistic variables. As an example consider:
if (pressure==HIGH) then
volume = SMALL;
endif
35
In Figure 3.5 I have drawn membership functions µhigh and µsmall for
the fuzzy sets representing the linguistic terms/values high, for the linguistic
variable pressure, and small for the linguistic variable volume. I picked the
shapes for the membership functions arbitrarily, usually they would probably
both be the same shape but there is nothing thats says they must. According
pressure volume
µhigh µsmall
1 1
0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
Figure 3.5: Membership functions for the term high of the pressure linguistic
variable, and the term small of the volume variable
to the If-then rule, high pressure implies small volume. So the plot to the
left somehow implies the plot to the right (implication means that if the one
holds, so does the second). Note that the two plots have different scales and
units altogether.
Our linguistic variable, as explained in Section 3.1.2, has some linguistic
value, e.g. high, but also a degree of membership to the fuzzy set represent-
ing the linguistic term high. And underneath, there is the original numerical
value. Say the pressure is numerically 3.6bars, as in Figure 3.6. Then the
linguistic variable pressure has value high to a degree of 0.4. Then, as men-
tioned, the volume also belongs to the fuzzy set small to a degree of 0.4.
That is, we limit the fuzzy set small, with the membership function µsmall
shown in the plot, to max 0.4.
pressure volume
µhigh µsmall
1 1
0.4
0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
3.6
Figure 3.6: A numerical pressure value 3.6 is fuzzified, and results in mem-
bership degree 0.4 of the high fuzzy set. The membership function of the
consequence of our if-then rule µsmall is limited accordingly
Or in other words, we take the minimum of µhigh (3.6) and µsmall , that is
the result of evaluating our if-then rule is a fuzzy set, with the membership
function given by min(µhigh (3.6), µsmall ). In this case we get the same results
if we just multiply µsmall by 0.4, but let us stick to the min operator.
36
The evaluation of an if-then rule:
if a is A then b is B
with one premise results in a fuzzy set with the membership function
Again in two-valued logic, the premise would evaluate to true or false. Now
the ”pressure is high” expression will be more or less true, and the ”temperature
is low” will be more or less true. Think about the case when one of those
is zero, e.g. the pressure is not at all member of the set high. Then, as in
two-valued logic, the AND should evaluate to false, or in fuzzy terms, the
membership of volume to the fuzzy set very small should be zero.
This gives us a hint that we could use the min operator also to combine
the subexpressions in our premise. This is in fact a popular way of doing it,
and the only one I consider in this text. We might also intuitively be able to
guess that the max operator could be used for OR expressions.
We illustrate the evaluation of the if-then rule in Figure 3.7. I again
choose a different shape for the membership function µlow , just for fun. The
result of combining the subexpressions with AND results in the value 0.4 in
this example, that is the temperature is ignored.
The evaluation of an if-then rule:
if a is A AND b is B then c is C
with two premises results in a fuzzy set with the membership function
37
pressure
µhigh
1
0.4 volume
0.4 AND 0.9
0 µsmall
=min{0.4,0.9} 1
1 2 3 4 bar =0.4 min{0.4, µsmall }
3.6
0 3
temperature m
0.9 0.1 0.2 0.3
1
µlow
0
5 10 15 20 degrees
Figure 3.7: Evaluation of a if-then rule with two premises separated by AND
1 pressure volume
2 high small
3 medium medium
4 low large
The evaluation of each rule separately, as we did earlier, results three
fuzzy sets that are limited versions of some of the membership functions
µsmall etc. How do we combine the results? If we use the min operator
again, some of the resulting fuzzy sets will be ignored. But certainly all the
rules should be considered. In Figure 3.8 we use the max operator to combine
the results of evaluating the three rules above for a numerical value 1.9 for
the pressure. The result is the hatched area in the bottommost plot. Does
the max operator seem to be a valid choice? In fact it is a popular choice.
The evaluation of a set of rules thus becomes a min-max operation. The
individual rules are evaluated using min, and the results are combined using
max,
38
pressure volume
µmedium
µmedium
1 0.8 1
0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
1 1 µsmall
µlow
0.2
0 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
µhigh
1 µlarge
1
0 0.1 0 3
1 2 3 4 bar 0.1 0.2 0.3 m
1.9
1
max{min{0.4, µmedium }, min{0.2, µsmall }, min{0.1, µlarge }}
0 3
0.1 0.2 0.3 m
Figure 3.8: Evaluation of a if-then rule with two premises separated by AND
3.4 Defuzzification
The most common methods for combining fuzzy rules produce a fuzzy set.
When we e.g. design a controller we need a crisp numerical output value.
This requires some process of defuzzification, i.e. producing a numerical
value that best reflects the fuzzy set. There are many techniques for for
defuzzification, we will mention only three.
39
3.4.1 The Centroid Method
The center of area, or center of gravity, or centroid method computes the
center of area of the region under the membership function.
R
∗ u · µA (u)du
u = R
µA (u)du
Or in discrete form: Pn
∗ i=1 xi · µA (xi )
u = P n
i=1 µA (xi )
40
3.5 Fuzzy Control
Fuzzy control is typically used when the explicit system analytical model is
not available. Fuzzy control is intuitive to understand and easy to design for
engineers who are unfamiliar with classical control theory. A fuzzy controller
can be designed based on e.g. a human operators experience. Fuzzy control
consists of selecting and using
4. a defuzzication method
Fuzzification
System Model We have at our disposal a discrete model of the system as
a difference equation:
In reality we would not have this model, that is primarily the reason for using
fuzzy control instead of classical methods.
We can also do a z-transformation of the model in order to obtain a
transfer function:
41
th
−0.6z −1 − 0.06z −2
y(z) = u(z)
(1 − 2.1z −1 + 0.98z −2 )
We start by looking at the universe of discourse, that is the intervals,
of the input variables θ(t) and θ(t)0 and the output u(t). For θ we have an
interval X in degrees, say limited to X = [−45◦ , +45◦ ], for θ0 we have an
interval Y in degrees per second, which depends on the pendulum used and
how violent disturbances it can be the target of, lets say Y = [−30◦ /s, 30◦ /s].
The output, that is the control space, of u(t) could be an interval repre-
senting the force in Newtons applied on the cart. A pendulum in Technobot-
nia instead uses a positioning signal for the cart, with an interval -0.5-0.5
(unitless but should represent a force). We use this interval and call the
output universe of discourse U = [−0.5, 0.5].
Next we define linguistic variables that divide the universes into fuzzy
subsets and define the membership functions for the terms. In this example
it is suitable to use the same termset for all variables:
42
differ in this example. For this example we choose triangular membership
functions for all the linguistic variables. We define membership functions
µN B , µN M , µN S , µP S , µP M , µP B for all linguistic variables as depicted in Fig-
ure 3.10. The five functions are the same for all linguistic variables, except
for the scale.
The choice of membership functions chosen to represent the linguistic
variables is somewhat arbitrary. One of the adjustments made during test-
ing of the control system will be experimenting with different membership
functions, that is changing the parameters of these triangular functions, or
perhaps switching to other shapes of functions e.g. gaussian. This process is
called tuning.
membership degree
muNB muNM muNS muPS muPM muPB
1
xlabel
a) xmin 0 xmax
membership degree
muNB muNM muNS muPS muPM muPB
1
xlabel
b) xmin 0 xmax
membership degree
muNB muNM muNS muPS muPM muPB
1
xlabel
c) xmin 0 xmax
Figure 3.10: The membership functions for the terms of the linguistic vari-
ables a) angle b) positioning and c) angular velocity
43
Designing If-Then Rules
We now use our intuition and experience to design a set of If...then rules.
5. . . .
We have six terms in both inputs. Designing rules for each combination of
the inputs thus results in 36 rules, three of which are listed above. Instead
of listing the rest of the rules in the format of above, we can (while we have
only two inputs) compactly represent them in a lookup table as below.
θ/θ0 NB NM NS PS PM PB
NB NB NB NB NM NS PS
NM NB NB NM NS PS PS
NS NB NM NS PS PS PM
PS NM NS NS PS PM PB
PM NS NS PS PM PB PB
PB NS PS PM PB PB PB
The rules again look like crisp rules but remember that we are talking
about fuzzy sets.
We take as an example the inputs θ = 13◦ and θ0 = 6◦ /s. This value for
θ falls under the terms NM and NS so that µN M (8) = 0.6 and µN S (8) = 0.4.
For θ0 = 6circ /s the fuzzy sets (terms) PS and PM have nonzero membership
grades so that µP S (6) = 0.8 and µP M (6) = 0.3. If we remove from the table
above the rules for terms equal to zero we are left with the table below.
θ/θ0 PS PM
NM NS PS
NS PS PS
The four rules in this table are those that are said to fire for the input θ = 13◦
44
and θ0 = 6◦ /s. These rules are the same that we listed earlier in the form of
If-then sentences.
The AND in the rules is, as explained in Section 3.2, means taking the
minimum of the membership degrees. We illustrate the evaluation of the
rules in Figures below.
input variable: angle
muNM
1
o.6 output variable: positioning
muNS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.6 AND 0.8=0.6
111111
000000
input variable: : anglar velocity
000000
111111
000000
111111
muPS 000000
111111
000000
111111 xlabel_pos
1 xmin_pos 0 xmax_pos
0.8
xlabel_vel
xmin_vel 0 xmax_vel
muNM
1
output variable: positioning
o.6
muPS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.6 AND 0.3=0.3
1111
0000
input variable: : anglar velocity
0000
1111
0000
1111 xlabel_pos
muPM xmin_pos 0 xmax_pos
1
0.3
xlabel_vel
xmin_vel 0 xmax_vel
muNS 1
output variable: positioning
0.4
muNS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.4 AND 0.8=0.4
input variable: : anglar velocity 11111
00000
00000
11111
00000
11111
muPS xmin_pos 00000
11111
0
xlabel_pos
xmax_pos
1
0.8
xlabel_vel
xmin_vel 0 xmax_vel
45
input variable: angle
muNS
1
output variable: positioning
o.4 muPS
1
xlabel_angle
xmin_angle 0 xmax_angle
0.4 AND 0.3=0.3
1111
0000
input variable: : anglar velocity
0000
1111
0000
1111 xlabel_pos
muPM xmin_pos 0 xmax_pos
1
0.3
xlabel_vel
xmin_vel 0 xmax_vel
46
Combining the Rules
The application of the rules just illustrated resulted in a number of fuzzy
subsets of U with the membership functions ϕj (u). Next we combine these
membership functions into one function representing the control action. For
this we use the OR operator, which means taking the maximum.
muNS
1
muPS
111111
000000
000000
111111
00000
11111
000000
111111
00000
11111
000000
111111
00000
11111
000000
111111
xmin_pos
00000
11111
000000
111111 xlabel_pos
xmax_pos
0
Defuzzification
3.6 Exercises
1. The terms/values ”moderately high fever” and ”high fever” for the
linguistic variable ”body temperature” are fuzzy sets with the mem-
47
bership functions shown in Figure 3.4. Try to draw the intersection
and union of these fuzzy sets (this was not covered in the text, use
your intuition).
2.
48
Chapter 4
So far we have discussed two distinct methods for building controllers: fuzzy
and neural. Often the choice of method is dictated by the data available on
the plant to be controlled. If the data are pairs of numbers, we may turn to
a neural method, and if the data are rules, fuzzy methods are appropriate.
Neural methods provide learning capability, whereas fuzzy methods provide
flexible knowledge-representational capability.
49
Chapter 5
Reinforcement Learning
5.1 Introduction
The following introduction is a selection of directly stolen pieces from [SB98].
Reinforcement learning is different from supervised learning, the kind of
learning studied in most current research in machine learning, statistical
pattern recognition, and artificial neural networks. Supervised learning is
learning from examples provided by a knowledgeable external supervisor.
This is an important kind of learning, but alone it is not adequate for learn-
ing from interaction. In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and representative of all
the situations in which the agent has to act. In uncharted territory where
one would expect learning to be most beneficial an agent must be able to
learn from its own experience.
One of the challenges that arise in reinforcement learning and not in other
kinds of learning is the trade-off between exploration and exploitation. To
obtain a lot of reward, a reinforcement learning agent must prefer actions
that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected
before. The agent has to exploit what it already knows in order to obtain
reward, but it also has to explore in order to make better action selections
in the future. The dilemma is that neither exploration nor exploitation can
be pursued exclusively without failing at the task.
All reinforcement learning agents have explicit goals, can sense aspects of
their environments, and can choose actions to influence their environments.
Reinforcement learning agents try to function in environments where correct
choice requires taking into account indirect, delayed consequences of actions,
and thus may require foresight or planning.
50
5.1.1 Elements of Reinforcement Learning
One can identify four main subelements of a reinforcement learning system:
1. a policy
2. a reward function
3. a value function
A policy defines the learning agent’s way of behaving at a given time. Roughly
speaking, a policy is a mapping from perceived states of the environment to
actions to be taken when in those states.
A reward function defines the goal in a reinforcement learning problem.
Roughly speaking, it maps each perceived state (or state/action pair) of the
environment to a single number, a reward, indicating the intrinsic desirability
of that state. A reinforcement learning agent’s sole objective is to maximize
the total reward it receives in the long run.
Whereas a reward function indicates what is good in an immediate sense,
a value function specifies what is good in the long run. Roughly speaking,
the value of a state is the total amount of reward an agent can expect to
accumulate over the future, starting from that state. For example, a state
might always yield a low immediate reward but still have a high value because
it is regularly followed by other states that yield high rewards. Or the reverse
could be true. We seek actions that bring about states of highest value, not
highest reward, because these actions obtain the greatest amount of reward
for us over the long run.
The fourth and final element of some reinforcement learning systems is
a model of the environment. Models are used for planning, by which we
mean any way of deciding on a course of action by considering possible future
situations before they are actually experienced. Early reinforcement learning
systems were explicitly trial-and error learners; what they did was viewed as
almost the opposite of planning.
Reinforcement learning can be used when the state set is very large or
even infinite.
51
In the first trial, the RL system begins riding the bicycle and performs a
series of actions that result in the bicycle being tilted 45 degrees to the right.
At this point their are two actions possible: turn the handle bars left or turn
them right. The RL system turns the handle bars to the left and immedi-
ately crashes to the ground, thus receiving a negative reinforcement. The
RL system has just learned not to turn the handle bars left when tilted 45
degrees to the right. In the next trial the RL system performs a series of ac-
tions that again result in the bicycle being tilted 45 degrees to the right. The
RL system knows not to turn the handle bars to the left, so it performs the
only other possible action: turn right. It immediately crashes to the ground,
again receiving a strong negative reinforcement. At this point the RL system
has not only learned that turning the handle bars right or left when tilted
45 degrees to the right is bad, but that the ”state” of being titled 45 degrees
to the right is bad. Again, the RL system begins another trial and performs
a series of actions that result in the bicycle being tilted 40 degrees to the
right. Two actions are possible: turn right or turn left. The RL system
turns the handle bars left which results in the bicycle being tilted 45 degrees
to the right, and ultimately results in a strong negative reinforcement. The
RL system has just learned not to turn the handle bars to the left when
titled 40 degrees to the right. By performing enough of these trial-and-error
interactions with the environment, the RL system will ultimately learn how
to prevent the bicycle from ever falling over.
52
J
F
C I
Traveler A
E
B H Gold
hit (success), prepare a reward scheme. When the gold is not found (failure),
prepare a penalty scheme. Then trace back to the starting vertex A; at each
visited vertex, apply the reward or penalty scheme. That is, put the placed
stone back into the signpost box with an additional stone of the same color
(reward), or take the placed stone away from the signpost (penalty). When
the the traveler returns, the next traveler will hit the road with a bit more
hope. Repeat the same journey many times.
53
The power of reinforcement learning actually lies in that the agent ac-
tually does not need to wait until it receives feedback at the end to make
adjustments. This is achieved by temporal difference methods discussed in
Section 5.5.
5.5.1 Q-Learning
Q-learning maintains a matrix, call it the Q-matrix, which has one row for
every possible state, and one column for every action the agent can take in
these states. We look at Q-learning through a cliff walker example ([SB98]).
Consider the grid world shown in Figure 5.2. The walker (agent) should find
the optimal path from start to stop without falling down the cliff (the black
area is no-no). For us it is easy to see that the optimal path is to follow
the edge of the cliff. The walker can be in 4*10 positions, i.e. states, so
1111111111111111111111
0000000000000000000000
0000000000000000000000
1111111111111111111111
start 0000000000000000000000
1111111111111111111111
stop
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
0000000000000000000000
1111111111111111111111
the Q-matrix should have 40 rows. The agent can go up, down, right or
left (actions), that is the matrix should have four columns. The matrix is a
listing of state-action pairs. The values filled into the matrix are the expected
returns (rewards) of a certain action in a certain state. An episode stops when
we hit the goal, or run into the cliff. The Q-matrix is the memory of the
system, and the key issue is how to update the matrix so that the walker
54
makes better and better decisions. Initially the agent knows nothing, so the
matrix is filled with random values. The walker can do one of two things:
• take the optimal action, which means he’ll take the action correspond-
ing to the biggest Q-value.
• explore his world by taking a random action, just to see what happens.
Value function So the value of a given state, is the total amount of rewards
an agent can expect to accumulate in the long run. The value function V
can be defined as the value of the state’s best state-action pair:
max
V (s) = Q(s, a)
a
That is, there is a cost (negative reward) of each step we take in the grid.
Thus the agent should learn to take as few steps as possible.
55
Initialize Q(s, a) randomly
Repeat (for each episode):
Initialize s
Repeat (for each step of episode, i.e. until goal or cliff is met)
Choose a from s using policy derived from Q (depending on )
Take action a, observe r and the new state s0
Q(st , at ) = (1 − α)Q(st , at ) + α [rn + γV (s0 )]
s ← s0
5.6 Exercises
1. What is reinforcement good at, compared to, say neural networks?
2. Try to think about how a neural network could manage the bicycle
riding. The only input-output pairs we could present to the nn would
be disaster/not disaster for the output, and a sequence of actions as
inputs.
56
Bibliography
57