Beruflich Dokumente
Kultur Dokumente
I. INTRODUCTION
Although conventional control and artificial intelligence
researchers have made many advances, neither ideology
seems capable of realizing autonomous operation. That
is, neither can produce machines which can interact with
the world with an ease comparable to humans or at least
higher animals. In responding to such limitations, many
researchers have looked to biological/physiological based
systems as the motivation to design artificial systems. As
an example are the behavior based systems of Brooks [l]
and Beer [2]. Behavior based control systems consist of a
hierarchical structure of simple behavior modules. Each
module is responsible for the sensory motor responses of
a particular level of behavior. The overall effect is that
higher level behaviors are recursively built upon lower ones
and the resulting system operates in a self-organizing manner. Both Brooks and Beers systems were loosely based
upon the nervous systems of insects. These artificial insects operated in a hardwired manner and exhibited an
interesting repertoire of simple behaviors. By hardwired it
0-7803-0999-5/93/$03.0001993 IEEE
is meant that each behavior module had its responses predetermined and was simply programmed externally. Although this approach is successful with simple behaviors,
it is obvious that many situations exist where predetermined solutions are impossible or impractical to obtain.
It is subsequently proposed that by incorporating learning into the behavior based control system, these difficult
behaviors could be acquired through self-exploration and
self-learning.
Complex behaviors are usually characterized by a 8equence of actions with success or failure only known at
the end of that sequence. Also, the critical error signal is only an indication of the success or failure of the
system and no information regarding error gradients can
be determined, as in the case of continuous valued error
feedback. Thus the required learning mechanism must
be capable of both reinforcement learning as well as temporal credit assignment. Incremental dynamic programming techniques such as Bartos [3] temporal difference
(TD) appear to be well suited to such tasks. Based upon
Bartos previous adaptive heuristic critic [4], TD employs
adaptive state and action evaluation functions to incrementally improve its action policy until successful operation is attained. The incorporation of TD learning into
behavior based control results in a framework of adaptive (ABMs) and non-adaptive behavior modules which
is referred to here as a distributed adaptive control system (DACS). The remainder of this report will be concerned with a brief description of the DACS and ABMs,
and implementing of the locomotion level ABM within the
DACS of a simulated quadruped mobile robot. This level
is considered appropriate because the actuator sequences
for quadruped locomotion are not intuitively obvious and
difficult to determine. Other levels such as global navigation, task planning and task coordination are implemented
and discussed by Digney [5].
11. DISTRIBUTED
ADAPTIVE
CONTROL
SYSTEMS
The DACS shown in Figure 1is comprised of various adaptive and non-adaptive behavior modules. Non-adaptive
144
. -
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
- + - +
r: =
re ag rg a, r,
(1)
where: a,,ag and a, are the relative importance of the
reinforcements.
<N
QOdBudwnfMamrd
It can be seen from Figure 2 that the flow of environmental and sensor based reinforcement is in the upward
direction. This will result in lower level skills and behaviors being learned first, then other higher level behaviors,
converging in a recursive manner toward the highest level.
Figure 1 shows this highest level as existing within a single physical machine. However, in the case of multiple
machines operating in a collective, higher abstract behavior levels are possible. Within the context of this paper,
only behaviors relevant to individual machines will be dis-
Ad.plv.8.k.riarModL&
EnvCo"(.lBrd
FYhfcm~~rd
145
.,
..".","
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
, ,
...
(3)
= 7 * Qvirtuai
Qs,U*
+ rt.
(4)
mU={QuJ.
QUADRUPED
MOBLEROBOT
(5)
It is easily seen that Qvirtuai becomes the minimum action dependent evaluation function of the new state, w ,
(remember the evaluation functions are negative in sign)
and in effect corresponds to the action most likely to be
taken when the system leaves state w .
If the action, ,'U achieves the desired goal, the virtual
state evaluation is,
Qvirtual
= 0.
(6)
This provides relative state evaluations and allows for openended or cyclic goal states. This is illustrated by considering that for cyclic goals it is the dynamic transitions
between states that constitutes a goal state and not simplely the arrival at a static system state(s).
This error is used to adapt the evaluation functions according to LMS rules as follows
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
z=z:7
' 1r=-mldw
MM.nacnW.
0
-Rg
if CIoc-otion
otherwise
-Rlow
-Rhjgh
is performed,
easy operation
difficult operation
(15)
(16)
0
1
forward
left turn
(10)
cmoo
For any specific command the locomotion ABM will issue action responses, u~ocomot~on,
to the actuators driving the legs in the horizontal, h, and vertical, U , directions. Within this action vector are the individual actuator commands to extend, e x , or retract, rt, as shown in
Equation 11 and 12.
where
hold
extend vertical
Cleg =
url
retract vertical
(12)
he, extend horizontal
, hrt retract horizontal
Each leg is equipped with sensors for measuring the
forces on each foot and the positions of each leg. The
forces on the foot are are biased such that -fmaz <
fleg < fmaz, where fmaz is the highest force magnitude
expected. Similarly the position sensors are biased such
that -lm,, < l l e g < I,,,,
where lma, is half the stroke
of the linear actuator. For an arbitrary leg, leg,, and
direction, d, the force and position descriptions of state
are
'
<0
=0
>0
<0
>0
-1 ifflegl,d
0
iffleg,,d
+1 iffleg,,d
apo#,d,lr#,,
-1 if 1leg,,d
+1 if Ileg,,d
IV. SIMULATION
AND RESULTS
Using the reinforcement scheme described above, the locomotion ABM was simulated and its abilities to learn in
an initially unknown environment and adapt to changes
and malfunctions were evaluated. In the following section
results from the locomotion tests are presented. Note that
the temporal discount factor and the adaption rate chosen
for these tests was 0.9 and 0.5 respectively. The reinforcements Rg, %
,,, Rhigh and R, were set to 1.0, 1.0,2.0 and
4.0 respectively.
(13)
All possible gaits, including forward, backward, right
turn and left turn were discovered and eventually mastered. Figure 6 shows the quadruped's performance while
(14) learning the four gaits. Shown on the vertical axis is the
147
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
:::I
Locomtion
400 1
'POnard2.drt'
'Left2.dat'
*Right2.dat'
-.-
250
2oo
1so
tI
Aripp1.d Here
100
so
0
50
100
200
150
250
300
350
400
stepm/20
Locormtion
400 r
' F O N a r d l .dat'
-...-.
-
200
150
100
50
Frame
Commands
Reinforcements
50
100
150
200
250
Stepd20
300
350
400
148
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.
REFERENCES
Brooks, R (1991) Intelligence Without Reason, A I Memo
No. 1293, MIT Artificial Intelligence Lab.
leg0
hez
vex
Commands
leg1 leg2 legs
hez
hrt
hrt
re
Rlow
r9
0
Rg
he,
&ow
&ow
vez
R~ow
Rg
Vrt
vez
Vrt
hrt
hrt
hez
vez
vez
Vrt
Beer, R.D., Chiel, H.J., and Sterling, L.S. (1990) A biological perspective on autonomous agent design, Robtics
Reinforcements
r,
0
0
0
0
Barto, A.G., R.S. Sutton and C.H. Watkins (1989)Learning and Sequential Decision Making, COINS Technical
Report
V. DISCUSSION
The results achieved in the locomotion tests showed initial poor performance followed by rapid improvement and
eventual convergence to an optimum solution. It is during the initial poor performance that the ABM explores
the state space, discovering all the obtainable states of
the system. For the locomotion ABM, these states are
the leg positions and the force on each foot. During this
exploration, the learning system refined the initially random action policy to a policy that achieves the specified
command. These commands themselves are not supplied
to the system as predetermined knowledge but are discovered as realizable commands by the system above. If
the quadruped had been placed in water, then it would
learn to swim or had the physical configuration allowed it
to perform other gaits then these gaits would be learned
also.
In response to actuator malfunctions the ABM, after a
Digney, B.L. (1992) Emergent Intelligence in a Distributed Adaptive Control System, Ph.D. Thesis circulated in draft, University of Saskatchewan, Saskatoon,
Saskatchewan.
Carpenter, G.A., S. Grossberg, and D., Rosen, (1990)
ARTZA: An adaptive resonance Algorithm for rapid category learning and recognition, IJCNN 1991 Seattle Washington, Vol I1 pp 151-156.
Barto, A.G., S.J. Bradtke and S.P. Singh (1991) Realtime Learning and Control Using Asynchronous Dynamic
Programming University of Amherst Technical Report 91
- 57
149
Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.