Sie sind auf Seite 1von 6

A Distributed Adaptive Control System

for a Quadruped Mobile Robot


Bruce L. Digney and M. M. Gupta
Intelligent Systems Research Laboratory, College of Engineering
University of Saskatchewan, Saskatoon, Sask. CANADA S7N OW0
Email: digney@dvinci.usask.ca
Abdruct- In this research, a method by which
reinforcement learning can be combined into a behavior based control system is presented. Behaviors which are impossible or impractical to embed
as predetermined responses are learned through
self-exploration and self-organizationusing a temporal difference reinforcement learning technique.
This results in what is referred to as a distributed
adaptive control system (DACS); in effect the
robots artificial nervous system. A DACS is developed for a simulated quadruped mobile robot
and the locomotion behavior level is isolated and
evaluated. At the locomotion level the proper actuator sequences were learned for all possible gaits
and eventually graceful gait transitions were also
learned. When confronted with an actuator malfunction, all gaits and transitions were adapted resulting in new limping gaits for the quadruped.

I. INTRODUCTION
Although conventional control and artificial intelligence
researchers have made many advances, neither ideology
seems capable of realizing autonomous operation. That
is, neither can produce machines which can interact with
the world with an ease comparable to humans or at least
higher animals. In responding to such limitations, many
researchers have looked to biological/physiological based
systems as the motivation to design artificial systems. As
an example are the behavior based systems of Brooks [l]
and Beer [2]. Behavior based control systems consist of a
hierarchical structure of simple behavior modules. Each
module is responsible for the sensory motor responses of
a particular level of behavior. The overall effect is that
higher level behaviors are recursively built upon lower ones
and the resulting system operates in a self-organizing manner. Both Brooks and Beers systems were loosely based
upon the nervous systems of insects. These artificial insects operated in a hardwired manner and exhibited an
interesting repertoire of simple behaviors. By hardwired it

0-7803-0999-5/93/$03.0001993 IEEE

is meant that each behavior module had its responses predetermined and was simply programmed externally. Although this approach is successful with simple behaviors,
it is obvious that many situations exist where predetermined solutions are impossible or impractical to obtain.
It is subsequently proposed that by incorporating learning into the behavior based control system, these difficult
behaviors could be acquired through self-exploration and
self-learning.
Complex behaviors are usually characterized by a 8equence of actions with success or failure only known at
the end of that sequence. Also, the critical error signal is only an indication of the success or failure of the
system and no information regarding error gradients can
be determined, as in the case of continuous valued error
feedback. Thus the required learning mechanism must
be capable of both reinforcement learning as well as temporal credit assignment. Incremental dynamic programming techniques such as Bartos [3] temporal difference
(TD) appear to be well suited to such tasks. Based upon
Bartos previous adaptive heuristic critic [4], TD employs
adaptive state and action evaluation functions to incrementally improve its action policy until successful operation is attained. The incorporation of TD learning into
behavior based control results in a framework of adaptive (ABMs) and non-adaptive behavior modules which
is referred to here as a distributed adaptive control system (DACS). The remainder of this report will be concerned with a brief description of the DACS and ABMs,
and implementing of the locomotion level ABM within the
DACS of a simulated quadruped mobile robot. This level
is considered appropriate because the actuator sequences
for quadruped locomotion are not intuitively obvious and
difficult to determine. Other levels such as global navigation, task planning and task coordination are implemented
and discussed by Digney [5].
11. DISTRIBUTED
ADAPTIVE
CONTROL
SYSTEMS
The DACS shown in Figure 1is comprised of various adaptive and non-adaptive behavior modules. Non-adaptive

144

. -

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

modules are present as inherent knowledge and are used


where adaptive solutions are not required. All modules
receive sensory inputs and respond with actions in an attempt to perform a command specified by a higher level.
The performance of commands in most cams will require a
sequence of actione by the lower level system and possibly
the cooperation of many lower level systems. The coupling
between ABMs is shown in Figure 2. In this configuration, the action from level 1 1 becomes the command for
level 1. Level 1 1 dm supplies goal based reinforcement,
r g , to drive level 1 towards succeasful completion of that
command. Level 1 in turn issues actions to level 1 - 1 and
receives environmental based reinforcement, pc ,from level
1 - 1. This environment based reinforcement is representative of the difficulty or cost incurred while performing
the requested actions and is included to drive level 1 to a
cost effective solution. While operating, level 1 may enter a state which is in some way damaging or dangerous.
To drive the system away from such a state, sensor based
reinforcement, r,, is used. Sensor based reinforcement is
supplied from sensors at level 1. It is analogous to pain or
fear and will ensure that level 1 operates in a safe manner. These three reinforcements are combined into a total
reinforcement signal, rt , according to Equation 1.

- + - +

r: =
re ag rg a, r,
(1)
where: a,,ag and a, are the relative importance of the
reinforcements.

Figure 1: Schematic of DACS

Figure 2: Hierarchy of Three ABMs


trolling individual machines, the purpose or task of the
machine is embedded within the DACS as an instinct o r
drive. This instinct is the high level action which results
in a feeling of accomplishment or positive reinforcement
within the DACS. It is then the responsibility of the adaptive behavior modules within the DACS to learn the skills
and behaviors necessary to fulfill this drive. This concept
as well as the self-organizing characteristics that result
from such interactions are further discussed by Digney [5].
The ABM is the primary adaptive building block for
the DACS. Within it exist computational mechanisms for
state classification, learning and the combination of reinforcement signals. Figure 3 shows a schematic of an ABM
complete with incoming command, sensory and reinforcement signals. For clarity the outgoing reinforcement signals have been removed. For any particular level, say I ,
the ABM observes the relevant system states through appropriate sensors. For a perception system consisting of
N sensors, the state S I , is defined as

where: sn is the individual sensor reading, 0 < n

<N

QOdBudwnfMamrd

It can be seen from Figure 2 that the flow of environmental and sensor based reinforcement is in the upward
direction. This will result in lower level skills and behaviors being learned first, then other higher level behaviors,
converging in a recursive manner toward the highest level.
Figure 1 shows this highest level as existing within a single physical machine. However, in the case of multiple
machines operating in a collective, higher abstract behavior levels are possible. Within the context of this paper,
only behaviors relevant to individual machines will be dis-

State transitions are detected and the resulting states

cussed. In the absence of higher collective behaviors con-

are classified uaing an idealized neural claaeiflcation

&mar Bumd (Pdn) R*dmm"

Ad.plv.8.k.riarModL&

EnvCo"(.lBrd

FYhfcm~~rd

Figure 3: Single ABM

145

.,

..".","

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

, ,

...

scheme. This classification embodies the macroscopic o p


erating principlea of unsupervised neural networks such
as ART2-A [6] and will be assumed adequate in the context of these simulations. The Temporal Difference (TD)
algorithm as developed by Barton [I learns by adjusting state and action evaluation functions then uses these
evaluations to choose an optimum action policy. It can
be shown that these two evaluation functions can be combined into a single action dependent evaluation function,
say Q,,,,, similar to that described by Barto [7]. Given
the system at state 8 , the action taken, U', is the action
which satisfies

where: q is the rate of adaption and k is the index of


adaption.
As the evaluation function converges, the goal driven
component begins to dominate over the exploration driven
component. The resulting action policy will perform the
command in a successful and efficient manner. Generally,
an ABM will be capable of performing more than a single
command. For an ABM capable of cmao commands, the
vector of the evaluation functions is defined as:

(3)

where: Qs,u is the evaluation function and c the particular


command 0 < c < Cmas.

where: C is a random valued function.


In Equation 3, Q,,u and C can be thought of as the
goal driven and exploration driven components of the action policy respectively. Taking the action U* results in
the transition from state s to state w and the incurring
of a total reinforcement signal r t . The action dependent
evaluation function error is obtained by modifying the T D
error equation and is
e

= 7 * Qvirtuai

Qs,U*

+ rt.

(4)

where: Qvirlual is the virtual state evaluation value of the


next state w and 7 is the temporal discount factor.
If action, U', does not achieve the desired goal, 1the virtual state evaluation is,
Quirtuolz

mU={QuJ.

111. DACS FOR

QUADRUPED
MOBLEROBOT

To evaluate the DACS, the simulated quadruped shown


in Figure 4 was used. This mobile robot was placed inside a simulated three dimensional landscape where it is
left to develop skills and behaviors as it interacts with its
environment. This world is made up of ramps, plateaus,
cliffs and walls, as well as various substances of interest.
In the absence of any predetermined knowledge it is the
responsibility of the DACS and in particular the ABMs to
acquire the skills and behaviors for successful operation.

(5)

It is easily seen that Qvirtuai becomes the minimum action dependent evaluation function of the new state, w ,
(remember the evaluation functions are negative in sign)
and in effect corresponds to the action most likely to be
taken when the system leaves state w .
If the action, ,'U achieves the desired goal, the virtual
state evaluation is,
Qvirtual

= 0.

(6)

This provides relative state evaluations and allows for openended or cyclic goal states. This is illustrated by considering that for cyclic goals it is the dynamic transitions
between states that constitutes a goal state and not simplely the arrival at a static system state(s).
This error is used to adapt the evaluation functions according to LMS rules as follows

Figure 4: Simulated Quadruped


Although not the most efficient method of locomotion,
the learning of quadruped walking provides interesting
and challenging problems. Involved is the learning of complex actuator sequences in the midst of numerous false
goal states and modes of failure. Figure 5 shows the locomotion ABM with the appropriate sensory, reinforcement
and motor action connections.
146

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

z=z:7
' 1r=-mldw

The reinforcement signals are defined aa

MM.nacnW.

0
-Rg

if CIoc-otion
otherwise

-Rlow
-Rhjgh

is performed,

easy operation
difficult operation

(15)

(16)

Figure 5: Locomotion ABM


The commands, Clocomotjon, are issued from the ABM
above and are dependent upon the possible sensory states
of that module. In this case these sensors are capable of
detecting all realizable modes of body motion. The commands for the locomotion level are defined in Equation 10.

0
1

forward
left turn
(10)

cmoo

all possible modes

For any specific command the locomotion ABM will issue action responses, u~ocomot~on,
to the actuators driving the legs in the horizontal, h, and vertical, U , directions. Within this action vector are the individual actuator commands to extend, e x , or retract, rt, as shown in
Equation 11 and 12.

where

where: f b e l l y is a force sensor on the bottom of the robot


and R, are appropriate positive values.
The net effect of these reinforcements is to drive the
locomotion system to learn efficient actuator sequences
that will perform the specified gaits while maintaining
the quadruped's balance. At this level of abstraction the
quadruped is said to lose balance when the line drawn between any two legs in contact with the ground does not
pass through the quadruped's center of gravity.
Equations 13 and 14, when combined into Equation 18,
define the complete state of the locomotion system. For
the transition between any pair of past and present states,
the total reinforcement can be determined by combining
Equations 15, 16 and 17 into Equation 1. The total reinforcement is then used to adapt the command specific
evaluation functions of Equation 19 according to Equations 4, 7 and 8. As these evaluation functions converge,
all realizable gaits should be achieved.

hold
extend vertical
Cleg =
url
retract vertical
(12)
he, extend horizontal
, hrt retract horizontal
Each leg is equipped with sensors for measuring the
forces on each foot and the positions of each leg. The
forces on the foot are are biased such that -fmaz <
fleg < fmaz, where fmaz is the highest force magnitude
expected. Similarly the position sensors are biased such
that -lm,, < l l e g < I,,,,
where lma, is half the stroke
of the linear actuator. For an arbitrary leg, leg,, and
direction, d, the force and position descriptions of state
are
'

<0
=0
>0
<0
>0

-1 ifflegl,d
0
iffleg,,d
+1 iffleg,,d
apo#,d,lr#,,

-1 if 1leg,,d
+1 if Ileg,,d

IV. SIMULATION
AND RESULTS

Using the reinforcement scheme described above, the locomotion ABM was simulated and its abilities to learn in
an initially unknown environment and adapt to changes
and malfunctions were evaluated. In the following section
results from the locomotion tests are presented. Note that
the temporal discount factor and the adaption rate chosen
for these tests was 0.9 and 0.5 respectively. The reinforcements Rg, %
,,, Rhigh and R, were set to 1.0, 1.0,2.0 and
4.0 respectively.
(13)
All possible gaits, including forward, backward, right
turn and left turn were discovered and eventually mastered. Figure 6 shows the quadruped's performance while
(14) learning the four gaits. Shown on the vertical axis is the
147

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

cumulative negative reinforcement per step, which can be


thought of as the difficulty encountered while performing
a single step. The number of steps taken is shown on the
horizontal axis. Improvement is obvious as the quadruped
initially functions poorly, then eventually learns the optimum actions necessary to perform and maintain a particular gait. In the simulation, the quadruped alternates
between all possible gaits, making it possible to evaluate
the quadruped's ability to learn graceful transitions between the gaits. In the performance graph of Figure 6, a
zone of slow improvement is evident. It is here where the
graceful gait transitions are being learned. A mechanical
malfunction was simulated by deactivating the horizontal
actuator on a single leg. This effectively made the leg able
to support only a vertical load and unable to apply any
horizontal force t o the body of the quadruped. Figure 7
shows the performance of the quadruped under both intact
and crippled conditions. Initially, the intact quadruped
learns the four standard gaits. Once crippling occurs, a
recovery period is required as the quadruped relearns all
gaits, this time with a limp. As before, graceful transitions
are eventually learned for the new limping gaits. For an
intact quadruped Figures 8 and 9 show the individual leg
movements that comprise forward and right turning gaits
respectively. The corresponding leg commands and reinforcement signals are shown in Tables 1 and 2 for these
forward and right turning gaits respectively. In these simulations the end of the quadruped closest to the reader is
the front and Zego,Zegl,legz, and legs are considered the
front right, front left, rear left and rear right legs respectively. Also, the actuator extensions vex and he, cause the
leg to move downward and towards the front. The actuator retractions vrt and hrt cause the leg to move upward
and towards the rear.

:::I

Locomtion

400 1

' Backu8rdZ. dat'

'POnard2.drt'
'Left2.dat'
*Right2.dat'

-.-

250

2oo
1so

tI

Aripp1.d Here

100

so
0

50

100

200

150

250

300

350

400

stepm/20

Figure 7: Performance of Crippled Quadruped

Locormtion

400 r

' Backwardl dat'

' F O N a r d l .dat'

'Left 1. dat '


'Rightl.dat'

-...-.
-

Figure 8: Forward gait(motion towards the reader). Note


frame sequence: top left (frame 1) --* top right (frame 2)
4bottom left (frame 3) 4 bottom right (frame 4) 4 top
left (frame 1).

200
150
100

Table 1: Command Sequence: Forward Gait

50

Frame

Commands

Reinforcements

50

100

150

200
250
Stepd20

300

350

400

Figure 6: Performance of Intact Quadruped

148

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

period of recovery, adapted to the changes and converged


to another optimum action policy, in this case a limping
leg sequences. This malfunction is labeled as non-severe
because the adaption required for the DACS to recover is
confined to a single ABM. If the malfunction was such that
some gaits became impossible to perform, then adaption
of higher ABMs would be required for recovery, and what
is called a severe malfunction would have occurred.
VI. CONCLUSIONS

Figure 9: Right Turn Gait(rotation is clockwise as viewed


from top) Note frame sequence: top left (frame 1) + top
right (frame 2) + bottom left (frame 3) + bottom right
(frame 4) -+ top left (frame 1).

The ABM for locomotion successfully learned the complex


action sequences for all realizable gaits. The locomotion
ABM was able to successfully recover from a non-severe
malfunction. The success of a single ABM can not be extrapolated to imply the success of the entire DACS. Work
is currently being performed in the simulation of other behaviors such as body coordination, local navigation, global
navigation, task planning, and task coordination. Once
these other levels can be incorporated into the DACS, the
concepts of self-organization and self-configuration can be
explored.

REFERENCES
Brooks, R (1991) Intelligence Without Reason, A I Memo
No. 1293, MIT Artificial Intelligence Lab.

Table 2: Command Sequence: Right Turn Gait


Frame
k
1-2
2 3
3-4
4+ 1
+

leg0
hez
vex

Commands
leg1 leg2 legs
hez
hrt
hrt

re
Rlow

r9
0

Rg

he,

&ow
&ow

vez

R~ow

Rg

Vrt

vez

Vrt

hrt

hrt

hez

vez

vez

Vrt

Beer, R.D., Chiel, H.J., and Sterling, L.S. (1990) A biological perspective on autonomous agent design, Robtics

Reinforcements

and Autonomous Systems 6, pp 169-186.

r,
0
0
0
0

Barto, A.G., R.S. Sutton and C.H. Watkins (1989)Learning and Sequential Decision Making, COINS Technical
Report

Barto, A.G., Sutton, R.S., and Anderson, C.W. (1983)


Neuronlike adaptive elements that can solve difficult
learning control problems, IEEE Transactions on Systems, Man, and Cybernetics SMC-13, pp 834-846.

V. DISCUSSION
The results achieved in the locomotion tests showed initial poor performance followed by rapid improvement and
eventual convergence to an optimum solution. It is during the initial poor performance that the ABM explores
the state space, discovering all the obtainable states of
the system. For the locomotion ABM, these states are
the leg positions and the force on each foot. During this
exploration, the learning system refined the initially random action policy to a policy that achieves the specified
command. These commands themselves are not supplied
to the system as predetermined knowledge but are discovered as realizable commands by the system above. If
the quadruped had been placed in water, then it would
learn to swim or had the physical configuration allowed it
to perform other gaits then these gaits would be learned
also.
In response to actuator malfunctions the ABM, after a

Digney, B.L. (1992) Emergent Intelligence in a Distributed Adaptive Control System, Ph.D. Thesis circulated in draft, University of Saskatchewan, Saskatoon,
Saskatchewan.
Carpenter, G.A., S. Grossberg, and D., Rosen, (1990)
ARTZA: An adaptive resonance Algorithm for rapid category learning and recognition, IJCNN 1991 Seattle Washington, Vol I1 pp 151-156.
Barto, A.G., S.J. Bradtke and S.P. Singh (1991) Realtime Learning and Control Using Asynchronous Dynamic
Programming University of Amherst Technical Report 91

- 57

149

Authorized licensed use limited to: Khajeh Nasir Toosi University of Technology. Downloaded on December 21, 2009 at 05:54 from IEEE Xplore. Restrictions apply.

Das könnte Ihnen auch gefallen