Sie sind auf Seite 1von 153

s, s n ∈ S

n

S S n

A

a, a n ∈ A

n

A A n

R

r, r n ∈ R

n

ξ = ( s 0 , a 0 ,

, s N , a N ) N

R R n

{ S n } n =0, ,N

M

M

T ( S n 1 , A n 1 , S n )

R ( S n , A n , R n ) r ( s n , a n ) π ( s ) E ( ·|· ) V ( s ) V π ( s )

V ( s )

π

k

Q( s, a)

Q k ( s, a)

I d

s

s π

k

( s, a)

k

I

σ

+

N A k ( · , w A ) U ( A)

a : b

w A

S

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

SAMP LE NODE

divergence time

choose

 

 

 

 

 

 

 

 

 

 

 

divergence time

 
 

 

choose

 

 

 

 

{ S n } n =0,

,N

S

S n

{ S n } n =0,

,N

S n

S n 1

S 0 S 1
S 0
S 1

.

.

.

S N − 1 S N
S N − 1
S N

. . . A 0 A 1 . . . S 0 S 1
.
.
.
A 0
A
1
.
.
.
S 0
S
1
A N − 1 S N − 1 S N
A N − 1
S N − 1
S N

S n A n 1 A n ∈ A

{ S n } n =0,

,N

{ A n } n =0, 1 S n

,N

S n 1

A n 1

A 0 A 1 S 0 S 1 R 0 R 1
A 0
A
1
S 0
S
1
R 0
R 1

.

.

.

.

.

.

.

.

.

A N − 1 S N − 1 S N R N − 1 R
A N − 1
S N − 1
S N
R N − 1
R N

S n R n

S n

S n 1 A n 1 T

T ( S n 1 , A n 1 , S n ) = P ( S n | S n 1 , A N 1 ) P ( S n | S n 1 , A n 1 )

S n 1

S n A n 1

S n R n

0 .1 1

− √ 2

100

R n S n A n

R ( S n , A n , R n ) = P ( R n | S n , A n ) M

M = ( S , A , T, R )

S A T

R S A

π

π π

π : S → A

π ( s ) P ( A| s )

ξ = ( s 0 , a 0 , s 1 , a 1 ,

, s N , a N )

r ( s n , a n ) = r ( s n , π ( s n ))

s n ,a n ξ

s

n ξ

r ( s n , a n )

s n a n

π

π

π

V π ( s ) =

E ( r ( s, π ( s )) + γV π ( s n +1 ) | s n = s )

γ [0 , 1]

γ

γ 0

γ 1

m = 1 , 2 ,

k = 1 , 2 ,

s ∈ S

π

V

k

m

(s ) E r (s, π (s )) + γV

π

m

k

1 (s n+1 )| s n = s

m

s ∈ S

π m +1 arg max a

E (r (s, a ) + γV π m (s n+1 )| s n = s )

V

π

V k V k 1

π m π m+1

V π

V k ( s ) = arg max E ( r ( s, a) + γV k 1 ( s n +1 ) | s n = s )

a

m d

℄ V k ( s ) = arg max E ( r ( s, a )

T

V ( s ) s Q( s, a)

( s, a) a s

s n a n

Q( s n , a n ) Q( s n , a n ) + α ( r ( s, a) + γ max Q( s n +1 , a) Q( s n , a n )) .

a

α

r ( s, a) + γ max Q( s n +1 , a) Q( s n , a n )

a

Q (s, a ) 0

Q

(s, a ) ∈ S × A

a n = π (s n )

a n s n+1 r (s n , a n )

Q (s n , a n ) (1 α)Q (s n , a n )

+ α(r (s n , a n ) + γ max a Q (s n+1 , a ))

π π (s ) = arg max Q (s, a )

a A

s ∈ S

d m

O ( m d )

m

s n +1 = s n + a n s n , a n R

s [s min , s max ]

s 0 s = 0

a 0 = s 0

m

s =

0

a a a = −s 0 a = −s 0 s 1 s s 0
a
a
a = −s 0
a = −s 0
s 1
s
s 0
s 0

s 0

s

a a a = −s 0 a = −s 0 s 1 s s 0 s

t

s 0

s

a a a = −s 0 a = −s 0 s 1 s s 0 s

t

s

s = 0

a 0 = s 0

V

( s ) = f θ ( s ) .

θ f

( s n , a, s n +1 , r ( s n , a))

π ( s ) = g θ ( s )

θ g

π θ

θ

r ( s, a) = h( g θ ( s ) , a)

θ

k

0 1

i 1

p i

h

h

p i

h < h

1

p i

p i i

h

h

1 p i

p i

p i

1

h

r ( s W , a C ) = 100

s W a C

s W

X ( t)

X ( t 1)

1 X 1 ( t)

t

dt

X 1 ( t + dt) = X 1 ( t) + N 1 ( t) N 1 ( t)

σ 2 Idt σ N 1 ( t)

X 1 (1)

0 σ 2 I

X 2 ( t)

T d