 s, s n ∈ S n S S n A a, a n ∈ A n A A n R r, r n ∈ R n ξ = ( s 0 , a 0 , , s N , a N ) N R R n

{ S n } n =0, ,N

M

M

T ( S n 1 , A n 1 , S n )

R ( S n , A n , R n ) r ( s n , a n ) π ( s ) E ( ·|· ) V ( s ) V π ( s )

V ( s )

π

k

Q( s, a)

Q k ( s, a)

I d

s

s π

k

( s, a)

k

 I σ

+

N A k ( · , w A ) U ( A)

a : b

w A

S

{ S n } n =0,

,N

S

S n

{ S n } n =0,

,N

S n

S n 1

S 0
S 1

.

.

.

S N − 1
S N

.
.
.
A 0
A
1
.
.
.
S 0
S
1
A N − 1
S N − 1
S N

S n A n 1 A n ∈ A

{ S n } n =0,

,N

{ A n } n =0, 1 S n

,N

S n 1

A n 1

A 0
A
1
S 0
S
1
R 0
R 1

.

.

.

.

.

.

.

.

.

A N − 1
S N − 1
S N
R N − 1
R N

S n R n

S n

S n 1 A n 1 T

T ( S n 1 , A n 1 , S n ) = P ( S n | S n 1 , A N 1 ) P ( S n | S n 1 , A n 1 )

S n 1

S n A n 1

S n R n

0 .1 1

− √ 2

100

R n S n A n

R ( S n , A n , R n ) = P ( R n | S n , A n ) M

M = ( S , A , T, R )

S A T

R S A

π

π π

π : S → A

π ( s ) P ( A| s )

ξ = ( s 0 , a 0 , s 1 , a 1 ,

, s N , a N )

r ( s n , a n ) = r ( s n , π ( s n ))

s n ,a n ξ

s

n ξ

r ( s n , a n )

s n a n

π

π

π

 V π ( s ) = E ( r ( s, π ( s )) + γV π ( s n +1 ) | s n = s ) γ ∈ [0 , 1]

γ

γ 0

γ 1

m = 1 , 2 ,

k = 1 , 2 ,

s ∈ S

π

V

k

m

(s ) E r (s, π (s )) + γV

π

m

k

1 (s n+1 )| s n = s

m

 s ∈ S π m +1 ← arg max a E (r (s, a ) + γV π m (s n+1 )| s n = s )

 V ⋆ π ⋆

V k V k 1

π m π m+1

V π

V k ( s ) = arg max E ( r ( s, a) + γV k 1 ( s n +1 ) | s n = s )

a

m d

T

V ( s ) s Q( s, a)

( s, a) a s

s n a n

Q( s n , a n ) Q( s n , a n ) + α ( r ( s, a) + γ max Q( s n +1 , a) Q( s n , a n )) .

a

α

r ( s, a) + γ max Q( s n +1 , a) Q( s n , a n )

a

Q (s, a ) 0

Q

(s, a ) ∈ S × A

 a n = π (s n ) a n s n+1 r (s n , a n ) Q (s n , a n ) ← (1 − α)Q (s n , a n ) + α(r (s n , a n ) + γ max a Q (s n+1 , a ))

π π (s ) = arg max Q (s, a )

a A

s ∈ S

d m

O ( m d )

m

s n +1 = s n + a n s n , a n R

s [s min , s max ]

s 0 s = 0

a 0 = s 0

m

s =

0

a
a
a = −s 0
a = −s 0
s 1
s
s 0
s 0

s 0

s

t

s 0

s

t

s

s = 0

a 0 = s 0

V

( s ) = f θ ( s ) .

θ f

( s n , a, s n +1 , r ( s n , a))

π ( s ) = g θ ( s )

θ g

π θ

θ

r ( s, a) = h( g θ ( s ) , a)

θ

k

0 1

i 1

p i

h

h

p i

h < h

1

p i

p i i

h

h

1 p i

p i

p i

1

h

r ( s W , a C ) = 100

s W a C

s W

X ( t)

X ( t 1)

1 X 1 ( t)

t

dt

X 1 ( t + dt) = X 1 ( t) + N 1 ( t) N 1 ( t)

σ 2 Idt σ N 1 ( t)

X 1 (1)

0 σ 2 I

X 2 ( t)

T d