Sie sind auf Seite 1von 273

AlphaGo: Mastering the game of Go

with deep neural networks and tree search

Karel Ha
article by Google DeepMind

Optimization Seminar, 20th April 2016


Why AI?
Applications of AI

 spam filters

1
Applications of AI

 spam filters
 recommender systems (Netflix, YouTube)

1
Applications of AI

 spam filters
 recommender systems (Netflix, YouTube)
 predictive text (Swiftkey)

1
Applications of AI

 spam filters
 recommender systems (Netflix, YouTube)
 predictive text (Swiftkey)
 audio recognition (Shazam, SoundHound)

1
Applications of AI

 spam filters
 recommender systems (Netflix, YouTube)
 predictive text (Swiftkey)
 audio recognition (Shazam, SoundHound)
 self-driving cars

1
Artistic-Style Painting (1/2)

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
Artistic-Style Painting (1/2)

[1] Gatys, Ecker, and Bethge 2015 [2] Li and Wand 2016 2
Artistic-Style Painting (2/2)

Champandard 2016 3
C Code Generated Character by Character

Karpathy 2015 4
Algebraic Geometry Generated Character by Character

Karpathy 2015 5
Game of Thrones Generated Character by Character

http://pjreddie.com/darknet/rnns-in-darknet/ 5
Game of Thrones Generated Character by Character
JON

He leaned close and onions, barefoot from


his shoulder. “I am not a purple girl,” he
said as he stood over him. “The sight of
you sell your father with you a little choice.”

“I say to swear up his sea or a boy of stone


and heart, down,” Lord Tywin said. “I love
your word or her to me.”

Darknet (on Linux)

http://pjreddie.com/darknet/rnns-in-darknet/ 5
Game of Thrones Generated Character by Character
JON

He leaned close and onions, barefoot from


his shoulder. “I am not a purple girl,” he
said as he stood over him. “The sight of
you sell your father with you a little choice.”

“I say to swear up his sea or a boy of stone


and heart, down,” Lord Tywin said. “I love
your word or her to me.”

Darknet (on Linux)


JON

Each in days and the woods followed his


king. “I understand.”

“I am not your sister Lord Robert?”

“The door was always some cellar to do his


being girls and the Magnar of Baratheon,
and there were thousands of every bite of
half the same as though he was not a great
knight should be seen, and not to look at
the Redwyne two thousand men.”

Darknet (on OS X)
http://pjreddie.com/darknet/rnns-in-darknet/ 5
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

 We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.

Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

 We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
 The biggest risk to the world, is me, believe it or not.

Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

 We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
 The biggest risk to the world, is me, believe it or not.
 I am what ISIS doesn’t need.

Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

 We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
 The biggest risk to the world, is me, believe it or not.
 I am what ISIS doesn’t need.
 I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He
has been amazing.

Hayes 2016 6
DeepDrumpf: a Twitter bot / neural network which learned
the language of Donald Trump from his speeches

 We’ve got nuclear weapons that are obsolete. I’m going to create jobs just by making the worst thing ever.
 The biggest risk to the world, is me, believe it or not.
 I am what ISIS doesn’t need.
 I’d like to beat that @HillaryClinton. She is a horror. I told my supporter Putin to say that all the time. He
has been amazing.
 I buy Hillary, it’s beautiful and I’m happy about it.

Hayes 2016 6
Atari Player by Google DeepMind

https://youtu.be/0X-NdPtFKq0?t=21m13s
Mnih et al. 2015 7
https://xkcd.com/1002/ 7
Heads-up Limit Holdem Poker Is Solved!

Bowling et al. 2015 8


Heads-up Limit Holdem Poker Is Solved!

Cepheus http://poker.srv.ualberta.ca/
0.000986 big blinds per game on expectation
Bowling et al. 2015 8
Basics of Machine Learning
https://dataaspirant.com/2014/09/19/supervised-and-unsupervised-learning/ 8
Supervised Learning (SL)

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment

http://www.nickgillian.com/ 9
Supervised Learning (SL)

1. data collection: Google Search, Facebook “Likes”, Siri, Netflix, YouTube views, LHC collisions, KGS Go
Server...
2. training on training set
3. testing on testing set
4. deployment

http://www.nickgillian.com/ 9
Regression

9
Regression

9
Mathematical Regression

https://thermanuals.wordpress.com/descriptive-analysis/sampling-and-regression/
10
Classification

https://kevinbinz.files.wordpress.com/2014/08/ml-svm-after-comparison.png 11
Underfitting and Overfitting

https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
Underfitting and Overfitting

Beware of overfitting!

https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
Underfitting and Overfitting

Beware of overfitting!
It is like learning for a mathematical exam by memorizing proofs.

https://www.researchgate.net/post/How_to_Avoid_Overfitting 12
Reinforcement Learning (RL)

https://youtu.be/0X-NdPtFKq0?t=16m57s 13
Reinforcement Learning (RL)

Specially: games of self-play

https://youtu.be/0X-NdPtFKq0?t=16m57s 13
Monte Carlo Tree Search
Tree Search

Optimal value v ∗ (s) determines the outcome of the game:

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s
 under perfect play by all players.

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s
 under perfect play by all players.

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s
 under perfect play by all players.

It is computed by recursively traversing a search tree containing


approximately b d possible sequences of moves, where

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s
 under perfect play by all players.

It is computed by recursively traversing a search tree containing


approximately b d possible sequences of moves, where
 b is the games breadth (number of legal moves per position)

Silver et al. 2016 14


Tree Search

Optimal value v ∗ (s) determines the outcome of the game:


 from every board position or state s
 under perfect play by all players.

It is computed by recursively traversing a search tree containing


approximately b d possible sequences of moves, where
 b is the games breadth (number of legal moves per position)
 d is its depth (game length)

Silver et al. 2016 14


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

That makes Go a googol


[10100 ] times more complex
than chess.
https://deepmind.com/alpha-go.html

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

That makes Go a googol


[10100 ] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

That makes Go a googol


[10100 ] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
 for the breadth: a neural network to select moves

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

That makes Go a googol


[10100 ] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
 for the breadth: a neural network to select moves
 for the depth: a neural network to evaluate the current
position

Allis et al. 1994 15


Game tree of Go

Sizes of trees for various games:


 chess: b ≈ 35, d ≈ 80
 Go: b ≈ 250, d ≈ 150 ⇒ more positions than atoms in the
universe!

That makes Go a googol


[10100 ] times more complex
than chess.
https://deepmind.com/alpha-go.html
How to handle the size of the game tree?
 for the breadth: a neural network to select moves
 for the depth: a neural network to evaluate the current
position
 for the tree traverse: Monte Carlo tree search (MCTS)
Allis et al. 1994 15
Monte Carlo tree search

16
Neural networks
Neural Networks (NN): Inspiration

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks (NN): Inspiration

 inspired by the neuronal structure of the mammalian cerebral


cortex

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks (NN): Inspiration

 inspired by the neuronal structure of the mammalian cerebral


cortex
 but on much smaller scales

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks (NN): Inspiration

 inspired by the neuronal structure of the mammalian cerebral


cortex
 but on much smaller scales
 suitable to model systems with a high tolerance to error

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks (NN): Inspiration

 inspired by the neuronal structure of the mammalian cerebral


cortex
 but on much smaller scales
 suitable to model systems with a high tolerance to error
 e.g. audio or image recognition
http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 17
Neural Networks: Modes

Dieterle 2003 18
Neural Networks: Modes

Two modes

Dieterle 2003 18
Neural Networks: Modes

Two modes
 feedforward for making predictions

Dieterle 2003 18
Neural Networks: Modes

Two modes
 feedforward for making predictions
 backpropagation for learning
Dieterle 2003 18
Neural Networks: an Example of Feedforward

http://stevenmiller888.github.io/mind-how-to-build-a-neural-network/ 19
Gradient Descent in Neural Networks

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
Gradient Descent in Neural Networks

Motto: ”Learn by mistakes!”

However, error functions are not necessarily convex or so “smooth”.


http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 20
http://xkcd.com/1425/ 20
Convolutional Neural Networks (CNN or ConvNet)

http://code.flickr.net/2014/10/20/introducing-flickr-park-or-bird/ 21
(Deep) Convolutional Neural Networks

The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”.

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
(Deep) Convolutional Neural Networks

The hierarchy of concepts is captured in the number of layers: the deep in “Deep Learning”.

http://pages.cs.wisc.edu/~bolo/shipyard/neural/local.html 22
Rules of Go
Backgammon: Man vs. Fate

22
Backgammon: Man vs. Fate

Chess: Man vs. Man

22
Go: Man vs. Self

Robert Šámal (White) versus Karel Král (Black), Spring School of Combinatorics 2016 22
Rules of Go

23
Rules of Go

Black versus White. Black starts the game.

23
Rules of Go

Black versus White. Black starts the game.

23
Rules of Go

Black versus White. Black starts the game.

the rule of liberty

23
Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

23
Rules of Go

Black versus White. Black starts the game.

the rule of liberty

the “ko” rule

Handicap for difference in ranks: Black can place 1 or more stones


in advance (compensation for White’s greater strength). 23
Scoring Rules: Area Scoring

https://en.wikipedia.org/wiki/Go_(game) 24
Scoring Rules: Area Scoring

A player’s score is:

 the number of stones that the player has on the board

https://en.wikipedia.org/wiki/Go_(game) 24
Scoring Rules: Area Scoring

A player’s score is:

 the number of stones that the player has on the board


 plus the number of empty intersections surrounded by that
player’s stones

https://en.wikipedia.org/wiki/Go_(game) 24
Scoring Rules: Area Scoring

A player’s score is:

 the number of stones that the player has on the board


 plus the number of empty intersections surrounded by that
player’s stones
 plus komi(dashi) points for the White player
which is a compensation for the first move advantage of the Black player

https://en.wikipedia.org/wiki/Go_(game) 24
Ranks of Players

Kyu and Dan ranks

https://en.wikipedia.org/wiki/Go_(game) 25
Ranks of Players

Kyu and Dan ranks

or alternatively, Elo ratings

https://en.wikipedia.org/wiki/Go_(game) 25
Chocolate micro-break

25
AlphaGo: Inside Out
Policy and Value Networks

Silver et al. 2016 26


Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 27


SL Policy Network (1/2)

 13-layer deep convolutional neural network

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Results:

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Results:

 44.4% accuracy (the state-of-the-art from other groups)

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Results:

 44.4% accuracy (the state-of-the-art from other groups)


 55.7% accuracy (raw board position + move history as input)

Silver et al. 2016 28


SL Policy Network (1/2)

 13-layer deep convolutional neural network


 goal: to predict expert human moves
 task of classification
 trained from 30 millions positions from the KGS Go Server
 stochastic gradient ascent:
∂ log pσ (a|s)
∆σ ∝
∂σ
(to maximize the likelihood of the human move a selected in state s)

Results:

 44.4% accuracy (the state-of-the-art from other groups)


 55.7% accuracy (raw board position + move history as input)
 57.0% accuracy (all input features)
Silver et al. 2016 28
SL Policy Network (2/2)

Small improvements in accuracy led to large improvements


in playing strength

Silver et al. 2016 29


Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 30


Rollout Policy

 Rollout policy pπ (a|s) is faster but less accurate than SL


policy network.

Silver et al. 2016 31


Rollout Policy

 Rollout policy pπ (a|s) is faster but less accurate than SL


policy network.
 accuracy of 24.2%

Silver et al. 2016 31


Rollout Policy

 Rollout policy pπ (a|s) is faster but less accurate than SL


policy network.
 accuracy of 24.2%
 It takes 2µs to select an action, compared to 3 ms in case
of SL policy network.

Silver et al. 2016 31


Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 32


RL Policy Network (1/2)

 identical in structure to the SL policy network

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play
 between the current RL policy network and a randomly
selected previous iteration

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play
 between the current RL policy network and a randomly
selected previous iteration
 to prevent overfitting to the current policy

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play
 between the current RL policy network and a randomly
selected previous iteration
 to prevent overfitting to the current policy

 stochastic gradient ascent:


∂ log pρ (at |st )
∆ρ ∝ zt
∂ρ
at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play
 between the current RL policy network and a randomly
selected previous iteration
 to prevent overfitting to the current policy

 stochastic gradient ascent:


∂ log pρ (at |st )
∆ρ ∝ zt
∂ρ
at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 33


RL Policy Network (1/2)

 identical in structure to the SL policy network


 goal: to win in the games of self-play
 task of classification
 weights ρ initialized to the same values, ρ := σ
 games of self-play
 between the current RL policy network and a randomly
selected previous iteration
 to prevent overfitting to the current policy

 stochastic gradient ascent:


∂ log pρ (at |st )
∆ρ ∝ zt
∂ρ
at time step t, where reward function zt is +1 for winning and −1 for losing.

Silver et al. 2016 33


RL Policy Network (2/2)

Results (by sampling each move at ∼ pρ (·|st )):

Silver et al. 2016 34


RL Policy Network (2/2)

Results (by sampling each move at ∼ pρ (·|st )):

 80% of win rate against the SL policy network

Silver et al. 2016 34


RL Policy Network (2/2)

Results (by sampling each move at ∼ pρ (·|st )):

 80% of win rate against the SL policy network


 85% of win rate against the strongest open-source Go
program, Pachi (Baudiš and Gailly 2011)

Silver et al. 2016 34


RL Policy Network (2/2)

Results (by sampling each move at ∼ pρ (·|st )):

 80% of win rate against the SL policy network


 85% of win rate against the strongest open-source Go
program, Pachi (Baudiš and Gailly 2011)
 The previous state-of-the-art, based only on SL of CNN:
11% of “win” rate against Pachi

Silver et al. 2016 34


Training the (Deep Convolutional) Neural Networks

Silver et al. 2016 35


Value Network (1/2)

 similar architecture to the policy network, but outputs a single


prediction instead of a probability distribution

Silver et al. 2016 36


Value Network (1/2)

 similar architecture to the policy network, but outputs a single


prediction instead of a probability distribution
 goal: to estimate a value function

v p (s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played


by using policy p)

Silver et al. 2016 36


Value Network (1/2)

 similar architecture to the policy network, but outputs a single


prediction instead of a probability distribution
 goal: to estimate a value function

v p (s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played


by using policy p)
 Double approximation: vθ (s) ≈ v pρ (s) ≈ v ∗ (s).

Silver et al. 2016 36


Value Network (1/2)

 similar architecture to the policy network, but outputs a single


prediction instead of a probability distribution
 goal: to estimate a value function

v p (s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played


by using policy p)
 Double approximation: vθ (s) ≈ v pρ (s) ≈ v ∗ (s).
 task of regression

Silver et al. 2016 36


Value Network (1/2)

 similar architecture to the policy network, but outputs a single


prediction instead of a probability distribution
 goal: to estimate a value function

v p (s) = E[zt |st = s, at...T ∼ p]

that predicts the outcome from position s (of games played


by using policy p)
 Double approximation: vθ (s) ≈ v pρ (s) ≈ v ∗ (s).
 task of regression
 stochastic gradient descent:
∂vθ (s)
∆θ ∝ (z − vθ (s))
∂θ
(to minimize the mean squared error (MSE) between the predicted vθ (s) and the true z)

Silver et al. 2016 36


Value Network (2/2)

Beware of overfitting!

Silver et al. 2016 37


Value Network (2/2)

Beware of overfitting!

 Consecutive positions are strongly correlated.

Silver et al. 2016 37


Value Network (2/2)

Beware of overfitting!

 Consecutive positions are strongly correlated.


 Value network memorized the game outcomes, rather than
generalizing to new positions.

Silver et al. 2016 37


Value Network (2/2)

Beware of overfitting!

 Consecutive positions are strongly correlated.


 Value network memorized the game outcomes, rather than
generalizing to new positions.
 Solution: generate 30 million (new) positions, each sampled
from a seperate game

Silver et al. 2016 37


Value Network (2/2)

Beware of overfitting!

 Consecutive positions are strongly correlated.


 Value network memorized the game outcomes, rather than
generalizing to new positions.
 Solution: generate 30 million (new) positions, each sampled
from a seperate game
 almost the accuracy of Monte Carlo rollouts (using pρ ), but
15000 times less computation!

Silver et al. 2016 37


Evaluation Accuracy in Various Stages of a Game

Move number is the number of moves that had been played in the given position.

Silver et al. 2016 38


Evaluation Accuracy in Various Stages of a Game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:


 forward pass of the value network vθ

Silver et al. 2016 38


Evaluation Accuracy in Various Stages of a Game

Move number is the number of moves that had been played in the given position.

Each position evaluated by:


 forward pass of the value network vθ
 100 rollouts, played out using the corresponding policy
Silver et al. 2016 38
Elo Ratings for Various Combinations of Networks

Silver et al. 2016 39


The Main Algorithm

Silver et al. 2016 39


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Each edge (s, a) keeps:

 action value Q(s, a)

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Each edge (s, a) keeps:

 action value Q(s, a)


 visit count N(s, a)

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Each edge (s, a) keeps:

 action value Q(s, a)


 visit count N(s, a)
 prior probability P(s, a) (from SL policy network pσ )

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Each edge (s, a) keeps:

 action value Q(s, a)


 visit count N(s, a)
 prior probability P(s, a) (from SL policy network pσ )

Silver et al. 2016 40


MCTS Algorithm

The next action is selected by lookahead search, using simulation:

1. selection phase
2. expansion phase
3. evaluation phase
4. backup phase (at end of all simulations)

Each edge (s, a) keeps:

 action value Q(s, a)


 visit count N(s, a)
 prior probability P(s, a) (from SL policy network pσ )

The tree is traversed by simulation (descending the tree) from the


root state.
Silver et al. 2016 40
MCTS Algorithm: Selection

Silver et al. 2016 41


MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg max(Q(st , a) + u(st , a))


a

Silver et al. 2016 41


MCTS Algorithm: Selection

At each time step t, an action at is selected from state st

at = arg max(Q(st , a) + u(st , a))


a

where bonus
P(s, a)
u(st , a) ∝
1 + N(s, a)
Silver et al. 2016 41
MCTS Algorithm: Expansion

Silver et al. 2016 42


MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

Silver et al. 2016 42


MCTS Algorithm: Expansion

A leaf position may be expanded (just once) by the SL policy network pσ .

The output probabilities are stored as priors P(s, a) := pσ (a|s).

Silver et al. 2016 42


MCTS: Evaluation

Silver et al. 2016 43


MCTS: Evaluation

 evaluation from the value network vθ (s)

Silver et al. 2016 43


MCTS: Evaluation

 evaluation from the value network vθ (s)


 evaluation by the outcome z using the fast rollout policy pπ until the end of game

Silver et al. 2016 43


MCTS: Evaluation

 evaluation from the value network vθ (s)


 evaluation by the outcome z using the fast rollout policy pπ until the end of game

Silver et al. 2016 43


MCTS: Evaluation

 evaluation from the value network vθ (s)


 evaluation by the outcome z using the fast rollout policy pπ until the end of game

Using a mixing parameter λ, the final leaf evaluation V (s) is

V (s) = (1 − λ)vθ (s) + λz

Silver et al. 2016 43


MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

 the action values Q

Silver et al. 2016 44


MCTS: Backup

At the end of simulation, each traversed edge is updated by accumulating:

 the action values Q


 visit counts N

Silver et al. 2016 44


Once the search is complete, the algorithm
chooses the most visited move from the root
position.

Silver et al. 2016 44


Percentage of Simulations

percentage frequency with which actions were selected from the root during simulations

Silver et al. 2016 45


Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

Silver et al. 2016 46


Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

 AlphaGo selected the move indicated by the red circle;

Silver et al. 2016 46


Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

 AlphaGo selected the move indicated by the red circle;


 Fan Hui responded with the move indicated by the white square;

Silver et al. 2016 46


Principal Variation (Path with Maximum Visit Count)

The moves are presented in a numbered sequence.

 AlphaGo selected the move indicated by the red circle;


 Fan Hui responded with the move indicated by the white square;
 in his post-game commentary, he preferred the move (labelled 1) predicted by AlphaGo.

Silver et al. 2016 46


Scalability

 asynchronous multi-threaded search

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Distributed version of AlphaGo (on multiple machines):

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Distributed version of AlphaGo (on multiple machines):

 40 search threads

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Distributed version of AlphaGo (on multiple machines):

 40 search threads
 1202 CPUs

Silver et al. 2016 47


Scalability

 asynchronous multi-threaded search


 simulations on CPUs
 computation of neural networks on GPUs

AlphaGo:

 40 search threads
 40 CPUs
 8 GPUs

Distributed version of AlphaGo (on multiple machines):

 40 search threads
 1202 CPUs
 176 GPUs
Silver et al. 2016 47
Elo Ratings for Various Combinations of Threads

Silver et al. 2016 48


Results: the strength of AlphaGo
Tournament with Other Go Programs

Silver et al. 2016 49


Fan Hui

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan
 European Go Champion in 2013, 2014 and 2015

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan
 European Go Champion in 2013, 2014 and 2015
 European Professional Go Champion in 2016

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan
 European Go Champion in 2013, 2014 and 2015
 European Professional Go Champion in 2016
 biological neural network:

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan
 European Go Champion in 2013, 2014 and 2015
 European Professional Go Champion in 2016
 biological neural network:
 100 billion neurons

https://en.wikipedia.org/wiki/Fan_Hui 50
Fan Hui

 professional 2 dan
 European Go Champion in 2013, 2014 and 2015
 European Professional Go Champion in 2016
 biological neural network:
 100 billion neurons
 100 up to 1,000 trillion neuronal connections
https://en.wikipedia.org/wiki/Fan_Hui 50
AlphaGo versus Fan Hui

51
AlphaGo versus Fan Hui

AlphaGo won 5:0 in a formal match on October 2015.

51
AlphaGo versus Fan Hui

AlphaGo won 5:0 in a formal match on October 2015.


[AlphaGo] is very strong and stable, it seems
like a wall. ... I know AlphaGo is a computer,
but if no one told me, maybe I would think
the player was a little strange, but a very
strong player, a real person.

Fan Hui 51
Lee Sedol “The Strong Stone”

https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”

 professional 9 dan

https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”

 professional 9 dan
 the 2nd in international titles

https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”

 professional 9 dan
 the 2nd in international titles
 the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history

https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”

 professional 9 dan
 the 2nd in international titles
 the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
 Lee Sedol would win 97 out of 100 games against Fan Hui.

https://en.wikipedia.org/wiki/Lee_Sedol 52
Lee Sedol “The Strong Stone”

 professional 9 dan
 the 2nd in international titles
 the 5th youngest (12 years 4 months) to become
a professional Go player in South Korean history
 Lee Sedol would win 97 out of 100 games against Fan Hui.
 biological neural network comparable to Fan Hui’s (in number
of neurons and connections)
https://en.wikipedia.org/wiki/Lee_Sedol 52
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.

Lee Sedol

52
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4:1 may allow


the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.

interview in JTBC
Newsroom

52
I heard Google DeepMind’s AI is surprisingly
strong and getting stronger, but I am
confident that I can win, at least this time.

Lee Sedol

...even beating AlphaGo by 4:1 may allow


the Google DeepMind team to claim its de
facto victory and the defeat of him
[Lee Sedol], or even humankind.

interview in JTBC
Newsroom

52
AlphaGo versus Lee Sedol

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
AlphaGo versus Lee Sedol

In March 2016 AlphaGo won 4:1 against the legendary Lee Sedol.
AlphaGo won all but the 4th game; all games were won
by resignation.
The winner of the match was slated to win $1 million.
Since AlphaGo won, Google DeepMind stated that the prize will be
donated to charities, including UNICEF, and Go organisations.
Lee received $170,000 ($150,000 for participating in all the five
games, and an additional $20,000 for each game won).

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol 53
Who’s next?

53
http://www.goratings.org/ (18th April 2016) 53
AlphaGo versus Ke Jie?

https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?

 professional 9 dan

https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?

 professional 9 dan
 the 1st in (unofficial) world ranking list

https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?

 professional 9 dan
 the 1st in (unofficial) world ranking list
 the youngest player to win 3 major international tournaments

https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?

 professional 9 dan
 the 1st in (unofficial) world ranking list
 the youngest player to win 3 major international tournaments
 head-to-head record against Lee Sedol 8:2

https://en.wikipedia.org/wiki/Ke_Jie 54
AlphaGo versus Ke Jie?

 professional 9 dan
 the 1st in (unofficial) world ranking list
 the youngest player to win 3 major international tournaments
 head-to-head record against Lee Sedol 8:2
 biological neural network comparable to Fan Hui’s, and thus
by transitivity, also comparable to Lee Sedol’s
https://en.wikipedia.org/wiki/Ke_Jie 54
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.

Ke Jie

54
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.

Ke Jie

Now facing AlphaGo, I do not feel the same


strong instinct of victory when I play a
human player, but I still believe I have the
advantage against it. It’s 60 percent in
favor of me.

Ke Jie

54
I believe I can beat it. Machines can be very
strong in many aspects but still have
loopholes in certain calculations.

Ke Jie

Now facing AlphaGo, I do not feel the same


strong instinct of victory when I play a
human player, but I still believe I have the
advantage against it. It’s 60 percent in
favor of me.

Ke Jie

Even though AlphaGo may have defeated


Lee Sedol, it won’t beat me.

Ke Jie

54
Conclusion
Difficulties of Go

 challenging decision-making

Silver et al. 2016 55


Difficulties of Go

 challenging decision-making
 intractable search space

Silver et al. 2016 55


Difficulties of Go

 challenging decision-making
 intractable search space
 complex optimal solution
It appears infeasible to directly approximate using a policy or value function!

Silver et al. 2016 55


AlphaGo: summary

 Monte Carlo tree search

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation
 Monte Carlo rollouts

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation
 Monte Carlo rollouts

 scalable implementation

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation
 Monte Carlo rollouts

 scalable implementation
 multi-threaded simulations on CPUs

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation
 Monte Carlo rollouts

 scalable implementation
 multi-threaded simulations on CPUs
 parallel GPU computations

Silver et al. 2016 56


AlphaGo: summary

 Monte Carlo tree search


 effective move selection and position evaluation
 through deep convolutional neural networks
 trained by novel combination of supervised and reinforcement
learning
 new search algorithm combining
 neural network evaluation
 Monte Carlo rollouts

 scalable implementation
 multi-threaded simulations on CPUs
 parallel GPU computations
 distributed version over multiple machines

Silver et al. 2016 56


Novel approach

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)


 evaluating them more precisely (value network)

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)


 evaluating them more precisely (value network)

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)


 evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)


 evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.


AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.

Silver et al. 2016 57


Novel approach

During the match against Fan Hui, AlphaGo evaluated thousands


of times fewer positions than Deep Blue against Kasparov.
It compensated this by:

 selecting those positions more intelligently (policy network)


 evaluating them more precisely (value network)

Deep Blue relied on a handcrafted evaluation function.


AlphaGo was trained directly and automatically from gameplay.
It used general-purpose learning.
This approach is not specific to the game of Go. The algorithm
can be used for much wider class of (so far seemingly)
intractable problems in AI!
Silver et al. 2016 57
Thank you!
Questions?

57
Backup Slides
Input features for rollout and tree policy

Silver et al. 2016


Selection of Moves by the SL Policy Network

move probabilities taken directly from the SL policy network pσ (reported as a percentage if above 0.1%).

Silver et al. 2016


Selection of Moves by the Value Network

evaluation of all successors s 0 of the root position s, using vθ (s)

Silver et al. 2016


Tree Evaluation from Value Network

action values Q(s, a) for each tree-edge (s, a) from root position s (averaged over value network evaluations only)
Silver et al. 2016
Tree Evaluation from Rollouts

action values Q(s, a), averaged over rollout evaluations only


Silver et al. 2016
Results of a tournament between different Go programs

Silver et al. 2016


Results of a tournament between AlphaGo and distributed Al-
phaGo, testing scalability with hardware

Silver et al. 2016


AlphaGo versus Fan Hui: Game 1

Silver et al. 2016


AlphaGo versus Fan Hui: Game 2

Silver et al. 2016


AlphaGo versus Fan Hui: Game 3

Silver et al. 2016


AlphaGo versus Fan Hui: Game 4

Silver et al. 2016


AlphaGo versus Fan Hui: Game 5

Silver et al. 2016


AlphaGo versus Lee Sedol: Game 1

https://youtu.be/vFr3K2DORc8

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (1/2)

https://youtu.be/l-GsfyVCBu0

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 2 (2/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 3

https://youtu.be/qUAmTYHEyM8
https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 4

https://youtu.be/yCALyQRN3hw

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (1/2)

https://youtu.be/mzpW10DPHeQ

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
AlphaGo versus Lee Sedol: Game 5 (2/2)

https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol
Further Reading I

AlphaGo:

 Google Research Blog


http://googleresearch.blogspot.cz/2016/01/alphago-mastering-ancient-game-of-go.html
 an article in Nature
http://www.nature.com/news/google-ai-algorithm-masters-ancient-game-of-go-1.19234
 a reddit article claiming that AlphaGo is even stronger than it appears to be:
“AlphaGo would rather win by less points, but with higher probability.”
https://www.reddit.com/r/baduk/comments/49y17z/the_true_strength_of_alphago/
 a video of how AlphaGo works (put in layman’s terms) https://youtu.be/qWcfiPi9gUU

Articles by Google DeepMind:

 Atari player: a DeepRL system which combines Deep Neural Networks with Reinforcement Learning (Mnih
et al. 2015)
 Neural Turing Machines (Graves, Wayne, and Danihelka 2014)

Artificial Intelligence:

 Artificial Intelligence course at MIT


http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/
6-034-artificial-intelligence-fall-2010/index.htm
Further Reading II
 Introduction to Artificial Intelligence at Udacity
https://www.udacity.com/course/intro-to-artificial-intelligence--cs271
 General Game Playing course https://www.coursera.org/course/ggp
 Singularity http://waitbutwhy.com/2015/01/artificial-intelligence-revolution-1.html + Part 2
 The Singularity Is Near (Kurzweil 2005)

Combinatorial Game Theory (founded by John H. Conway to study endgames in Go):

 Combinatorial Game Theory course https://www.coursera.org/learn/combinatorial-game-theory


 On Numbers and Games (Conway 1976)
 Computer Go as a sum of local games: an application of combinatorial game theory (Müller 1995)

Chess:

 Deep Blue beats G. Kasparov in 1997 https://youtu.be/NJarxpYyoFI

Machine Learning:

 Machine Learning course


https://youtu.be/hPKJBXkyTK://www.coursera.org/learn/machine-learning/
 Reinforcement Learning http://reinforcementlearning.ai-depot.com/
 Deep Learning (LeCun, Bengio, and Hinton 2015)
Further Reading III

 Deep Learning course https://www.udacity.com/course/deep-learning--ud730

 Two Minute Papers https://www.youtube.com/user/keeroyz

 Applications of Deep Learning https://youtu.be/hPKJBXkyTKM

Neuroscience:

 http://www.brainfacts.org/
References I

Allis, Louis Victor et al. (1994). Searching for solutions in games and artificial intelligence. Ponsen & Looijen.

Baudiš, Petr and Jean-loup Gailly (2011). “Pachi: State of the art open source Go program”. In: Advances in
Computer Games. Springer, pp. 24–38.

Bowling, Michael et al. (2015). “Heads-up limit holdem poker is solved”. In: Science 347.6218, pp. 145–149. url:
http://poker.cs.ualberta.ca/15science.html.

Champandard, Alex J (2016). “Semantic Style Transfer and Turning Two-Bit Doodles into Fine Artworks”. In:
arXiv preprint arXiv:1603.01768.

Conway, John Horton (1976). “On Numbers and Games”. In: London Mathematical Society Monographs 6.

Dieterle, Frank Jochen (2003). “Multianalyte quantifications by means of integration of artificial neural networks,
genetic algorithms and chemometrics for time-resolved analytical data”. PhD thesis. Universität Tübingen.

Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge (2015). “A Neural Algorithm of Artistic Style”. In:
CoRR abs/1508.06576. url: http://arxiv.org/abs/1508.06576.

Graves, Alex, Greg Wayne, and Ivo Danihelka (2014). “Neural turing machines”. In: arXiv preprint
arXiv:1410.5401.
Hayes, Bradley (2016). url: https://twitter.com/deepdrumpf.

Karpathy, Andrej (2015). The Unreasonable Effectiveness of Recurrent Neural Networks. url:
http://karpathy.github.io/2015/05/21/rnn-effectiveness/ (visited on 04/01/2016).
References II

Kurzweil, Ray (2005). The singularity is near: When humans transcend biology. Penguin.

LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton (2015). “Deep learning”. In: Nature 521.7553, pp. 436–444.

Li, Chuan and Michael Wand (2016). “Combining Markov Random Fields and Convolutional Neural Networks for
Image Synthesis”. In: CoRR abs/1601.04589. url: http://arxiv.org/abs/1601.04589.

Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518.7540,
pp. 529–533. url:
https://storage.googleapis.com/deepmind-data/assets/papers/DeepMindNature14236Paper.pdf.

Müller, Martin (1995). “Computer Go as a sum of local games: an application of combinatorial game theory”.
PhD thesis. TU Graz.
Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature
529.7587, pp. 484–489.

Das könnte Ihnen auch gefallen