Beruflich Dokumente
Kultur Dokumente
Tetris can be modeled as a Markov decision process. In the tures used in these evaluation functions, as well as how the
most typical formulation, a state includes the current con- weights are tuned.
figuration of the grid as well as the identity of the falling
Tetrimino. The available actions are the possible legal 3.1. Early Attempts
placements that can be achieved by rotating and translat-
ing the Tetrimino before dropping it. Tsitsiklis & Van Roy (1996) used Tetris as a test bed
for large-scale feature-based dynamic programming. They
This formulation ignores some pieces of information pro- used two simple state features: the number of holes, and
vided in some implementations of the game, for example, the height of the highest column. They achieved a score of
the identity of the next Tetrimino to fall after the current around 30 cleared lines on a 16 × 10 grid.
one is placed. It also excludes actions that are available in
some implementations of the game. One example is sliding Bertsekas & Tsitsiklis (1996) added two sets of features:
a Tetrimino under a cliff, which is known as an overhang. height of each column, and the difference in height between
Another example is rotating the T-shaped Tetrimino to fill consecutive columns. Using lambda-policy iteration, they
an otherwise unreachable slot at the very last moment. This achieved a score of around 2,800 lines. Note, however, that
maneuver is known as a T-spin and it gives extra points in their implementation for ending the game effectively re-
some implementations of the game. duces the grid to a size of 19 × 10.
The original version of the game used a scoring function Lagoudakis et al. (2002) added further features, including
that awards one point for each cleared line. Subsequent mean column height and the sum of the differences in con-
versions allocate more points for clearing more than one secutive column heights. Using least-squares policy itera-
line simultaneously. Clearing four lines at once, by plac- tion, they achieved an average score of between 1,000 and
ing an I-shaped Tetrimino in a deep well, is allocated the 3,000 lines.
largest number of points. Most implementations of Tetris Kakade (2002) used a policy-gradient algorithm to clear
by researchers use the original scoring function, where a 6,800 lines on average using the same features as Bertsekas
single point is allocated to each cleared line. & Tsitsiklis (1996).
Farias & Van Roy (2006) studied an algorithm that samples
3. Algorithms and Features constraints in the form of Bellman equations for a linear
Tetris is estimated to have 7 × 2200 states. Given this large programming solver. The solver finds a policy that clears
number, the general approach has been to approximate a around 4,500 lines using the features used by Bertsekas &
value function or learn a policy using a set of features that Tsitsiklis (1996).
describe either the current state or the current state–action Ramon & Driessens (2004) used relational reinforcement
pair. learning with a Gaussian kernel and achieved a score of
Tetris poses a number of difficulties for research. First, around 50 cleared lines. Romdhane & Lamontagne (2008)
small changes in the implementation of the game cause combined reinforcement learning and case-based reasoning
very large differences in scores. This makes comparison using patterns of small parts of the grid. Their scores were
of scores from different research articles difficult. Second, also around 50 cleared lines.
the score obtained in the game has a very large variance.
Therefore, a large number of games need to be completed 3.2. Hand-Crafted Agent
to accurately assess average performance. Furthermore, Until 2008, the best artificial Tetris player was handcrafted,
games can take a very long time to complete. Researchers as reported by Fahey (2003). Pierre Dellacherie, a self-
who have developed algorithms that can play Tetris rea- declared average Tetris player, identified six simple fea-
sonably well have found themselves waiting for days for tures and tuned the weights by trial and error. These fea-
a single game to be over. For that reason, it is common tures were number of holes, landing height of the piece,
to work with grids smaller than the standard grid size of number of row transitions (transitions from a full square to
20 × 10. A reasonable way to make the game shorter, with- an empty one, or vice versa, examining each row from end
out compromising its nature, is to use a grid size of 10×10. to end), number of column transitions, cumulative number
The scores reported in this article are those achieved on the of wells, and eroded cells (number of cleared lines multi-
standard grid of size 20 × 10, unless otherwise noted. plied by the number of holes in those lines filled by the
The most common approach to Tetris has been to develop a present Tetrimino). The evaluation function was as follows:
linear evaluation function, where each possible placement −4 × holes − cumulative wells
of the Tetrimino is evaluated to select the placement with − row transitions − column transitions
the highest value. In the next sections, we discuss the fea- − landing height + eroded cells
The Game of Tetris in Machine Learning
This linear evaluation function cleared an average of (Salimans et al., 2017). Notably, they are easier to paral-
660,000 lines on the full grid. The scores were reported on lelize than reinforcement learning algorithms. As discussed
an implementation where the game was over if the falling below, the latest reported reinforcement learning algorithm
Tetrimino had no space to appear on the grid (in the cen- uses an evolutionary strategy inside the policy evaluation
ter of the top row). In the simplified implementation used step (Gabillon et al., 2013; Scherrer et al., 2015).
by the approaches discussed earlier, the games would have
continued further, until every placement would overflow 3.4. Approximate Modified Policy Iteration
the grid. Therefore, this report underrates this simple linear
rule compared to other algorithms. Gabillon et al. (2013) found a vector of weights that
achieved 51,000,000 cleared lines using a classification-
based policy iteration algorithm inspired by Lagoudakis &
3.3. Genetic Algorithms
Parr (2003). This is the first reinforcement learning algo-
Böhm et al. (2005) used evolutionary algorithms to de- rithm that has a performance comparable with that of the
velop a Tetris controller. In their implementation, the agent genetic algorithms. Lagoudakis & Parr’s idea is to make
knows not only the falling Tetrimino but also the next one. use of sophisticated classifiers inside the loop of reinforce-
This makes their results incomparable to those achieved ment learning algorithms to identify good actions. Gabil-
on versions with knowledge of only the current Tetrimino. lon et al. (2013) estimated values of state–action pairs us-
They evolved both a linear and an exponential policy. They ing rollouts and then minimized a complex function of
reported 480,000,000 lines cleared using the linear func- these rollouts using the CMA-ES algorithm (Hansen &
tion and 34,000,000 using the exponential function, both Ostermeier, 2001). This is the most widely known evo-
on the standard grid. They introduced new features such as lutionary strategy (Salimans et al., 2017). Within this al-
the number of connected holes, number of occupied cells, gorithm, CMA-ES performs a cost-sensitive classification
and the number of occupied cells weighted by its height. task, hence the name of the algorithm: classification-based
These additional features were not picked up in subsequent modified policy iteration (CBMPI).
research.
The sample of states used by CBMPI was extracted from
Szita & Lörincz (2006) used the cross-entropy algorithm trajectories played by BCTS and then subsampled to make
and achieved 350,000 lines cleared. The algorithm probes the grid height distribution more uniform. CBMPI takes
random parameter vectors in search of the linear policy that at least as long as BCTS to achieve a good level of per-
maximizes the score. For each parameter vector, a number formance. Furthermore, how best to subsample to achieve
of games are played. The mean and standard deviation of good performance is not well understood (Scherrer et al.,
the best parameter vectors are used to generate a new gen- 2015, p. 27).
eration of policies. A constantly decreasing noise allows
We next review an approach of a different nature: how the
for an efficient exploration of the parameter space. Later,
structure of the decision environment allows for domain
Szita & Lõrincz (2007) also successfully applied a version
knowledge to be integrated into the learning algorithm.
of the cross-entropy algorithm to Ms. Pac-Man, another
difficult domain.
4. Structure of the Decision Environment
Following this work, Thiery & Scherrer (2009a;b) added
a couple of features (hole depth and rows with holes) and Real-world problems present regularities of various types.
developed the BCTS controller using the cross-entropy al- It is therefore reasonable to expect regularities in the deci-
gorithm, where BCTS stands for building controllers for sion problems encountered when playing video games such
Tetris. They achieved an average score of 35,000,000 as Tetris. Recent work has identified some of these regu-
cleared lines. With the addition of a new feature, pat- larties and has shown that learning algorithms can exploit
tern diversity, BCTS won the 2008 Reinforcement Learn- them to achieve better performance (Şimşek et al., 2016).
ing Competition.
Specifically, Şimşek et al. (2016) showed that three types
In 2009, Boumaza introduced another evolutionary algo- of regularities are prevalent in Tetris: simple dominance,
rithm to Tetris, the covariance matrix adaptation evolution cumulative dominance, and noncompensation. When these
strategy (CMA-ES). He saw that the resulting weights were conditions hold, a large number of placements can be (cor-
very close to Dellacherie’s and also cleared 35,000,000 rectly) eliminated from consideration even when the exact
lines on average. weights of the evaluation function are unknown. For exam-
ple, when one placement simply dominates another place-
The success of genetic algorithms deserves our attention
ment (which means that one placement is superior to the
given the recent resurgence of evolutionary strategies as
other in at least one feature and inferior in none), the dom-
strong competitors for reinforcement learning algorithms
inated placement can be eliminated.
The Game of Tetris in Machine Learning
People enjoy playing Tetris and can learn to play reason- As AI research continues to deal with increasingly difficult
ably well after a little practice. Some effort is being made real-world problems, inspiration may come from how peo-
to understand how people do this (Sibert et al., 2015). ple cope with the uncertainties in their lives. Video games,
Progress in this research may help AI tackle the type of such as Tetris, provide controlled environments where peo-
problems that gamers face every day. ple’s decision making can be studied in depth. Game play-
ers have limited time and limited computational resources
Kirsh & Maglio (1994) gave a detailed account of how peo- to consider every possible action open to them. Uncer-
ple perceive a Tetrimino and execute the actions necessary tainty about consequences, limited information about al-
to place it. But an important question is not yet answered: ternatives, and the sheer complexity of the problems they
How do people decide where to place the Tetrimino? face make any exhaustive computation infeasible (Simon,
Finally, we have not yet developed an efficient learning al- 1972, p. 169). Yet they make hundreds of decisions in a
gorithm for Tetris. Current best approaches require hun- minute and outperform every existing algorithm in a num-
dreds of thousands of samples, have a noisy learning pro- ber of domains.
1
cess, and rely on a prepared sample of states extracted from Trafficability refers to ”the ability of a vehicle or unit to move
games played by a policy that already plays well. The chal- across a specified piece of terrain.”
lenge is still to develop an algorithm that reliably learns to
play using little experience.
The Game of Tetris in Machine Learning
T SITSIKLIS & VAN ROY (1996) A PPROXIMATE VALUE 16 × 10 30 H OLES AND PILE
ITERATION HEIGHT
B ERTSEKAS & T SITSIKLIS (1996) λ - PI 19 × 10 2,800 B ERTSEKAS
L AGOUDAKIS ET AL . (2002) L EAST- SQUARES PI 20 × 10 ≈ 2,000 L AGOUDAKIS
K AKADE (2002) NATURAL POLICY 20 × 10 ≈ 5,000 B ERTSEKAS
GRADIENT
D ELLACHERIE
[R EPORTED BY FAHEY (2003)] H AND TUNED 20 × 10 660,000 D ELLACHERIE
R AMON & D RIESSENS (2004) R ELATIONAL RL 20 × 10 ≈ 50
B ÖHM ET AL . (2005) G ENETIC ALGORITHM 20 × 10 480,000,000 B ÖHM
(T WO P IECE )
FARIAS & VAN ROY (2006) L INEAR PROGRAMMING 20 × 10 4,274 B ERTSEKAS
S ZITA & L ÖRINCZ (2006) C ROSS ENTROPY 20 × 10 348,895 D ELLACHERIE
ROMDHANE & L AMONTAGNE (2008) C ASE - BASED 20 × 10 ≈ 50
REASONING AND RL
B OUMAZA (2009) CMA-ES 20 × 10 35,000,000 BCTS
T HIERY & S CHERRER (2009 A ; B ) C ROSS ENTROPY 20 × 10 35,000,000 DT
G ABILLON ET AL . (2013) C LASSIFICATION - BASED 20 × 10 51,000,000 DT FOR POLICY
POLICY ITERATION DT + RBF FOR VALUE
Gabillon, Victor, Ghavamzadeh, Mohammad, and Scher- Sibert, Catherine, Gray, Wayne D., and Lindstedt, John K.
rer, Bruno. Approximate dynamic programming finally Tetris: Exploring human performance via cross entropy
performs well in the game of tetris. In Advances in reinforcement learning models. In submitted for Pro-
neural information processing systems, pp. 1754–1762, ceedings of the 38th Annual Conference of the Cognitive
2013. Science Society, 2015.
Hansen, Nikolaus and Ostermeier, Andreas. Completely Simon, Herbert A. Theories of bounded rationality. Deci-
Derandomized Self-Adaptation in Evolution Strategies. sion and organization, 1(1):161–176, 1972.
Evolutionary Computation, 9(2):159–195, 2001. doi: Şimşek, Ö., Algorta, S., and Kothiyal, A. Why most de-
10.1162/106365601750190398. cisions are easy in tetris-and perhaps in other sequential
decision problems, as well. In 33rd International Con-
Kakade, Sham. A natural policy gradient. Advances in neu-
ference on Machine Learning, ICML 2016, volume 4,
ral information processing systems, 2:1531–1538, 2002.
2016.
Kirsh, David and Maglio, Paul. On distinguishing epis- Stevens, Matt and Pradhan, Sabeek. Playing tetris with
temic from pragmatic action. Cognitive science, 18(4): deep reinforcement learning. Unpublished.
513–549, 1994.
Sutton, Richard S, Precup, Doina, and Singh, Satinder. Be-
Lagoudakis, Michail G and Parr, Ronald. Reinforcement tween mdps and semi-mdps: A framework for temporal
learning as classification: Leveraging modern classifiers. abstraction in reinforcement learning. Artificial intelli-
In ICML, volume 3, pp. 424–431, 2003. gence, 112(1-2):181–211, 1999.
The Game of Tetris in Machine Learning