Sie sind auf Seite 1von 21

Novel Regression-Based Chess Evaluation Function Challenges Conventional Evaluation Function

Rahul Desirazu The Harker School

Abstract
21st century chess engines use an evaluation function to determine the best move in a given board state. This evaluation function uses weighted metrics created by chess experts. Without these experts, building a world-class evaluation function is difficult. In this paper, we present a method to create a chess evaluation function that does not require chess expertise. The evaluation function that we present uses 55 metrics that are based only on piece weights and attacking relationships in a position. The weights of each of these metrics were obtained by applying a least squares regression to evaluations of 6588 board positions using a well-known chess engine, Rybka. We replaced the evaluation function in Stockfish, an open source chess engine, with our regression-based evaluation function and played against other computer bots on the Internet Chess Club (ICC). The weights of the pieces determined through our regression model differed from the conventional weights of these pieces, suggesting that the conventional piece weights used to evaluate board states may not be entirely accurate. After playing 31 games, our engine obtained a rating of 2346, which is better than 99.5% of active players on the ICC. Based on the engines success, it appears that someone with little knowledge of chess can in fact engineer a chess engines evaluation function by using regression analysis. This same analysis may be tested in fields like medicine to weight risk factors for diseases like heart disease, cancer, and diabetes.

1. Introduction
21st century chess engines use an evaluation function to find the best move in a board state. These evaluation functions use a set of weighted metrics, created by chess experts. Without access to experts, creating a chess engine is difficult. In this project, we attempt to create a chess engine without a team of experts that can perform at a reasonable playing level.

1.1 A History of the Chess Engine1


Hungarian engineer Baron Wolfgang von Kempelen built the first chess-playing machine in 1769. Known as The Turk, the machine was actually not a machine. Rather it was a contraption with a human chess master hidden in its interface. Since 1769, chessplaying machines have progressed significantly. Mathematician Claude Shannon was one of the first to create an algorithm for a chess engine. His algorithm looks at every possible sequence of moves that can be played by both players in a given board state. Each sequence, or continuation, is analyzed to a certain depth, or number of moves in the continuation. For example, if white plays 8 moves and black plays 7 moves in a certain continuation, the depth of the continuation is 15. The algorithm evaluates the board state at the end of every continuation to determine the best move in the original board state. Shannon also realized that chess has many possible continuations, so he devised a pruning algorithm to reduce the number of continuations analyzed by the engine. Fewer continuations mean that each continuation can be analyzed to a greater depth. Over time, mathematicians have focused on increasing the depth of the continuations analyzed and improving the algorithm used to evaluate positions.

1.2 FEN String2 and Evaluation Function3


In chess engines, Forsythe-Edwards Notation (FEN) strings are used to represent a board state. FEN strings provide the necessary information to restart a game from a particular board state and contain 6 fields: an alpha numeric representation of the board state; side to move; castling availability for both sides; en passant target squares; number of moves since last pawn move or capture (used in claiming a draw); number of moves played in the game. The FEN string for Figure 1 is 8/8/8/8/3B1k2/8/1r6/4K3 b - - 0 35. Chess engines read a FEN string and generate the evaluation of the board state using an evaluation function. A board state with a positive evaluation correlates to an advantage for white, while a board state with a negative evaluation correlates to a black advantage. An evaluation with a large magnitude also means that the player with the advantage has a large one. An engines evaluation function uses a weighted positional metrics to determine the evaluation of the position. Figure 1 illustrates how a board state could be evaluated:

Figure 1: In this example, only three metrics are used to evaluate the board state. The first metric, with weight a, is the number of white bishops minus black bishops on the board, the second, with weight b is the number of white rooks minus black rooks on the board, and the third, with weight c, is the number of white bishops attacking black rooks minus black bishops attacking white rooks. The first metric value is +1, because white has one bishop on the board and black has zero. The second metric value is -1, because white has zero rooks on the board and black has one. The third metric value is +1 because white has one bishop attacking a black rook, while black has none. From whites perspective the evaluation of this board state is (+1) a + (-1) b + (+1) c or a b + c. In Figure 1, the each metric value was multiplied by the corresponding weight and summed for the three metrics. Equation 1 generalizes an evaluation function for n metrics: Equation 1: mi is the value of metric i; wi is the weight of metric i; e is the evaluation of the board state

1.3 Minimax and Alpha-Beta Pruning4


Chess engines use a Minimax and alpha-beta pruning algorithm in conjunction with an evaluation function to arrive at the best move in a board state. Minimax is a decision rule used to minimize the possible loss in a worst-case scenario. It evaluates 5

every possible board state n moves after the current board state using the engines evaluation function and determines which move would result in the best scenario given perfect play from the opponent. The Minimax Algorithm has runtime complexity O (2n), where n is the depth of evaluation. Due to the expensive nature of the Minimax Algorithm, most modern chess engines use an alpha-beta algorithm to prune branches in the tree of possible continuations. The algorithm stops evaluating a possible move when at least one continuation has been found that proves the move to be worse than a previous move. Although the pruned tree still has an exponential runtime complexity, fewer branches of the tree are examined, so each branch can be explored to a deeper depth.

1.4 Problems with Existing Evaluation Functions5


Chess engines use thousands of weighted positional metrics, created and handtuned by a panel of experts. Stockfish, an open source chess engine, uses metrics such as pawn structure, king safety and piece mobility to evaluate board states; each of these metrics is then subdivided into hundreds of more specific metrics. Each of these metrics has a middlegame weight and endgame weight, so the weight that fires during evaluation depends on the stage of the game. Creating a world-class evaluation function requires expertise in chess. The purpose of this project was to create a relatively accurate evaluation function without experts. We created a simplified list of metrics without experts and weighted these metrics using a least squares regression to database of board state evaluations. We replaced the evaluation function in an open-source chess engine with our regression-based evaluation function and played games to determine how well our-regression based chess engine could play.

2. Methods
2.1 Collecting Board States and Evaluations in Rybka
We used the tool AutoIT6 to collect p = 6588 board states and evaluations from Deep Rybka 47, a world champion chess engine. The AutoIT script performs the following tasks: 1. Choose a random board state from Rybkas database of 1.5 million games 2. Use Rybkas Copy Position to export board state as a FEN string 3. Copy the FEN string to a data file 4. Start Rybkas analysis engine 5. Wait for 30 seconds for the evaluation value to converge 6. Match the evaluation of the board state with the FEN string obtained in Step 3 7. Do the same process for another random board state

2.2 Representing our System of Equations


An amateur would evaluate a board state by looking for material imbalances (comparing the pieces on the white and black sides) and attacking relationships (e.g. pawn attacking queen, knight attacking a rook) on the board. So, we presented a list of n = 55 metrics with weights initially unknown that only consider piece weights and attacking relationships. Since we have a database of p board states and corresponding evaluations, we represent the system of p equations and n unknown metric weights in the form Ax = b, where A is the matrix of metric values derived from every board state, x is the unknown vector of weights, and b is the vector of Rybka evaluations:

)(

( )

Equation 2: mpn is the value of the nth metric in the pth board state, wn is the unknown metric weight, ep is the known evaluation of the pth board state

2.3 Collecting Metric Values


To collect our matrix of metric values, the FEN Strings collected in Rybka were imported into Stockfish. We modified Stockfish to compute metric values for the p board states. Figure 3 illustrates the pseudo code describing our modifications.

//create a matrix of metric values // with p rows and n columns metricsValues[p][n]; // the first FEN string to be read boardState = 0; while (boardState < p) { read boardState for (metricNumber = 0; metricNumber < n; metricNumber++) { metrics[lineNumber][metricNumber] = <compute metric value of metric> } // next boardState boardState++; } Figure 2: Pseudo code describing how the matrix of metric values is created using the database of board states.

2.4 Calculating Metric Weights


Assuming all of the rows of the matrix in Equation 2 are linearly independent (no two board states are the same) and p >> n, we have an over constrained system, so no one set of metrics weights satisfy all the equations. We used a least squares regression8 in Mathematica 89 to calculate a set of metric weights that minimizes the sum of the squared 8

errors between our evaluations and the database evaluations and automate the weights of our metrics. The table in the Appendix displays the 55 metrics, their weights, and method of computing the metric value. We replaced the evaluation function in Stockfish, an open source chess engine, with our weighted metrics. Figure 3 shows how our modified version of Stockfish evaluates a board state. weights[n] = <import weights> metricValues[n]; for (metricNumber = 0; metricNumber < n; metricNumber++) { metricValues[metricNumber] = <compute metric value of the metric> } eval = <dot product of metricsValues and weights> Figure 3: Pseudo code describing how a board state is evaluated. A list of metric values is computed and dotted with a list of weights to obtain the evaluation.

2.5 Non-Regression-Based Engine


Given our knowledge of chess, we wanted to test if a regression-based engine played at a higher level than an engine with metrics weighted without a regression. Of our 55 metrics, we only knew the weights of the 5 piece value metrics from convention (Table 1). These weighted metrics became the basis of our non-regression based engines evaluation function.

Metric Pawn Knight Bishop Rook Queen

Weight (nonregression) 1 3 3 5 9

How metric value is computed? Number of white pawns number of black pawns Number of white knights number of black knights Number of white bishops number of black bishops Number of white rooks number of black rooks Instance of white queen instance of black queen

Table 1: 5 metrics and their known weights for the non-regression-based evaluation function

2.6 Playing Games on the Internet Chess Club


The original version of Stockfish could only evaluate a board state. We modified it to be able to play an entire game. The regression-based engine played 31 games against computer bots on the Internet Chess Club (ICC)10, while the non-regression based engine played 24 games against computer bots on the ICC. For every game, a numerical score was assigned to each possible result, with a win assigned a 1, a draw a 0.5, and a loss a 0. After each engine played all its games, the average scores and opponent ratings were calculated and inputted into a rating calculator on the United States Chess Federation11 website to determine the rating of each engine.

3. Data and Results


The weights of pieces obtained from the regression-based evaluation function are different from the conventional weights of those pieces. Graph 1 depicts the difference between the two sets of weights for the different pieces:

10

Discrepancy Between Conventional and Regression-based piece weights


10 9 8 Piece Weight 7 6 5 4 3 2 1 0 KNIGHT BISHOP ROOK QUEEN 3 3 5 Theoretical Weights Regression-based Weights 9

Graph 1: Illustrates the difference between expected piece weight and regression-based piece weights when the weight of a pawn is normalized (set to 1). The expected weight is larger than the regression-based weight for each piece, but the magnitude of the difference increases with the weight of the piece. Our attacking relationship metrics can be divided into 4 categories: 1. Side to moves lower weighted piece attacking a higher weighted piece; 2. Side to moves higher weighted piece attacking a lower weighted piece; 3. Defending sides lower weighted piece attacking a higher weighted piece; 4. Defending sides higher weighted piece attacking a lower weighted piece. Table 2 shows the average weights for these four metric categories: Side to Move Defending Side Lower Weighted -> Higher Weighted Higher Weighted -> Lower Weighted 2.62 (1) 0.11 (2) 0.39 (3) 0. (4)

Table 2: Shows the average weights of metrics in each attacking relationship category. Illustrates the importance of attacking relationships in the 1st category (Side to moves lower weighted pieces attacking opponents higher weighted pieces)

11

There is a piece weight difference between the two pieces in an attacking relationship. Going by the weights in the Appendix, the piece weight difference for a pawn attacking a queen is 6.64 (the weight of a queen) 1.04 (the weight of a pawn) or 5.60. Graph 2 highlights the correlation between the regression-based weight of the attacking relationship and the piece weight difference for attacking relationships in the 1st category.

Correlation between Piece Weight difference and Attacking Relationship Weight


6 R = 0.8224 Regression-based Weight 5 4 3 2 1 0 0 1 2 3 4 Piece Weight Difference 5 6 Difference vs. Weight Linear (Difference vs. Weight)

Graph 2: Illustrates the strong R^2 = 0.82 positive correlation between piece weight difference and regression-based weight for attacking relationships in the 1st category. The regression-based engine obtained an ICC rating 2346, which is better than 99.5% of active players on ICC, while the non-regression based engine has ICC rating 2213, which is better than 98.6% of active players on ICC. Graph 3 depicts trends in score due to opponent rating.

12

How Opponent Playing Level Affects Results


0.8 0.7 0.6

Average Score

0.5 0.4 0.3 0.2 0.1 0 Under 2400 2400 Over 2400 Regression Non Regression

Opponent Playing Level Graph 3: Illustrates the average score of both the regression based and non-regression based engines against opponents rated under 2400, 2400, and above 2400. For both engines, the average score decreased as the playing level increased. The regression based engine scored higher against opponents of all categories, especially opponents rated above 2400, scoring an average of 0.36 against opponents rated over 2400, versus the 0.00 scored by the non-regression-based engine.

4. Discussion
4.1 Results from the Games
Our regression-based model played better than 99.5% people on the ICC, while the non-regression-based engine played better than 98.6% of people on the ICC. The regression-based engine held its ground against higher rated players much more consistently than the non-regression based engine. Also, the regression-based engine both lost to and won against players of a higher caliber than the non-regression based engine. Both the regression-based and non-regression-based engines performed at a reasonable playing level, which suggests that someone with little knowledge of chess can make a decent chess evaluation function. The regression-based engine outperformed the 13

non-regression-based engine, so using a regression to model metrics weights was more effective than using basic chess conventions to model these weights.

4.1.1 Small Sample Size of Games Played


The small sample size of games played (31 for the regression-based engine, 24 for the non-regression-based engine) means that a few outlier game results may be affecting both the engines ratings and discrepancies between the regression-based and nonregression-based engines. We manually played games on the ICC (we did not create a bot interface), so playing games took time, and we could not collect more games. Our engine evaluates certain types of board states more accurately than others. From observation, the regression-based engine converted advantageous board states into wins more effectively in the middlegame than the endgame. A majority of the 6588 board states and evaluations in our database came from the middlegame because many of the 1.5 million source games end in the middlegame. The regression-based evaluation function, therefore, used metric weights that favored middlegame board states over endgame board states. Therefore, the engine that evaluated more middlegame board states than endgame board states would report better results.

4.2 Metric Weights


Although most metrics followed an expected trend, there were some discrepancies between the expected metric weights and our regression-based metric weights. The regression-based weights of our pieces were smaller and in a different ratio than the conventional weights of those pieces. The expected normal weight of the pawn x pawn attacking relationship for the side to move had a strong negative value. For lesserweighted pieces attacking larger weighted pieces, a few weights did not adhere to the

14

correlation between the weights of an attacking relationship and the difference between the weights of the two pieces.

4.2.1 Important 1st Category Attacking Relationships


1st category attacking relationships were weighted much higher that other types of attacking relations. For 1st category attacking relationships, the active side has the chance to trade a lowly weighted piece for a highly weighted piece. By contrast, in 2nd category relationships, the side to move does not have the chance to make a beneficial trade of pieces, and in the 3rd category, the side to move can move the highly weighted piece before the defending side can capture it.

4.2.2 Correlation between Piece Weights and Attacking Relationship Weights in 1st Category
The metric weights in the appendix mostly conformed to the expected metrics weights. For the side to move, a large positive weight was given to lesser-weighted pieces attacking pieces with larger weights. After the exchange of pieces, the evaluation of the board state should favor the attacking side, which traded off a lesser piece for the opponents better piece. As expected, the magnitude of the regression-based metric weights strongly correlated (R^2 = 0.82) with the piece weight difference. The strong correlation suggests that we could have used piece weight difference instead of a regression to weight attacking relationships in the 1st category, the most significant category.

4.3 Discrepancies in Piece Weight


The most surprising result was the discrepancy between the conventional weights and regression-based piece weights. Based on chess convention, the normalized

15

regression piece weights should match those shown in Table 1. While the regression weighted each piece in this increasing order of magnitude, the weights did not conform to this ratio.

4.3.1 Sample Set of Board States


The weights 1, 3, 3, 5, 9 can vary by board state. In board states with closed pawn structures, the knight is more valuable than its equal counterpart, the bishop, and far advanced pawns that threaten to become queens are usually worth more than its given weight. Although the 6588 sample board states should have normalized the weights of the metrics, every board state was collected from a game between two top-level opponents. Perhaps the weights of pieces in high level games stray away from the 1, 3, 3, 5, 9 ratio. To normalize the effect of player level in sample games on metric weights, we should collect board states from games contested between players of various playing strengths and perform the regression on the new data set.

4.3.2 Effect of other Metrics on Piece Weight


The 1, 3, 3, 5, 9 weights are generally used when piece weights are the only positional metrics. In other words, these weights are used so that anybody can quickly and somewhat accurately evaluate a board state. However, we used 50 other metrics in conjunction with the piece weight metrics. The weights of these 50 metrics may be compensating for the weights of pieces, skewing the piece weight ratios. To normalize the effect of our 50 additional metrics, we must perform a regression on only the 5 piece weight metrics. If the new weighted metrics still have weight different ratios and an engine using these weighted metrics performs better than an engine using the 1, 3, 3, 5, 9

16

weighted metrics, the hypothesis that the existing 1, 3, 3, 5, 9 weight ratio is not optimal would be supported.

4.4 Pawn x Pawn Attacking Relationship Disadvantage


The weight of the Pawn x Pawn metric for the active side had a large negative magnitude, suggesting that its a significant disadvantage to attack an enemy pawn. All attacking relationships should favor the active side or at least be insignificant in the evaluation of the board state. The principal explanation for this large negative weight is that a majority of board states with pawn x pawn relationships happened to be losing for the active side, even with the large sampling of board states. To explore the correlation between the pawn x pawn attacking relationship and board state evaluation, more board states would have to be collected to normalize the effect of outliers in the current data set.

4.5 Some Weights did not follow the Correlation


For lesser pieces attacking greater pieces, there was a R^2 = 0.82 positive correlation between the weight of an attacking relationship and the discrepancy between the weights of the two pieces. However, not every attacking relationship followed this trend. For example, the Pawn x Knight relationship held more weight than the Pawn x Rook relationship even though the latter is a more favorable exchange. All of the 6588 board states were taken from games contested between players of a high caliber. Highlevel players dont leave pieces hanging, so the only time the active sides lesser pieces attacks the opponents greater pieces is during a piece exchange. And even though we have 6588 board states, only a few of those board states catch a snapshot of a piece exchange. To determine the true correlational strength between attacking relationship

17

weight and discrepancy between the weights of the pieces, we must collect more board states in the midst of a piece exchange.

5. Direction of Future Research


5.1 Implications in Other Fields
The idea of creating simplified, cost-effective models like the one we used for a chess engines evaluation function could be tested in other fields like medicine. Doctors use subjective criteria to determine risk of a disease such as heart disease. If a group of doctors can compile a set of known risk factors for heart disease and collect ekgs and blood samples from patients who suffered a heart attack, doctors can use regression modeling to create a universal set of weights for these risk factors. The use of regression modeling to weight risk factors can be extended beyond heart disease to diseases like cancer, diabetes, and obesity.

5.2 Implications in Chess


Our regression modeled metric weights for 55 metrics, while current world-class engines use thousands of weighted metrics. Since there is a correlation between the number of metrics and the playing strength of the engine, we can continue adding metrics to our regression model to try and increase the playing level of our engine. We can also give each metric a middlegame and endgame weight so that the engine evaluates each part of the game with equal accuracy. Eventually, the goal is to have our engines playing level surpass Stockfishs.

18

6. Acknowledgements
Thanks to my mentor, Dr. Peter Danzig, for his guidance throughout the development of the paper, Mr. Richard Page, whose class project inspired me to pursue this project and who worked with me over the summer on my project goals, and Mr. Christopher Spenner for his advice in writing this paper. Also thanks to the very helpful support staff at stockfishchess.org.

7. References
1

A short history of computer chess. Accessed November 11, 2012. http://www.chessbase.com/columns/column.asp?pid=102. FEN Standard. Accessed November 11, 2012. http://www.chessville.com/ Reference_Center/FEN_Description.htm. Shannon, Claude E. "Programming a Computer for Playing Chess." Philosophical Magazine, November 8, 1949. Accessed November 11, 2012. http://vision.unipv.it/IA1/ProgrammingaComputerforPlayingChess.pdf Minimax search and Alpha-Beta Pruning. Last modified 2002. Accessed November 11, 2012. http://www.cs.cornell.edu/courses/cs312/2002sp/lectures/rec21.htm. Romstad, Tord, Marco Costalba, Joona Kiiski, Daylen Yang, Salvo Spitaleri, and Jim Ablett. Stockfish. Version 2.3.1. 2012. Stockfish. Accessed November 11, 2012. http://stockfishchess.org/. AutoIT. Version 3.3.8.1. 2012. autoitscript. Accessed November 11, 2012. http://www.autoitscript.com/site/. Rajilich, Vasik. Rybka. Version 4. 2010. CD-ROM. Papoulis, Athanasios, and S. Unnikrishna Pillai. Probability, Random Variables and Stochastic Processes. 4th ed. New York, NY: McGraw-Hill, 2002. Wolfram, Stephen. Mathematica. Version 8.0.4. Champaign, IL: Wolfram, 2011. CD-ROM. BlitzIn. Version 3.0.5. 2011. ICC. Accessed November 11, 2012. http://www.chessclub.com/download-software.

10

19

11

Rating Estimator. Last modified August 4, 2012. Accessed November 11, 2012. http://www.uschess.org/content/view/9177/679/.

8. Appendix
Metric Side to Move Pawn X Pawn Pawn X Knight Pawn X Bishop Pawn X Rook Pawn X Queen Knight X Pawn Knight X Knight Knight X Bishop Knight X Rook Knight X Queen Bishop X Pawn Bishop X Knight Bishop X Bishop Bishop X Rook Bishop X Queen Rook X Pawn Rook X Knight Rook X Bishop Rook X Rook Rook X Queen Queen X Pawn Queen X Knight Queen X Bishop Queen X Rook Queen X Queen Defending Side Pawn X Pawn Pawn X Knight Pawn X Bishop Pawn X Rook Pawn X Queen Knight X Pawn Knight X Knight Knight X Bishop Regression (weight) -1.08354 1.33703 1.74922 1.02793 4.7279 0.0956523 -0.228428 0.760167 1.02338 5.36788 0.150581 0.297229 0.0591714 1.79835 4.54408 0.188971 0.426902 0.760438 0.528506 3.86158 0.196079 -0.00773957 0.453631 -0.279459 0.18281 -1.14423 -0.0870649 0.230158 0.863338 0.363773 0.0560502 -0.281736 0.180345 How metric value is computed Instances of pawns attacking pawns Instances of pawns attacking knights Instances of pawns attacking bishops Instances of pawns attacking rooks Instances of pawns attacking queen Instances of knights attacking pawns Instances of knights attacking knights Instances of knights attacking bishops Instances of knights attacking rooks Instances of knights attacking queen Instances of bishops attacking pawns Instances of bishops attacking knights Instances of bishops attacking bishops Instances of bishops attacking rooks Instances of bishops attacking queen Instances of rooks attacking pawns Instances of rooks attacking knights Instances of rooks attacking bishops Instances of rooks attacking rooks Instances of rooks attacking queen Instances of queen attacking pawns Instances of queen attacking knights Instances of queen attacking bishops Instances of queen attacking rooks Instance of queen attacking queen Instances of pawns attacking pawns Instances of pawns attacking knights Instances of pawns attacking bishops Instances of pawns attacking rooks Instances of pawns attacking queen Instances of knights attacking pawns Instances of knights attacking knights Instances of knights attacking bishops 20

Knight X Rook Knight X Queen Bishop X Pawn Bishop X Knight Bishop X Bishop Bishop X Rook Bishop X Queen Rook X Pawn Rook X Knight Rook X Bishop Rook X Rook Rook X Queen Queen X Pawn Queen X Knight Queen X Bishop Queen X Rook Queen X Queen Piece Value Pawn Knight Bishop Rook Queen

-0.10701 0.859854 0.120128 0.1253 -0.422367 0.233419 0.529539 0.109619 0.573432 0.421293 0.119157 0.878515 0.039703 0.020088 0.499491 -0.0405067 -0.18281 1.03723 2.27615 2.38974 3.35919 6.64456

Instances of knights attacking rooks Instances of knights attacking queen Instances of bishops attacking pawns Instances of bishops attacking knights Instances of bishops attacking bishops Instances of bishops attacking rooks Instances of bishops attacking queen Instances of rooks attacking pawns Instances of rooks attacking knights Instances of rooks attacking bishops Instances of rooks attacking rooks Instances of rooks attacking queen Instances of queen attacking pawns Instances of queen attacking knights Instances of queen attacking bishops Instances of queen attacking rooks Instance of queen attacking queen Number of white pawns number of black pawns Number of white knights number of black knights Number of white bishops number of black bishops Number of white rooks number of black rooks Instance of white queen instance of black queen

21

Das könnte Ihnen auch gefallen