Connect-4 Minimax and Learning a Neural Network Utility Function

(Practicum in Artificial Intelligence and Advanced Artificial Intelligence Project)

BOOM 2002

Motivation and Background:

Game playing in general is an important artificial intelligence domain because it is hoped that many of the problems solved in the organized structure of a game can be extrapolated to the real world. This is because the essential structure of a game -- that is, a defined virtual world where players compete for goals -- is similar to the real world where an intelligent agent (be it a robot or a search engine) tries to reason about and accomplish some task. In addition, it is believed that if computers can become good at tasks thought to require intelligence (like game playing), then those computers can then be considered intelligent.

A great deal of research in game playing in the past has revolved around search-based game play. This approach is basically the following: When the AI makes a move, it examines all possible moves it can make and the resultant board states. From those board states, it then looks at all possible moves the opponent can make. From those board states that the opponent can move to, it can then calculate all possible moves it could make. This process is repeated to some depth, or ply, of boards, and if winning boards aren’t found, some kind of board evaluation function is applied that says how good the board states searched to are. This rating is then propagated up the game-tree and used to determine what move the AI should make. This is the approach taken by the famed “Deep Blue” chess super-computer, which challenged and won against a human chess grandmaster, Gary Kasparov.

Many people would not consider such a system to be intelligent. Indeed, the Deep Blue system lacks in the vital ability to learn from its play. While Deep Blue did include some learning aspects and was in some sense “trained” by a human grandmaster, the process wasn’t very analogous to the manner in which humans learn to play chess.

This is in marked contrast to recent success in the game of backgammon, where a neural network was trained, using a learning method known as temporal difference earning, to play at the level of the best human players simply by playing more than a million games against itself (combined with some search). This type of learning where an AI learns from a final reward (winning or losing) is known as reinforcement learning.

One major difference between a game like chess and a game such as backgammon is that backgammon has a chance element – there are dice rolls that introduce a non-determinism to it. It has been hypothesized that this is one reason that reinforcement learning works for backgammon but doesn’t work so spectacularly for chess.

Connect-4:

For my project I chose a similarly deterministic game (to chess) called connect-4. The board set-up and game play are basically the same as that in the popular children’s game (my board is slightly larger). In my 473 project I implemented a search-based AI that evaluates board states with a hand-coded heuristic. I am continuing this project in CS 672 and examining various ways to bootstrap reinforcement learning for connect-4. This is done by first pre-training a neural network with board evaluations calculated using the search-based AI and standard backpropagation and then taking the resultant network and updating it via reinforcement learning. I have used both self-play and play against the search player with the hand-coded heuristic for the reinforcement learning segment of the training.

Evaluation of Play:

To evaluate the play of my heuristics (neural network and hand-coded), I wrote a baseline heuristic. This heuristic only returns a relevant value for a board if that board is a terminal state – that is, if it is a winning or losing state. Otherwise the heuristic returns a constant. This behavior is intended to strip all knowledge from the game other than that imparted by search. Thus, a search-based player using a heuristic that encapsulates real knowledge about boards should be able to beat this baseline player, even if they are searching to the same ply. This is because the heuristics should be guiding the play of the AI player in a “good” direction. Thus, to evaluate the performance of the other heuristics, they are played at various ply against the baseline heuristic.

Results:

So far I have obtained limited results via pre-training alone. For some network topologies, it appears that the neural network is capable of performing better than the hand-coded heuristic (although for most trained networks it performs approximately the same or worse). Success with reinforcement learning has been limited, but this area of the project is considered to still be a work-in-progress. Numerical Results to Come

References:

Mitchell, Tom M. Machine Learning. McGraw-Hill. New York, NY. 1997.

Russel, Stuart and Norvig, Peter. Artficial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River, NJ. 1995

Tesauro, G (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, 38(3): 58-68.