CS 473 - An Intelligent Agent For Risk

David Keppler and Edward Choi
Fall 2000

Introduction

The purpose of this project was to create an agent that would be able to train itself to play Risk, the board game5. We thought that this would be a significant challenge, for the game is inherently complicated and hopefully the agent would be able to offer new insight into game strategy, in a similar way that TD-Gammon revolutionized the way human players play backgammon. There is, as yet, no definite strategy that we could find; many websites mentioned holding strong-points, capturing Australia first, or consolidating your forces, but no obvious strategy. We did come across a few online games of Risk using computer players, but they did not seem to be particularly good.4 This was an attempt to create a worthy opponent to play against.

Our basic approach involved using a neural network that could be trained using Temporal Difference learning to output the utility of a certain move. It would then pick the move with the highest utility. It would then “learn” using back-propagation.

The evaluation of this agent is particularly difficult, however we will give our qualitative evaluations of how the agent played, as well as play it against agents with random weights to see how it performs.

Problem Definition

There are several components of Risk, which can be broken up into two stages:

- Initial Claiming of Territories at the beginning of the game
- Actual player moves during the game which consist of:
o Placing armies in your territories
o Attacking the territories of the other players
o And moving armies between territories at the end of a turn

We determined that a neural net would be able to aid in both of these stages efficiently.

The neural net would take in as input an array representation of the board, one cell for each of the 42 territories, as well as one cell for each of the 6 continents. Each input is a true or false value represented as 1 or 0, except for the continent indicators, which were a percentage.

Each step of a move, as described above, consists of calls to the neural net to evaluate the utility of possible moves and then logic to select the best move given the data generated by the neural network.

For the “initial claiming of territories” stage we ran through the list of territories that the agent could possess, and evaluated the board if it did have the territory in question. It would then claim the territory that resulted in a board which returned the highest utility.

In the “placement of armies” stage we create a “fake player” that consists of all of the enemy players, and evaluated the board from their point of view. We then find the territories this amalgamated enemy is most likely to attack (see the attack stage below) and assign a certain percentage of our armies as “defense” to guard against these attacks. Then we evaluate the board position from the agent’s point of view to determine where we are likely to want to attack from, and place a certain number of armies on “offense” to facilitate this attack.

In the “attack” stage we create a list of territories that the agent is capable of attacking. Then we iterate through this list and choose the moves for which we would have a certain percentage chance of winning as determined using the method described below. We chose 60% as a threshold success percentage to perform an attack. From those possibilities we evaluated the board position if we were to own the territory that we wanted to attack. The board position with the highest returned utility would thus indicate the territory that we would want to attack first. After each successful attack we reevaluate the entire board and all possible moves and then pick the next best move from this position. The agent stops making attack moves when no more moves are found that meet the criteria for utility gain and likelihood of a successful attack.

In the “fortify” stage we use a strategy similar to the “placement of armies” stage; we create a fake player composed of all of the enemies put together, and determine, based on the agent’s utility function, where this “fake player” is most likely to attack. If the enemy is likely to be successful in attacking, the agent then searches for friendly territories adjacent to the likely attacked territory, called “buddies,” from which we can send armies to help defend against the impending attack. Once we had found one we move the armies from this “buddy” to the territory in danger.

To determine whether an attack is winnable and worth making, we run a “simulated battle” that pitches the numbers of armies in the combating territories against each other in a mock battle. We assumed that the agents (human or computer) would choose the most advantageous way of winning a battle, i.e. it would choose to use the maximum amount of dice possible. We then used a random number generator to determine who would win the battle, and how many armies might be left over after the battle. Averaging a certain number of results, we would then come up with a percentage of the probability that the agent would win the battle given the numbers of armies on both sides. All of the probability values used in this function are calculated averages for the various combinations of dice rolls. See the comments for this code in the AIPlayer.java file for more information.

The neural net consists of 48 input nodes, and 10 hidden nodes, and one output node. Each node has a sigmoid activation function, and the weights are modified by the standard back-propagation method as used in class.1 Unlike the standard back-propagation learning method the neural net does not have an outside error signal to compare its output to. A normal training set for such a complex game as Risk would be prohibitively large and rely on the skills of an observer to rate the utility of many example board positions. This method would be inadequate for Risk since the state space of possible board positions is so large that it would be impossible for us to generate and rate enough example boards to have any significance as far as training a network. Also, this method would require us to assign discrete numerical values to each example board, a very subjective process at best.

To overcome this problem we implement a form of Temporal Difference learning to self-generate a teaching signal to drive the back-propagation training algorithm. We use the output of the standard TD state utility function as the “expected” output value of the network for a particular input set and then feed this data into the back-propagation algorithm.

Board state rewards are also implemented to assign a reward of +1 to a winning position (all territories owned), and -1 to a losing board position (no territories owned).

Using this combined TD / back-propagation learning system we can expect the neural network to self learn the utility function and converge to the optimal set of node connection weights.

The agent plays games against itself to learn. We scripted the learning games to include a variety of three, four, five, and six player games. In each game, the neural networks for every player are initialized with the weights carried over from the previous game. At the conclusion of a game, the winner’s neural network is saved and used as the basis for the next game in the series. Because of the updating of neural network weights during the course of each game, the networks for each player will slightly evolve during the course of a game. The winner is likely to be the network that learned the best changes during the game. These changes are then carried over to all of the players of the next game and the winner of that game will hopefully learn another small advantage over the others. Thus a basic form of competitive learning is established.

The nature of the TD algorithm requires a slow learning rate for the network. Therefore, the changes made during any one game will be very small and not much will be learned. But over the course of many trial games, a good evaluation of board position values will evolve.

Evaluation

There are two ways by which we can evaluate the agent. We can evaluate it by playing against it, and coming up with qualitative comments on how intelligent it seems to play, and we can also play it against other instances of itself and agents with random weights attached to it. The first has the criterion of being able to be a worthy adversary against human opponents. We do not expect this to be likely, given that it has only explored a vanishingly small proportion of the state space. The second method has the criterion of it being able to beat the naïve, untrained computer agent. If it were to fail this criterion, then it would indicate that the update function was performing poorly, incorrectly or not at all. Thus our hypothesis would be that if it were actually performing with some intelligence, it would perform better than an agent with random weights, but would still not be able to beat a human player.

The training data consisted of games played against itself. The test data consisted of playing games against it with both human and random agents. Thus both sets of data are very realistic in that they represent a typical game almost entirely. (The only exceptions are from the omission of cards, which feature prominently in the board game but were considered irrelevant for the overall strategy.)

The performance data will thus consist of a qualitative evaluation of how the agent performs against humans in one game, as it is too long to play multiple human vs. computer games, and it will also include some statistics on how the agent performs against other computer opponents with randomized (i.e. unlearned) neural nets. We played against it at three stages in its development: the initial agent, the agent after 1000 games of training, and the agent after 5000 games of training. We also played it against random agents after two periods of training, after 2000 games, and after 5000 games.

Initially, it has essentially random weights that may have been modified a little from the considerations it has made in making its moves. We can see that there is some intelligence, but it is mostly the hard-coded defensive strategy. The initial agent performs as a paranoid “knee-jerk reflex” beginner with the hard-coded strategy, in that it attempts to defend against impending attack if armies are built up in an adjacent territory, and it does not try to consolidate its held territories; it has many territories scattered around randomly.

As of approximately 1000 training games, we were able to discern a slight improvement when we played against it in its strategy over the untrained agent. However, it was still relatively simple to defeat.

After playing more than 5000 games, we observed that it immediately tried to occupy a few strategic (in our opinion) territories, and after a couple of moves decided to take all of Australia, which is usually a good continent to control. It was also very interesting to watch it go through and attempt to conquer Africa; we could clearly observe where the “pre-selection of moves” function was hindering its ability to make good choices on where to attack. It had this weird overwhelming desire to conquer a few specific territories, which is undoubtedly a layover from the initial random weights, and it occupied essentially all of Asia, but left the last two territories untouched. In short, it was much more intelligent than the previous agent we played against, but still very far from a worthy adversary for a human player.

After a period of about 2000 training games, we played it against a random agent (i.e. an agent with random weight) and a previous incarnation of the agent which had been trained with the wrong update function (and hence should be essentially random). The agent recorded 40/100 wins, the random agent recorded 29/100 wins, and the incorrectly-trained agent recorded 31/100 wins. This is promising, but in no way is it excellent. The slightly trained agent clearly does better than the other two agents, but only by 7% more than if it were entirely random. As expected, the incorrectly-trained agent performs almost like the random agent.

Our test of the agent against random computer agents after 5600 games of learning also verified our qualitative observations that it had learnt significantly. In a three-player game against two random agents, the trained network won 48% of the time while the two random players combined only won 52% of the time. This clearly shows that the training it has received has helped it perform better.

We anticipate that with much more training, the agent may actually perform somewhat competently, but as it stands the agent does not satisfy our first, and most important criterion, i.e. it is not up to the level of human gameplay. It has verified our hypothesis, however, in that the agent performs better than an untrained agent but, due to lack of training, inferior to a human player.

Related Work

We initially considered this method in the creation of our agent for Risk based on a similar agent that has had much success in another functionally similar domain, TD-Gammon, that works phenomenally well at playing backgammon.3 The domains are similar for the following reasons:

- Element of chance

Both backgammon and Risk involve dice-rolling, and thus it requires an element of uncertainty that must be taken care of, something that solutions for many other games do not adequately account for.

- Large search space

Gerard Tesauro suggests that TD-Gammon has a search space roughly in excess of 1020.3 Thus many brute-force searches in the style of Deep Blue (for chess) would not be good at backgammon.
We estimate that Risk (just using board positions) has a state space roughly equal to 242. which is also a fairly large state space to be doing brute-force searches.

- Fallibility of human judgement

Many backgammon experts found out that their commonly held standard beliefs were in fact misguided. While we do not claim to have expert knowledge in the domain of Risk, even if we did it would not mean that we knew all of the correct strategies to win the game. Some strategy sources suggest that Australia or South America may be good starting-off points, but we’ve found that North America is sometimes a good place to hold with much potential for expansion; and even holding Africa is possible. Plus, controlling one territory in a continent is often useful to deny the opponent the continent, but opinions and preferences differ widely as to which territories to hold out.

- Feedback only at the end

Instead of relying upon some arbitrarily designated heuristic, the agent only receives a definite response to its actions at the end of the game, telling it whether it won or lost. Like many games, it is possible to have intermediate cues as to the development of the game (more territories or armies in the case of Risk, or having more of your pieces closer to the end in backgammon) but there is often much deeper strategy than can be immediately observed.

The domains differ in the following respects:

- Smaller branching factor

Rather than being caused by the dice rolls, as in backgammon, Risk has a branching factor due to the number of possible moves one may make depending on the number of territories adjacent to ones you own. Assuming a standard three-player board setup, a player may hold about 14 territories. All territories have at least two adjacent territories, and some have as many as seven. On average, not counting the territories that are adjacent but belong to you (which are thus removed from the possible territories to attack), there will be about 2 territories that you can attack at any one time from any one of your territories. If you do your strategy right, and control some “choke-points”, you need only fortify three or four territories, that can possibly attack or be attacked by 7 or 8 adjacent territories. Thus the branching factor is much less than the 2120 possible moves that backgammon has.

- The game dynamics

Fundamentally, backgammon isn’t Risk. There are huge differences between the strategies and rules of the two games. Moving all of your checkers to one end of the one-dimensional array-like playing surface is a much different problem from trying to place an army in each position on a graph-like playing board.

TD-Gammon has had much success in its domain, and the approach that Tesauro et al3 used is as good, if not superior, to the approach that we have used. However, in order to make any comparisons, one must take into account the amount of training the two systems have used. By raw number of games, TD-Gammon had been trained on approximately 100,000 games for early versions and over 1,000,000 games for later versions; the agent for Risk has been trained on less than 10,000 games due to the more restricted resources of a college student versus those of an IBM researcher. Our approach is slightly different, in that we abstract the board further; in this abstraction much of the information of the board is lost, but also we reduce the complexity of the function to be learned significantly. We did not set out to make an agent that will rival TD-Gammon; the two domains are too different to make this comparison. We used the temporal difference concept explored by TD-Gammon to train our neural network that would play Risk.

The only possible comparison we might make is with the other computer agents for Risk. We believe that the ease with which this learning approach can increase its knowledge about the game domain is greatly advantageous over many other agents that use a hard-coded internal strategy with randomization.4 But a machine-learning approach has the dual benefits that it does not rely on the correctness of expert knowledge in the domain, and that it could conceivably exceed the ability of any human Risk player involved in the project.

Future Work

There are a few aspects in which this project could be improved. We did not implement the game cards that feature greatly in the board game, due to the complexity this aspect of the game would introduce.

We also have hard-coded significant portions of the strategy; we assume that the agent will want to assign a certain number of armies for defence and offence, we assume that the agent will want to concentrate on one front for attacking at a time, etc. In the future it is likely that these tasks could also be handled by a machine-learning algorithm as well. At the very least, better expert knowledge can be applied to these functions of the agent.

Also, we abstracted away the number of armies in territories so the network does not even consider that as a factor; the possibilities for attack are filtered by statistical analysis before they even reach the network. This could also be handled by the neural network, using a modified version of the agent we have used.

Finally, simply given more time to play training games, the agent will eventually perform better due to its self-learning neural network.

Conclusion

The agent has clearly shown that it improves its level of play after as few as 5000 games. After this short period of time, not surprisingly, it is far from a human level of gameplay, but superior to the untrained agent.

We believe that this agent, while it did not perform as well as we may have hoped, shows great potential for a superior level of gameplay, to the level that it could be considered a competent Risk player and rival the abilities of the average human player. We intend to allow it to train further, and we will reevaluate its performance after a more suitable period of training time. This project has shown that a machine learning approach to the game of Risk will be able to improve the game performance of the agent and given a sufficient amount of training time, may be able to yield a computer player that can play Risk at the level of human players.

References

1. Russell and Norvig, Artificial Intelligence: A Modern Approach, Prentice-Hall 1995

2. Bart Kosko, Neural Networks and Fuzzy Systems, Prentice-Hall 1992

3. Gerard Tesauro, Temporal Difference Learning and TD-Gammon, Association of Computing Machinery, 1995, http://www.research.ibm.com/massive/tdl.html

4. Clevermedia’s “World Conquest” agent: an online agent that can be tailored for various levels of aggressiveness and types of personality and does a fair job in gameplay, http://clevermedia.com/worldconquest

5. Risk, Hasbro 1999 (the actual board game)