AlphaGo: what it does and why it works

@gbonaert Gregory Bonaert bonaert

In 2016, the team led by David Silver and Aja Huang released Alpha Go, the first computer program that beat a professional human player at Go. It also beat every engine ever made before it by a large margin. Their approach is detailed in the paper Mastering the game of Go with deep neural networks and tree search.

Theoretically, to choose the optimal move, it’s enough to examine every possible sequence of moves that could happen, and pick the one that leads most often to victory. However, in practice, the tree is enormous for most games and so this approach isn’t possible. For example, chess has about 35 possibles legal moves and there are about 80 moves in game, meaning the search tree will have roughly 10¹²³ leaf nodes.

To ensure the tree doesn’t get too big, we could stop considering new moves after K moves. Once we do the last move, we need to estimate how good our current game situation is. The purpose of value function v(s) is exactly that: to evaluate the value of a state, e.g. evaluate its quality. Practical value functions are approximations, since creating perfect value function would require us to explore the whole tree and we’d be stuck again.

This idea was enough to beat professional at some board games, such as checkers or chess, but it wasn’t good enough for Go. Why makes Go harder than chess? It has a much larger amount of possible moves (250 vs 35) and the games tend to last longer (150 vs 80), meaning we explore the search tree much more shallowly than in chess. This lead to worse estimates and worse moves.

Instead of limiting the depth of the tree (e.g. consider only K moves) we could limit its breadth. Instead of considering all possibles moves, we only consider one or a few. Sampling at most K moves creates a sequences of moves and states called a Monte Carlo rollout. If at the end of a rollout the game hasn’t ended yet, we can approximate its quality through a value function. Since at each step we only consider one move instead of all, we can make our rollouts much deeper. By averaging over many rollouts, we can estimate the quality of each move we can make now and pick the best one. This technique is enough to beat human at backgammon and Scrabble, but only reaches the level of a weak amateur at Go.

As we do rollouts, we start to get a better idea if a specific move will be good or not. When doing the next rollout, we pick good and bad moves equally likely, which doesn’t make much sense. The idea behind Monte Carlo Tree search is to select more often children / moves with higher quality, concentrating most of our search effort in parts of the tree search that are likely to be useful. If we learn that the action A is good and action B is bad, in the future we’ll explore A’s subtree much more often than B’s subtree. Note that we don’t make it impossible to search B’s subtree, just more unlikely. Since we rely on estimates of pick an action, this is safer because it ensure we could still explore an action’s subtree even if its estimate was incorrect initially.

Previous work show Monte Carlo tree search could reach the level of a strong amateur at Go if the policies or value functions did a linear combination of their input features. Much better performance should be possible if we use more powerful models, such as deep convolutional neural networks. Two main networks are used in AlphaGo:

  • A 13-layer RL policy network that samples actions / moves and is trained using reinforcement learning
  • A value network that evaluates positions

They also add two helper networks:

  • A SL policy network with the same structure as the RL policy network and that samples actions / moves. It’s trained using supervised learning from a dataset of human games before the RL policy network and its final weight are used to initialize the weights of the RL policy network.
  • A rollout policy network, which is used to sample actions in a rollout. This network is much smaller and about 1000 times faster than the other 2 networks, making it possible to do many more rollouts. Their quality will be lower but the sheer number of rollouts it can create compensates for that.

Training process

1. Train the SL policy network. Using 30 million (state, action) from the KGS Go server, they train a policy network using a supervised learning to predict the best action given a state as input. This networks is able to predict the move human experts would do in 57% of the test set states (unseen during training). A small increase in accuracy led to large increases in win rate.

2. Train the fast rollout policy network. The same training process was applied on a much smaller network (the rollout network), which only achieves 24% accuracy but takes 2μs to pick an actions instead of the 3000μs needed by the larger networks.

3. Train the RL policy network. AlphaGo’s goal is to win Go games, but the SL network is instead trained to copy human moves. To remedy that, the RL policy network is explicitly trained with the goal of winning games.

The initial values of the weights are set to the final weights of the trained SL policy network. Instead of re-using the 30 million games, the RL policy networks learns by playing against a randomly selected previous version of the policy. Using a random previous opponent instead of the last one makes training more stable, because we can’t overfit to the current policy. Instead of being good only against the last version, we have to be good against all previous version. Every 500 iterations, the current state of network is added to the pool of possible opponents.

The reward is 0 except for the final state, where it’s 1 (victory) or -1 (defeat). However, the outcome z at time t is set to the reward obtained at the end of the game. This makes it possible to improve the weights by stochastic gradient ascent in the direction that maximizes the outcome (e.g. that leads to a victory).

A RL network trained in this manner is strong enough to beat the SL network 80% of the time. It also beats Pachi, the best open source Go engine, 85% of the time. Previous approaches that used only supervised learning only managed to beat Pachi only 11% of the time.

4. Training the value function. The goal of this value function is to predict the outcome given a game state s where both players use policy p. Ideally we’d use the optimal policy p* but since we don’t know it, we estimate the value function for the best policy we trained. This value function is approximated using a neural network with the same structure as the policy networks, but we modify the last layer so that it outputs a single prediction instead of a probability distribution. This value function is the result of 2 approximations: aiming to learn the value function of the best policy we know instead of the optimal policy, and approximating that value function with a neural network.

The network is trained using stochastic gradient descent to minimize the mean squared error (MSE) between the predicted value and the real outcome.

The obvious way to train this network is to use all successive positions in the games obtained from the KGS Go server. Unfortunately, this leads to overfitting, because the successive game states are highly correlated (they only differ by one stone) but the target is the same for all game. In practice, this meant the network learned to memorize the game outcomes, instead of learning to predict them. The MSE on the training set was 0.19 but was 0.37 for the test set, almost doubling.

To solve this problem, they created 30 million different positions sampled from 30 million different games, ensuring the game states are little correlated and that the target varies often. This makes it hard for the network to simply memorize the outcome, forcing it to learn and generalize. Each game was created making the RL policy network play against itself until the game ended. This lead to a training set MSE of 0.226 and a test set MSE of 0.234. Using this value function is enough to approach the accuracy of Monte Carlo rollouts but requires 15 000 less computation.

How to choose the move

Once the 4 networks are trained, we can use them to pick actions. AlphaGo combines the policy network, the value network and the rollout network to choose good moves by doing lookahead search.

We do many tree-traversal simulations (during a simulation, we go down the tree until we reach the end of the game). The choice of action in the search tree is dictated by a few rules. Each edge (s, a) in the search tree stores 3 values:

  • An action value Q(s, a)
  • A visit count N(s, a)
  • A prior probability P(s, a)

At time t, the action that is chosen is

\underset{a}{argmax}(Q(s, a) + U(s, a))

which maximises the action value and a bonus

U(s, a) \propto \frac{P(s, a)}{1 + N(s, a)}

which leads to visiting high-probability nodes that haven’t been visited much. In the long run, if N is high for all actions, the probability term dominates.

So how are all P, Q and N created and updated? When we want to visit a never-seen before edge (s, a), e.g. a leaf node, we use the SL policy network to compute p(a|s) and we set P(s, a) to be that value. The authors also tested using the RL policy network but it performed worse, “presumably because humans select a diverse beam of promising moves, whereas RL optimizes for the single best move.

We also need to know that node’s value V(s). To do so, we combine two estimates:

  1. The prediction E(s) of the value network
  2. The outcome Z of a random rollout obtained using the fast rollout policy network

These estimates are combined using a mixing parameter:

V(s) = (1 – \lambda) E(s) + \lambda Z

Contrarily to the probabilities, the value function derived from the RL policy performed better than the value function derived from the SL policy. They do not comment on why.

At the end of each simulation, we update the visit count N(s) of each node and update the action values:

Q(s, a) = \frac{1}{N(s, a)} \sum_{n=1}^{n} I(s, a, i) V(s_L^i)
where s_L^i is the leaf node of the i-th simulation and I(s, a, i) indicates if the edge (s, a) was traversed during the i-th iteration.

The action value of an edge (s, a) is the mean value of the leaf nodes of all simulations that went through it. After all simulations are complete, we have to pick an action. The action with the highest action value may seem to be the best, but the authors decide instead to pick the action that was visited most often. Why? According to the authors, this approach is less sensitive to outliers than maximizing action value.

Once an action is picked, we set that child node as the new root of the search tree. The subtree of all other nodes are discarded, since we won’t need them.

Additional improvements

  • To improve performance and distribute the workload, they develop an asynchronous distributed version of MCTS than runs on multiple CPUs (search simulations) and GPUs (policy and value network evaluations). The distributed version beat the non distributed version 77% of games and other programs 100% of games.
  • The formula for the bonus U(s, a) is more involved than the one shown here, but the proportionality is correct
  • AlphaGo resigns if it estimates it has less than a 10% chance of winning (e.g. the biggest Q(s, a) is smaller then -0.8)
  • The input contains:
    • The stone color at each position
    • A grid-size matrix of 0’s and another filled with 1’s
    • The legality of each move
    • The liberties of each position
    • The number of times since the move was played (up to 8 turns)
    • For each move, if it’s a ladder escape, and if it’s a ladder capture
    • For each move, how many of opponent stones would be captured (up to 8) and how many of our stones would be captured (up to 8)
    • Whether the current player is white or black
  • Since Go has some symmetry, a random rotation is chosen when evaluating the network, forcing it to train more robustly. At training time, all 8 equivalent game states are used in a single batch.

Performance

The version of AlphaGo that combines the value function and rollouts beats 95% of the time the versions that don’t. AlphaGo was able to beat other programs 99.8% of games. When the opponent was Crazy Stone (the best commercial program) was given 4 handicap stones, it was still beaten by AlphaGo 77% of the time.

AlphaGo played 10 games against Fan Hui, a 2 dan professional player, beating him in the 5 formal games. In the 5 informal games with shorter time controls, AlphaGo won 3 times and lost 2. It was the first time a computer system beat a human Go professional.

Leave a Reply