![]() | ||
In computer science, Monte Carlo tree search (MCTS) is a heuristic search algorithm for some kinds of decision processes, most notably those employed in game play. Two leading examples of Monte Carlo tree search are the computer game Total War: Rome II's implementation in their high level campaign AI and recent computer Go programs, but it also has been used in other board games, as well as real-time video games and non-deterministic games such as poker (see history section).
Contents
Principle of operation
The focus of Monte Carlo tree search is on the analysis of the most promising moves, expanding the search tree based on random sampling of the search space. The application of Monte Carlo tree search in games is based on many playouts. In each playout, the game is played out to the very end by selecting moves at random. The final game result of each playout is then used to weight the nodes in the game tree so that better nodes are more likely to be chosen in future playouts.
The most basic way to use playouts is to apply the same number of playouts after each legal move of the current player, then choosing the move which led to the most victories. The efficiency of this method—called Pure Monte Carlo Game Search—often increases with time as more playouts are assigned to the moves that have frequently resulted in the player's victory (in previous playouts). Full Monte Carlo tree search employs this principle recursively on many depths of the game tree. Each round of Monte Carlo tree search consists of four steps:
Sample steps from one round are shown in the figure below. Each tree node stores the number of won/played playouts.
Note that the updating of the number of wins in each node during backpropagation should arise from the player who made the move that resulted in that node (this is not accurately reflected in the sample image above). This ensures that during selection, each player's choices expand towards the most promising moves for that player, which mirrors the goal of each player to maximize the value of their move.
Rounds of search are repeated as long as the time allotted to a move remains. Then the move with the most simulations made is selected rather than the move with the highest average win rate.
Pure Monte Carlo game search
This basic procedure can be applied to any game whose positions necessarily have a finite number of moves and finite length. For each position, all feasible moves are determined: k random games are played out to the very end, and the scores are recorded. The move leading to the best score is chosen. Ties are broken by fair coin flips. Pure Monte Carlo Game Search results in strong play in several games with random elements, as in EinStein würfelt nicht!. It converges to optimal play (as k tends to infinity) in board filling games with random turn order, for instance in Hex with random turn order.
Exploration and exploitation
The main difficulty in selecting child nodes is maintaining some balance between the exploitation of deep variants after moves with high average win rate and the exploration of moves with few simulations. The first formula for balancing exploitation and exploration in games, called UCT (Upper Confidence Bound 1 applied to trees), was introduced by Levente Kocsis and Csaba Szepesvári. UCT is based on the UCB1 formula derived by Auer, Cesa-Bianchi, and Fischer and the provably convergent AMS (Adaptive Multi-stage Sampling) algorithm first applied to multi-stage decision making models (specifically, Markov Decision Processes) by Chang, Fu, Hu, and Marcus. Kocsis and Szepesvári recommend to choose in each node of the game tree the move, for which the expression
The first component of the formula above corresponds to exploitation; it is high for moves with high average win ratio. The second component corresponds to exploration; it is high for moves with few simulations.
Most contemporary implementations of Monte Carlo tree search are based on some variant of UCT that traces its roots back to the AMS simulation optimization algorithm for estimating the value function in finite-horizon Markov Decision Processes (MDPs) introduced by Chang et al. (2005) in Operations Research. (AMS was the first work to explore the idea of UCB-based exploration and exploitation in constructing sampled/simulated (Monte Carlo) trees and was the main seed for UCT.)
Advantages and disadvantages
Although it has been proved that the evaluation of moves in Monte Carlo tree search converges to minimax, the basic version of Monte Carlo tree search converges very slowly. However Monte Carlo tree search does offer significant advantages over alpha–beta pruning and similar algorithms that minimize the search space.
In particular, Monte Carlo tree search does not need an explicit evaluation function. Simply implementing the game's mechanics is sufficient to explore the search space (i.e. the generating of allowed moves in a given position and the game-end conditions). As such, Monte Carlo tree search can be employed in games without a developed theory or in general game playing.
The game tree in Monte Carlo tree search grows asymmetrically as the method concentrates on the more promising subtrees. Thus it achieves better results than classical algorithms in games with a high branching factor.
Moreover, Monte Carlo tree search can be interrupted at any time yielding the most promising move already found.
Improvements
Various modifications of the basic Monte Carlo tree search method have been proposed to shorten the search time. Some employ domain-specific expert knowledge, others do not.
Monte Carlo tree search can use either light or heavy playouts. Light playouts consist of random moves while heavy playouts apply various heuristics to influence the choice of moves. These heuristics may employ the results of previous playouts (e.g. the Last Good Reply heuristic) or expert knowledge of a given game. For instance, in many go-playing programs certain stone patterns in a portion of the board influence the probability of moving into that area. Paradoxically, playing suboptimally in simulations sometimes makes a Monte Carlo tree search program play stronger overall.
Domain-specific knowledge may be employed when building the game tree to help the exploitation of some variants. One such method assigns nonzero priors to the number of won and played simulations when creating each child node, leading to artificially raised or lowered average win rates that cause the node to be chosen more or less frequently, respectively, in the selection step. A related method, called progressive bias, consists in adding to the UCB1 formula a
The basic Monte Carlo tree search collects enough information to find the most promising moves only after many rounds; until then its moves are essentially random. This exploratory phase may be reduced significantly in a certain class of games using RAVE (Rapid Action Value Estimation). In these games, permutations of a sequence of moves lead to the same position. Typically, they are board games in which a move involves placement of a piece or a stone on the board. In such games the value of each move is often only slightly influenced by other moves.
In RAVE, for a given game tree node N, its child nodes Ci store not only the statistics of wins in playouts started in node N but also the statistics of wins in all playouts started in node N and below it, if they contain move i (also when the move was played in the tree, between node N and a playout). This way the contents of tree nodes are influenced not only by moves played immediately in a given position but also by the same moves played later.
When using RAVE, the selection step selects the node, for which the modified UCB1 formula
Heuristics used in Monte Carlo tree search often require many parameters. There are automated methods to tune the parameters to maximize the win rate.
Monte Carlo tree search can be concurrently executed by many threads or processes. There are several fundamentally different methods of its parallel execution:
History
The Monte Carlo method, based on random sampling, dates back to the 1940s. Bruce Abramson explored the idea in his 1987 PhD thesis and said it "is shown to be precise, accurate, easily estimable, efficiently calculable, and domain-independent." He experimented in-depth with Tic-tac-toe and then with machine generated evaluation functions for Othello and Chess.
Specifically Monte Carlo tree search was first explored and successfully applied to heuristic search in the field of automated theorem proving by W. Ertel, J. Schumann and C. Suttner in 1989, thus improving the exponential search times of uninformed search algorithms such as e.g. breadth-first search, depth-first search or iterative deepening.
In 1992, B. Brügmann employed it for the first time in a go-playing program. Chang et al. proposed the idea of "recursive rolling out and backtracking" with "adaptive" sampling choices in their Adaptive Multi-stage Sampling (AMS) algorithm for the model of Markov decision processes. (AMS was the first work to explore the idea of UCB-based exploration and exploitation in constructing sampled/simulated (Monte Carlo) trees and was the main seed for UCT.). Inspired by these predecessors, Rémi Coulom described the application of the Monte Carlo method to game-tree search and coined the name Monte Carlo tree search, L. Kocsis and Cs. Szepesvári developed the UCT algorithm, and S. Gelly et al. implemented UCT in their program MoGo. In 2008, MoGo achieved dan (master) level in 9×9 go, and the Fuego program began to win with strong amateur players in 9×9 go. In January 2012, the Zen program won 3:1 in a Go match on a 19×19 board with an amateur 2 dan player. Google Deepmind developed the program AlphaGo, which in October 2015 became the first Computer Go program to beat a professional human Go player without handicaps on a full-sized 19x19 board. In March 2016, AlphaGo was awarded an honorary 9-dan (master) level in 19×19 go for defeating Lee Sedol in a five-game match with a final score of four games to one. AlphaGo represents a significant improvement over previous Go programmes as well as a milestone in machine learning as it uses Monte Carlo tree search via artificial neural networks (a deep learning method) with efficiency far surpassing previous programmes.
Monte Carlo tree search has also been used in programs that play other board games (for example Hex, Havannah, Game of the Amazons, and Arimaa), real-time video games (for instance Ms. Pac-Man, Fable Legends and Total War: Rome II), and nondeterministic games (such as skat, poker, Magic: The Gathering, or Settlers of Catan).