In artificial intelligence, Thompson sampling, named after William R. Thompson, is a heuristic for choosing actions that addresses the exploration-exploitation dilemma in the multi-armed bandit problem. It consists in choosing the action that maximizes the expected reward with respect to a randomly drawn belief.
Contents
Description
Consider a set of contexts
The elements of Thompson sampling are as follows:
- a likelihood function
P ( r | θ , a , x ) ; - a set
Θ of parametersθ of the distribution ofr ; - a prior distribution
P ( θ ) on these parameters; - past observations triplets
D = { ( x ; a ; r ) } ; - a posterior distribution
P ( θ | D ) ∝ P ( D | θ ) P ( θ ) , whereP ( D | θ ) is the likelihood function.
Thompson sampling consists in playing the action
where
In practice, the rule is implemented by sampling, in each round, a parameter
History
Thompson sampling was originally described in an article by Thompson from 1933 but has been largely ignored by the artificial intelligence community. It was subsequently rediscovered numerous times independently in the context of reinforcement learning. A first proof of convergence for the bandit case has been shown in 1997. The first application to Markov decision processes was in 2000. A related approach (see Bayesian control rule) was published in 2010. In 2010 it was also shown that Thompson sampling is instantaneously self-correcting. Asymptotic convergence results for contextual bandits were published in 2011. Nowadays, Thompson Sampling has been widely used in many online learning problems: Thompson sampling has also been applied to A/B testing in website design and online advertising; Thompson sampling has formed the basis for accelerated learning in decentralized decision making; a Double Thompson Sampling (D-TS) algorithm has been proposed for dueling bandits, a variant of traditional MAB, where feedbacks come in the format of pairwise comparison.
Probability matching
Probability matching is a decision strategy in which predictions of class membership are proportional to the class base rates. Thus, if in the training set positive examples are observed 60% of the time, and negative examples are observed 40% of the time, the observer using a probability-matching strategy will predict (for unlabeled examples) a class label of "positive" on 60% of instances, and a class label of "negative" on 40% of instances.
Bayesian control rule
A generalization of Thompson sampling to arbitrary dynamical environments and causal structures, known as Bayesian control rule, has been shown to be the optimal solution to the adaptive coding problem with actions and observations. In this formulation, an agent is conceptualized as a mixture over a set of behaviours. As the agent interacts with its environment, it learns the causal properties and adopts the behaviour that minimizes the relative entropy to the behaviour with the best prediction of the environment's behaviour. If these behaviours have been chosen according to the maximum expected utility principle, then the asymptotic behaviour of the Bayesian control rule matches the asymptotic behaviour of the perfectly rational agent.
The setup is as follows. Let
where the "hat"-notation
where
In practice, the Bayesian control amounts to sampling, in each time step, a parameter