A partially observable Markov decision process (POMDP) is a generalization of a Markov decision process (MDP). A POMDP models an agent decision process in which it is assumed that the system dynamics are determined by an MDP, but the agent cannot directly observe the underlying state. Instead, it must maintain a probability distribution over the set of possible states, based on a set of observations and observation probabilities, and the underlying MDP.
Contents
- Formal definition
- Discussion
- Belief update
- Belief MDP
- Policy and Value Function
- Approximate POMDP solutions
- POMDP uses
- References
The POMDP framework is general enough to model a variety of real-world sequential decision processes. Applications include robot navigation problems, machine maintenance, and planning under uncertainty in general. The framework originated in the operations research community, and was later taken over by the artificial intelligence and automated planning communities.
An exact solution to a POMDP yields the optimal action for each possible belief over the world states. The optimal action maximizes (or minimizes) the expected reward (or cost) of the agent over a possibly infinite horizon. The sequence of optimal actions is known as the optimal policy of the agent for interacting with its environment.
Formal definition
A discrete-time POMDP models the relationship between an agent and its environment. Formally, a POMDP is a 7-tuple
At each time period, the environment is in some state
Discussion
Because the agent does not directly observe the environment's state, the agent must make decisions under uncertainty of the true environment state. However, by interacting with the environment and receiving observations, the agent may update its belief in the true state by updating the probability distribution of the current state. A consequence of this property is that the optimal behavior may often include information gathering actions that are taken purely because they improve the agent's estimate of the current state, thereby allowing it to make better decisions in the future.
It is instructive to compare the above definition with the definition of a Markov decision process. An MDP does not include the observation set, because the agent always knows with certainty the environment's current state. Alternatively, an MDP can be reformulated as a POMDP by setting the observation set to be equal to the set of states and defining the observation conditional probabilities to deterministically select the observation that corresponds to the true state.
Belief update
An agent needs to update its belief upon taking the action
After reaching
where
Belief MDP
A Markovian belief state allows a POMDP to be formulated as a Markov decision process where every belief is a state. The resulting belief MDP will thus be defined on a continuous state space, since there are infinite beliefs for any given POMDP. In other words, there are technically infinite belief states (in
Formally, the belief MDP is defined as a tuple
Of these,
where
The belief MDP reward function (
The belief MDP is not partially observable anymore, since at any given time the agent knows its belief, and by extension the state of the belief MDP.
Also, unlike the "originating" MDP, where each action is available from only one state; in the corresponding Belief MDP, all belief states allow all actions, since you (almost) always have some probability of believing you are in any (originating) state.
Policy and Value Function
The expected reward for policy
where
where
The optimal policy, denoted by
For finite-horizon POMDPs, the optimal value function is piecewise-linear and convex. It can be represented as a finite set of vectors. In the infinite-horizon formulation, a finite vector set can approximate
Approximate POMDP solutions
In practice, POMDPs are often computationally intractable to solve exactly, so computer scientists have developed methods that approximate solutions for POMDPs.
Grid-based algorithms comprise one approximate solution technique. In this approach, the value function is computed for a set of points in the belief space, and interpolation is used to determine the optimal action to take for other belief states that are encountered which are not in the set of grid points. More recent work makes use of sampling techniques, generalization techniques and exploitation of problem structure, and has extended POMDP solving into large domains with millions of states. For example, point-based methods sample random reachable belief points to constrain the planning to relevant areas in the belief space. Dimensionality reduction using PCA has also been explored.
POMDP uses
POMDPs can be used to model many kinds of real-world problems. Notable applications include the use of a POMDP in management of patients with ischemic heart disease, assistive technology for persons with dementia, the conservation of the critically endangered and difficult to detect Sumatran tigers and aircraft collision avoidance.