|  | ||
A Boltzmann machine is a type of stochastic recurrent neural network and Markov Random Field invented by Geoffrey Hinton and Terry Sejnowski in 1985. Boltzmann machines can be seen as the stochastic, generative counterpart of Hopfield nets. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. They are theoretically intriguing because of the locality and Hebbian nature of their training algorithm, and because of their parallelism and the resemblance of their dynamics to simple physical processes. Due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems
Contents
- Structure
- Probability of a units state
- Equilibrium state
- Training
- Problems
- Restricted Boltzmann machine
- History
- References
They are named after the Boltzmann distribution in statistical mechanics, which is used in their sampling function.
Structure
A Boltzmann machine, like a Hopfield network, is a network of units with an "energy" defined for the network. It also has binary units, but unlike Hopfield nets, Boltzmann machine units are stochastic. The global energy,                     
Where:
Often the weights are represented in matrix form with a symmetric matrix                     
Probability of a unit's state
The difference in the global energy that results from a single unit                     
This can be expressed as the difference of energies of two states:
We then substitute the energy of each state with its relative probability according to the Boltzmann Factor (the property of a Boltzmann distribution that the energy of a state is proportional to the negative log probability of that state):
where                     
We can now finally solve for                     
where the scalar                     
Equilibrium state
The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state's energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is "at thermal equilibrium", meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.
If we want to train the network so that the chance it will converge to a global state is according to an external distribution that we have over these states, we need to set the weights so that the global states with the highest probabilities will get the lowest energies. This is done by the following training procedure.
Training
The units in the Boltzmann Machine are divided into 'visible' units, V, and 'hidden' units, H. The visible units are those which receive information from the 'environment', i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted                     
As is discussed above, the distribution over global states converges as the Boltzmann machine reaches thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as                     
Our goal is to approximate the "real" distribution                     
where the sum is over all the possible states of                     
There are two phases to Boltzmann machine training, and we switch iteratively between them. One is the "positive" phase where the visible units' states are clamped to a particular binary state vector sampled from the training set (according to                     
where:
This result follows from the fact that at thermal equilibrium the probability                     
Remarkably, this learning rule is fairly biologically plausible because the only information needed to change the weights is provided by "local" information. That is, the connection (or synapse biologically speaking) does not need information about anything other than the two neurons it connects. This is far more biologically realistic than the information needed by a connection in many other neural network training algorithms, such as backpropagation.
The training of a Boltzmann machine does not use the EM algorithm, which is heavily used in machine learning. By minimizing the KL-divergence, it is equivalent to maximizing the log-likelihood of the data. Therefore, the training procedure performs gradient ascent on the log-likelihood of the observed data. This is in contrast to the EM algorithm, where the posterior distribution of the hidden nodes must be calculated before the maximization of the expected value of the complete data likelihood during the M-step.
Training the biases is similar, but uses only single node activity:
Problems
The Boltzmann machine would theoretically be a rather general computational medium. For instance, if trained on photographs, the machine would theoretically model the distribution of photographs, and could use that model to, for example, complete a partial photograph.
Unfortunately, there is a serious practical problem with the Boltzmann machine, namely that it seems to stop learning correctly when the machine is scaled up to anything larger than a trivial machine. This is due to a number of effects, the most important of which are:
Restricted Boltzmann machine
Although learning is impractical in general Boltzmann machines, it can be made quite efficient in an architecture called the "restricted Boltzmann machine" or "RBM" which does not allow intralayer connections between hidden units. After training one RBM, the activities of its hidden units can be treated as data for training a higher-level RBM. This method of stacking RBMs makes it possible to train many layers of hidden units efficiently and is one of the most common deep learning strategies. As each new layer is added the overall generative model gets better.
There is an extension to the restricted Boltzmann machine that affords using real valued data rather than binary data. Along with higher order Boltzmann machines, it is outlined here [1].
One example of a practical application of Restricted Boltzmann machines is the performance improvement of speech recognition software.
History
The Boltzmann machine is a Monte Carlo version of the Hopfield network.
The idea of using annealed Ising models for inference is often thought to have been first described by:
However, it should be noted that these articles appeared after the seminal publication by John Hopfield, where the connection to physics and statistical mechanics was made in the first place, mentioning spin glasses:
The idea of applying the Ising model with annealed Gibbs sampling is also present in Douglas Hofstadter's Copycat project:
Similar ideas (with a change of sign in the energy function) are also found in Paul Smolensky's "Harmony Theory".
The explicit analogy drawn with statistical mechanics in the Boltzmann Machine formulation led to the use of terminology borrowed from physics (e.g., "energy" rather than "harmony"), which has become standard in the field. The widespread adoption of this terminology may have been encouraged by the fact that its use led to the importation of a variety of concepts and methods from statistical mechanics. However, there is no reason to think that the various proposals to use simulated annealing for inference described above were not independent. (Helmholtz made a similar analogy during the dawn of psychophysics.)
Ising models are now considered to be a special case of Markov random fields, which find widespread application in various fields, including linguistics, robotics, computer vision, and artificial intelligence.
