Vanishing gradient problem

Updated on Nov 26, 2024

Edit

Comment

In machine learning, the vanishing gradient problem is a difficulty found in training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, each of the neural network's weights receives an update proportional to the gradient of the error function with respect to the current weight in each iteration of training. Traditional activation functions such as the hyperbolic tangent function have gradients in the range (−1, 1), and backpropagation computes gradients by the chain rule. This has the effect of multiplying n of these small numbers to compute gradients of the "front" layers in an n-layer network, meaning that the gradient (error signal) decreases exponentially with n and the front layers train very slowly.

With the advent of the back-propagation algorithm in the 1970s, many researchers tried to train supervised deep artificial neural networks from scratch, initially with little success. Sepp Hochreiter's diploma thesis of 1991 formally identified the reason for this failure in the "vanishing gradient problem", which not only affects many-layered feedforward networks, but also recurrent neural networks. The latter are trained by unfolding them into very deep feedforward networks, where a new layer is created for each time step of an input sequence processed by the network.

When activation functions are used whose derivatives can take on larger values, one risks encountering the related exploding gradient problem.

Multi-level hierarchy

To overcome this problem, several methods were proposed. One is Jürgen Schmidhuber's multi-level hierarchy of networks (1992) pre-trained one level at a time through unsupervised learning, fine-tuned through backpropagation. Here each level learns a compressed representation of the observations that is fed to the next level.

Similar ideas have been used in feed-forward neural network for unsupervised pre-training to structure a neural network, making it first learn generally useful feature detectors. Then the network is trained further by supervised back-propagation to classify labeled data. The Deep belief network model by Hinton et al. (2006) involves learning the distribution of a high level representation using successive layers of binary or real-valued latent variables. It uses a restricted Boltzmann machine to model each new layer of higher level features. Each new layer guarantees an increase on the lower-bound of the log likelihood of the data, thus improving the model, if trained properly. Once sufficiently many layers have been learned the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations. Hinton reports that his models are effective feature extractors over high-dimensional, structured data. This work plays a keyrole in reintroducing the interests in deep neural network research and consequently leads to the developments of Deep learning, although deep belief network is no longer the main deep learning technique.

Long short-term memory

Another method particularly used for Recurrent neural network is the long short-term memory (LSTM) network of 1997 by Hochreiter & Schmidhuber. In 2009, deep multidimensional LSTM networks demonstrated the power of deep learning with many nonlinear layers, by winning three ICDAR 2009 competitions in connected handwriting recognition, without any prior knowledge about the three different languages to be learned.

Faster hardware

Hardware advances have meant that from 1991 to 2015, computer power (especially as delivered by GPUs) has increased around a million-fold, making standard backpropagation feasible for networks several layers deeper than when the vanishing gradient problem was recognized. Schmidhuber notes that this "is basically what is winning many of the image recognition competitions now", but that it "does not really overcome the problem in a fundamental way".

Residual networks

One of the newest and most effective ways to resolve the vanishing gradient problem is with residual neural networks (RNNs). It was noted prior to RNNs that with a shallow network and a deeper network, the deeper network would actually have higher training error than the shallow network! This intuitively can be understood as disappearing/mangling information through too many layers of the network, meaning whatever output we had from a shallow layer was getting lost or mangled through deeper layers and yielding a worse result. Going with this intuitive hypothesis, it was found by Microsoft Research, et al. that splitting a deep network into chunks (say, each chunk is three layers or so -- tweakable) and passing the input into each chunk straight through to the next chunk (along with the residual -- output of the chunk minus the input to the chunk that we are reintroducing) helped eliminate much of this disappearing/mangling problem. No extra parameters or changes to the learning algorithm were needed. These deeper, so called RNNs, yielded lower training error (and test error) than their shallower counterparts simply by reintroducing outputs from shallower layers in the network to make up for the disappearing/mangling. It's truly an example of making simple observations and coming up with simple solutions. In this case, the solution happened to have a profound effect, even if we are still trying to fully understand its granular meaning.

Other

Sven Behnke relied only on the sign of the gradient (Rprop) when training his Neural Abstraction Pyramid to solve problems like image reconstruction and face localization.

Neural networks can also be optimized by using a universal search algorithm on the space of neural network's weights, e.g., Random guess or more systematically Genetic algorithm. This approach is not based on gradient, thus avoiding vanishing gradient problem fundamentally.

References

Vanishing gradient problem Wikipedia

(Text) CC BY-SA

Contents