Stochastic approximation algorithms are recursive update rules that can be used, among other things, to solve optimization problems and fixed point equations (including standard linear systems) when the collected data is subject to noise. In engineering, optimization problems are often of this type, when you do not have a mathematical model of the system (which can be too complex) but still would like to optimize its behavior by adjusting certain parameters.
Contents
- RobbinsMonro algorithm
- Complexity results
- Subsequent developments and Polyak Ruppert Averaging
- Application in Stochastic Optimization
- Convergence of the Algorithm
- Example where the stochastic gradient method is appropriate
- Kiefer Wolfowitz algorithm
- Subsequent developments and important issues
- Further developments
- References
For this purpose, you can do experiments or run simulations to evaluate the performance of the system at given values of the parameters. Stochastic approximation algorithms have also been used in the social sciences to describe collective dynamics: fictitious play in learning theory and consensus algorithms can be studied using their theory
Stochastic approximation methods are a family of iterative stochastic optimization algorithms that attempt to find zeroes or extrema of functions which cannot be computed directly, but only estimated via noisy observations. This situation is common, for instance, when taking noisy measurements of empirical data, or when computing parameters of a statistical model.
Mathematically, the goal of these algorithms is to understand properties of a function
which is the expected value of a function depending on a random variable
The earliest, and prototypical, algorithms of this kind are the Robbins-Monro and Kiefer-Wolfowitz algorithms introduced respectively in 1951 and 1952.
Robbins–Monro algorithm
The Robbins–Monro algorithm, introduced in 1951 by Herbert Robbins and Sutton Monro, presented a methodology for solving a root finding problem, where the function is represented as an expected value. Assume that we have a function
Here,
A particular sequence of steps which satisfy these conditions, and was suggested by Robbins–Monro, have the form:
Complexity results
- If
f ( θ ) is twice continuously differentiable, and strongly convex, and the minimizer off ( θ ) belongs to the interior ofΘ , then the Robbins-Monro algorithm will achieve the asymptotically optimal convergence rate, with respect to the objective function, beingE [ f ( θ n ) − f ∗ ] = O ( 1 / n ) , wheref ∗ f ( θ ) overθ ∈ Θ . - Conversely, in the general convex case, where we lack both the assumption of smoothness and strong convexity, Nemirovski and Yudin have shown that the asymptotically optimal convergence rate, with respect to the objective function values, is
O ( 1 / n ) . They have also proven that this rate cannot be improved.
Subsequent developments and Polyak-Ruppert Averaging
While the Robbins-Monro algorithm is theoretically able to achieve
Chung(1954) and Fabian's(1968) showed that we would achieve optimal convergence rate
A1)
Therefore, the sequence
A more general result is given in Chapter 11 of Kushner and Yin by defining interpolated time
With assumption A1) and the following A2)
A2) There is a Hurwitz matrix
satisfied, and define
The success of the averaging idea is because of the time scale separation of the original sequence
Application in Stochastic Optimization
Suppose we want to solve the following stochastic optimization problem
Here
We then define a recursion analogously to Newton's Method in the deterministic algorithm:
Convergence of the Algorithm
The following result gives sufficient conditions on
C1)
C2)
C3)
C4)
C5)
Then
Here are some intuitive explanations about these conditions. Suppose
Example (where the stochastic gradient method is appropriate)
Suppose
Kiefer-Wolfowitz algorithm
The Kiefer-Wolfowitz algorithm, was introduced in 1952, and was motivated by the publication of the Robbins-Monro algorithm. However, the algorithm was presented as a method which would stochastically estimate the maximum of a function. Let
where
A suitable choice of sequences, as recommended by Kiefer and Wolfowitz, would be
Subsequent developments and important issues
- The Kiefer Wolfowitz algorithm requires that for each gradient computation, at least
d + 1 different parameter values must be simulated for every iteration of the algorithm, whered is the dimension of the search space. This means that whend is large, the Kiefer-Wolfowitz algorithm will require substantial computational effort per iteration, leading to slow convergence.- To address this problem, Spall proposed the use of simultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension
d .
- To address this problem, Spall proposed the use of simultaneous perturbations to estimate the gradient. This method would require only two simulations per iteration, regardless of the dimension
- In the conditions required for convergence, the ability to specify a predetermined compact set that fulfills strong convexity (or concavity) and contains the unique solution can be difficult to find. With respect to real world applications, if the domain is quite large, these assumptions can be fairly restrictive and highly unrealistic.
Further developments
An extensive theoretical literature has grown up around these algorithms, concerning conditions for convergence, rates of convergence, multivariate and other generalizations, proper choice of step size, possible noise models, and so on. These methods are also applied in control theory, in which case the unknown function which we wish to optimize or find the zero of may vary in time. In this case, the step size
C. Johan Masreliez and R. Douglas Martin were the first to apply stochastic approximation to robust estimation.
The main tool for analyzing stochastic approximations algorithms (including the Robbins-Monro and the Kiefer-Wolfowitz algorithms) is a theorem by Aryeh Dvoretzky published in the proceedings of the third Berkeley symposium on mathematical statistics and probability, 1956.