Proximal gradient (forward backward splitting) methods for learning is an area of research in optimization and statistical learning theory which studies algorithms for a general class of convex regularization problems where the regularization penalty may not be differentiable. One such example is                     
Contents
- Relevant background
- Moreau decomposition
- Lasso regularization
- Solving for 1 displaystyle ell 1 proximity operator
- Fixed point iterative schemes
- Practical considerations
- Adaptive step size
- Elastic net mixed norm regularization
- Exploiting group structure
- Group lasso
- Other group structures
- References
Proximal gradient methods offer a general framework for solving regularization problems from statistical learning theory with penalties that are tailored to a specific problem application. Such customized penalties can help to induce certain structure in problem solutions, such as sparsity (in the case of lasso) or group structure (in the case of group lasso).
Relevant background
Proximal gradient methods are applicable in a wide variety of scenarios for solving convex optimization problems of the form
where                     
where                     
Given a convex function                     
which is well-defined because of the strict convexity of the                     
Moreau decomposition
One important technique related to proximal gradient methods is the Moreau decomposition, which decomposes the identity operator as the sum of two proximity operators. Namely, let                     
The general form of Moreau's decomposition states that for any                     
which for                     
In certain situations it may be easier to compute the proximity operator for the conjugate                     
Lasso regularization
Consider the regularized empirical risk minimization problem with square loss and with the                     
where                     
where                     
Solving for ℓ 1 {displaystyle ell _{1}} proximity operator
For simplicity we restrict our attention to the problem where                     
we consider our objective function in two parts: a convex, differentiable term                     
Let us compute the proximity operator for                     
                              
For                     
Using the recharacterization of the proximity operator given above, for the choice of                     
which is known as the soft thresholding operator                     
Fixed point iterative schemes
To finally solve the lasso problem we consider the fixed point equation shown earlier:
Given that we have computed the form of the proximity operator explicitly, then we can define a standard fixed point iteration procedure. Namely, fix some initial                     
Note here the effective trade-off between the empirical error term                     
Convergence of this fixed point scheme is well-studied in the literature and is guaranteed under appropriate choice of step size                     
Practical considerations
There have been numerous developments within the past decade in convex optimization techniques which have influenced the application of proximal gradient methods in statistical learning theory. Here we survey a few important topics which can greatly improve practical algorithmic performance of these methods.
Adaptive step size
In the fixed point iteration scheme
one can allow variable step size                     
Elastic net (mixed norm regularization)
Elastic net regularization offers an alternative to pure                     
where                     
Exploiting group structure
Proximal gradient methods provide a general framework which is applicable to a wide variety of problems in statistical learning theory. Certain problems in learning can often involve data which has additional structure that is known a priori. In the past several years there have been new developments which incorporate information about group structure to provide methods which are tailored to different applications. Here we survey a few such methods.
Group lasso
Group lasso is a generalization of the lasso method when features are grouped into disjoint blocks. Suppose the features are grouped into blocks                     
which is the sum of the                     
where                     
In contrast to lasso, the derivation of the proximity operator for group lasso relies on the Moreau decomposition. Here the proximity operator of the conjugate of the group lasso penalty becomes a projection onto the ball of a dual norm.
Other group structures
In contrast to the group lasso problem, where features are grouped into disjoint blocks, it may be the case that grouped features are overlapping or have a nested structure. Such generalizations of group lasso have been considered in a variety of contexts. For overlapping groups one common approach is known as latent group lasso which introduces latent variables to account for overlap. Nested group structures are studied in hierarchical structure prediction and with directed acyclic graphs.
