Numerical properties of solutions of lasso regression

With Joab Winkler (University of Sheffield)

Numerical properties of solutions of lasso regression

The determination of a concise model of a linear system when there are fewer samples m than predictors n requires the solution of the equation Ax = b where A∈Rm×n and rank A = m, such that the selected solution from the infinite set of solutions is sparse, that is, many of its components are zero. This leads to the minimisation with respect to x of f(x,λ) = ||Ax-b||2_2+ λ||x||_2, where λ is the regularisation parameter. This problem, which is called lasso regression, is more difficult than Tikhonov regularisation because there does not exist a 1-norm matrix decomposition and the 1-norm of a vector or matrix is not a differentiable function.
Lasso regression yields a family of functions xlasso(λ) and it is necessary to determine the optimal value of λ, that is, the value of λ that balances the fidelity of the model, ||A xlasso(λ)-b||≈0, and the satisfaction of the constraint that xlasso(λ) be sparse. A solution that satisfies both these conditions, that is, the objective function and the constraint assume, approximately, their minimum values is called an optimal sparse solution. In particular, a sparse solution exists for many values of λ, but the restriction of interest to an optimal sparse solution places restrictions on λ, A and b.
It is shown that there does not exist an optimal sparse solution when the least squares (LS) solution xls = Ab = AT(AAT)-1b is well conditioned. It is also shown that, by contrast, an optimal sparse solution that has few zero entries exists if xls is ill conditioned. This association between sparsity and numerical stability has been observed in feature selection and the analysis of fMRI images of the brain. The relationship between the numerical condition of the LS problem and the existence of an optimal sparse solution requires that a refined condition number of xls be developed because it cannot be obtained from the condition number κ(A) of A. This inadequacy of κ(A) follows because it is a function of A only, but xls is a function of A and b.
The optimal value of λ is usually computed by cross validation (CV), but it is problematic because it requires the determination of a shallow minimum of a function. A better method, which requires the computation of the coordinates of the corner of a curve that has the form of an L, is described and examples that demonstrate its superiority with respect to CV are presented.
The talk includes examples of well conditioned and ill conditioned solutions xls of the LS problem that do, and do not, possess an optimal sparse solution. The examples include the effect of noise on the results obtained by the L-curve and CV.

Add to your calendar or Include in your list