There are many variants of quasi-Newton methods. In all of them, the idea is to base the matrix Bk
in the quadratic model
on an approximation of the Hessian matrix built up from the function and gradient values from some or all steps previously taken.
This loads a package that contains some utility functions
This shows a plot of the steps taken by the quasi-Newton method. The path is much less direct than for Newton's method. The quasi-Newton method is used by default by FindMinimum
for problems that are not sums of squares.
The first thing to notice about the path taken in this example is that it starts in the wrong direction. This direction is chosen because at the first step all the method has to go by is the gradient, and so it takes the direction of steepest descent. However, in subsequent steps, it incorporates information from the values of the function and gradient at the steps taken to build up an approximate model of the Hessian.
The methods used by Mathematica
are the Broyden-Fletcher-Goldfarb-Shanno (BFGS) updates and, for large systems, the limited memory BFGS (L-BFGS) methods, in which the model Bk
is not stored explicitly, but rather Bk-1 f (xk)
is calculated by gradients and step directions stored from past steps.
The BFGS method is implemented such that instead of forming the model Hessian Bk
at each step, Cholesky factors Lk
such that Lk.LkT
are computed so that only O (n2)
operations are needed to solve the system Bksk = - f (xk)
] for a problem with n
For large-scale sparse problems, the BFGS method can be problematic because, in general, the Cholesky factors (or the Hessian approximation Bk
or its inverse) are dense, so the O (n2)
memory and operations requirements become prohibitive compared to algorithms that take advantage of sparseness. The L-BFGS algorithm [NW99
] forms an approximation to the inverse Hessian based on a the last m
past steps, which are stored. The Hessian approximation may not be as complete, but the memory and order of operations are limited to O (n m)
for a problem with n
variables. In Mathematica
5, for problems over 250 variables, the algorithm is switched automatically to L-BFGS. You can control this with the method option "StepMemory"->m
. With m=
, the full BFGS method will always be used. Choosing an appropriate value of m
is a tradeoff between speed of convergence and the work done per step. With m<3
, you are most likely better off using a "conjugate gradient
This shows the same example function with the minimum computed using L-BFGS with m=5
Quasi-Newton methods are chosen as the default in Mathematica
because they are typically quite fast and do not require computation of the Hessian matrix, which can be quite expensive both in terms of the symbolic computation and numerical evaluation. With an adequate "line search
", they can be shown to converge superlinearly [NW99
] to a local minimum where the Hessian is positive definite. This means that
or, in other words, the steps keep getting smaller. However, for very high precision, this does not compare to the q
-quadratic convergence rate of Newton's
This shows the number of steps and function evaluations required to find the minimum to high precision for the problem shown.
Newton's method is able to find ten times as many digits with far fewer steps because of its quadratic convergence rate. However, the convergence with the quasi-Newton method is still superlinear since the ratio of the errors is clearly going to zero.
This makes a plot showing the ratios of the errors in the computation. The ratios of the errors are shown on a logarithmic scale so that the trend can clearly be seen over a large range of magnitudes.
The following table summarizes the options you can use with quasi-Newton methods.
|"StepMemory"||Automatic||the effective number of steps to "remember" in the Hessian approximation, can be a positive integer or Automatic|
|"StepControl"||"LineSearch"||how to control steps, can be "LineSearch" or None|
Method options for Method->"QuasiNewton".