Nonlinear Conjugate Gradient Methods
The basis for a nonlinear conjugate gradient method is to effectively apply the linear conjugate gradient method, where the residual is replaced by the gradient. A model quadratic function is never explicitly formed, so it is always combined with a "
line search" method.
The first nonlinear conjugate gradient method was proposed by Fletcher and Reeves as follows. Given a step direction
p_{k}, use the line search to find
_{k} such that
x_{k+1} = x_{k} + _{k}p_{k}. Then compute
It is essential that the line search for choosing
_{k }satisfies the strong Wolfe conditions; this is necessary to ensure that the directions
p_{k} are descent directions [
NW99]].
An alternate method, which generally (but not always) works better in practice, is that of Polak and Ribiere, where equation (2) is replaced with
In formula (3), it is possible that
_{k+1} can become negative, in which case
Mathematica uses the algorithm modified by using
p_{k+1}=max (_{k+1}, 0) p_{k}f (x_{k+1}). In
Mathematica, the default conjugate gradient method is PolakRibiere, but the FletcherReeves method can be chosen by using the method option
Method→{"ConjugateGradient", Method→"FletcherReeves"}
The advantage of conjugate gradient methods is that they use relatively little memory for largescale problems and require no numerical linear algebra, so each step is quite fast. The disadvantage is that they typically converge much more slowly than
Newton or
quasiNewton methods. Also, steps are typically poorly scaled for length, so the "
line search" algorithm may require more iterations each time to find an acceptable step.
This loads a package that contains some utility functions. 
This shows a plot of the steps taken by the nonlinear conjugate gradient method. The path is much less direct than for Newton's method.
Out[2]=  

One issue that arises with nonlinear conjugate gradient methods is when to restart them. As the search moves, the nature of the local quadratic approximation to the function may change substantially. The local convergence of the method depends on that of the linear conjugate gradient method, where the quadratic function is constant. With a constant quadratic function for
n variables and an exact line search, the linear algorithm will converge in
n or fewer iterations. By restarting (taking a steepest descent step with
_{k+1} = 0) every so often, it is possible to eliminate information from previous points, which may not be relevant to the local quadratic model at the current search point. If you look carefully at the example, you can see where the method was restarted and a steepest descent step was taken. One option is to simply restart after every
k iterations, where
k <= n. You can specify this using the method option
"RestartIterations"→k. An alternative is to restart when consecutive gradients are not sufficiently orthogonal according to the test
with a threshold
between 0 and 1. You can specify this using the method option
"RestartThreshold"→.
The table summarizes the options you can use with the conjugate gradient methods.
  
"Method"  "PolakRibiere"  Nonlinear conjugate gradient method can be "PolakRibiere" or "FletcherReeves" 
"RestartThreshold"  1/10  Threshold for gradient orthogonality below which a restart will be done 
"RestartIterations"   Number of iterations to restart after 
"StepControl"  "LineSearch"  Must be "LineSearch", but you can use this to specify line search methods 
Method options for Method>"ConjugateGradient".
It should be noted that the default method for
FindMinimum in
Mathematica 4 was a conjugate gradient method with a near exact line search. This has been maintained for legacy reasons and can be accessed by using the
FindMinimum option
Method>"Gradient". Typically, this will use more function and gradient evaluations than the newer
Method>"ConjugateGradient", which itself often uses far more than the methods that
Mathematica currently uses as defaults.