Introduction to Unconstrained Optimization
Mathematica has a collection of commands that do unconstrained optimization (
FindMinimum and
FindMaximum) and solve nonlinear equations (
FindRoot) and nonlinear fitting problems (
FindFit). All these functions work, in general, by doing a search, starting at some initial values and taking steps that decrease (or for
FindMaximum, increase) an objective or merit function.
The search process for
FindMaximum is somewhat analogous to a climber trying to reach a mountain peak in a thick fog; at any given point, basically all that climber knows is their position, how steep the slope is, and the direction of the fall line. One approach is always to go uphill. As long as climbers go uphill steeply enough, they will eventually reach a peak, though it may not be the highest one. Similarly, in a search for a maximum, most methods are ascent methods where every step increases the height and stops when it reaches any peak, whether it is the highest one or not.
The analogy with hill climbing can be reversed to consider descent methods for finding local minima. For the most part, the literature in optimization considers the problem of finding minima, and since this applies to most of the
Mathematica commands, from here on, this documentation will follow that convention.
For example, the function
x sin (x + 1) is not bounded from below, so it has no global minimum, but it has an infinite number of local minima.
This loads a package that contains some utility functions. 
This shows a plot of the function x Sin[x + 1].
Out[2]=  

The
FindMinimumPlot command is defined in the
Optimization`UnconstrainedProblems` package loaded automatically by this notebook. It runs
FindMinimum, keeps track of the function and gradient evaluations and steps taken during the search (using the
EvaluationMonitor and
StepMonitor options), and shows them superimposed on a plot of the function. Steps are indicated with blue lines, function evaluations are shown with green points, and gradient evaluations are shown with red points. The minimum found is shown with a large black point. From the plot, it is clear that
FindMinimum has found a local minimum point.
Starting at 2,
FindMinimum heads to different local minima, at which the function is smaller than at the first minimum found.
From these two plots, you might come to the conclusion that if you start at a point where the function is sloping downward, you will always head toward the next minimum in that direction. However, this is not always the case; the steps
FindMinimum takes are typically determined using the value of the function and its derivatives, so if the derivative is quite small,
FindMinimum may think it has to go quite a long way to find a minimum point.
When starting at
x = 7, which is near to a local maximum, the first step is quite large, so
FindMinimum returns a completely different local minimum.
All these commands have "find" in their name because, in general, their design is to search to find any point where the desired condition is satisfied. The point found may not be the only one (in the case of roots) or even the best one (in the case of fits, minima, or maxima), or, as you have seen, not even the closest one to the starting condition. In other words, the goal is to find any point at which there is a root or a local maximum or minimum. In contrast, the function
NMinimize tries harder to find the global minimum for the function, but
NMinimize is also generally given constraints to bound the problem domain. However, there is a price to pay for this generality;
NMinimize has to do much more work and, in fact, may call one of the
"Find" functions to polish a result at the end of its process, so it generally takes much more time than the
"Find" functions.
In two dimensions, the minimization problem is more complicated because both a step direction and step length need to be determined.
This shows the steps taken by FindMinimum to find a local minimum of the function cos (x^{2} 3y) + sin (x^{2}+y^{2}) starting at the point {x, y}={1, 1}.
Out[6]=  

The
FindMinimumPlot command for two dimensions is similar to the onedimensional case, but it shows the steps and evaluations superimposed on a contour plot of the function. In this example, it is apparent that
FindMinimum needed to change directions several times to get to the local minimum. You may notice that the first step starts in the direction of steepest descent (i.e., perpendicular to the contour or parallel to the gradient). Steepest descent is indeed a possible strategy for local minimization, but it often does not converge quickly. In subsequent steps in this example, you may notice that the search direction is not exactly perpendicular to the contours. The search is using information from past steps to try to get information about the curvature of the function, which typically gives it a better direction to go. Another strategy, which usually converges faster, but can be more expensive, is to use the second derivative of the function. This is usually referred to as
Newton's method.
This shows the steps taken using Newton's method.
Out[7]=  

In this example, it is clear that the extra information, that
Newton's method uses about the curvature of the function, makes a big difference in how many steps it takes to get to the minimum. Even though Newton's method takes fewer steps, it may take more total execution time since the symbolic Hessian has to be computed once and then evaluated numerically at each step.
Usually there are tradeoffs between the rate of convergence or total number of steps taken and cost per step. Depending on the size of the problems you want to solve, you may want to pick a particular method to best match that tradeoff for a particular problem. This documentation is intended to help you understand those choices as well as some ways to get the best results from the functions in
Mathematica. For the most part, examples will be used to illustrate the ideas, but a limited exposition on the mathematical theory behind the methods will be given so that you can better understand how the examples work.
For the most part, local minimization methods for a function
f are based on a quadratic model
The subscript
k refers to the
k^{th} iterative step. In Newton's method, the model is based on the exact Hessian matrix,
B_{k} =
^{2}f (x_{k}) , but other methods use approximations to
^{2}f (x_{k}), which are typically less expensive to compute. A trial step
s_{k} is typically computed to be the minimizer of the model, which satisfies the system of linear equations.
If
f is sufficiently smooth and
x_{k} is sufficiently close to a local minimum, then with
B_{k}=
^{2}f (x_{k}), the sequence of steps
x_{k+1} = s_{k} + x_{k} is guaranteed to converge to the local minimum. However, in a typical search, the starting value is rarely close enough to give the desired convergence. Furthermore,
B_{k} is often an approximation to the actual Hessian and, at the beginning of a search, the approximation is frequently quite inaccurate. Thus, it is necessary to provide additional control to the step sequence to improve the chance and rate of convergence. There are two frequently used methods for controlling the steps: line search and trust region methods.
In a "
line search" method, for each trial step
s_{k} found, a onedimensional search is done along the direction of
s_{k} so that
x_{k+1} = x_{k} + _{k} s_{k}. You could choose
_{k} so that it minimizes
f (x_{k+1}) in this direction, but this is excessive, and with conditions that require that
f (x_{k+1}) decreases sufficiently in value and slope, convergence for reasonable approximations
B_{k} can be proven.
Mathematica uses a formulation of these conditions called the Wolfe conditions.
In a "
trust region" method, a radius
_{k} within which the quadratic model
q_{k} (p) in equation (
1) is "trusted" to be reasonably representative of the function. Then, instead of solving for the unconstrained minimum of (
1), the trust region method tries to find the constrained minimum of (
1) with
p≤_{k}. If the
x_{k} are sufficiently close to a minimum and the model is good, then often the minimum lies within the circle, and convergence is quite rapid. However, near the start of a search, the minimum will lie on the boundary, and there are a number of techniques to find an approximate solution to the constrained problem. Once an approximate solution is found, the actual reduction of the function value is compared to the predicted reduction in the function value and, depending on how close the actual is to the predicted, an adjustment is made for
_{k+1}.
For symbolic minimization of a univariate smooth function, all that is necessary is to find a point at which the derivative is zero and the second derivative is positive. In multiple dimensions, this means that the gradient vanishes and the Hessian needs to be positive definite. (If the Hessian is positive semidefinite, the point is a minimizer, but is not necessarily a strict one.) As a numerical algorithm converges, it is necessary to keep track of the convergence and make some judgment as to when a minimum has been approached closely enough. This is based on the sequence of steps taken and the values of the function, its gradient, and possibly its Hessian at these points. Usually, the
Mathematica Find... functions will issue a message if they cannot be fairly certain that this judgment is correct. However, keep in mind that discontinuous functions or functions with rapid changes of scale can fool any numerical algorithm.
When solving "
nonlinear equations", many of the same issues arise as when finding a "
local minimum". In fact, by considering a socalled merit function, which is zero at the root of the equations, it is possible to use many of the same techniques as for minimization, but with the advantage of knowing that the minimum value of the function is 0. It is not always advantageous to use this approach, and there are some methods specialized for nonlinear equations.
Examples shown will be from one and twodimensions for the most part. This is by no means because
Mathematica is restricted to computing with such small examples, but because it is much easier to visually illustrate the main principles behind the theory and methods with such examples.