Introduction to Unconstrained Optimization

The Wolfram Language has a collection of commands that do unconstrained optimization (FindMinimum and FindMaximum) and solve nonlinear equations (FindRoot) and nonlinear fitting problems (FindFit). All these functions work, in general, by doing a search, starting at some initial values and taking steps that decrease (or for FindMaximum, increase) an objective or merit function.

The search process for FindMaximum is somewhat analogous to a climber trying to reach a mountain peak in a thick fog; at any given point, basically all that climbers know is their position, how steep the slope is, and the direction of the fall line. One approach is always to go uphill. As long as climbers go uphill steeply enough, they will eventually reach a peak, though it may not be the highest one. Similarly, in a search for a maximum, most methods are ascent methods where every step increases the height and stops when it reaches any peak, whether it is the highest one or not.

The analogy with hill climbing can be reversed to consider descent methods for finding local minima. For the most part, the literature in optimization considers the problem of finding minima, and since this applies to most of the Wolfram Language commands, from here on, this documentation will follow that convention.

For example, the function is not bounded from below, so it has no global minimum, but it has an infinite number of local minima.

This loads a package that contains some utility functions:
This shows a plot of the function x Sin[x+1]:
This shows the details of the steps taken by FindMinimum for the function x Sin[x+1] starting at x=0. The details can be extracted using the property "SolutionData":
FindMinimumPlot gives only the plot by default:

The FindMinimumPlot command is defined in the Optimization`UnconstrainedProblems` package loaded automatically by this notebook. It runs FindMinimum; keeps track of the function, gradient evaluations and steps taken during the search (using the EvaluationMonitor and StepMonitor options); and shows them superimposed on a plot of the function. Steps are indicated with blue lines, function evaluations are shown with green points and gradient evaluations are shown with red points. The minimum found is shown with a large black point. From the plot, it is clear that FindMinimum has found a local minimum point.

This shows the steps taken by FindMinimum for the function x Sin[x+1] starting at x=2:

Starting at 2, FindMinimum heads to different local minima, at which the function is smaller than at the first minimum found.

From these two plots, you might come to the conclusion that if you start at a point where the function is sloping downward, you will always head toward the next minimum in that direction. However, this is not always the case; the steps FindMinimum takes are typically determined using the value of the function and its derivatives, so if the derivative is quite small, FindMinimum may think it has to go quite a long way to find a minimum point.

This shows the steps taken by FindMinimum for the function x Sin[x+1] starting at x=7:

When starting at x=7, which is near a local maximum, the first step is quite large, so FindMinimum returns a completely different local minimum.

All these commands have "find" in their name because, in general, their design is to search to find any point where the desired condition is satisfied. The point found may not be the only one (in the case of roots) or even the best one (in the case of fits, minima, or maxima), or, as you have seen, not even the closest one to the starting condition. In other words, the goal is to find any point at which there is a root or a local maximum or minimum. In contrast, the function NMinimize tries harder to find the global minimum for the function, but NMinimize is also generally given constraints to bound the problem domain. However, there is a price to pay for this generalityNMinimize has to do much more work and, in fact, may call one of the "Find" functions to polish a result at the end of its process, so it generally takes much more time than the "Find" functions.

In two dimensions, the minimization problem is more complicated because both a step direction and step length need to be determined.

This shows the steps taken by FindMinimum to find a local minimum of the function starting at the point :

The FindMinimumPlot command for two dimensions is similar to the one-dimensional case, but it shows the steps and evaluations superimposed on a contour plot of the function. In this example, it is apparent that FindMinimum needed to change direction several times to get to the local minimum. You may notice that the first step starts in the direction of steepest descent (i.e. perpendicular to the contour or parallel to the gradient). Steepest descent is indeed a possible strategy for local minimization, but it often does not converge quickly. In subsequent steps in this example, you may notice that the search direction is not exactly perpendicular to the contours. The search is using information from past steps to try to get information about the curvature of the function, which typically gives it a better direction to go. Another strategy, which usually converges faster but can be more expensive, is to use the second derivative of the function. This is usually referred to as Newton's method.

This shows the steps taken using Newton's method:

In this example, it is clear that the extra information that Newton's method uses about the curvature of the function makes a big difference in how many steps it takes to get to the minimum. Even though Newton's method takes fewer steps, it may take more total execution time since the symbolic Hessian has to be computed once and then evaluated numerically at each step.

Usually, there are tradeoffs between the rate of convergence or total number of steps taken and cost per step. Depending on the size of the problems you want to solve, you may want to pick a particular method to best match that tradeoff for a particular problem. This documentation is intended to help you understand those choices as well as some ways to get the best results from the functions in the Wolfram Language. For the most part, examples will be used to illustrate the ideas, but a limited exposition on the mathematical theory behind the methods will be given so that you can better understand how the examples work.

For the most part, local minimization methods for a function are based on a quadratic model

The subscript refers to the th iterative step. In Newton's method, the model is based on the exact Hessian matrix, , but other methods use approximations to , which are typically less expensive to compute. A trial step is typically computed to be the minimizer of the model, which satisfies the system of linear equations.

If is sufficiently smooth and is sufficiently close to a local minimum, then with , the sequence of steps is guaranteed to converge to the local minimum. However, in a typical search, the starting value is rarely close enough to give the desired convergence. Furthermore, is often an approximation to the actual Hessian and, at the beginning of a search, the approximation is frequently quite inaccurate. Thus, it is necessary to provide additional control to the step sequence to improve the chance and rate of convergence. There are two frequently used methods for controlling the steps: line search and trust region methods.

In a line search method, for each trial step found, a one-dimensional search is done along the direction of so that . You could choose so that it minimizes in this direction, but this is excessive, and with conditions that require that decreases sufficiently in value and slope, convergence for reasonable approximations can be proven. The Wolfram Language uses a formulation of these conditions called the Wolfe conditions.

In a trust region method, a radius within which the quadratic model in equation (1) is "trusted" to be reasonably representative of the function. Then, instead of solving for the unconstrained minimum of (1), the trust region method tries to find the constrained minimum of (1) with . If the are sufficiently close to a minimum and the model is good, then often the minimum lies within the circle, and convergence is quite rapid. However, near the start of a search, the minimum will lie on the boundary, and there are a number of techniques to find an approximate solution to the constrained problem. Once an approximate solution is found, the actual reduction of the function value is compared to the predicted reduction in the function value and, depending on how close the actual value is to the predicted, an adjustment is made for .

For symbolic minimization of a univariate smooth function, all that is necessary is to find a point at which the derivative is zero and the second derivative is positive. In multiple dimensions, this means that the gradient vanishes and the Hessian needs to be positive definite. (If the Hessian is positive semidefinite, the point is a minimizer, but is not necessarily a strict one.) As a numerical algorithm converges, it is necessary to keep track of the convergence and make some judgment as to when a minimum has been approached closely enough. This is based on the sequence of steps taken and the values of the function, its gradient, and possibly its Hessian at these points. Usually, the Wolfram Language Find* functions will issue a message if they cannot be fairly certain that this judgment is correct. However, keep in mind that discontinuous functions or functions with rapid changes of scale can fool any numerical algorithm.

When solving nonlinear equations, many of the same issues arise as when finding a local minimum. In fact, by considering a so-called merit function, which is zero at the root of the equations, it is possible to use many of the same techniques as for minimization, but with the advantage of knowing that the minimum value of the function is 0. It is not always advantageous to use this approach, and there are some methods specialized for nonlinear equations.

Most examples shown will be from one and two dimensions. This is by no means because the Wolfram Language is restricted to computing with such small examples, but because it is much easier to visually illustrate the main principles behind the theory and methods with such examples.