Multivariable CalculusOptimization

Multivariable optimization problems are ubiquitous in applied math and data science, because a common path to achieving desirable results for real-world problems is to specify a value which captures some notion of badness and use optimization methods to make it as small as possible. Let's begin with a concrete example of this pattern.

The line of best fit

Suppose we want to use a line to describe the relationship between the -values and -values for a collection of points in the -plane. Some lines clearly do a very poor job of aligning with the points, while others are better. Try adjusting the sliders below to see how the resulting graph changes.

y=${m} x+${b}

To identify which line which does the "best" job, we need to quantify how well a given line fits the points. The most common way to do this is to measure the sum of the squares of the vertical distances from each point to the line. We seek to make that sum of squared distances as small as possible.

If the points are , then the value we're trying to minimize is

The letter is used in this context in reference to the word loss, which is a synonym for badness.

Here's the key insight: at any point where is a nonzero vector, there will be directions in which the pair may be perturbed to increase 's value (specifically, any directions which make an angle with ) and perturbation directions which decrease the value of (specifically, any directions which make an angle with ). Therefore, at any point where has a minimum, we must have .

Taking the partial derivatives of and setting them equal to zero, we get the system of equations

We have two equations and two unknowns ( and ), so we can solve using standard techniques for systems of linear equations. For example, solving the second equation for gives

and substituting into the first equation and solving for yields

We can substitute back into the equation for to find the only pair at which is equal to the zero vector. If we trust our intuition that at least one minimizing pair must exist, then this is the one. In fact, this intuition is correct, and the formula above is indeed the formula for the line of best fit.

Exercise
Run the code below to confirm visually that the formula we derived above fits the sampled points reasonably well. (Note that you have to run the cell twice to get the plot to show.)

import numpy as np
import matplotlib.pyplot as plt
n = 10
x = np.random.random_sample(n)
y = 2 * x + 3 + 0.2 * np.random.standard_normal(n)
m = ((np.sum(x*y) - np.sum(x)*np.sum(y)/n) /
     (np.sum(x**2) - np.sum(x)**2/n))
b = (np.sum(y) - m * np.sum(x))/n
yfit = m*x + b
plt.scatter(x,y)
plt.plot(x,yfit)

Critical points

Let's consider the problem of optimizing a differentiable function from a subset of to .

The graph of the function XEQUATIONX4188XEQUATIONX over the disk XEQUATIONX4189XEQUATIONX.

As we observed in the previous section, if is nonzero at a point away from the boundary of , then there are directions in which decreases as you move away from that point and other directions where it increases. Therefore, any minima or maxima of away from the boundary of must occur at points where the gradient of is zero. We call points where the gradient of is zero or where is non-differentiable critical points of . If a function has a minimum or a maximum at a point, then either that point is a critical point, or it is on the of the domain of the function.

Now let's consider the possibility that has a maximum or minimum on the boundary of its domain . For simplicity, we'll assume that is a subset of the plane. Because of the directional derivative formula, if is nonzero, then there are 180 degrees worth of directions in which may be nudged to increase the value of . Therefore, if reaches its maximum value at a boundary point, then either the gradient of is zero at that point, or else all of the directions in which increases are excluded by the constraint that is in . If has a tangent line at such a point, the latter possibility requires that the tangent line be to the gradient of .

If the gradient of is not perpendicular to the tangent line at a boundary point, then there are permissible directions of increase and does not have a maximum there.

If the boundary of is specified as a level set of a function , then being perpendicular to the tangent line is the same as being parallel to the gradient of . Two vectors are parallel if one is a multiple of the other, so we arrive at the equation

for some . The variable is called a Lagrange multiplier.

Although we framed the discussion assuming that the domain is a subset of the plane, the same analysis works in any dimension, and the resulting equation is still .

The extreme value theorem confirms the intuition that a continuous function defined on a closed and bounded subset of must have a maximum somewhere and a minimum somewhere. Let's combine all of these points into a single theorem statement:

Theorem (Extreme value theorem and Lagrange multipliers)
Suppose that is a continuous function defined on a closed and bounded subset of . Then

realizes an absolute maximum and absolute minimum on (the extreme value theorem)
any point where realizes an extremum is either a critical point—meaning that or is non-differentiable at that point—or at a point on the boundary.
if realizes an extremum at a point on a portion of the boundary which is the level set of a differentiable function with non-vanishing gradient , then either is non-differentiable at that point or the equation is satisfied at that point, for some .

This theorem suggests a program for optimizing a function on a domain whose boundary is a level set of a function .

Identify any points where
- ,
- is undefined,
- the boundary of is not smooth, or
- for some
Calculate at all of these points and select the largest and smallest values.

Exercise
Find the maximum and minimum values of on the rectangle .

Solution. The system of equations tells us that and . Substituting, we find , which can be rearranged and factored (by repeated applications of difference of squares) to get

The only value of in the interior of the rectangle which satisfies this equation is . Substituting into , we find the critical point in the interior of the rectangle. The value of the function this point is .

Along the bottom edge of the rectangle, the value of the function at is which ranges monotonically from 0 to 81. Along the top edge, we have . This has a critical point at .

Along the left edge, the function is equal to , which has no interior critical points. And finally, along the right edge, we have , which has a critical point at .

So all together, checking critical points and corners gives

Therefore, the absolute maximum is and the absolute minimum is .

Exercise
Find the point closest to the origin in the region .

Solution. We can rule out the possibility that the point nearest the origin is in the interior of the region, on geometric grounds. To find the boundary critical points, we set the objective function

and the constraint function describing the boundary of the given set. Solving the Lagrange equation

together with the constraint equation , we get , which implies . So we get a single critical point , and since there's only one, the minimum must occur here.

We can also use the Lagrange multiplier idea in the reverse direction. Let's look at an example from statistics where Lagrange multipliers give rise to a meaningful insight.

Lasso and ridge regression

Consider the problem of fitting a plane of the form to a set of points in . As we did for the line of best fit above, we will look for the plane which minimizes the sum of squared distances in the -direction from the points to the plane.

Sometimes this approach gives undesirable results because the resulting values for and are considered too large for a given application. If we want to force them to be smaller, we can add a term to our objective function which penalizes largeness of and . We choose some positive value and propose the objective function

This is called the ridge objective function. Alternatively, we could use

This is called the lasso objective function.

It turns out in practice that the lasso minimizer frequently has either or equal to zero, while the ridge minimizer almost never does. Zero coefficients can be helpful, because if then we are expressing a simpler relationship between just the -value and -value of each point (and similarly if is zero instead). Being able to ignore some of the coordinates is especially helpful if we have hundreds of coordinates rather than just two, as is often the case in statistical applications.

Example
Use the method of Lagrange multipliers to explain why the lasso minimizer tends to have solutions on the coordinate axes in the plane, while the ridge minimizer does not.

Solution. Let's define and . The function is minimized at a point where . In other words, the minimizer must be a point where and are vectors.

By the method of Lagrange multipliers, this is the same equation we get if consider the optimization problem of minimizing subject to the constraint that , where is any constant. Therefore, the ridge minimizer will occur at a point where has a minimum on some set of the form . Since is a disk in the -plane, we would expect the typical solution to be at a point on the boundary of the circle where neither nor is zero.

Applying the same analysis to the lasso minimizer, we get an optimization problem on the squared-shaped set . Since the square has corners which always contribute candidate points for minimization of (since there is no tangent line at those points), it's reasonable to expect the lasso minimizer to occur at these points.

Ridge. The point of tangency is where is minimized subject to the constraint XEQUATIONX4190XEQUATIONX. This point is typically not on a coordinate axis.

Lasso. The lowest level line which intersects the square does so at one of the square's corners.

Exercises

Exercise
Many airlines require that the sum of the length, width and height of a checked baggage cannot exceed 62 inches. Find the dimensions of the rectangular baggage that has the greatest possible volume under this regulation.

Solution. The objective function is the volume formula for a box, namely . The constraint equation is . The Lagrange equation tells us that

Since we also have the equation , we have four equations and four variables. Multiplying the three equations above, we get , which means . Dividing this equation by each of the original three equations gives . Substituting into the last equation, we find . Thus the maximum-volume luggage shape is a cube.

Change Language

Sign in to Mathigon

Share

Reset Progress

Glossary

Multivariable CalculusOptimization

The line of best fit

Critical points

Lasso and ridge regression

Exercises