Select one of the keywords on the left…

Multivariable CalculusOptimization

Reading time: ~45 min

Multivariable optimization problems are ubiquitous in applied math and data science, because a common path to achieving desirable results for real-world problems is to specify a value which captures some notion of badness and use optimization methods to make it as small as possible. Let's begin with a concrete example of this pattern.

The line of best fit

Suppose we want to use a line to describe the relationship between the x-values and y-values for a collection of points in the xy-plane. Some lines clearly do a very poor job of aligning with the points, while others are better. Try adjusting the sliders below to see how the resulting graph changes.

y=${m} x+${b}

To identify which line which does the "best" job, we need to quantify how well a given line fits the points. The most common way to do this is to measure the sum of the squares of the vertical distances from each point to the line. We seek to make that sum of squared distances as small as possible.

If the points are (x_1, y_1), \ldots, (y_n, y_n), then the value we're trying to minimize is

\begin{align*}L(m,b) = (mx_1 + b - y_1)^2 + \cdots + (mx_n + b - y_n)^2.\end{align*}

The letter L is used in this context in reference to the word loss, which is a synonym for badness.

Here's the key insight: at any point (m,b) where \nabla L(m,b) is a nonzero vector, there will be directions in which the pair (m,b) may be perturbed to increase L's value (specifically, any directions which make an angle with \nabla L(m,b)) and perturbation directions which decrease the value of L (specifically, any directions which make an angle with \nabla L(m,b)). Therefore, at any point where L has a minimum, we must have \nabla L = 0.

Taking the partial derivatives of L and setting them equal to zero, we get the system of equations

\begin{align*}2x_1(mx_1 + b - y_1) + \cdots + 2x_n(mx_n + b - y_n) &= 0 \\\ 2(mx_1 + b - y_1) + \cdots + 2(mx_n + b - y_n) &= 0\end{align*}

We have two equations and two unknowns (m and b), so we can solve using standard techniques for systems of linear equations. For example, solving the second equation for b gives

\begin{align*}b = \frac{1}{n}(y_1 + \cdots + y_n) - \frac{m}{n}(x_1 +\cdots + x_n),\end{align*}

and substituting into the first equation and solving for m yields

\begin{align*}m = \frac{(x_1y_1 + \cdots + x_ny_n) - \frac{1}{n}(x_1 + \cdots + x_n)(y_1 + \cdots + y_n)}{(x_1^2 + \cdots + x_n^2) - \frac{1}{n}(x_1 + \cdots + x_n)^2}\end{align*}

We can substitute back into the equation for b to find the only (m,b) pair at which \nabla L(m,b) is equal to the zero vector. If we trust our intuition that at least one minimizing (m,b) pair must exist, then this is the one. In fact, this intuition is correct, and the formula above is indeed the formula for the line of best fit.

Run the code below to confirm visually that the formula we derived above fits the sampled points reasonably well. (Note that you have to run the cell twice to get the plot to show.)

import numpy as np
import matplotlib.pyplot as plt
n = 10
x = np.random.random_sample(n)
y = 2 * x + 3 + 0.2 * np.random.standard_normal(n)
m = ((np.sum(x*y) - np.sum(x)*np.sum(y)/n) /
     (np.sum(x**2) - np.sum(x)**2/n))
b = (np.sum(y) - m * np.sum(x))/n
yfit = m*x + b

Critical points

Let's consider the problem of optimizing a differentiable function f from a subset D of \mathbb{R}^d to \mathbb{R}.

The graph of the function f(x,y) = -x^2 - y^2 + x + \frac{2}{3} y + \frac{23}{36} over the disk (x-\frac{1}{2})^2 + (y-\frac{1}{2})^2 \leq \frac{1}{4}.

As we observed in the previous section, if \nabla f is nonzero at a point away from the boundary of D, then there are directions in which f decreases as you move away from that point and other directions where it increases. Therefore, any minima or maxima of f away from the boundary of D must occur at points where the gradient of f is zero. We call points where the gradient of f is zero or where f is non-differentiable critical points of f. If a function has a minimum or a maximum at a point, then either that point is a critical point, or it is on the of the domain of the function.

Now let's consider the possibility that f has a maximum or minimum on the boundary of its domain D. For simplicity, we'll assume that D is a subset of the plane. Because of the directional derivative formula, if \nabla f is nonzero, then there are 180 degrees worth of directions in which (x,y) may be nudged to increase the value of f(x,y). Therefore, if f reaches its maximum value at a boundary point, then either the gradient of f is zero at that point, or else all of the directions in which f increases are excluded by the constraint that (x,y) is in D. If D has a tangent line at such a point, the latter possibility requires that the tangent line be to the gradient of f.

If the gradient of f is not perpendicular to the tangent line at a boundary point, then there are permissible directions of increase and f does not have a maximum there.

If the boundary of D is specified as a level set of a function g, then being perpendicular to the tangent line is the same as being parallel to the gradient of g. Two vectors are parallel if one is a multiple of the other, so we arrive at the equation

\begin{align*}\nabla f = \lambda \nabla g,\end{align*}

for some \lambda \in \mathbb{R}. The variable \lambda is called a Lagrange multiplier.

Although we framed the discussion assuming that the domain D is a subset of the plane, the same analysis works in any dimension, and the resulting equation is still \nabla f = \lambda \nabla g.

The extreme value theorem confirms the intuition that a continuous function defined on a closed and bounded subset D of \mathbb{R}^n must have a maximum somewhere and a minimum somewhere. Let's combine all of these points into a single theorem statement:

Theorem (Extreme value theorem and Lagrange multipliers)
Suppose that f is a continuous function defined on a closed and bounded subset D of \mathbb{R}^n. Then

  • f realizes an absolute maximum and absolute minimum on D (the extreme value theorem)
  • any point where f realizes an extremum is either a critical point—meaning that \nabla f = 0 or f is non-differentiable at that point—or at a point on the boundary.
  • if f realizes an extremum at a point on a portion of the boundary which is the level set of a differentiable function g with non-vanishing gradient \nabla g, then either f is non-differentiable at that point or the equation \nabla f = \lambda \nabla g is satisfied at that point, for some \lambda \in \mathbb{R}.

This theorem suggests a program for optimizing a function f on a domain D whose boundary is a level set of a function g.

  • Identify any points where
    • \nabla f = 0,
    • \nabla f is undefined,
    • the boundary of D is not smooth, or
    • \nabla f = \lambda \nabla g for some \lambda \in \mathbb{R}
  • Calculate f at all of these points and select the largest and smallest values.

Find the maximum and minimum values of f(x,y) = x^4 + y^4 - 4xy on the rectangle [0,3]\times [0,2].

Solution. The system of equations \frac{\partial}{\partial x} f = \frac{\partial}{\partial y} f = 0 tells us that y = x^3 and x = y^3. Substituting, we find x = x^9, which can be rearranged and factored (by repeated applications of difference of squares) to get

\begin{align*}x(x-1)(x+1)(x^2 + 1)(x^4+1) = 0. \end{align*}

The only value of x in the interior of the rectangle which satisfies this equation is x = 1. Substituting into y=x^3, we find the critical point (1,1) in the interior of the rectangle. The value of the function this point is -2.

Along the bottom edge of the rectangle, the value of the function at (x,0) is x^4 which ranges monotonically from 0 to 81. Along the top edge, we have f(x,2) = x^4 - 8x + 16. This has a critical point at x = \sqrt[3]{2}.

Along the left edge, the function is equal to y^4, which has no interior critical points. And finally, along the right edge, we have f(3,y) = y^4 - 12y + 81, which has a critical point at \sqrt[3]{3}.

So all together, checking critical points and corners gives

\begin{align*}\begin{array}{c|c} (x,y) & f(x,y) \\ \hline (1,1) & -2 \\\ (0,0) & 0 \\\ (3,0) & 81 \\\ (0,2) & 16 \\\ (3,2) & 73 \\\ (3,\sqrt[3]{3}) & 81 - 9\sqrt[3]{3} \approx 68.02 \\\ (\sqrt[3]{2},2) & 16 - 6\sqrt[3]{2} \approx 8.44 \end{array}\end{align*}

Therefore, the absolute maximum is \boxed{81} and the absolute minimum is \boxed{-2}.

Find the point closest to the origin in the region 3x + 2y + z \geq 6.

Solution. We can rule out the possibility that the point nearest the origin is in the interior of the region, on geometric grounds. To find the boundary critical points, we set the objective function

\begin{align*}f(x,y,z) = x^2 + y^2 + z^2\end{align*}

and the constraint function g(x,y,z) = 3x + 2y + z describing the boundary of the given set. Solving the Lagrange equation

\begin{align*}[2x, 2y, 2z] = \lambda [3, 2, 1],\end{align*}

together with the constraint equation 3x + 2y + z = 6, we get \frac{\lambda}{2}(3(3) + 2(2) + 1(1)) = 6, which implies \lambda = \frac{6}{7}. So we get a single critical point \left(\frac{9}{7},\frac{6}{7}, \frac{3}{7}\right), and since there's only one, the minimum must occur here.

We can also use the Lagrange multiplier idea in the reverse direction. Let's look at an example from statistics where Lagrange multipliers give rise to a meaningful insight.

Lasso and ridge regression

Consider the problem of fitting a plane of the form z = ax + by to a set of points (x_1, y_1, z_1), \ldots, (x_n, y_n, z_n) in \mathbb{R}^3. As we did for the line of best fit above, we will look for the plane which minimizes the sum of squared distances in the z-direction from the points to the plane.

Sometimes this approach gives undesirable results because the resulting values for a and b are considered too large for a given application. If we want to force them to be smaller, we can add a term to our objective function which penalizes largeness of a and b. We choose some positive value \lambda and propose the objective function

\begin{align*}(ax_1 + by_1 - z_1)^2 + \cdots + (ax_n + by_n - z_n)^2 + \lambda(a^2 + b^2)\end{align*}

This is called the ridge objective function. Alternatively, we could use

\begin{align*}(ax_1 + by_1 - z_1)^2 + \cdots + (ax_n + by_n - z_n)^2 + \lambda(|a| + |b|)\end{align*}

This is called the lasso objective function.

It turns out in practice that the lasso minimizer frequently has either a or b equal to zero, while the ridge minimizer almost never does. Zero coefficients can be helpful, because if a = 0 then we are expressing a simpler relationship between just the y-value and z-value of each point (and similarly if b is zero instead). Being able to ignore some of the coordinates is especially helpful if we have hundreds of coordinates rather than just two, as is often the case in statistical applications.

Use the method of Lagrange multipliers to explain why the lasso minimizer tends to have solutions on the coordinate axes in the (a,b) plane, while the ridge minimizer does not.

Solution. Let's define f(a,b) = (ax_1 + by_1 - z_1)^2 + \cdots + (ax_n + by_n - z_n)^2 and g(a,b) = a^2 + b^2. The function f(a,b) + \lambda g(a,b) is minimized at a point where \nabla f(a,b) + \lambda \nabla g(a,b) = 0. In other words, the minimizer must be a point where \nabla f and \nabla g are vectors.

By the method of Lagrange multipliers, this is the same equation we get if consider the optimization problem of minimizing f(a,b) subject to the constraint that g(a,b) \leq c, where c is any constant. Therefore, the ridge minimizer will occur at a point where f has a minimum on some set of the form g(a,b) \leq c. Since a^2 + b^2 \leq c is a disk in the ab-plane, we would expect the typical solution to be at a point on the boundary of the circle where neither a nor b is zero.

Applying the same analysis to the lasso minimizer, we get an optimization problem on the squared-shaped set |a| + |b| \leq c. Since the square has corners which always contribute candidate points for minimization of f (since there is no tangent line at those points), it's reasonable to expect the lasso minimizer to occur at these points.

Ridge. The point of tangency is where f(a,b) is minimized subject to the constraint a^2 + b^2 \leq c. This point is typically not on a coordinate axis.

Lasso. The lowest level line which intersects the square does so at one of the square's corners.


Many airlines require that the sum of the length, width and height of a checked baggage cannot exceed 62 inches. Find the dimensions of the rectangular baggage that has the greatest possible volume under this regulation.

Solution. The objective function is the volume formula for a box, namely f(l,w,h) = lwh. The constraint equation is g(l,w,h) = lw + wh + hl = 62. The Lagrange equation tells us that

\begin{align*}wh &= \lambda \\\ lh &= \lambda \\\ lw &= \lambda\end{align*}

Since we also have the equation lw + wh + hl = 62, we have four equations and four variables. Multiplying the three equations above, we get (whl)^2=\lambda^3, which means whl = \lambda^{3/2}. Dividing this equation by each of the original three equations gives w = h = l = \lambda^{1/2}. Substituting into the last equation, we find w = h = l = 62/3. Thus the maximum-volume luggage shape is a cube.

Bruno Bruno