Select one of the keywords on the left…

Linear AlgebraSingular Value Decomposition

Reading time: ~60 min

In this section we will develop one of the most powerful ideas in linear algebra: the singular value decomposition. The first step on this journey is the polar decomposition.

Polar decomposition

The Gram matrix of a square matrix A is a useful tool for understanding the behavior of A. Let's define the matrix \sqrt{A' A} to be V \Lambda^{1/2} V', where V \Lambda V' is the diagonalization of A' A and \Lambda^{1/2} is the matrix obtained by taking the square root of the diagonal entries of \Lambda. Then \sqrt{A' A} is and satisfies

\begin{align*}\sqrt{A' A}\sqrt{A' A} = V \Lambda^{1/2}V' V\Lambda^{1/2}V'= A' A,\end{align*}

as befits the notation \sqrt{A'A}.

The matrix \sqrt{A' A} is simpler to understand than A because it is symmetric and positive semidefinite, yet it transforms space very similarly to A: if \mathbf{x} \in \mathbb{R}^n, then

\begin{align*}|A\mathbf{x}|^2 = \mathbf{x}' A' A \mathbf{x} = \mathbf{x}' \sqrt{A' A} \sqrt{A' A} \mathbf{x} = |\sqrt{A' A}\,\mathbf{x}|^2.\end{align*}

In other words, for all \mathbf{x}, the images of \mathbf{x} under A and under \sqrt{A' A} have equal norm. If points are the same distance from the origin, then one may be mapped to the other by a about the origin.

Therefore, for each \mathbf{x} \in \mathbb{R}^n, there is an orthogonal transformation from the range of \sqrt{A' A} to the range of A which sends \sqrt{A' A}\mathbf{x} to A\mathbf{x}.

The grid-line images under A and \sqrt{A' A} have the same shape; they are related by an orthogonal transformation.

It can be shown that the orthogonal transformation mapping A\mathbf{x} to \sqrt{A' A}\mathbf{x} is the same transformation for every \mathbf{x}. Furthermore, even if the range of \sqrt{A'A} is not all of \mathbb{R}^n, we can extend this orthogonal transformation to an orthogonal transformation on \mathbb{R}^n by arbitrarily mapping vectors in a basis of the orthogonal complement of the range of \sqrt{A'A} to the orthogonal complement of the range of A. In this way, we obtain the polar decomposition:

Theorem (Polar Decomposition)
For any n \times n matrix A, there exists an orthogonal matrix R such that

\begin{align*}A = R\sqrt{A' A}.\end{align*}

This representation is useful because it represents an arbitrary square matrix as a product of matrices whose properties are easier to understand (the orthogonal matrix because it is distance- and angle-preserving, and the positive-definite matrix \sqrt{A' A} because it is orthogonally diagonalizable, by the Spectral Theorem.

Let's explore a fast method of computing a polar decomposition A=R\sqrt{A' A}. This method actually works by calculating R and then recovering \sqrt{A' A} as R^{-1}A (since this is computationally faster than calculating the matrix square root). We call R the orthogonal part of A and \sqrt{A' A} the symmetric part of A.

We set R_{0} = A and define the iteration

\begin{align*}R_{k+1} = \frac{R_{k} + (R_{k}')^{-1}}{2}\end{align*}

Let's see why this converges to R.

  1. Defining P = \sqrt{A' A} and using the equation A = RP, show that R_{1} = \frac{A + (A')^{-1}}{2} = R\left(\frac{P + P^{-1}}{2}\right). Use the prior step to explain why the R_k's all have the same orthogonal parts and have symmetric parts converging to the identity matrix.
    Hint: consider the eigendecompositions of the symmetric parts. You may assume that the sequence defined by x_{k+1} = \frac{1}{2}(x_k+1/x_k) converges to 1 regardless of the starting value x_0>0.

  2. Write some code to apply this algorithm to the matrix

import numpy as np
A = np.array([[1,3,4],[7,-2,5],[-3,4,11]])
A = [1 3 4; 7 -2 5; -3 4 11]

and confirm that the resulting matrices R and P satisfy R' R = I and P^2 = A' A.

Solution. Since both conjugation and inversion reverse the order of matrix products, we get

\begin{align*}R_{1} = \frac{RP + ((RP)')^{-1}}{2} = \frac{RP + (R^*)^{-1}(P')^{-1}}{2}\end{align*}

Since R is orthogonal, (R^*)^{-1} = R, as R' R = I. Since P is symmetric, P' = P. So this is equal to

\begin{align*}R_{1} = \frac{RP + RP^{-1}}{2} = R \left(\frac{P + P^{-1}}{2}\right)\end{align*}

We see that the R_0 = A and R_1 have the same orthogonal part, and repeating the calculation shows that all the R_{k} have the same orthogonal part. As for the symmetric parts P_{k}, we see that

\begin{align*}P_{k+1} = \frac{P_{k} + P_{k}^{-1}}{2}\end{align*}

Let's see why this averaging process converges to the identity matrix. Symmetric matrices are diagonalizable by the Spectral Theorem, so suppose P diagonalizes as V \Lambda V^{-1}. Then

\begin{align*}\frac{1}{2}(P + P^{-1}) = V\left(\frac{1}{2}\Lambda + \frac{1}{2}\Lambda^{-1}\right)\mathbf{v}\end{align*}

Thus the P_{k}'s converge to the matrix V \Lambda_\infty V^{-1}, where \Lambda_\infty is the diagonal matrix whose (i,i) th entry is the limit obtained when you start with \Lambda_{i,i} and repeatedly apply the function x \mapsto \frac{1}{2}\left(x + \frac{1}{x}\right). By the fact about this iteration given in the problem statement, we conclude that \Lambda_\infty is the identity matrix. Therefore, the limit of P_k as k\to\infty is equal to V I V^{-1} = I.

For example:

import numpy as np

def polar(A,n):
    R = A
    for i in range(n):
        R = (R + np.linalg.inv(R.T))/2
    return R, np.linalg.solve(R, A)

I = np.eye(3)
A = np.array([[1, 3, 4],[7, -2, 5], [-3, 4, 11]])
R, P = polar(A,100)
R.T @ R - I, P @ P - A.T @ A
using LinearAlgebra

function polar(A,n)
    R = A
    for i=1:n
        R = (R + inv(R'))/2
    R, R \ A

A = [1 3 4; 7 -2 5; -3 4 11]
R, P = polar(A,100)
R'*R - I, P^2 - A'*A

Both of the matrices returned on the last line have entries which are within 3\times 10^{-14} of zero.

Show that the product of two matrices with orthonormal columns has orthonormal columns.

Solution. If U' U = I and V' V = I, then (UV)' UV = V' U' UV = V' V = I.

The singular value decomposition

The polar decomposition tells us that any square matrix A is almost the same as some symmetric matrix, and the spectral theorem tells us that a symmetric matrix is almost the same as a simple scaling along the coordinate axes. (In both cases, the phrase "almost the same" disguises a composition with an orthogonal transformation.) We should be able to combine these ideas to conclude that any square matrix is basically the same as a simple scaling along the coordinate axes!

Let's be more precise. Suppose that A is a square matrix. The polar decomposition tells us that

\begin{align*}A = R \sqrt{A' A}\end{align*}

for some orthogonal matrix R. The spectral theorem tells us that \sqrt{A' A} = V \Sigma V' for some orthogonal matrix V and a diagonal matrix \Sigma with nonnegative diagonal entries.

Combining these equations, we get

\begin{align*}A = R V \Sigma V'.\end{align*}

Since a product of orthogonal matrices is , we can define U = RV and obtain the singular value decomposition (SVD) of A:

\begin{align*}A = U \Sigma V'\end{align*}

where U and V are orthogonal matrices.

We can visualize the decomposition A = U \Sigma V' geometrically by making a figure like the one shown below, which illustrates the successive effects of each map in the composition U \Sigma V'. If we draw grid lines on the second plot (just before \Sigma is applied) and propagate those grid lines to the other plots by applying the indicated maps, then we endow the domain and range of A with orthogonal sets of gridlines with A mapping one to the other.

We can extend the singular value decomposition to rectangular matrices A (that is, matrices which are not necessarily square) by adding rows or columns of zeros to a rectangular matrix to get a square matrix, applying the SVD to that square matrix, and then trimming the resulting \Sigma matrix as well as either U or V' (depending on which dimension of A is smaller). We get a decomposition of the form A = U \Sigma V' where U is an m \times m orthogonal matrix, V' is an n \times n orthogonal matrix, and \Sigma is a rectangular m \times n diagonal matrix.

The matrix A maps one set of orthogonal grid lines to another

Theorem (Singular value decomposition)
Suppose that A is an m \times n matrix. Then there exist orthogonal matrices U and V and a rectangular diagonal matrix \Sigma such that

\begin{align*}A = \underbrace{U}_{m \times m} \underbrace{\Sigma}_{m \times n} \underbrace{V'}_{n \times n} \:,\end{align*}

We call A = U \Sigma V' the a singular value decomposition (or SVD) of A. The diagonal entries of \Sigma are called the singular values of A.

The diagonal entries of \Sigma, which are the square roots of the eigenvalues of A' A, are called the singular values of A. The columns of U are called left singular vectors, and the columns of V are called right singular vectors.

Looking at the bottom half of the SVD figure, we see that the singular values of A are the lengths of the semi-axes of the ellipsoid in \mathbb{R}^m obtained as the image under A of the unit ball in \mathbb{R}^n. Moreover, the directions of these axes are the columns of U, since they are the images under U of the standard basis vectors. We will see an important application of this feature of the SVD in the probability chapter when we discuss principal component analysis.

As an example of how the singular value decomposition can be used to understand the structure of a linear transformation, we introduce the Moore-Penrose pseudoinverse A^+ of an m \times n matrix A. We define A^+ to be V \Sigma^+ U', where \Sigma^+ is the matrix obtained by inverting each nonzero element of \Sigma. The pseudoinverse is a swiss-army knife for solving the linear system A\mathbf{x} = \mathbf{b}:

  • If A is square and invertible, then A^+ = A^{-1}
  • If A\mathbf{x} = \mathbf{b} has no solution, then A^+\mathbf{b} is the value of \mathbf{x} which minimizes |A\mathbf{x} - \mathbf{b}|^2 (in other words, the closest thing to a solution you can get).
  • If A\mathbf{x} = \mathbf{b} has multiple solutions, then A^+\mathbf{b} is the solution with minimal norm.

Show that \left[\begin{smallmatrix}-160 & -120 \\ -12 & -134 \\\ 141 & 12 \\ \end{smallmatrix}\right] has SVD \left[\begin{smallmatrix} -\frac{4}{5} & 0 & \frac{3}{5}\\\ -\frac{9}{25} & -\frac{4}{5} & -\frac{12}{25} \\\ \frac{12}{25} & -\frac{3}{5} & \frac{16}{25} \end{smallmatrix}\right] \left[\begin{smallmatrix}250 & 0 \\\ 0 & 125 \\ 0 & 0 \end{smallmatrix}\right]\left[\begin{smallmatrix}\frac{4}{5} & \frac{3}{5} \\ -\frac{3}{5} & \frac{4}{5} \\ \end{smallmatrix}\right]. Find its Moore-Penrose pseudoinverse.

Solution. Let

\begin{align*}A = \left[\begin{smallmatrix}-160 & -120 \\\ -12 & -134 \\\ 141 & 12 \\\ \end{smallmatrix}\right],\end{align*}

and let U, \Sigma, and V' be the three matrices given in the problem statement.

We need to check that U,V are orthogonal, and that A = U \Sigma V'. We can verify that U,V are orthogonal by showing that their columns are orthogonal unit vectors. Equivalently, we may compute the products U' U and V' V and observe that they are identity matrices. Similarly, A = U \Sigma V' can be verified by hand or on a computer.

The formula for the Moore-Penrose pseudoinverse is

\begin{align*}A^{+} = V \Sigma^{+}U'\end{align*}

The matrix \Sigma^{+} is obtained by inverting the nonzero elements on the diagonal of \Sigma, and the transposing the resulting matrix.

\begin{align*}\Sigma^{+} = \begin{bmatrix} 1/125 & 0 \\0 & 1/250 \end{bmatrix}.\end{align*}

With a little calculation, we arrive at

\begin{align*}A^{+} = \frac{1}{31250} \begin{bmatrix} -80&84&138\\-60&-187&-84 \end{bmatrix}.\end{align*}

SVD and linear dependence

Linear dependence is numerically fragile: if the columns of a matrix (with more rows than columns) are linearly dependent, then perturbing the entries slightly by adding tiny independent random numbers is almost certain to result in a matrix with linearly independent columns. However, intuition suggests that subverting the principles of linear algebra in this way going to solve any real-world problems that emerge from linear dependence relationships among columns of a matrix.

This intuition is accurate, and it highlights the utility of having a generalization of the idea of linear independence which can quantify how close a list of vectors is to having linear dependence relationships, rather than remaining within the confines of the binary labels "linearly dependent" or "linearly independent". The singular value decomposition provides such a tool.

Define a matrix with 100 rows and 5 columns, and do it in such a way that two of the five columns are nearly equal to some linear combination of the other three. Calculate the singular values of the matrix, and make a conjecture about how the number of approximate linear dependencies could have been detected from the list of singular values.

import numpy as np

Solution. We see that two of the singular values are much smaller than the other three. (Remember that you have to run the cell twice to get the plot to show.)

import numpy as np
import matplotlib.pyplot as plt
A = np.random.standard_normal((100,5))
A[:,3] = A[:,2] + A[:,1] + 1e-2*np.random.standard_normal(100)
A[:,4] = A[:,1] - A[:,0] + 1e-2*np.random.standard_normal(100),np.linalg.svd(A)[1])
using Plots, LinearAlgebra
A = randn(100, 5)
A[:,4] = A[:,3] + A[:,2] + 1e-2*randn(100)
A[:,5] = A[:,2] - A[:,1] + 1e-2*randn(100)
bar(1:5, svdvals(A), label = "singular values")

We conjecture that k very small singular values indicates that k columns would need to be removed to obtain a matrix which does not have approximate linear dependence relationships among its columns.

In fact, the idea developed in this exercise is used by the NumPy function np.linalg.matrix_rank to calculate the rank of a matrix. Because of the roundoff errors associated with representing real numbers in memory on a computer, most matrices with float entries technically have full rank. Thus np.linalg.matrix_rank computes the singular value decomposition and returns the number of of the matrix which are larger than a given threshold. The threshold is adjustable, but one common setting is 10^{-15} times the largest entry of the matrix times the largest dimension of the matrix.

In fact, the idea developed in this exercise is used by the Julia function rank to calculate the rank of a matrix. Because of the roundoff errors associated with representing real numbers in memory on a computer, most matrices with float entries technically have full rank. Thus rank computes the singular value decomposition and returns the number of of the matrix which are larger than a given threshold. The threshold is adjustable, but one common setting is 10^{-15} times the largest entry of the matrix times the largest dimension of the matrix.

SVD and image compression

We close this section with a computational exercise illustrating another widely applicable feature of the singular value decomposition.


  • Show that if \mathbf{u}_1, \ldots, \mathbf{u}_n are the columns of U, \mathbf{v}_1, \ldots \mathbf{v}_n are the columns of V, and \sigma_1, \ldots, \sigma_n are the diagonal entries of \Sigma, then A = \sigma_{1} \mathbf{u}_{1}\mathbf{v}_{1}'+\sigma_{2}\mathbf{u}_{2}\mathbf{v}_{2}'+\cdots+ \sigma_{n}\mathbf{u}_{n}\mathbf{v}_{n}'.

  • The equation is useful for compression, because terms with sufficiently small singular value factors can be dropped and the remaining vectors and singular values can be stored using less space. Suppose that A is a 256 \times 128 matrix—how many entries does A have, and how many entries do \mathbf{u}_1, \mathbf{u}_2, \mathbf{u}_3, \mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 have in total?

  • The Python code below creates a matrix A with pixel values for the image shown. How many nonzero singular values does A have? Explain how you can tell just from looking at the picture.

    figure: img(src="/content/linear-algebra/images/zero.svg")

import numpy as np
import matplotlib.pyplot as plt

m = 80
n = 100
a = m // 8
b = m // 4
A = np.ones((m,n))

def pixel(i,j):
    if (a <= i <= b or m-b <= i <= m-a) and a <= j <= n-a:
        return 0
    elif (a <= j <= b or n-b <= j <= n-a) and a <= i <= m-a:
        return 0
    return 1

A = np.array([[pixel(i,j) for i in range(1,m+1)] for j in range(1,n+1)])

U, Σ, V = np.linalg.svd(A)

using LinearAlgebra, Plots
m = 80
n = 100
a = m ÷ 8
b = m ÷ 4
A = ones(m,n)

function pixel(i,j)
    if (a ≤ i ≤ b || m-b ≤ i ≤ m-a) && a ≤ j ≤ n - a
    elseif (a ≤ j ≤ b || n-b ≤ j ≤ n-a) && a ≤ i ≤ m - a

A = [pixel(i,j) for i=1:m,j=1:n]

U, Σ, V = svd(A)
  • Now add some noise to the image:
B = A +  0.05*np.linalg.standard_normal((m,n))
B = A + 0.05 * randn(m, n)

Display this new matrix B, and also find the matrix obtained by keeping only the first three terms of \sigma_{1}\mathbf{u}_{1}\mathbf{v}_{1}' +\sigma_{2}\mathbf{u}_{2}\mathbf{v}_{2}' +\cdots+\sigma_{n}\mathbf{u}_{n}\mathbf{v}_{n}' for this matrix B. Which looks more like the original image A: (i) B or (ii) the three-term approximation of B?

Hint: you can achieve this computationally either by setting some singular values to 0 or by indexing the matrices U, \Sigma, and V' appropriately. Also, you will need the function np.diagonal diagm to generate a diagonal matrix from the vector of \Sigma values returned by svd.

import numpy as np
import matplotlib.pyplot as plt


  • Let M = \sigma_1\mathbf{u}_1\mathbf{v}_1' + \cdots + \sigma_n\mathbf{u}_n\mathbf{v}_n'. We need to show that A = M. We will do this by first showing that A\mathbf{x} = M\mathbf{x} for all \mathbf{x} \in \mathbb{R}^n.

Now, A\mathbf{x} = U \Sigma V' \mathbf{x} for all \mathbf{x} \in \mathbb{R}^n. By definition of matrix-vector product, U \Sigma V' \mathbf{x} is a linear combination of the columns of U \Sigma with weights given by V' \mathbf{x}. Since \Sigma is diagonal, it is not hard to see that the i th column of U\Sigma is \Sigma_{ii}\mathbf{u}_i = \sigma_i \mathbf{u}_i. Using definition of matrix-vector product again, we find that the i th weight \left(V' \mathbf{x}\right)_i is the dot product of the i th row of V' and \mathbf{x}. But the i th row of V' is \mathbf{v}_i by definition, and thus \left(V' \mathbf{x}\right)_i = \mathbf{v}_i \cdot \mathbf{x}. Therefore,

\begin{align*}A\mathbf{x} = U \Sigma V' \mathbf{x} &= \left(V' \mathbf{x}\right)_1 \sigma_1\mathbf{u}_1 + \cdots + \left(V' \mathbf{x}\right)_n \sigma_n\mathbf{u}_n \\ &= \left(\mathbf{v}_1\cdot \mathbf{x}\right) \sigma_1\mathbf{u}_1 + \cdots + \left(\mathbf{v}_n \cdot \mathbf{x}\right) \sigma_n\mathbf{u}_n \\ &= \sigma_1\mathbf{u}_1\mathbf{v}_1' \mathbf{x} + \cdots + \sigma_n\mathbf{u}_n\mathbf{v}_n' \mathbf{x}\end{align*}

where \mathbf{v}_i' \mathbf{x} is being treated as a 1 \times 1 matrix for all 1 \leq i \leq n.

By linearity of matrix multiplication,

\begin{align*}\sigma_1\mathbf{u}_1\mathbf{v}_1' \mathbf{x} + \cdots + \sigma_n\mathbf{u}_n\mathbf{v}_n' \mathbf{x} = \left(\sigma_1\mathbf{u}_1\mathbf{v}_1' + \cdots + \sigma_n\mathbf{u}_n\mathbf{v}_n'\right) \mathbf{x}\end{align*}

and thus A\mathbf{x} = M \mathbf{x} for all \mathbf{x} \in \mathbb{R}^n. Since A\mathbf{x} = M\mathbf{x} for all \mathbf{x} \in \mathbb{R}^n, it follows that AB = MB for any n \times p matrix B. In particular, if B is the identity matrix in \mathbb{R}^n, we have

\begin{align*}A = AB = MB = M\end{align*}

as required.

  • A has 256 \times 128 = 32768 entries and \mathbf{u}_1, \mathbf{u}_2, \mathbf{u}_3, \mathbf{v}_1, \mathbf{v}_2, \mathbf{v}_3 combined have 3(256 + 128) = 1152 entries.

  • It can be seen from the picture that A has 3 kinds of columns: one whose components are all dark, another whose components are light in the middle, and the other whose components are dark on the outside and in the middle with strips of light in between. These columns are clearly linearly independent, and thus A has rank 3. Therefore A has 3 non-zero singular values.

We can select only the first three terms by suitably indexing the vectors, as follows:

U, Σ, V = np.linalg.svd(B)
plt.matshow(U[:,:3] * np.diag(Σ[:3]) * V.T[:3,:])
U, Σ, V = svd(B)
heatmap(U[:,1:3] * diagm(Σ[1:3]) * V'[1:3,:])
Bruno Bruno