Matrix Calculus
by Heinrich Hartmann / 2025-06-23 / Stemwede
Introduction
Good notation has vast influence on how well we are able to perform computations, and how much we can trust the results. A good "calculus" allows us to comprehend formulas and spot inconsistencies easily.
In this text we are following the general "function space" approach laid out in Natural Operators, but focus on matrix algebra suited for computation.
Vectors \(a\) are defined as functions on a (finite) set \(X\), with entries \(a[x], x\in X\). Matrices are defined as functions on products \(X \times Y\). Matrix & Vector operations are defined as "tensor" contractions of adjacent indices.
In this way we arrive at a calculus that makes the following properties clear:
- Explicit index computations \(i=1,\dots,N\) are naturally discouraged, since we are dealing with sets and not numbers as indices.
- Results and formulas specialize to numerical formulas by taking \(X = [N] = \{ 0, \dots, N-1 \}\). Here the notation \(a[x]\) is consistent with Python's indexing of arrays/lists.
- The concepts of rows and column vectors are avoided. Vectors are one-dimensional structures, and can be multiplied from the left and right (by contracting adjacent indices). Embeddings of vectors into matrix spaces are realized as outer products: \(1 \tensor a \in \IR[1,X]\).
Function Spaces
Let \(X\) be a set. Typical examples are \(X=[N]=\{1,\dots,N\}\) if we are doing computations, but also infinite sets like \(X=\IR\). We introduce the following notation:
- \(\IR[X]= (\stext{real valued functions on X}) = Map(X, \IR)\).
- For \(a\in \IR[X], x\in X\), we denote by \(a[x]\in \IR\) the value of \(a\) at \(x\).
- For \(x \in X\) we denote by \(\hat{x} \in \IR[X]\) the "delta" function with \(\hat{x}(x)=1\), and \(\hat{x}(y) = 0\) for all \(y\neq x\).
- \(\IR[X]\) is a commutative, associative, unital algebra with pointwise operations:
- \((a + b)[x] = a[x]+b[x] \in \IR\)
- \((a \cdot b)[x] = a[x]b[x] \in \IR\)
- The scalar multiplication by \(r \in \IR\) is denoted by \(r a\). We sometimes write \(r.a\) if we want to stress the scalar nature of \(r\).
Notes on Notation
- We choose the variables \(x \in X\) to denote points. \(x\) is a non-linear discrete entity without further structure. It may also be an index, \(x \in [N]\).
In other texts, we would see: \(i = 1,\dots,N\).
- For functions \(a \in \IR[X]\) we choose the variable \(a\). Here \(a\) is an algebraic entity,
with a rich set of operations, inherited from the real numbers. In general, functions are
best denoted by letters which are similar to the target space. We parametrize values in the target \(a[x]\) by \(x \in X\).
In other texts, we would see \(v\) with \(v_i, i = 1,\dots, N\).
- The notation \(a[x]\) is inspired by Python array indexing. If we choose \(X = \{ 0,\dots,N-1\}\) we exactly recover python-style array indexing \(a[i]\).
Slicing
For \(A \in \IR[X,Y], x\in X,y\in Y\) we define.
- \(A[x,:] \in \IR[Y]\) by \(A[x,:][y] = A[x,y]\).
- \(A[:,y] \in \IR[X]\) by \(A[:,y][x] = A[x,y]\).
Inner Products
Assume \(X\) is finite, and \(a,b\) in \(\IR[X]\).
We define the inner product as
- Commutativity We have \(a * b = b * a\).
- Reconstruction We have \(a = \sum_x a[x] \hat{x} = \sum_x (a * \hat{x})\hat{x}\)
Side-note The inner product is also defined if \(X\) is infinite but either \(a\) or \(b\) have finite support, i.e. \(a[x]=0\) for almost all \(x\).
Matrix Products
Let \(X,Y,Z\) be finite sets, \(x \in X,y \in Y\), \(a \in \IR[X], b \in \IR[Y], c \in \IR[Z]\).
- We define \(\IR[X,Y]=\IR[ X\times Y]\) as real valued functions on \(X\times Y\).
- For \(A \in \IR[X,Y]\) we denote the entries by \(A[x,y] \in \IR\).
- We define the vector-matrix product \(a * A \in \IR[Y]\) by \(a * A[y] = \sum_x a[x]A[x,y].\)
- We define the matrix-vector product \(A * b \in \IR[X]\) by \(A * b[x] = \sum_y A[x,y]b[y].\)
- We define the matrix-matrix product \(A * B\in \IR[X,Z]\) by \(A*B[x,z]=\sum_y A[x,y]B[y,z]\), where \(B \in \IR[Y,Z]\).
Note
- There are no row and column vectors here. Matrix-vector products are defined in both directions: \(a * A\) and \(A * b\). This is perfectly natural from an indexing standpoint, as "*" is defined as a "contraction operator" that sums over adjacent indices. This extends naturally to n-dimensional arrays.
- Traditionally \(v * A\) is denoted as \(v^T \cdot A\) and \(A * v\) is \(A \cdot v\), where \(v^T\) is a 1xN matrix "row matrix" and \(v\) an Nx1 "column matrix".
Associativity
- The matrix product is associative: \(A * (B * C) = (A * B) * C\), hence we write \(A * B * C\) without ambiguity.
- The matrix product is associative with vector-matrix product: \(a * (A * B) = (a * A) * B\), hence we write \(a * A * B\) without ambiguity.
- The vector-matrix product is associative with the scalar product: \((a * A) * b = a * (A * b)\), hence we write \(a * A * b\) without ambiguity.
Note that, since the scalar product is also commutative we have:
It's important to note, that in this case we can't remove the brackets, since "\((b * a) . A\)" is something entirely different: It's the scalar \(b * a \in \IR\) multipled with the matrix \(A\).
Unit
- The identity matrix \(I_X \in \IR[X,X]\) is given by: \(I_X[x,y] = \delta_{x,y}\), where \(\delta_{x,x} = 1\) and \(\delta_{x,y} = 0\) for \(x \neq y \in X\).
- For any vector \(a \in \IR[X]\) we have \(a * I_X = I_X * a = a\).
- For any matrix \(A \in \IR[X,Y]\), \(B \in \IR[Y,X]\) we have: \(I_X * A = A\) and \(B * I_X = B\).
Transposition
- We define \(A^* \in \IR[Y,X]\) as \(A^*[y,x] = A[x,y]\).
- We have \(A^* * a = a * A\)
In this text we will not use transposition of vectors / 1d arrays,
Outer Products
For \(a \in \IR[X]\) and \(b \in \IR[Y]\), define \(a \tensor b \in \IR[X,Y]\) as \((a \tensor b)[x,y] = a[x]b[y]\).
Properties
- For \(A \in \IR[W,X]\) and \(B \in \IR[Y,Z]\) we have: \(A * (a \tensor b) * B = (A * a) \tensor (b * B)\).
- The matrix \(a \tensor b\) acts on vectors \(a' \in \IR[X], b' \in \IR[Y]\) as \(a' * (a \tensor b) * b' = (a' * a) . (b * b')\).
- (Row- and Column-vectors) We can convert vectors into matrices via \(1 \tensor a \in \IR[1,X]\) and \(a \tensor 1 \in \IR[X,1]\).
- For \(x \in X\) the vector \(\hat{x} \tensor b = \hat{x} * (1 \tensor b)\) is the "row vector" centered at \(x\).
- For \(y \in X\) the vector \(a \tensor \hat{y} = (a \tensor 1) * \hat{y}\) is the "column vector" centered at \(y\).
- The identity matrix, can be expressed as \(I_X = \sum_{x\in X} \hat{x} \tensor \hat{x}\), for \(X\) finite.
- The identity \(I_X * A = A\) recovers \(A\) as sum of its "rows" $$ A = I_X * A = \sum_{x\in X} \hat{x} \tensor (\hat{x} * A) = \sum_{x\in X} \hat{x} \tensor A[x,:] $$
- The identity \(A * I_Y\) recovers \(A\) as sum of its "columns" $$ A = A * I_Y = \sum_{y\in Y} (A * \hat{y}) \tensor \hat{y} = \sum_{y\in Y} A[:,y] \tensor \hat{y}. $$
Proposition
- The matrix \(a \tensor b\) has rank one.
- Every rank one matrix \(R \in \IR[X,Y]\) is of the form \(a \tensor b\).
Remark
- In classical treatments, vectors \(a \in \IR[X]\) are identified with "column vectors" in \(\IR[X,1]\), which we denote by \(a \tensor 1\).
- The classical scalar product \(v^t \cdot w\), is expressed as \((1 \tensor v) * (w \tensor 1) = v * w \in \IR\).
- The classical rank-1 matrix \(v . w^t\), is expressed as \((v \tensor 1) * (1 \tensor w) = v \tensor w \in \IR[X,Y]\).
- The product \(\tensor\) generalizes naturally to the Kronecker Product or Tensor Product for higher order-tensors \(\IR[X_1, \dots, X_n]\).
Diagonal Matrices
Each \(m \in \IR[X]\) induces a linear transformation on \(\IR[X]\) by multiplication: \(a \mapsto m \cdot a\). This multiplication is represented by a diagonal matrix \(M = diag(m)\) with entries \(m[x,x] = x\) and zero elsewhere: $$ diag(m) = \sum_x m[x] \hat{x} \tensor \hat{x} \in \IR[X,X] $$ So \(M * a = m \cdot a\).
We characterize Multiplication operators as Diagonal Matrices or operators where all \(\hat{x}\) are Eigenvectors.
Proposition
- Let \(M=diag(m)\) a multiplication operator, then all \(\hat{x}\) are Eigenvectors for \(M\): \(M \hat{x} = m[x] \hat{x}.\)
- Conversely, let \(A \in \IR[X,X]\) be a matrix with \(A * \hat{x} = m[x] \hat{x}\), for some set of numbers \(m[x] \in \IR[X]\), then \(A = diag(m)\).
Next we study matrices that commute with diagonal operators. It turns out that those matrices decouple along the level-sets, and have a block diagonal form.
Proposition
If \(A \in \IR[X,X]\) and \(M = diag(m)\) a multiplication operator, the following are equivalent:
- \(A * M = M * A\).
- \(A[x,y] = 0\) for all \(x,y\) with \(m[x] \neq m[y]\).
- \(A = \vsum_\mu A_\mu\) where \(A_\mu \in \IR[X_\mu, X_\mu]\) is the restriction of \(A\) to \(X_\mu \subset X\) be the level-set where \(m[x] = \mu\).
Moreover in this case:
- If \(n\) is another function which induces the same partition of \(X\) in to level-sets, with \(N = diag(n)\). Then \(A * N = N * A\) if and only if \(A * M = M * A\).
- If all values \(m[x]\) are distinct, and \(AM=MA\) then \(A\) is diagonal as well.
Orthonormal Systems
- An orthonormal system \(u_z \in \IR[X], z \in Z\) is a set of vectors for which \(u_z * u_w = \delta_{z,w}\) for \(z,w \in Z\).
- The vectors \(u_z \in \IR[X], z\in Z\) form an orthonormal system if and only if the matrix \(U = \sum_z u_z \tensor \hat{z} \in \IR[X,Z]\) fulfills \(U^* U = I_Z\).
- An orthonormal basis (ONB) \(u_x \in \IR[X], x \in X\) is an orthonormal system in \(\IR[X]\) indexed by \(X\).
Proposition
- Orthonormal systems are linearly independent in \(\IR[X]\).
- Any orthonormal system \(u_z \in \IR[X]\), with \(Z \subset X\) can be extended to an orthonormal basis.
- The reverse composition \(P = U * U^* \in \IR[X,X]\) and \(Q = I_X - P\) are orthogonal projection matrices:
- Projection property: \(P * P = P, \quad Q*Q = Q\).
- Orthogonality: \(P * Q = 0, \quad Q * P = 0\).
- \(Im(P) = Im(U) = Span(u_z, z\in Z) = Ker(Q)\)
- \(Im(Q) = Im(P)^\perp\)
Linear Analysis
We repeat some of the results from the Analysis of Linear Functions post.
- For a real-valued differentiable function \(f: \IR[N] \lra \IR\) define the gradient \(d f(p) \in \IR[N]\) at a point \(p \in \IR^N\) as $$ d f(p)[n] = \frac{\del}{\del t}|_{t=0} f(p + t\hat{n}). $$
- For a vector-valued differentiable function \(f: \IR[N] \lra \IR[M]\), with components \(f^m: \IR^N \lra \IR\) and a point \(p \in \IR^N\) we define the Jacobi Matrix \((Df)(p) \in \IR[M,N]\) as \(Df(p) = \sum_m \hat{m} \tensor df^m(p)\) so that: $$ Df(p)[m,n] = \frac{\del}{\del t}|_{t=0} f^m(p + t \hat{n}). $$
- For a real-valued differentiable function \(f: \IR[N] \lra \IR\) define the Hessian \(Hf(p) \in \IR[N,N]\) at a point \(p \in \IR^N\) as \(Hf(p) = Ddf(p)\), so that: $$ Hf(p)[n,m] = \frac{\del}{\del s}\frac{\del}{\del t}|_{t=0,s=0} f(p + t\hat{n} + s \hat{m}). $$
Calculus
Let \(a, b\) are two real valued functions on \(\IR[M]\), \(\lambda \in \IR\). Let \(c\) be a real valued functions on \(\IR[M]\). Let \(f: \IR[M] \lra \IR[N]\), \(g: \IR[L] \lra \IR[M]\), and pick \(p \in \IR[M]\), \(q\in \IR[L]\) with \(g(q) = p\).
We ommit the points from the notation and write \(Df = D_p(f), df = df(p)\).
- Linearity $$ d (a + \lambda b) = da + \lambda db \quad \in \IR[M]. $$
- Chain Rule $$ D_q (f \circ g) = (D_p f) * (D_q g) \quad \in \IR[L,N]. $$
- Chain Rule (1-dim) $$ d(a \circ g)(q) = d a (p) * D_q g \quad \in \IR[M]. $$
- Product Rule $$ d(a \cdot b) = (da).b + a.(db) \quad \in \IR[N] $$.
- Product Rule (N-dim) $$ D (f \cdot c) = D f . c + f \tensor d c \quad \in \IR[M,N] $$.
- Quotient Rule $$ d(a/b) = (da \cdot b - a \cdot db)/b^2 \quad \in \IR[N] $$.
- Quotient Rule (N-dim) $$ D(f/c) = (Df . c - f \tensor dc)/c^2 \quad \in \IR[M,N] $$.
- Power rule \(\quad d(a^n) = n a^{n-1} da\).
- Inverse Rule \(\quad d(1/a) = - da / a^2\).
Examples I
- Let \(f(p) = a * p\), for some fixed \(a \in \IR[M]\) we have \(df (p) = a\) and \(Hf(p) = 0\).
- Let \(f(p) = A * p\), for some \(A \in \IR[N,M]\) we have \(Df (p) = A\).
- Let \(f(p) = p*p\), then \(df (p) = 2 p\) and \(Hf(p) = 2I.\)
- Let \(f(p) = p * G * p\), then \(df (p) = G*p + p*G\) and \(Hf(p) = G + G^*\)
- Let \(f(p) = p * A^* * A * p\), then \(df (p) = 2 A^* * A * p\) and \(Hf(p) = 2 A^* * A\)
- Let \(f(p) = \| A*p - t \|^2\), then \(d f(p) = 2A^* * A * p - 2A^* * t\) and \(Hf(p) = 2 A^* * A\)
Examples II
- Let \(n(p) = \| p \|\), \(p \neq 0\) then $$ dn (p) = p / | p |. $$ Note, that \(dn(p)\) is the unit vector in direction \(p\).
- Let \(u(p) = p/\|p\|\), \(p \neq 0\) be the unit vector in direction \(p\).
We have \(u(p) = p / n(p)\), so \(Du(p) = (I_N . n(p) - p \tensor dn(p)) / n(p)^2\) which simplifies to: $$ Du(p) = \frac{1}{| p |} (I_N - u(p) \tensor u(p) ) \in \IR[N,N]. $$ Note, that \(u(p) \tensor u(p)\) is the orthogonal projector to the 1-d space \(u(p)\). Therefore, \(I - u(p) \tensor u(p)\) projects onto \(u(p)^\perp\).
Proposition (Rayleigh Quotient)
Let \(r(p) = p * A * p / (p * p) \in \IR\), \(p\neq 0\) for a symmetric matrix \(A \in \IR[N,N], A^* = A\).
- The derivative is given by: $$ dr(p) = \frac{2}{p*p} (A - r(p) I_N) * p \in \IR[N]. $$
- At a critical point \(p\) (\(dr(p) = 0\)), the Hessian is given by a very similar formula: $$ Hr(p) = \frac{2}{ p * p} (A - r(p) I_N) \in \IR[N,N]. $$
- If \(p\) is an eigenvector for \(A\) with eigenvalue \(\lambda\) then \(r(p) = \lambda\) and \(dr(p) = 0\).
- Conversely, if \(dr(p) = 0\) then \(p\) is an eigenvector of \(A\) for eigenvalue \(r(p)\).
Spectral Theory
Spectral Decomposition
Theorem — Spectral Decomposition (Vector Form)
Let \(A \in \IR[X,X]\) be a symmetric matrix (\(A^* = A\)). There exists a number \(r\), and a set of positive real numbers \(\lambda_1 \geq \dots \geq \lambda_r > 0\), and an length-\(r\) orthonormal system \(q_z, z \in Z = [r]\), so that: $$ A = \sum_{z \in Z} \lambda_z . q_z \tensor q_z. $$
Equivalently, by arranging the vectors \(q_z\) into a matrix \(Q \in \IR[X,Z]\), and the numbers \(\lambda_z\) into a diagonal matrix \(D \in \IR[Z,Z]\), we get a factorization: $$ A = Q * D * Q^*. $$
Where:
- \(Q \in \IR[X,Z]\) is an orthonormal system \(Q*Q^* = I_Z\)
- \(D \in \IR[Z,Z]\) is a diagonal matrix with positive descending entries \(D[1,1] \geq \dots \geq D[r,r] > 0\).
Proof
We prove by induction on the dimension of the space.
If \(\#X=1\) is trivial since every \(1 \times 1\) has the form \(a \hat{x} \tensor \hat{x}\).
For the inductive step, assume the theorem holds for all symmetric matrices of dimension \(n-1\).
The function \(f(a) = a * A * a \in \IR\) is continuous on the compact set \(S = \{a \in \mathbb{R}^n : a * a - 1 = 0\}\) (the unit sphere). By the extreme value theorem, \(f\) attains its maximum on \(S\) at some point \(u \in S\).
We apply the Lagrange multiplier criteria for maximizing \(f(a) = a * A * a\) subject to the constraint \(g(a) = a * a - 1 = 0\): At the maximum \(u\), we have \(df(a) = \lambda dg(a)\). The derivatives are given by \(df(a) = 2 A*a\) (here the assumption \(A^* = A\) enters) and \(dg(a) = 2 a\). Therefore, at the maximum point \(u\), we have \(A u = \lambda u\).
If \(\lambda = 0\), then \(f(a) = 0\) for all \(a\) and hence \(A = 0\) has already the desired form. Hence we can assume \(\lambda \neq 0\) from here on.
Consider the orthogonal complement \(u^\perp = \{a \in \IR[X] : a * u = 0\}\). For \(a \in V^\perp\), we find \(a * A * u = a * \lambda * u = 0\), hence \(A * a \in u^\perp.\) and \(A\) restricts to a linear operator \(A_u: u^\perp \ra u^\perp\).
Now \(dim(u^\perp) = \# X - 1\), so we find a representation: \(A_u = \sum_{x \in Z'} \lambda_x u_x \tensor u_x\) indexed by a set \(Z'\) of cardinality \(\# Z' = \# X - 1\). Extending \(Z'\) by a new symbol \(*\) to \(Z\), and setting \(u_* = u, \lambda_* = \lambda\) we get:
Comparison Theorem for Spectral Decompositon
Let \(A \in \IR[X,X]\) be a symmetric matrix (\(A^* = A\)), and \(A = Q*D*Q^*\) a Spectral Decomposition for \(A\). Here \(Q \in \IR[X,Z], D \in \IR[Z,Z], Z=[r]\).
-
Let \(T \in \IR[Z,Z]\) be an orthonormal matrix \(T^*T = I_Z\) with \(T*D=T*D\) then
\[ (Q*T) * A * (Q*T)^* = A \]is another spectral decomposition for \(A\), with \(D' = D\) and \(Q' = Q * T\).
-
Conversely, let \(A = Q'D'Q'^*\) be another Spectral Decompositions, with index set \(Z=[r],Z'=[r']\) respectively, then:
- The "rank" and "Eigenvalues \(D\)" are the same: \(r=r', Z=Z', D=D' \in \IR[Z,Z]\).
- The matrix \(T = Q'^* Q \in \IR[Z,Z]\) is orthogonal and commutes with \(D\): $$ T * D = D * T, \quad T * T^* = I_Z $$
Proof
Let \(A = Q D Q^*\) and \(A = Q' D' Q'^*\) be two spectral decompositions, as described above. We first show that \(r = r'\) so \(Z=[r]=Z'\), then that \(D=D'\) finally we validate that \(T = Q'^* Q\) has the desired property.
Step 1: The rank \(r\) is uniquely determined
The rank \(r\) equals the number of non-zero eigenvalues of \(A\). Since \(A\) is symmetric, this equals the rank of \(A\) as a linear transformation. Hence \(r = r'\) and we can identify \(Z = Z' = [r]\).
Step 2: The eigenvalues are uniquely determined
Consider the Rayleigh quotient \(f(a) = a * A * a / (a * a)\) for \(a \neq 0\). By the Rayleigh quotient proposition, the critical points of \(f\) are exactly the eigenvectors of \(A\), and the value of \(f\) at a critical point equals the corresponding eigenvalue.
The maximum value of \(f\) over the unit sphere \(a * a = 1\) is the largest eigenvalue. This maximum is uniquely determined by \(A\). Similarly, restricting to the orthogonal complement of the first eigenvector, we get the second largest eigenvalue, and so on.
More precisely, let \(\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_r > 0\) be the eigenvalues from the first decomposition, and \(\mu_1 \geq \mu_2 \geq \dots \geq \mu_r > 0\) from the second. Then: - \(\lambda_1 = \max_{a*a=1} a * A * a = \mu_1\) - \(\lambda_2 = \max_{a*a=1, a*q_1=0} a * A * a = \mu_2\) - And so on.
Therefore \(D = D'\) (up to reordering, which we can assume without loss of generality).
Step 3: The transformation \(T\) has the required properties
Set \(T = Q'^* * Q \in \IR[Z,Z]\). Since both \(Q\) and \(Q'\) are orthonormal systems, we have: \(T^* * T = (Q'^* * Q)^* * (Q'^* * Q) = Q^* * Q' * Q'^* * Q = Q^* * I_X * Q = I_Z\).
From \(A = Q * D * Q^* = Q' * D * Q'^*\), we get: \(Q * D * Q^* = Q' * D * Q'^*\).
Multiplying on the left by \(Q'^*\) and on the right by \(Q\): \(Q'^* * Q * D * Q^* * Q = Q'^* * Q' * D * Q'^* * Q\), which simplifies to \(T * D = D * T\).
Therefore \(T\) is orthogonal and commutes with \(D\), completing the proof.
Theorem — Full Spectral Decomposition (Matrix Form)
Let \(A \in \IR[X,X]\) be any symmetric matrix.
Then \(A\) can be factored as
$$
A = U * D * U^*
$$
where
- \(U \in \IR[X,X]\) is an orthogonal transformation: \(U^* * U = U * U^* = I_X\).
- \(D \in \IR[X,X]\) is diagonal: \(D = \sum_x \lambda_x . \hat{x} \tensor \hat{x}\).
Theorem - Abstract Spectral Decomposition
Let \(A \in \IR[X,X]\) be any symmetric matrix. Let \(A_\lambda = Ker(A - \lambda I_X)\) be the eigenspace of \(A\) for eigenvalue \(\lambda\).
Then
- Only finitely many \(A_\lambda\) are non-zero.
- \(\IR[X]\) decomposes into an orthogonal direct sum \(\bigoplus_\lambda K_\lambda\).
- Let \(P_\lambda \in \IR[X,X]\) be the orthogonal projector with image \(A_\lambda\), then: $$ A = \sum_{\lambda \in \IR} \lambda . P_\lambda $$
- The projectors are mutually orthogonal: \(P_\lambda * P_\mu = \delta_{\lambda, \mu} P_\lambda.\)
- The identity matrix decomposes as \(I_X = \sum_\lambda P_\lambda\)
Singular Value Decomposition (SVD)
Theorem - SVD Vector Form
Let \(A \in \IR[Y,X]\) be any matrix. Then there is an index set \(Z\) and orthonormal systems \(u_z \in \IR[X], z \in Z\) and \(v_z \in \IR[Y], z\in Z\) so that: $$ A = \sum_{z \in Z} \sigma_z . v_z \tensor u_z $$ Here \(\sigma_z \in \IR, z\in Z\), are positive numbers \(\sigma_z > 0\), called the singular values of \(A\).
- The singular values are uniquely determined by \(A\) up to re-ordering.
- The size of the index set \(\# Z\) is equal to the rank of \(A\).
Proof
Existence
- The matrix \(K = A^* * A \in \IR[X,X]\) is symmetric and positive semi-definite.
- By the spectral theorem, we can write \(K\) we find an orthogonal matrix \(U \in \IR[X,X]\), so that \(K = \sum_{x \in X} \lambda_x . u_x \tensor u_x\).
- Since \(K\) is positive semidefinite, we have \(u_x * K * u_x = \lambda_x \geq 0\).
- Let \(Z \subset X\) be the subset of \(z \in X\) with \(\lambda_z > 0\). With this notation we have \(K = \sum_{z \in Z} \lambda_{z} . u_{z} \tensor u_z.\)
- For \(z \in Z\), let \(\sigma_z = \sqrt{\lambda_z}\), and set \(v_z = A * u_z / \sigma_z \in \IR[Y]\).
This is an ortho-normal system in \(\IR[Y]\) since for \(z,w \in Z\) we have \(v_z * v_w = u_{z} * A^* * A * u_{w} / (\sigma_z \sigma_w) = \delta_{z,w}\).
Now \(A = \sum_{z \in Z} \sigma_z v_z \tensor u_z\), since the values on the basis \(u_x, x \in X\) agree:
- For \(x \in Z \subset X\), we find \(\sum_{z \in Z} \sigma_z v_z \tensor u_z * u_x = \sigma_x v_x = A * u_x\)
- For \(x \in Z, x \notin Z\), we find \(\sum_{z \in Z} \sigma_z v_z \tensor u_z * u_x = 0\), and \(\| A * u_x \| = u_x *K *u_x = 0\), so \(A u_x = 0\).
Here are two more equivalent versions of the SVD theorem:
Theorem - Reduced Matrix SVD
Let \(A \in \IR[Y,X]\) be any matrix. \(A\) can be factored as
where
- \(V \in \IR[Y,Z]\) is an orthonormal system in \(\IR[Y]\): \(V^* * V = I_Z\).
- \(U^* \in \IR[X,Z]\) is an orthonormal systems in \(\IR[X]\) \(U * U^* = I_Z\).
- \(\Sigma \in \IR[Z,Z]\) is a diagonal matrix \(\Sigma = \sum_z \sigma_z . \hat{z} \tensor \hat{z}\).
- \(rk(U)=rk(V)=rk(\Sigma)=rk(A) = \# Z\).
Theorem - Full Matrix SVD
Let \(A \in \IR[Y,X]\) be any matrix. \(A\) can be factored as
where
- \(V \in \IR[Y,Y]\) is an orthonormal transformation: \(V^* * V = V * V^* = I_Y\).
- \(U \in \IR[X,X]\) is an orthonormal transformation: \(U^* * U = U * U^* = I_X\).
- \(\Sigma\) is a (non-square) diagonal matrix: \(\Sigma = \sum_z \sigma_z y(z)\hat\ \tensor x(z)\hat\ ,\) for a pair of embeddings \(x:Z\ra X\), \(y:Z\ra Y\).
Comparison Theorem
Let \(A \in \IR[Y,X]\) be any matrix, and \(A = U * \Sigma * V^*\) a reduced Sigular Value decomposition for \(A\), so \(U \in \IR[X,Z]\), \(V \in \IR[Y,Z]\) orthogonal systems and \(\Sigma \in \IR[Z,Z]\) diagonal with positive (non-zero) entries.
- Let \(T\) be any orthogonal matrix in \(\IR[Z,Z]\) satisfying the conditions \(T * \Sigma = \Sigma * T\), then: $$ A = (U * T) * \Sigma * (V * T)^* $$ is a Singular Value Decomposition of \(A\).
- Let \(A = U' \Sigma V'^*\) be another reduced singular value decompositions for \(A\). Then there is an orthogonal matrix \(T \in \IR[Z,Z]\) with \(T * \Sigma = \Sigma * T\), so that \(U' = U * T\) and \(V' = V * T.\)
Recall from above, that the condition \(T\Sigma = \Sigma T\) is equivalent to \(T\) having a block-diagonal structure.
Proof
The first claim follows by direct computation: \(U T \Sigma T^* V^* = U \Sigma T T^* V^* = U \Sigma V^* = A\).
Now assume \(U \Sigma V^* = A = U' \Sigma V'^*\). Let \(S = U'^* U\) and \(T = V'^* V\), we need to show that \(S = T\), and \(T\Sigma = \Sigma T\).
Since \(U \Sigma^2 U^* = A A^* = U' \Sigma^2 U'^*\), we find that \(S \Sigma^2 = \Sigma^2 S\). Since \(\Sigma > 0\), the diagonal operators \(\Sigma\) and \(\Sigma^2\) induce the same partition of \(Z\) into level-sets \(Z = \Union_\sigma Z_\sigma\). Therefore \(S \Sigma = \Sigma S\) as well.
Now \(U \Sigma V^* = A = U' \Sigma V'^*\), shows that \(S \Sigma = \Sigma T\). Combining both relations we find \(\Sigma S = \Sigma T\). Since \(\Sigma > 0\) we find \(S = T\). QED.
Relationship between SVD and Spectral Decomposition
For a symmetric matrix \(A = A^*\), any Spectral Decomposition \(A = Q D Q^*\) is also a Singular Value decomposition.
If \(A = U \Sigma V^*\) is a Singular Value Decomposition of a symmetric matrix \(A\), we automatically get a (reduced) Spectral Decomposition i.e. \(U = V\).
The example of the Zero matrix shows that this not the case for the full SVD.
Theorem
Let \(A \in \IR[X,X]\) with \(A = A^*\).
Let \(A = U * \Sigma * V^*\) be a reduced Singular Value decomposition for \(A\), where \(U \in \IR[X,Z]\), \(V \in \IR[X,Z]\) orthogonal systems and \(\Sigma \in \IR[Z,Z]\) diagonal with positive (non-zero) entries.
Then \(U = V\).
Proof
Since \(A = A^*\) and \(A > 0\), \(A\) admits a spectral decomposition: \(A = Q D Q^*\) where \(Q\) is orthogonal and \(D\) is diagonal with \(D > 0\).
As any spectral decomposition is also a SVD, we can apply the comparison theorem, to find an orthogonal \(T \in \IR[Z,Z]\) with \(T \Sigma = \Sigma T\) and \(U = Q T\) and \(V = Q T\). Hence \(U = V\). QED.
Proposition
Let \(A \in \IR[Y,X]\) be an arbitrary matrix with singular value decomposition \(A = \sum_{z \in Z} \sigma_z v_z \tensor u_z\).
Then:
- \(A^* * A = \sum_{z \in Z} \sigma_z^2 \, u_z \tensor u_z\) is a spectral decomposition of \(A^* * A \in \IR[X,X]\).
- \(A * A^* = \sum_{z \in Z} \sigma_z^2 \, v_z \tensor v_z\) is a spectral decomposition of \(A * A^* \in \IR[Y,Y]\).
Proposition
Let \(A \in \IR[Y,X]\) be an arbitrary matrix, \(K = A^* * A \in \IR[X,X]\) and \(L = A * A^* \in \IR[Y,Y]\).
Let \(\IR[X] = \bigoplus_\lambda K_\lambda\) and \(\IR[Y] = \bigoplus_\lambda L_\lambda\) the abstract spectral decompositions into eigenspaces.
Then:
- \(A(K_\lambda) \subseteq L_\lambda\), hence we get an induced map as restriction which we denote by \(A_\lambda: K_\lambda \ra L_\lambda\).
- \(A\) decomposes as \(A = \sum_\lambda A_\lambda\)
- \(A_\lambda^* A_\lambda = \lambda I\) on \(K_\lambda\) and \(A_\lambda A_\lambda^* = \lambda I\) on \(L_\lambda\)
- If \(\lambda > 0\) then \(A_\lambda\) is an isomorphism.
- For \(\lambda = 0\) we have \(A(K_0)=0\) and \(A^*(L_0)=0\).
Pseudoinverse
Definition — Pseudoinverse
Let \(A \in \IR[Y,X]\) be any matrix.
The pseudoinverse \(A^+ \in \IR[X,Y]\) is the unique map satisfying:
- \(A * A^+ * A = A\) - Normal Equation
- \(A^+ * A * A^+ = A^+\) - Normal Equation
- \((A * A^+)^* = A * A^+\) - Symmetry
- \((A^+ * A)^* = A^+ * A\) - Symmetry
Proposition
- The inverse matrix is a pseudoinverse if it exists.
- If a pseudoinverse exists it is unique.
Proposition — Full Column Rank Case
Let \(A \in \IR[Y,X]\).
- If \(A\) has full column rank (\(\operatorname{rk}(A) = \# X\)). Then \(A^+ = (A^* * A)^{-1} * A^*.\)
- If \(A\) has full row rank (\(\operatorname{rk}(A) = \# Y\)). Then \(A^+ = A^* * (A * A^*)^{-1}.\)
Proof
Set \(y = A^+ x\). The pseudoinverse satisfies \(A^* A A^+ = A^*\), so \(A^* A (A^+ x) = A^* x\) showing \(y = A^+ x\) solves the "normal equation" \(A^* A x = A^* y\).
Solving for \(y\) gives us \(A^+ x = (A^* A)^{-1} A^* x\).
Verifying properties:
- \(A A^+ A = A\) since \((A^* A)^{-1} A^* A = I\).
- \(A^+ A A^+ = A^+\) since \(A^+ A = I\).
- Symmetry conditions follow since \(A^* A\) is symmetric.
The row rank case follows similarly.
Proposition — Pseudoinverse via SVD
Let \(A = \sum_{z \in Z} \sigma_z v_z \tensor u_z\) be the singular value decomposition of \(A\), with singular values \(\sigma_z > 0\).
Then $$ A^+ = \sum_{z \in Z} \frac{1}{\sigma_z} u_z \tensor v_z. $$
Proof
We compute:
$$
A * A^+ = \sum_{z} v_z \tensor v_z, \quad
A^+ * A = \sum_{z} u_z \tensor u_z
$$
These are orthogonal projectors onto \(\operatorname{Im}(A)\) and \(\operatorname{Im}(A^*)\).
Verifying the defining properties:
- \(A A^+ A = A\)
- \(A^+ A A^+ = A^+\)
- \(A A^+\), \(A^+ A\) symmetric.
Spectral Functions
Definition Let \(f:\IR \to \IR\) be any function defined on the real line. For a symmetric matrix \(A \in \IR[X,X]\) with spectral decomposition \(A = \sum_{\lambda \in \IR} \lambda . P_\lambda\), we define the matrix function \(f(A)\) as: $$ f(A) = \sum_{\lambda \in \IR} f(\lambda) . P_\lambda. $$ Note that \(f(A)\) is determined by the values of \(f\) on the spectrum \(\sigma(A) \subset \IR\), i.e. the set of Eivenvalues.
Proposition
Let \(A \in \IR[X,X]\) be a symmetric matrix with abstract spectral decomposition \(\sum_{\lambda \in \IR} \lambda . P_\lambda\).
- Let \(P(x) = \sum_{k=0}^d a_k x^k\) be a formal polynomial with real coefficients \(a_k\). then $$ \sum_{k=0}^d a_k A^{*k} = P(A) = \sum_{\lambda \in \IR} P(\lambda) . P_\lambda $$ I.e. evaluation of \(P\) via matrix multiplication gives the same result as the action on the Eigenvalues.
- Let \(P(x) = \sum_{k=0}^\infty a_k x^k\) be a power series with real coefficients \(a_k\) convergent within \(|x|<\rho\). Assume that \(|\lambda| < \rho\) for all eigenvalues \(\lambda\) of \(A\), then the matrix power series $$ P(A) = \sum_{k=0}^\infty a_k A^k $$ converges and equals \(\sum_{\lambda \in \IR} P(\lambda) . P_\lambda\).
Proof
Since \(A = \sum_\lambda \lambda P_\lambda\), we have \(A^k = \sum_\lambda \lambda^k P_\lambda\) for all \(k \geq 0\). Hence the partial sums are \(\sum_{k=0}^n a_k A^k = \sum_\lambda \left(\sum_{k=0}^n a_k \lambda^k\right) P_\lambda\). Since \(|\lambda| < \rho\) and \(P(x) = \sum_{k=0}^\infty a_k x^k\) converges for \(|x| < \rho\), each scalar series \(\sum_{k=0}^\infty a_k \lambda^k = P(\lambda)\) converges. The matrix series converges to \(\sum_{\lambda \in \IR} P(\lambda) . P_\lambda\) since the projectors \(P_\lambda\) are mutually orthogonal and only finitely many are non-zero.
Examples
- \(\sqrt{A} = \sum_{\lambda \geq 0} \sqrt{\lambda} . P_\lambda\) (defined when all eigenvalues are non-negative)
- \(e^A = \sum_{\lambda \in \IR} e^\lambda . P_\lambda\) (always defined)
- \(\log(A) = \sum_{\lambda > 0} \log(\lambda) . P_\lambda\) (defined when all eigenvalues are positive)
- \(\sin(A) = \sum_{\lambda \in \IR} \sin(\lambda) . P_\lambda\) (always defined)
- \(\cos(A) = \sum_{\lambda \in \IR} \cos(\lambda) . P_\lambda\) (always defined)
Theorem — Polynomial Formulas for Spectral Objects Let \(A \in \IR[X,X]\) be a symmetric matrix with distinct eigenvalues \(\lambda_1, \dots, \lambda_m \in \IR\).
- Sylvester Formula for Spectral Projectors.
The projector \(P_{\lambda_1}\) onto the eigenspace for \(\lambda_1\) can be computed as: $$ P_{\lambda_1} = \prod_{j=2}^m \frac{A - \lambda_j I}{\lambda_1 - \lambda_j} $$
- Polynomial Formula for Pseudoinverse.
Let \(\lambda_1, \dots, \lambda_m \in \IR\) be the non-zero eigenvalues of \(A\). The pseudoinverse can be computed as: $$ A^+ = \sum_{i=1}^m \frac{1}{\lambda_i} \prod_{j=1, j \neq i}^m \frac{A - \lambda_j I}{\lambda_i - \lambda_j} $$
Norms
Vector Norms
Let \(a \in \IR[X]\). We define the euclidean norm (or \(L^2\)-norm) as
Proposition
For \(a,b \in \IR[X]\), \(\lambda \in \IR\)
- \(\| \lambda a \| = |\lambda| \, \|a\|\). (Scaling)
- \(\| a + b \| \leq \| a \| + \|b\|\). (Triangle Inequality)
- \(\| a \| = 0\) only if \(a = 0\). (Positive Definiteness)
- For all \(a, b \in \IR[X]\): we have \(| a * b | \le \| a \| \, \| b \|.\) (Cauchy-Schwarz)
Proof
For the Triangle Inequality, fix \(b \neq 0\) and \(\|a\| = l\) consider the function \(f(a) = \|a + b\|^2\) on the sphere \(g(a) = a * a - l^2 = 0\). We have to show that \(f(a)\) is not larger than the constant \((\|a|\ + \|b\|)^2\). By compactness, \(f(a)\) attins it maximum at a point \(a\) where \(df(a) = \lambda dg(a)\) by the Lagrange criterium. Now compute: $$ df(a) = 2(a + b), \quad dg(a) = 2a. $$ Hence \(a + b = \lambda a\), which shows that \(a\) and \(b\) are co-linear: \(a = \mu b\). In this case \(\|a + b\| = |1 + \mu| \|b\| \leq \|a\| + \|b\|\). qed.
For Cauchy-Schwarz, it suffices by scaling to consider \(\|a\| = \|b\| = 1\). Maximize \(f(a) = (a * b)^2\) over the sphere \(g(a) = a * a - 1 = 0\). By compactness a \(a\) maximum exists, for which \(df(a) = \lambda dg(a)\) by the Lagrange criterium. Compute: $$ df(a) = 2 (a*b) b, \quad dg(a) = 2a $$ This shows that either \(a * b=0\) in which case \(f(a) = 0\), or \(b = \mu a\), hence \(a = \pm b\), in which case \(f(a) = 1\) so \(0 \leq f(a) \leq 1\), qed.
Matrix Norms
Let \(A \in \IR[Y,X]\), define:
- The operator (or spectral) norm is defined as \(\| A \|_2 = \sup_{a} \| A * a \| / \|a\|.\)
- The Frobenius norm is \(\|A\|_F = \sqrt{ \sum_{x,y} A[x,y]^2 } = \sqrt{ \operatorname{trace}(A^* * A) }.\)
Proposition
- If \(U \in \IR[X,X],V \in \IR[Y]\) are orthogonal matrices then:
- \(\|A\|_2 =\|U*A*V\|_2\)
- \(\|A\|_F =\|U*A*V\|_F\)
- Let \(U \Sigma V^* = A\) is a SVD for \(A\), then
- \(\|A\|_2 = \| \Sigma \|_2 = \sigma_{max}\), where \(\sigma_{max}\) is the largest singular value for \(A\).
- \(\|A\|_F = \| \Sigma \|_F = \sqrt{\sum_z \sigma_z^2}\).
- If \(A = A^* \in \IR[X,X]\) and \(A = U D U^*\) is the spectral decomposition, then:
- \(\|A\|_2 = \| D \|_2 = \lambda_{max}\), where \(\lambda_{max}\) is the largest singular value for \(A\).
- \(\|A\|_F = \| D \|_F = \sqrt{\sum_z \lambda_z^2}\).
Proof
For an orthogonal \(U \in \IR[X,X]\), we find \(\|U * A * a\| = \|A * a\|\) and hence \(\|U*A\|_2=\|A\|_2\). For an orthogonal \(V \in \IR[Y,Y]\), we have \(\|a\|= \|V a\|\) and hence \(\|A * V\|_2=\|A\|_2\).
For the Frobenius norm we have \(\| U \Sigma V^* \|_F = \sqrt{tr(V \Sigma U^* U \Sigma V^*)} = \sqrt{tr(\Sigma\Sigma)}\) by cyclicity of the trace. qed.
Proposition
Let \(A \in \IR[X,Y]\), \(B \in \IR[Y,Z]\), \(b \in \IR[Y]\), and let \(\|-\|\) denote either the Operator or the Frobenius Norm.
- \(\| A \|_2 = \| A^* \|_2\) and \(\| A \|_F = \| A^* \|_F\).
- \(\| A * b \|_2 \le \|A\|_2 \|b\|\) and \(\| A * b \|_F \le \|A\|_F \|b\|.\)
- \(\| A * B \|_2 \le \|A\|_2 \|B\|_2\) and \(\| A * B \|_F \le \|A\|_F \|B\|_F.\)