β

Linear Algebra

This is really just a refresher on Linear Algebra for me. My full reference is Strang's Introduction to Linear Algebra. This doesn't go quite as deep as that, but captures the key stuff needed for deep learning.

Β

Matrix Multiplication

Key Terms

• Span (of a matrix / set of vectors): the set of all vectors obtainable via linear combination
• Column / row space: span of a matrix's columns / rows
• Linear independence: no vectors are in the span of any other group of vectors
• Invertible matrix: square matrix with linearly independent columns

Motivation

Consider the "basic" matrix-vector multiplication equation: , where is a fixed matrix and is a variable vector to be found.
For a fixed , we care about the possible solutions (i.e. values of ): how many are there and what are they?
Considering the case where is arbitrary (i.e. considering what is true for all ) is perhaps more interesting, and can tell us a lot about and its properties. The key question is: does the equation have a solution for all ?

Solutions

In the case of a fixed the basic equation either has 0, 1 or many solutions. The authors provide a useful way of thinking about this:
To analyse how many solutions the equation has, think of the columns of as specifying different directions we can travel in from the origin ... then determine how many ways there are of reaching .
In the case of an arbitrary :
• There is >= 1 solution for all iff has a set of linearly independent columns.
• This is due to the column space of being all of .
• A necessary (but not sufficient) condition here is (at least as many columns as rows), otherwise the column space can't span .
• There is = 1 solution for all iff has exactly linearly independent columns.
• A necessary condition is therefore that m = n (i.e. is square).
• If satisfies this condition we say it is invertible.
• Note that a square matrix that is not invertible is called singular or degenerate.
Why is this all useful? Consider...

Inverse Matrices

Think of as applying a transformation to a vector or matrix.
If the basic equation has one solution (i.e. is invertible) then this transformation can be reversed. This is often really useful!
This reversal can be expressed as the matrix inverse .
In practice computing directly is often avoided as it can be numerically unstable, but this property is still very important.

Norms

Motivation

A norm is a function that gives us some measure of the distance from the origin to a vector or matrix.
Clearly this is a useful concept!

Definition

A norm is any function which satisfies the following three properties (for all ):
point-separating:
absolutely scalable:
triangle inequality:

Vector Norms

The norm of , often denoted by , is defined as:
where () .
The norm is called the Manhattan norm.
The norm is called the Euclidean norm. This is the standard norm and is commonly referred to without the subscript as simply . The squared norm is also used in some contexts, which is simply .
The norm is called the max norm. It is defined as .

Matrix Norms

(This is a much more complex field that we only touch on briefly here!)
We consider two analogous matrix norms for the vector norm.
In wider mathematics the spectral norm is often used. It can be useful in ML for analysing (among other things) exploding/vanishing gradients. It is defined as 's largest singular value:
However, most in most ML applications it is assumed the Frobenius norm is used. This is defined as:
Note that this is equivalent to: .

Eigendecomposition

Key Terms

• Unit vector: the norm = 1
• Orthogonal vectors:
• Orthonormal vectors: orthogonal unit vectors
• Orthogonal matrix: rows & columns are mutually orthonormal

Motivation

Eigendecomposition decomposes a matrix into eigenvectors and eigenvalues.
A linear transformation applied to one of its eigenvectors , simply shrinks or elongates in its current direction.

Eigenvectors & Eigenvalues

Vector and scalar are an eigenvector-eigenvalue pair for square matrix iff:
(Strictly speaking, here is a right eigenvector. A left eigenvector is such that . We care primarily about right eigenvectors.)
If is an eigenvector it follows that any rescaled version of is also an eigenvector with the same eigenvalue. We typically use a scale such that we have a unit eigenvector.

Calculating Eigenvector/values

How to find the eigenvalues of ?
They are the solution to:
1. We call the characteristic polynomial of .
1. Its roots are solution to the above equations, and hence equal the eigenvalues of .
1. There are guaranteed to be roots.
Β
The algebraic multiplicity of eigenvalue is the number of times it appears as a root in . i.e. the value in
Β
How to find the eigenvectors of , given the eigenvalues ?
These are the linearly independent solutions of:
• There are guaranteed to be per eigenvalue, where is the algebraic multiplicity.
• The value is termed the geometric multiplicity
Β
If all , we say the matrix is diagonalisable.
Only diagonalisable matrices have an eigendecomposition.

Decomposition

If has independent eigenvectors we can combine them into the columns of a matrix, , such that .
The eigendecomposition of is then defined as: .
We are only guaranteed an eigendecomposition if is symmetric (and real-valued). In this case it is often denoted:
Here the decomposition is guaranteed to be real-valued and is orthogonal.
The decomposition may not be unique if two (independent) eigenvectors have the same eigenvalues.
Zero-valued eigenvalues exist iff is singular.

Geometric interpretation

Multiplying by a diagonal matrix (e.g. ) scales allong the standard axes:
If we multiply by an orthogonal matrix before and after, this gives us a rotation which allows us to scale in any directions:
Why is multiplying by an orthogonal matrix a rotation?
1. Length is preserved:
1. Angle between rotated vectors is preserved:

The Determinant

The determinant, noted , is a scalar that reflects how much multiplying by alters the volume of an object.
The determinant is calculated by taking the product of 's eigenvalues.
volume is preserved.
space is contracted completely in at least one dimension. Thus volume = 0.

Singular Value Decomposition

Motivation

We want a decomposition for all real matrices. SVD provides this.
It is also key to computing the pseudoinverse, which enables us to find solutions to in all cases.

Singular Values & Vectors

The SVD decomposes a matrix into:
where:
• is an orthogonal matrix of left-singular vectors (columns)
• is an orthogonal matrix of right-singular vectors (columns)
• is a diagonal matrix of singular values (not necessarily square)
These matrices are calculated as follows:
• = the (unit norm) eigenvectors of
• = the (unit norm) eigenvectors of
• = the square roots of the eigenvalues of either (padded with zeroes if there aren't enough).
With the SVD we can calculate the...

Pseudoinverse

Motivation

There are either 0, 1 or solutions to .
The inverse, , allows us to find if there is a singular solution.
However, for the inverse to exist, must satisfy the properties for being invertible (exactly linearly independent columns).
The pseudoinverse gives us a method for finding the "best" for all possible values of and .

The Moore-Penrose Pseudoinverse

We denote the pseudoinverse as and take our candidate for as follows:
The pseudoinverse is defined as follows:
where,
• and are taken directly from the SVD, although the order here is reversed.
• contains the reciprocals of the (non-zero) values in , with a transpose applied to fix the dimensions.
When there are valid solutions, the pseudoinverse gives the value of with minimal norm.
When there are 0 valid solutions, the pseudoinverse gives the value of such that has minimal norm.

Principal Component Analysis

Motivation

We often encounter high dimensional data that can be compressed by projecting it to a lower dimensional space.
We can do this using linear algebra, by multiplying by a matrix (encoding) and multiplying by its inverse (decoding).
We want to find the matrix that minimises information loss when encoding.

Process

Suppose we have data where .
Encoding/decoding works as follows:
• Encoder:
• Code:
• Decoder: where
We first define the decoder: where .
The columns of are orthogonal and have unit norm, but not square (so itβs not technically an orthogonal matrix).
The encoder is then defined: .
It can be shown this choice minimises . (i.e. the gradient is zero when )
The entire process defines the PCA reconstruction operation: .

Choice of Decoding Matrix

The only question that remains is what we should actually choose for the -dimensional vectors that make up .
It turns out the optimal choice here is the eigenvectors corresponding to the largest eigenvalues of (i.e. squared singular values of ).
The proof of this relies on the following:
1. To find we seek to minimise .
1. This can be shown to be equivalent to maximising subject to .
1. A standard result for a positive semidefinite matrix like is that this fraction's maximum possible value is equal to the largest eigenvalue of .
1. To reach this possible value, is required to equal the corresponding eigenvector.
1. This can be repeated for the subsequent values of , which then correspond to the eigenvectors of the next smallest eigenvalues.

Β