Support Vector Machines


Background

Projections

Given vectors and , we wish to find the vector projection of onto , denoted . This is the component of that lies on :
To find an expression for in terms of and , we note the following:
  • and point in the same direction, meaning they have the same unit vectors:
  • We have a right-angled triangle, meaning:
  • Our standard rule for applies:
Using these three equations, we can express as follows:
where , and the scalar projection .

Hyperplanes

Witten et al. (2013) give the following definition:
In a p-dimensional space, a hyperplane is a flat affine subspace of hyperplane dimension p − 1.
Given , a hyperplane is defined by:
This can be turned into an inequality to separate the p-dimensional space.
To visualise this, think of the vector extending out into space and then doing the dot product with an arbitrary vector in that space. At some set of values for the dot product will equal . This is our hyperplane!
Visually it makes sense that this set of values is perpendicular to . The maths checks out: .
 
The (shortest) distance of a point to the hyperplane is given by :
where is the nearest point on the hyperplane, making perpendicular to the hyperplane and thus parallel to .

Maximal Margin Classifiers

We assume a binary classification problem, .
The maximal margin hyperplane is the hyperplane that correctly splits the space into two halves corresponding to the classes, and has the largest possible distance between the hyperplane and any point(s).
Mathematically, this is the solution to the optimisation problem:
Using what we know about the shortest distance to the hyperplane, we can view the second constraint as . Thus the first constraint is essentially arbitrary and we could limit the size to any value (or possibly constrain the value of and minimise the value of ?).

Support Vector (/Soft Margin) Classifier

We identify two main problems with MMCs:
  1. They only work if the data is linearly separable.
  1. They are particularly sensitive to the support vectors, suggesting overfitting.
To address these weaknesses, we introduce a series of optimised slack variables representing the distance by which individual points are able to violate the margin:
If then crosses the margin, and if then it also crosses the hyperplane. Such are termed the support vectors.
is a budget for how much we can effectively deviate from the MMC. It also acts as an upper bound for how many points can cross the hyperplane.
The value of effectively controls the bias-variance trade-off:
  • Low means few points define the boundary: high variance.
  • High means we fit the data less closely: high bias.

Support Vector Machines

SVCs are an improvement, but still only offer a linear separating boundary, which in many cases won't be sufficient. Enter the SVM, which addresses this using the kernel trick.
The maths underlying this gets quite complicated, so rather than go into detail I'll give a quick sketch with sources to follow up on.
  1. The constrained optimisation problem presented for SVCs can be framed in terms of the primary or dual Lagrangian (see my notes, and this explanation).
  1. Both have their advantages, but the latter is easier to optimise and gives the benefit of only requiring the inner product of each pair of variables: (see The Elements of Statistical Learning, Ch 12 for the mathematical proof of this, via the KKT conditions, or my notes).
  1. The dual form then gives the following classifier: .
  1. Note that where is the set of support vectors, meaning our sum reduces to: , although (I think) we still have to compute all pairs in practice because we don't know the support vectors in advance.
  1. The kernel trick allows us to essentially compute the separating hyperplane as though we had projected the data into some higher-dimensional space that's easier to separate. This is done by computing the inner-product as though it was that resulting from the inner-product in this high-dimensional space. The functions that compute these special "products" are known as kernels: , and we can substitute them to give our final SVM equation:
The standard kernel we've already seen: it's the inner product! We can think of the kernel as representing a measure of the similarity between two vectors. Let's examine a few more:

Polynomial Kernel

The polynomial kernel of degree d is represented by:
This is equivalent to using the inner product between the standard d-degree polynomial basis expansions ( ) for both vectors: i.e. . The complexity of this product is .
Our polynomial kernel is a little different. It can be seen that the terms of are the same as the inner product described above (albeit with some scaled differently). The difference is, because we can compute the power via an iterative self-product, we can do it in , and never have to bother computing the full basis expansion. In fact, it's because we don't need to compute this full expansion that this gain is made possible!

Radial Kernel

The radial basis function (RBF) kernel is given by:
Where is a positive constant. This Stack Exchange post outlines the equivalent inner product for this kernel. We can think of this as a bit like a Gaussian, meaning that unless we're near the mean/peak then the similarity will be very low. In the context of SMVs, this means that a point's classification will be determined by its proximity to the peaks of the different support vectors (not necessarily organised linearly!). The size of these peaks is determined by our choice of .

Multi-class Classification

We have two options here:
Pairwise / One-vs-one: Train classifiers for each pair of tasks and choose the final class assigned to the most out of all pairs.
One-vs-all: Train a classifier for each class, versus all other classes. Choose the one which assigns the most distance/confidence to the point.

Link to Logistic Regression

SVMs are actually doing something very similar to logistic regression.
To understand this let's revisit the optimisation problem for the linear SVC and look at its Larangian (primary):
This looks just like a loss with an L2 regularisation term! In fact this loss has a name - we call it the hinge loss.
With this in mind, let's look again at the linear SVC and compare it with the standard loss formulation for linear logistic regression (ignoring the regularisation term). We'll do this for a single data pair . We use two intermediate variables here: is the output of the linear transformation, and is the "output" that we use in our loss (potentially different from our actual prediction ).
For linear logistic regression, we have:
For a linear SVC, we have:
 
This side-by-side should make clear the similarities in how these models learn. For logistic regression we apply this "logistic" (negative softplus) loss to and get the red curve. Because of the sigmoid output attached to , we want to push this as far to the correct extreme as possible. Hence the asymptote.
The margin classifier approach is simply to optimise to directly match the desired class label in a non-smooth way, applying the hinge loss to in order to push values towards the correct classification.
One key difference is that the margin classifier is not looking to improve correctly classified points. In a way, the sigmoid function (and subsequent loss required) basically just means we never reach the point where we're happy with our classification, which for hinge loss is simply once we hit the margin.
[not my diagram] on the x-axis we have . It can be easier to think of this just in the y=1 case (y=-1 is basically the same but flipped). On the y-axis we have each loss.
[not my diagram] on the x-axis we have . It can be easier to think of this just in the y=1 case (y=-1 is basically the same but flipped). On the y-axis we have each loss.
 
Here we've just looked at SVCs. But we can use the kernel trick for logistic regression (and other ML problems) too! In all, it's hard not to conclude that despite initial appearances, SMVs are really very similar to standard ML methods. Finally, although we won't cover it here, SVMs can also be used for regression problems as well as classification. The difference between this approach and linear regression is that our loss only punishes residuals larger in absolute terms than some positive constant.