Support Vector Machines(SVM) - From Hinge Loss to Optimization

A Quick Recap: Hinge Loss

To understand Support Vector Machines (SVM) fully, we need to revisit Hinge Loss, which is fundamental to SVM’s working mechanism. Hinge loss is mathematically defined as:

\[\ell_{\text{Hinge}} = \max(1 - m, 0) = (1 - m)_+\]

Here, \(m = y f(x)\) represents the margin, and the notation \((x)_+\) refers to the “positive part” of \(x\). This simply means that \((x)_+\) equals \(x\) when \(x \geq 0\), and is zero otherwise.

Why is this loss function important? Hinge loss provides a convex, upper bound approximation of the 0–1 loss, making it computationally efficient for optimization. However, it’s not without limitations—it is not differentiable at \(m = 1\), which we’ll address later. A “margin error” occurs whenever \(m < 1\), and this forms the basis of SVM’s classification mechanism.

With this understanding of Hinge Loss in place, we’re ready to explore how SVM is formulated as an optimization problem.

SVM as an Optimization Problem

At its core, SVM aims to find a hyperplane that maximizes the separation, or margin, between data points of different classes. This task is mathematically framed as the following optimization problem:

We minimize the objective:

\[\frac{1}{2} \|w\|^2 + c \sum_{i=1}^n \xi_i\]

subject to the constraints:

\[\xi_i \geq 1 - y_i (w^T x_i + b), \quad \text{for } i = 1, \dots, n\] \[\xi_i \geq 0, \quad \text{for } i = 1, \dots, n\]

which is equivalent to

\[\text{minimize } \frac{1}{2} \|w\|^2 + \frac{c}{n} \sum_{i=1}^n \xi_i\] \[\text{subject to } \xi_i \geq \max \big(0, 1 - y_i (w^T x_i + b) \big), \quad \text{for } i = 1, \dots, n\]

In this formulation:

The term \(\|w\|^2\) represents the L2 regularizer, which helps prevent overfitting by penalizing large weights.
The variables \(\xi_i\) (slack variables) account for data points that violate the margin constraint, allowing some degree of misclassification.
The parameter \(c\) controls the trade-off between maximizing the margin and minimizing classification errors.

To simplify, we can integrate the constraints directly into the objective function. This gives us:

\[\min_{w \in \mathbb{R}^d, b \in \mathbb{R}} \frac{1}{2} \|w\|^2 + c \sum_{i=1}^n \max(0, 1 - y_i (w^T x_i + b))\]

This new formulation has two terms:

The first term, \(\frac{1}{2} \|w\|^2\), is the L2 regularizer.
The second term, \(\sum_{i=1}^n \max(0, 1 - y_i (w^T x_i + b))\), captures the Hinge Loss for all data points.

This concise representation highlights the two fundamental objectives of SVM: maintaining a large margin and minimizing misclassifications.

Depending on the nature of the data, SVM can be approached in two ways. If the data is perfectly separable, we use a hard-margin SVM. In this case, all data points must be correctly classified while maintaining the margin constraints.

However, real-world data is often noisy and not perfectly separable. Here, we use a soft-margin SVM, which introduces the slack variables \(\xi_i\) to allow some margin violations. The degree of permissible violations is controlled by the parameter \(c\), striking a balance between classification accuracy and margin size.

The SVM Objective Function

The final objective function for SVM, combining regularization and hinge loss, is given by:

\[J(w) = \frac{1}{n} \sum_{i=1}^n \max(0, 1 - y_i w^T x_i) + \lambda \|w\|^2\]

Here, \(\lambda\) is the regularization parameter, which is inversely related to \(c\). This function encapsulates both the classification objective (minimizing hinge loss) and the regularization goal (keeping weights small).

The relationship is inverse because:

When \(c\) is large, the model is more focused on minimizing misclassification, meaning less regularization (\(\lambda\)) is needed.
When \(c\) is small, the model allows more misclassifications, which means it can afford to have more regularization (\(\lambda\)) to prevent overfitting.

Gradients and Optimization

At this point, you might wonder: how do we actually optimize the SVM objective? This is where gradients come into play. Let’s break it down.

Derivative of Hinge Loss

The derivative of Hinge Loss, \(\ell(m) = \max(0, 1 - m)\), is as follows:

\[\ell'(m) = \begin{cases} 0 & \text{if } m > 1 \\ -1 & \text{if } m < 1 \\ \text{undefined} & \text{if } m = 1 \end{cases}\]

Using the chain rule, the gradient of \(\ell(y_i w^T x_i)\) with respect to \(w\) is:

\[\nabla_w \ell(y_i w^T x_i) = \begin{cases} 0 & \text{if } y_i w^T x_i > 1 \\ - y_i x_i & \text{if } y_i w^T x_i < 1 \\ \text{undefined} & \text{if } y_i w^T x_i = 1 \end{cases}\]

Gradient of the SVM Objective

Combining the gradients for all data points, the gradient of the SVM objective is:

\[\nabla_w J(w) = \nabla_w \left( \frac{1}{n} \sum_{i=1}^{n} \ell(y_i w^T x_i) + \lambda \|w\|^2 \right)\] \[= \frac{1}{n} \sum_{i=1}^{n} \nabla_w \ell(y_i w^T x_i) + 2\lambda w\] \[= \frac{1}{n} \sum_{i: y_i w^T x_i < 1} (-y_i x_i) + 2\lambda w\] \[y_i w^T x_i \neq 1 \text{ for all i}\]

For \(y_i w^T x_i \geq 1\), the gradient is undefined.

A common concern with the SVM objective is that it is not differentiable at \(y_i w^T x_i = 1\). However, in practice:

Starting with a random \(w\), the probability of hitting exactly \(y_i w^T x_i = 1\) is negligible.
Even if this occurs, small perturbations in the step size can help bypass such points.

Thus, gradient-based optimization methods, such as gradient descent, can be effectively applied to the SVM objective.

Wrapping Up

SVM is a robust and versatile algorithm, but understanding it fully requires breaking it into manageable pieces. In this post, we explored its formulation, loss functions, and gradients. This sets the stage for discussing subgradients and subgradient descent in the next post, which address the non-differentiability issues of hinge loss.

SVM is undoubtedly a vast topic, but step by step, it all begins to make sense. Trust the process, and by the end of this series, you’ll appreciate the journey. Stay tuned!