Empirical Risk Minimization (ERM) | Monishver Chandrasekaran

Continuing from our discussion of supervised learning, we now dive into Empirical Risk Minimization (ERM). While the ultimate goal is to minimize the true risk, ERM provides a practical way to approximate this goal using finite data. Let’s break it down.

Definition: Empirical Risk Minimization

A function \(\hat{f}\) is an empirical risk minimizer if:

\[\hat{f} \in \arg\min_f \hat{R}_n(f),\]

where \(\hat{R}_n(f)\) is the empirical risk, defined over a finite sample and the minimum is taken over all functions\(f: X \to Y\). In an ideal scenario, we would want to minimize the true risk \(R(f)\). This raises a critical question:

Is the empirical risk minimizer close enough to the true risk minimizer?

Example of ERM in Action

Let’s consider a simple case:

Let \(P_X = \text{Uniform}[0,1]\) and \(Y \equiv 1\) (i.e., \(Y\) is always 1).
A proposed prediction function:

\[\hat{f}(x) = \begin{cases} 1 & \text{if } x \in \{0.25, 0.5, 0.75\}, \\ 0 & \text{otherwise.} \end{cases}\]

Loss Analysis:

Under both the square loss and the 0-1 loss:

Empirical Risk: 0
True Risk: 1

Explanation:

The empirical risk measures the loss on the training data. Since \(\hat{f}(x)\) perfectly predicts the labels for the training points \(x \in \{0.25, 0.5, 0.75\}\), the empirical risk is zero. There are no errors on the observed data points. - The true risk, however, measures the loss over the entire distribution of \(P_X\). For all \(x \notin \{0.25, 0.5, 0.75\}\), \(\hat{f}(x)\) incorrectly predicts 0 instead of the correct label 1, resulting in significant errors. Since \(P_X\) is uniform over \([0,1]\), this means \(\hat{f}(x)\) is incorrect for all others data points of the domain, leading to a true risk of 1.

This illustrates a key problem: ERM can lead to overfitting by simply memorizing the training data.

Generalization: Improving ERM

In the above example, ERM failed to generalize to unseen data. To improve generalization, we must “smooth things out”—a process that spreads and extrapolates information from observed parts of the input space \(X\) to unobserved parts.

One solution is Constrained ERM: Instead of minimizing empirical risk over all possible functions, we restrict our search to a subset of functions, known as a hypothesis space.

Hypothesis Spaces

A hypothesis space \(\mathcal{F}\) is a set of prediction functions mapping \(X \to Y\) that we consider when applying ERM.

Desirable Properties of a Hypothesis Space

Includes only functions with the desired “regularity” (e.g., smoothness or simplicity).
Is computationally tractable (efficient algorithms exist for finding the best function in \(\mathcal{F}\)).

In practice, much of machine learning involves designing appropriate hypothesis spaces for specific tasks.

Constrained ERM

Given a hypothesis space \(\mathcal{F}\), the empirical risk minimizer in \(\mathcal{F}\) is defined as:

\[\hat{f}_n \in \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i),\]

where \(\ell(f(x), y)\) is the loss function.

Similarly, the true risk minimizer in \(\mathcal{F}\) is:

\[f^*_{\mathcal{F}} \in \arg\min_{f \in \mathcal{F}} \mathbb{E}[\ell(f(x), y)].\]

Excess Risk Decomposition

We analyze the performance of ERM through excess risk decomposition, which breaks down the gap between the true risk(e.g., the function returned by ERM) and the risk of the Bayes optimal function:

Again, Definitions

Bayes Optimal Function:
\[f^* = \arg\min_f \mathbb{E}[\ell(f(x), y)]\]
Risk Minimizer in \(\mathcal{F}\):
\[f_{\mathcal{F}} = \arg\min_{f \in \mathcal{F}} \mathbb{E}[\ell(f(x), y)]\]
ERM Solution:
\[\hat{f}_n = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i)\]

Excess Risk Decomposition for ERM

Definition

The excess risk measures how much worse the risk of a function \(f\) is compared to the risk of the Bayes optimal function \(f^*\), which minimizes the true risk. Mathematically, it is defined as:

\[\text{Excess Risk}(f) = R(f) - R(f^*)\]

where:

\(R(f)\) is the true risk of the function \(f\).
\(R(f^*)\) is the Bayes risk, i.e., the lowest achievable risk.

Can Excess Risk Be Negative? No, the excess risk can never be negative because the Bayes optimal function \(f^*\) achieves the minimum possible risk by definition. For any other function \(f\), the risk \(R(f)\) will be equal to or greater than \(R(f^*)\).

Decomposition of Excess Risk for ERM

For the empirical risk minimizer \(\hat{f}_n\), the excess risk can be decomposed as follows:

\[\text{Excess Risk}(\hat{f}_n) = R(\hat{f}_n) - R(f^*) = \underbrace{R(\hat{f}_n) - R(f_\mathcal{F})}_{\text{Estimation Error}} + \underbrace{R(f_\mathcal{F}) - R(f^*)}_{\text{Approximation Error}}\]

where:

\(f_\mathcal{F}\) is the best function within the chosen hypothesis space \(\mathcal{F}\).
Estimation Error: This term captures the error due to estimating the best function \(f_\mathcal{F}\) using finite training data.
Approximation Error: This term reflects the penalty for restricting the search to the hypothesis space \(\mathcal{F}\) instead of considering all possible functions.

Key Insight: Tradeoff Between Errors There is always a tradeoff between approximation and estimation errors:

A larger hypothesis space \(\mathcal{F}\) reduces approximation error but increases estimation error (due to greater model complexity).
A smaller hypothesis space \(\mathcal{F}\) reduces estimation error but increases approximation error (due to limited flexibility).

This tradeoff is crucial when designing models and choosing hypothesis spaces.

ERM in Practice

In real-world machine learning, finding the exact ERM solution is challenging. We often settle for an approximate solution:

Let \(\tilde{f}_n\) be the function returned by an optimization algorithm.
The optimization error is:

\[\text{Optimization Error} = R(\tilde{f}_n) - R(\hat{f}_n)\]

where:

\(\tilde{f}_n\)is the function returned by the optimization method.
\(\hat{f}_n\)is the empirical risk minimizer.

Practical Decomposition

For \(\tilde{f}_n\), the excess risk can be further decomposed as:

\[\text{Excess Risk}(\tilde{f}_n) = \underbrace{R(\tilde{f}_n) - R(\hat{f}_n)}_{\text{Optimization Error}} + \underbrace{R(\hat{f}_n) - R(f_{\mathcal{F}})}_{\text{Estimation Error}} + \underbrace{R(f_{\mathcal{F}}) - R(f^*)}_{\text{Approximation Error}}.\]

Summary: ERM Overview

To apply ERM in practice:

Choose a loss function \(\ell(f(x), y)\).
Define a hypothesis space \(\mathcal{F}\) that balances approximation and estimation error.
Use an optimization method to find \(\hat{f}_n = \arg\min_{f \in \mathcal{F}} \frac{1}{n} \sum_{i=1}^n \ell(f(x_i), y_i)\) (or an approximate solution \(\tilde{f}_n\)).

As the size of training data increases, we can use larger hypothesis spaces to reduce approximation error while keeping estimation error manageable.

Conclusion

Empirical Risk Minimization (ERM) provides a foundational framework for supervised learning by optimizing a model’s performance on training data. However, achieving a balance between approximation, estimation, and optimization errors is key to building effective models. This naturally raises the question: How do we efficiently minimize empirical risk in practice, especially for complex models and large datasets?

In the next blog, we’ll dive into Gradient Descent, one of the most powerful and widely used optimization algorithms for minimizing risk, and explore how it enables us to tackle the challenges of ERM. Stay tuned and see you! 👋