On-line to Batch Conversion | Monishver Chandrasekaran

In previous sections, we explored various algorithms for online learning, such as the Perceptron and Winnow algorithms, analyzing their performance within the mistake model—a setting where no assumptions are made about how the training sequence is generated.

A natural question arises:

Can these algorithms be leveraged to derive hypotheses with small generalization error in a standard stochastic setting?
Moreover, how can the intermediate hypotheses they generate be combined to form an accurate predictor?

These questions are the focus of this section.

Problem Setup

We consider a supervised learning setting where we have a labeled sample:

\[S = \{(x_1, y_1), (x_2, y_2), \dots, (x_T, y_T)\} \subset (X \times Y)^T\]

drawn i.i.d. from some unknown but fixed distribution \(D\).

Let \(H\) be a hypothesis class consisting of functions mapping \(X\) to \(Y'\), and let \(L: Y' \times Y \to \mathbb{R}^+\) be a bounded loss function, meaning there exists a constant \(M \geq 0\) such that:

\[L \leq M.\]

An online learning algorithm \(A\) sequentially processes the dataset \(S\), generating hypotheses:

\[h_1, h_2, \dots, h_T \in H.\]

It starts with an initial hypothesis \(h_1\) and updates it after processing each training example \((x_t, y_t)\), for \(t \in [T]\).

Regret and Generalization Error

Before we tackle the question we started with, let’s first clarify why it’s important to address it. Also, let’s define the key terms and concepts to ensure everything is clear and aligned as we proceed.

1. Regret:

The regret of an algorithm is a measure of how much worse the algorithm’s performance is compared to the best possible hypothesis in the class \(H\), in hindsight.

Formally, it’s defined as:

\[R_T = \sum_{t=1}^{T} L(h_t(x_t), y_t) - \min_{h \in H} \sum_{t=1}^{T} L(h(x_t), y_t)\]

Let’s break this down:

\(h_t(x_t)\) is the prediction of the algorithm at time \(t\), after processing the \(t\)-th training example \((x_t, y_t)\).
The term \(\sum_{t=1}^{T} L(h_t(x_t), y_t)\) is the total loss the algorithm incurs by predicting \(h_t(x_t)\) for each training example.
\(\min_{h \in H} \sum_{t=1}^{T} L(h(x_t), y_t)\) represents the loss of the best hypothesis \(h \in H\) if it were chosen in advance and used to predict every \(x_t\) in the sequence.

Why regret matters:
Regret tells us how far off the algorithm’s performance is compared to the best possible performance it could have had with a perfect hypothesis. If the regret is small, it means the algorithm is making predictions that are close to the best possible predictions, at least in terms of total loss.

2. Generalization Error:

The generalization error of a hypothesis \(h \in H\) refers to how well it performs on unseen data from the same distribution \(D\), and it’s typically measured by the expected loss:

\[R(h) = \mathbb{E}_{(x,y) \sim D} [L(h(x), y)]\]

In other words, the generalization error measures how well the hypothesis \(h\) performs on future data (or on data drawn from the same distribution) as opposed to just the training data. This is crucial because we want our hypothesis to not just perform well on the training set (which can be overfit) but also to generalize well to new, unseen examples.

How does this tie into the problem setup?
The goal in this section is to bound the average generalization error of the sequence of hypotheses \(h_1, h_2, \dots, h_T\) generated by the algorithm \(A\). You want to know if the algorithm can generate hypotheses whose average loss on the training dataset \(S\) is a good predictor of their generalization error—meaning, will the algorithm’s performance on the training set reflect its performance on unseen data?

Connecting Regret to Generalization Error:

Now, let’s dive into the key insight of this section:

The algorithm generates a sequence of hypotheses \(h_1, h_2, \dots, h_T\), but we don’t necessarily care about each individual hypothesis. Rather, we care about the average performance over all of them.

Let’s assume the average loss of the hypotheses on the training data is:

\[\hat{L} = \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t)\]

The idea is that, if the average loss \(\hat{L}\) is small (i.e., the algorithm has done well on the training set), then the generalization error (i.e., the expected loss on unseen data) should also be small. The question here is how small, and whether we can bound this generalization error using the average loss on the training set.

Key Insights and the Challenge:

The big challenge in this problem setup is to establish a connection between regret and generalization error. Specifically, you are interested in showing that even though the algorithm has no prior knowledge of the distribution \(D\), the average loss of the hypotheses it generates will give you a good idea of their generalization performance.

How do we approach this?

To answer the question of how to use the average loss to bound the generalization error, we might use tools from PAC learning (Probably Approximately Correct) theory, where we can bound the generalization error of a hypothesis by looking at its empirical loss on the training set and how much variability exists due to the randomness in the sampling process.

In simple terms:

If the training loss is small, can we show that the hypothesis will likely also perform well on unseen data?
Can we bound how much the training loss differs from the generalization error for the hypotheses \(h_1, h_2, \dots, h_T\)?

This section would likely explore these questions in greater detail, using concepts such as concentration inequalities (e.g., Hoeffding’s inequality) or tools from statistical learning theory (e.g., uniform convergence) to establish these bounds.

Key Takeaways:

Regret measures how well the algorithm’s performance compares to the best possible hypothesis over the sequence.
Generalization error measures how well a hypothesis performs on new, unseen data.
The main goal is to relate the average loss of the hypotheses generated by the online algorithm to their generalization error, so that we can prove the algorithm produces good generalizers, not just good fitters to the training data.

Alright, we’re all set! Now, let’s take a look at the actual bounds that are available.

Generalization Error Bound

The following lemma provides a bound on the average generalization error in terms of the empirical loss:

Lemma 8.14

Let \(S = \{(x_1,y_1), ..., (x_T,y_T)\}\) be an i.i.d. sample from \(D\), and let \(h_1, ..., h_T\) be the sequence of hypotheses generated by an online algorithm \(A\) processing \(S\). If the loss function is bounded by \(M\), then for any \(\delta > 0\), with probability at least \(1 - \delta\):

\[\frac{1}{T} \sum_{t=1}^{T} R(h_t) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log(1/\delta)}{T}}\]

Proof Sketch:

We begin by defining the random variable:

\[V_t = R(h_t) - L(h_t(x_t), y_t)\]

This represents the difference between the true expected loss (generalization error) of \(h_t\) and the empirical loss observed on the sample \((x_t, y_t)\).

Step 1: Establishing Expectation Property

Since the data points \((x_t, y_t)\) are drawn i.i.d. from the distribution \(D\), the expectation of the empirical loss over the randomness in the sample should match the expected loss:

\[\mathbb{E}[L(h_t(x_t), y_t) | h_t] = R(h_t)\]

Rearranging this gives:

\[\mathbb{E}[V_t | x_1, ..., x_{t-1}] = R(h_t) - \mathbb{E}[L(h_t(x_t), y_t) | h_t] = 0\]

Thus, the sequence \(\{V_t\}_{t=1}^{T}\) forms a martingale difference sequence, meaning that the expected value of \(V_t\) at any step remains zero given all past observations. This ensures that there is no systematic drift in expectations.

This means that knowing past values does not help predict the next value’s expectation—it remains centered around zero. No systematic drift means that the sequence does not consistently increase or decrease over time; it behaves like a fair game in probability.

Step 2: Bounding the Range of \(V_t\)

Since the loss function is bounded by \(M\), we get:

\[L(h_t(x_t), y_t) \in [0, M] \quad \Rightarrow \quad V_t = R(h_t) - L(h_t(x_t), y_t) \in [-M, +M]\]

This ensures that \(V_t\) is always within a fixed range.

Step 3: Applying Azuma’s Inequality

Azuma’s inequality states that for a martingale difference sequence \(V_t\) with bounded differences \(\vert V_t \vert \leq M\), the probability of the empirical mean deviating from its expectation satisfies:

\[P\left( \frac{1}{T} \sum_{t=1}^{T} V_t \geq \epsilon \right) \leq \exp \left( -\frac{2T\epsilon^2}{4M^2} \right)\]

Rearranging the denominator:

\[P\left( \frac{1}{T} \sum_{t=1}^{T} V_t \geq \epsilon \right) \leq \exp \left( -\frac{T\epsilon^2}{2M^2} \right)\]

Step 4: Solving for \(\epsilon\)

We now set the right-hand side equal to \(\delta > 0\):

\[\exp \left( -\frac{T\epsilon^2}{2M^2} \right) = \delta\]

Taking the natural logarithm on both sides:

\[-\frac{T\epsilon^2}{2M^2} = \log \delta\]

Solving for \(\epsilon\):

\[\epsilon = M \sqrt{\frac{2 \log(1/\delta)}{T}}\]

Step 5: Concluding the Bound

Since \(\frac{1}{T} \sum_{t=1}^{T} V_t\) represents the difference between the average generalization error and the empirical average loss, we obtain:

\[\frac{1}{T} \sum_{t=1}^{T} R(h_t) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log(1/\delta)}{T}}\]

This completes the proof.

Before we move forward, let’s break down what all of this means and how we’re using it to reach our desired outcome.

Why does the martingale difference property allow us to apply Azuma’s inequality?

The key idea of a martingale difference sequence is that at each step, the expectation of the difference \(V_t\) is zero given past observations.
This means that on average, the sequence doesn’t drift systematically in any direction—it’s balanced in expectation.
Moreover, since each \(V_t\) is bounded within \([-M, M]\), there are no extreme jumps.
Azuma’s inequality applies precisely in such cases—it tells us that despite small fluctuations at each step, the cumulative deviation from zero remains small with high probability.

💡 Analogy: Imagine you’re walking on a balance beam with a safety harness.

You might take small steps to the left or right, but each step is unbiased, meaning you’re not favoring one direction over time.
Also, your steps are limited in size (bounded).
Azuma’s inequality is like saying that, with high probability, you won’t drift too far from the center because of these properties.

How does this help bound generalization error?

The empirical loss over the training sample is what we observe, but what we really care about is the generalization error (expected loss over unseen data).
The sum \(\sum_{t=1}^{T} V_t\) captures how much our empirical loss deviates from the true generalization error.
Since we established that this deviation behaves like a martingale difference sequence, we can now use Azuma’s inequality to show that this deviation is small with high probability.
This means that our empirical loss is a good estimate of the true generalization error, up to a small correction term!

💡 Analogy: Think of a weather forecast model trained on past temperature data.

If the model is unbiased and doesn’t systematically overestimate or underestimate temperatures, then on average, its predictions should be close to reality.
Azuma’s inequality tells us that, with high probability, the difference between past observed temperatures (empirical loss) and future actual temperatures (generalization error) remains small.
This justifies why our training loss is a reliable estimate of test loss.

Application: Averaging Hypotheses

When the loss function is convex in its first argument, we can bound the generalization error of the average hypothesis:

\[\bar{h} = \frac{1}{T} \sum_{t=1}^{T} h_t\]

Since expectation is linear, we can use Jensen’s inequality, which states that for a convex function \(f\):

\[f\left(\frac{1}{T} \sum_{t=1}^{T} x_t\right) \leq \frac{1}{T} \sum_{t=1}^{T} f(x_t)\]

Applying this to the generalization error:

\[R(\bar{h}) = \mathbb{E}_{(x,y) \sim D} [L(\bar{h}(x), y)]\]

By convexity of \(L\), we get:

\[R(\bar{h}) \leq \frac{1}{T} \sum_{t=1}^{T} R(h_t)\]

From our previous bound:

\[\frac{1}{T} \sum_{t=1}^{T} R(h_t) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log(1/\delta)}{T}}\]

Thus, we obtain:

\[R(\bar{h}) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log(1/\delta)}{T}}\]

Key Insights:

The generalization error of the averaged hypothesis is controlled by the average loss of the individual hypotheses plus a small correction term.
This shows that averaging hypotheses is a simple yet powerful way to obtain a low generalization error in an online learning setting.

Connection to Regret Minimization

If the online learning algorithm has small regret \(R_T\), then:

\[\frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) \approx \inf_{h \in H} \frac{1}{T} \sum_{t=1}^{T} L(h(x_t), y_t)\]

Since the second term in our bound vanishes as \(T \to \infty\), we conclude:

\[R(\bar{h}) \approx \inf_{h \in H} R(h)\]

This means that the average hypothesis is nearly optimal in terms of generalization error.

If the above result feels a bit half-baked for you to process, no worries—we’ll work through it and figure out what it means next.

Theorem 8.15: Generalization Error of Averaged Hypotheses

Let \(S = ((x_1, y_1), \dots, (x_T, y_T))\) be a labeled sample drawn i.i.d. according to distribution \(D\). Let \(L\) be a loss function that is bounded by \(M\) and convex with respect to its first argument. Consider a sequence of hypotheses \(h_1, \dots, h_T\) generated by an online learning algorithm \(A\) processing \(S\) sequentially.

Then, for any \(\delta > 0\), with probability at least \(1 - \delta\), the following bounds hold:

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log \frac{1}{\delta}}{T}}.\] \[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \inf_{h \in \mathcal{H}} R(h) + \frac{R_T}{T} + 2M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

Proof

Step 1: Bounding the Generalization Error of the Averaged Hypothesis

By the convexity of \(L\) in its first argument, for any \((x, y)\), we have (via Jensen’s inequality):

\[L\left(\frac{1}{T} \sum_{t=1}^{T} h_t(x), y\right) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x), y).\]

Taking expectations over the data distribution \(D\):

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) = \mathbb{E}_{(x,y) \sim D} \left[ L\left(\frac{1}{T} \sum_{t=1}^{T} h_t(x), y\right) \right] \leq \frac{1}{T} \sum_{t=1}^{T} R(h_t).\]

Using Lemma 8.14, we get:

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log \frac{1}{\delta}}{T}}.\]

Step 2: Connection to Regret Minimization

By definition of the regret \(R_T\), for any \(\delta > 0\), the following holds with probability at least \(1 - \delta/2\):

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) + M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

Since the learner attempts to minimize regret, we take the minimum over all hypotheses:

\[\frac{1}{T} \sum_{t=1}^{T} L(h_t(x_t), y_t) \leq \min_{h \in \mathcal{H}} \frac{1}{T} \sum_{t=1}^{T} L(h(x_t), y_t) + \frac{R_T}{T}.\]

Note: Not sure how we got this. Actually, this is what we started with, right? The definition of regret as defined above when we started, just dividing by T across.

Thus, with probability at least \(1 - \delta/2\):

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \min_{h \in \mathcal{H}} \frac{1}{T} \sum_{t=1}^{T} L(h(x_t), y_t) + \frac{R_T}{T} + M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

Using Hoeffding’s inequality, we can show that for any optimal hypothesis \(h^*\), with probability at least \(1 - \delta/2\):

\[\frac{1}{T} \sum_{t=1}^{T} L(h^*(x_t), y_t) \leq R(h^*) + M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

Combining these results:

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq R(h^*) + \frac{R_T}{T} + 2M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

By the definition of \(\inf_{h \in H} R(h)\), for any \(\epsilon > 0\), there exists \(h^* \in H\) with

\[R(h^*) \leq \inf_{h \in H} R(h) + \epsilon.\]

Since \(h^*\) is chosen optimally, which gives:

\[R\left(\frac{1}{T} \sum_{t=1}^{T} h_t\right) \leq \inf_{h \in \mathcal{H}} R(h) + \frac{R_T}{T} + 2M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}.\]

Since the inequality holds for all \(\epsilon > 0\), it proves the second statement.

Application to Online Learning Algorithms

The theorem can be applied to a variety of online regret minimization algorithms. A key case is when:

\[\frac{R_T}{T} = O(1/\sqrt{T}).\]

In particular, we can apply this theorem to the exponentially weighted average algorithm. Assuming:

The loss function \(L\) is bounded by \(M = 1\).
The number of rounds \(T\) is known to the algorithm.

Then, using the regret bound from Theorem 8.6, we obtain:

\[R_T \leq \sqrt{(T/2) \log N}\]

where \(N\) is the number of experts, or the dimension of the weight vectors.

If \(T\) is not known in advance, we can apply the doubling trick (Theorem 8.7) to derive a similar bound.

Before we wrap up, just one last thing: Hoeffding’s Inequality. I think I’ll split this blog into two parts for better readability—it’s a bit too much to cover in one post.

Hoeffding’s Inequality and Its Application in Generalization Bounds

Hoeffding’s inequality is a fundamental concentration inequality that provides a bound on the probability that the sum of bounded independent random variables deviates from its expected value.

Statement of Hoeffding’s Inequality

Let \(X_1, X_2, \dots, X_T\) be independent random variables such that for each \(t\),

\[a_t \leq X_t \leq b_t.\]

Define the sum:

\[S_T = \sum_{t=1}^{T} X_t.\]

Then, the probability that \(S_T\) deviates from its expected value by more than \(\epsilon\) satisfies:

\[P\left( \left| S_T - \mathbb{E}[S_T] \right| \geq \epsilon \right) \leq 2 \exp \left( -\frac{2 \epsilon^2}{\sum_{t=1}^{T} (b_t - a_t)^2} \right).\]

If each \(X_t\) is bounded in the range \([a, b]\), this simplifies to:

\[P\left( \left| S_T - \mathbb{E}[S_T] \right| \geq \epsilon \right) \leq 2 \exp \left( -\frac{2 \epsilon^2}{T (b - a)^2} \right).\]

Application in Generalization Bounds

In our proof, we use Hoeffding’s inequality to control the deviation of the empirical risk from the expected risk.

Define the random variables:

\[X_t = L(h^*(x_t), y_t),\]

where \(h^*\) is the optimal hypothesis. Since the loss function is bounded by \(M\), we have:

\[0 \leq X_t \leq M.\]

Thus, the sum:

\[\sum_{t=1}^{T} L(h^*(x_t), y_t)\]

is a sum of bounded independent random variables. Applying Hoeffding’s inequality, we get:

\[P\left( \left| \frac{1}{T} \sum_{t=1}^{T} L(h^*(x_t), y_t) - R(h^*) \right| \geq \epsilon \right) \leq 2 \exp \left( -\frac{2 T \epsilon^2}{M^2} \right).\]

Setting \(\epsilon = M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}}\) and solving for \(\delta\), we obtain the bound:

\[P\left( \frac{1}{T} \sum_{t=1}^{T} L(h^*(x_t), y_t) \geq R(h^*) + M \sqrt{\frac{2 \log \frac{2}{\delta}}{T}} \right) \leq \frac{\delta}{2}.\]

This bound ensures that, with high probability, the empirical loss is close to the expected loss, which is crucial for proving the generalization bounds of the learning algorithm.

A few questions that i had and you might too:

Why Does the Optimal \(h^*\) Minimize the Loss?
The optimal hypothesis \(h^*\) is defined as the one that minimizes the expected loss over the data distribution:

\[h^* = \arg\min_{h \in \mathcal{H}} R(h),\]

where the expected loss (also called the risk) is:

\[R(h) = \mathbb{E}_{(x,y) \sim D} [L(h(x), y)].\]

Since \(h^*\) minimizes this risk, it achieves the smallest possible expected loss over all hypotheses in \(\mathcal{H}\).

Why This Justifies Applying Hoeffding’s Inequality

We consider the empirical loss:

\[\frac{1}{T} \sum_{t=1}^{T} L(h^*(x_t), y_t),\]

which is an estimate of the true expected loss \(R(h^*)\). However, since the sample is drawn i.i.d., there is some randomness, so the empirical loss might deviate from the expected loss.

Since the loss function is bounded (i.e., \(0 \leq L(h^*(x), y) \leq M\)), we can apply Hoeffding’s inequality to bound this deviation. This ensures that, with high probability, the empirical loss does not stray too far from \(R(h^*)\), giving us a reliable way to estimate the generalization error.

Intuitive Explanation

Think of this as estimating the average height of people in a city by measuring the height of a random sample. The true average height (analogous to \(R(h^*)\)) is unknown, but if we take a large enough sample, the average height in our sample (empirical loss) will be close to the true average height with high probability.

By Hoeffding’s inequality, the probability that our sample average significantly deviates from the true mean is exponentially small, ensuring our empirical loss is a good approximation of the expected loss.

Key Insights

\(h^*\) is optimal since it minimizes the expected loss.
The empirical loss is a sample-based approximation of \(R(h^*)\).
Hoeffding’s inequality guarantees that the empirical loss is close to the true loss with high probability.
This allows us to confidently generalize from training data to unseen data.

Conclusion

This analysis demonstrates that online learning algorithms, despite being designed for adversarial settings, can be effectively converted into batch learning methods. By averaging the hypotheses they produce, we can derive strong generalization guarantees, bridging the gap between online and batch learning paradigms.

This insight is particularly valuable in large-scale machine learning, where online learning methods can be computationally efficient while still ensuring strong predictive performance.

To-Do:

Split this Blog into two.

References

Foundations of Machine Learning Book