Exponential Weighted Average Algorithm | Monishver Chandrasekaran

The Exponential Weighted Average Algorithm (EWAA) is an online learning algorithm that provides elegant guarantees for minimizing regret in adversarial settings. It extends the principles of the Weighted Majority Algorithm by incorporating exponential weight updates, making it particularly effective for handling convex loss functions.

How the Exponential Weighted Average Algorithm Works

At its core, the EWAA maintains and updates weights for a set of experts, similar to the Weighted Majority Algorithm. However, it uses an exponential weighting scheme to achieve better bounds on regret, especially for convex losses.

Steps of the Algorithm

Initialization Set initial weights for all \(N\) experts:

\[w_{1,i} = 1, \quad \forall i \in \{1, \dots, N\}\]

Prediction at Round \(t\)

Observe the predictions \(\hat{y}_{t,i}\) from all experts.
Compute the aggregate prediction as a weighted average:

\[\hat{y}_t = \frac{\sum_{i=1}^N w_{t,i} \cdot \hat{y}_{t,i}}{\sum_{i=1}^N w_{t,i}}\]

Update Weights

After receiving the true outcome \(y_t\), update the weights for the next round:

\[w_{t+1,i} = w_{t,i} \cdot e^{-\eta L(\hat{y}_{t,i}, y_t)}\]

The weight update can also be expressed directly using the cumulative loss \(L_t(i)\) of expert \(i\) after \(t\) rounds:

\[w_{t+1,i} = e^{-\eta L_t(i)}\]

Key Points to Highlight:

Simplification of Weight Updates While the equation appears to involve \(w_{t,i}\) in the update, the final weight \(w_{t+1,i}\) depends only on the cumulative loss \(L_t(i)\) and not on previous weights.
This is why it shows \(w_{t+1,i} = e^{-\eta L_t(i)}\), as the weights can be normalized afterward.

Interpretation of \(e^{-x}\) The term \(e^{-\eta x}\) decreases exponentially as \(x\) (loss) increases.
This ensures poorly performing experts are rapidly down-weighted. A plot of \(e^{-x}\) can visually illustrate this decay.

If you’re still unclear about the final weight update rule, keep reading — the explanation below should clarify things.

We start from the original weight update equation and simplify it step by step to express it in terms of the cumulative loss.

1. Original Weight Update The weight update at time \(t+1\) is defined as:

\[w_{t+1,i} = w_{t,i} \cdot e^{-\eta L(\hat{y}_{t,i}, y_t)}\]

2. Recursive Application Expanding the recursion over all previous rounds \(1, \dots, t\):

\[w_{t+1,i} = w_{1,i} \cdot e^{-\eta L(\hat{y}_{1,i}, y_1)} \cdot e^{-\eta L(\hat{y}_{2,i}, y_2)} \cdots e^{-\eta L(\hat{y}_{t,i}, y_t)}\]

3. Simplify the Product Using the property of exponents \(a^x \cdot a^y = a^{x+y}\):

\[w_{t+1,i} = w_{1,i} \cdot e^{-\eta \sum_{s=1}^t L(\hat{y}_{s,i}, y_s)}\]

4. Initial Weights Since the initial weights are set to \(w_{1,i} = 1\) for all experts, this simplifies to:

\[w_{t+1,i} = e^{-\eta \sum_{s=1}^t L(\hat{y}_{s,i}, y_s)}\]

5. Cumulative Loss Definition Define the cumulative loss of expert \(i\) after \(t\) rounds as:

\[L_t(i) = \sum_{s=1}^t L(\hat{y}_{s,i}, y_s)\]

6. Final Simplified Form Substituting \(L_t(i)\) into the equation gives the simplified weight update:

\[w_{t+1,i} = e^{-\eta L_t(i)}\]

Key Insight:

The weight at time \(t+1\) depends only on the cumulative loss \(L_t(i)\), not on the individual losses at previous rounds or the intermediate weights.
This simplification is possible because the update rule is multiplicative, and the cumulative loss naturally aggregates all penalties from previous rounds.

Theorem: Regret Bound for EWAA

Let \(L(y, y')\) be a convex loss function in its first argument, taking values in \([0, 1]\). For any \(\eta > 0\) and any sequence of labels \(y_1, \dots, y_T \in \mathcal{Y}\), the regret of the Exponential Weighted Average Algorithm satisfies:

\[R_T \leq \frac{\log N}{\eta} + \frac{\eta T}{8}\]

Here, the regret is defined as the difference between the total loss of the algorithm and the loss of the best expert:

\[R_T = \sum_{t=1}^T L(\hat{y}_t, y_t) - \min_{i \in [N]} \sum_{t=1}^T L(\hat{y}_{t,i}, y_t)\]

Optimized Learning Rate

By choosing \(\eta = \sqrt{\frac{8 \log N}{T}}\), we minimize the regret bound, resulting in: \(R_T \leq \sqrt{\frac{T}{2} \log N}\)

This demonstrates that the regret grows logarithmically with the number of experts \(N\) and sublinearly with the number of time steps \(T\), indicating the efficiency of EWAA.

Convex Loss Function in Its First Argument

Before we dive deeper, let’s clarify what we mean by a convex loss function in its first argument. In this context, the phrase refers to the loss function \(L(y, y')\) being convex with respect to its first argument, \(y\) (which could be the true label or the model output).

To break it down:

The loss function \(L(y, y')\) measures the difference between the true label \(y\) and the predicted label \(y'\).
Convexity in the first argument means that for any fixed value of \(y'\), the function \(L(y, y')\) is convex in \(y\). This implies that as you vary the predicted label \(y'\), the loss function increases in a “bowl-shaped” manner when considering its first argument \(y\). This property is important for optimization because convex functions are easier to minimize, ensuring that algorithms like EWAA can efficiently adjust to minimize cumulative loss over time.

In mathematical terms, for any fixed \(y'\), the function \(L(y, y')\) satisfies the condition of convexity:

\[L(\lambda y_1 + (1-\lambda) y_2, y') \leq \lambda L(y_1, y') + (1-\lambda) L(y_2, y')\]

for any \(y_1, y_2 \in \mathcal{Y}\) and \(\lambda \in [0, 1]\).

Proof of the Regret Bound

Define the potential function:

\[\Phi_t = \log \left( \sum_{i=1}^N w_{t,i} \right)\]

The goal is to derive an upper bound and a lower bound for \(\Phi_t\), and then combine them to establish the regret bound.

Upper Bound

Step 1: Change in Potential Function

From the weight update rule:

\[w_{t+1,i} = w_{t,i} e^{-\eta L(\hat{y}_{t,i}, y_t)},\]

we can write the change in the potential function as:

\[\Phi_{t+1} - \Phi_t = \log \left( \frac{\sum_{i=1}^N w_{t,i} e^{-\eta L(\hat{y}_{t,i}, y_t)}}{\sum_{i=1}^N w_{t,i}} \right) = \log \left( \mathbb{E}_{p_t}[e^{-\eta X}] \right),\]

where \(X = -L(\hat{y}_{t,i}, y_t) \in [-1, 0]\) and \(p_t(i) = \frac{w_{t,i}}{\sum_{j=1}^N w_{t,j}}\) is the probability distribution over experts.

Step 2: Centering the Random Variable

Define \(X = -L(\hat{y}_{t,i}, y_t)\), where \(X \in [-1, 0]\). The expectation can be centered around its mean:

\[\mathbb{E}_{p_t} \left[ e^{-\eta L(\hat{y}_{t,i}, y_t)} \right] = \mathbb{E}_{p_t} \left[ e^{\eta (X - \mathbb{E}_{p_t}[X])} \right] e^{\eta \mathbb{E}_{p_t}[X]}\]

Substituting this back, we have:

\[\Phi_{t+1} - \Phi_t = \log \mathbb{E}_{p_t} \left[ e^{\eta (X - \mathbb{E}_{p_t}[X])} \right] + \eta \mathbb{E}_{p_t}[X]\]

Step 3: Applying Hoeffding’s Lemma

By Hoeffding’s Lemma, for any centered random variable \(X - \mathbb{E}_{p_t}[X]\) bounded in \([-1, 0]\):

\[\log \mathbb{E}_{p_t} \left[ e^{\eta (X - \mathbb{E}_{p_t}[X])} \right] \leq \frac{\eta^2}{8}\]

Substituting this bound:

\[\Phi_{t+1} - \Phi_t \leq \eta \mathbb{E}_{p_t}[X] + \frac{\eta^2}{8}\]

Step 4: Substituting \(X = -L(\hat{y}_{t,i}, y_t)\)

Recall that \(X = -L(\hat{y}_{t,i}, y_t)\), so \(\mathbb{E}_{p_t}[X] = -\mathbb{E}_{p_t}[L(\hat{y}_{t,i}, y_t)]\). Substituting:

\[\Phi_{t+1} - \Phi_t \leq -\eta \mathbb{E}_{p_t}[L(\hat{y}_{t,i}, y_t)] + \frac{\eta^2}{8}\]

Step 5: Applying Convexity of \(L\)

By the convexity of \(L\) in its first argument:

\[\mathbb{E}_{p_t}[L(\hat{y}_{t,i}, y_t)] \geq L(\mathbb{E}_{p_t}[\hat{y}_{t,i}], y_t) = L(\hat{y}_t, y_t),\]

where \(\hat{y}_t = \mathbb{E}_{p_t}[\hat{y}_{t,i}]\) is the prediction of the algorithm. Using this:

\[\Phi_{t+1} - \Phi_t \leq -\eta L(\hat{y}_t, y_t) + \frac{\eta^2}{8}\]

[How? - Unclear]

Step 6: Summing Over \(t = 1, \dots, T\)

Summing this inequality over all time steps:

\[\sum_{t=1}^T (\Phi_{t+1} - \Phi_t) \leq -\eta \sum_{t=1}^T L(\hat{y}_t, y_t) + \frac{\eta^2 T}{8}\]

The left-hand side telescopes:

\[\Phi_{T+1} - \Phi_1 \leq -\eta \sum_{t=1}^T L(\hat{y}_t, y_t) + \frac{\eta^2 T}{8}\]

Final Upper Bound

This establishes the upper bound for the change in potential:

\[\Phi_{T+1} - \Phi_1 \leq -\eta \sum_{t=1}^T L(\hat{y}_t, y_t) + \frac{\eta^2 T}{8}\]

Lower Bound

Step 1: Potential Function at \(T+1\)

From the definition of the potential function:

\[\Phi_{T+1} = \log \left( \sum_{i=1}^N w_{T+1,i} \right),\]

where \(w_{T+1,i}\) is the weight of expert \(i\) at time \(T+1\).

Step 2: Weight Update Rule

Using the weight update rule:

\[w_{T+1,i} = e^{-\eta L_{T,i}},\]

where \(L_{T,i} = \sum_{t=1}^T L(\hat{y}_{t,i}, y_t)\) is the cumulative loss of expert \(i\) up to time \(T\).

Substituting into the potential function:

\[\Phi_{T+1} = \log \left( \sum_{i=1}^N e^{-\eta L_{T,i}} \right).\]

Step 3: Lower Bound for Log-Sum-Exp

Applying the lower bound for the log-sum-exp function:

\[\log \left( \sum_{i=1}^N e^{-\eta L_{T,i}} \right) \geq \max_{i \in [N]} \left( -\eta L_{T,i} \right) + \log N.\]

Rewriting:

\[\Phi_{T+1} \geq -\eta \min_{i \in [N]} L_{T,i} + \log N,\]

where \(\min_{i \in [N]} L_{T,i}\) is the smallest cumulative loss among all experts.

Note: If this isn’t clear, refer to the log-sum-exp trick—it’s essentially the same approach we’ve used here.

Step 4: Initial Potential Function

From the initial condition, the potential function at \(t = 1\) is:

\[\Phi_1 = \log N\]

Step 5: Combining Results

Combining the expressions for \(\Phi_{T+1}\) and \(\Phi_1\), we obtain:

\[\Phi_{T+1} - \Phi_1 \geq -\eta \min_{i \in [N]} L_{T,i}\]

Final Lower Bound

Thus, the lower bound for the change in potential is:

\[\Phi_{T+1} - \Phi_1 \geq -\eta \min_{i \in [N]} L_{T,i}\]

Combining Bounds

From the upper and lower bounds:

\[-\eta \min_{i \in [N]} L_{T,i} \leq -\eta \sum_{t=1}^T L(\hat{y}_t, y_t) + \frac{\eta^2 T}{8}\]

Rearranging terms:

\[\sum_{t=1}^T L(\hat{y}_t, y_t) - \min_{i \in [N]} L_{T,i} \leq \frac{\log N}{\eta} + \frac{\eta T}{8}\]

Thus, the regret satisfies:

\[R_T \leq \frac{\log N}{\eta} + \frac{\eta T}{8}\]

Note: \(\min_{i \in [N]} L_{T,i}\) and Its Meaning

The term \(\min_{i \in [N]} L_{T,i}\) refers to the minimum cumulative loss among all the experts (or models) after \(T\) rounds. Specifically:

\(L_{T,i}\) is the cumulative loss of expert (or model) \(i\) after \(T\) rounds.
\(\min_{i \in [N]} L_{T,i}\) represents the smallest cumulative loss incurred by any expert over the \(T\) rounds.

This is the term we need in our calculation to compute the regret, right!

Key Takeaways:

The regret bound of the EWAA is a function of both the learning rate \(\eta\) and the time horizon \(T\).
By choosing \(\eta = \sqrt{\frac{8 \log N}{T}}\), the regret grows sublinearly with \(T\) and logarithmically with \(N\), ensuring the algorithm’s efficiency.

Advantages and Disadvantages of EWAA

Advantages

Strong Theoretical Guarantees:
- The regret bound for the (EWAA) is logarithmic in the number of experts \(N\) and sublinear in the number of time steps \(T\). This means that even as the number of experts or rounds increases, the regret grows slowly, offering a strong theoretical guarantee on performance. [Why? - Think]
Applicability to Convex Losses:
- Unlike algorithms specifically tailored for binary losses, EWAA can handle convex loss functions. This makes it a more versatile algorithm since convex losses are more general and can cover a wider range of applications beyond binary classification.
Weight Adaptivity:
- The exponential weight updates in EWAA ensure that poor-performing experts are penalized efficiently over time. This adaptive mechanism allows the algorithm to focus more on better-performing experts, while discouraging the influence of worse-performing ones, improving its overall performance.

Disadvantages

Requires Knowledge of Horizon \(T\):
- A disadvantage of the EWAA is that it requires knowledge of the horizon \(T\), which refers to the total number of rounds or time steps the algorithm will run. Specifically, the learning rate \(\eta\) in the regret bound often depends on \(T\) (for example, \(\eta\) might be chosen as \(\frac{1}{\sqrt{T}}\)).
- This means that to optimize the regret bound, you need to have some insight or knowledge about the total number of rounds \(T\) in advance. This can be a significant limitation in practical applications, where \(T\) is not always known or fixed in advance. In real-world scenarios, you might need to adapt to changing environments without prior knowledge of how long the process will last, making it challenging to choose the best parameters like \(\eta\).

Before we wrap up, let’s take a step back and get a clearer picture of the whole thing:

Why Convexity Helps in EWAA

Optimization and Regret Minimization:
- The convexity of the loss function with respect to the predicted labels \(y\) (the first argument) ensures that the algorithm can effectively minimize cumulative loss. Since convex functions have a single global minimum, optimization is straightforward and guarantees convergence toward a solution with low regret.
Exponential Weight Updates:
- In EWAA, the weight updates are based on the exponential of the loss, and convexity allows these updates to be well-behaved. Specifically, since the loss function increases in a convex manner as the difference between \(y\) and \(y'\) increases, the exponential weight updates ensure that poorly performing experts are penalized more heavily. This ensures that the algorithm focuses on the most promising experts while reducing the influence of poor ones.
Efficient Learning:
- Convexity ensures that the loss function grows in a predictable manner, which helps in adjusting the weights efficiently across time steps. This is important for the overall performance of the algorithm, as it leads to effective adaptation and faster convergence to a good solution.
Theoretical Guarantees:
- The convexity property allows the theoretical regret bounds for EWAA to be derived more easily. Since convex functions have well-defined gradients and curvature properties, we can make rigorous claims about the regret bound, such as the logarithmic growth in the number of experts \(N\) and sublinear growth in the number of time steps \(T\). Without convexity, such guarantees would not be as strong or as easily established.

And, If you’re unsure about the difference between linear and sublinear growth, here’s a quick clarification:

Linear growth means the value grows at a constant rate (proportional to the parameter). Mathematically: \(f(x) = O(x)\).
Sublinear growth means the value grows at a slower rate than the parameter, such that the output doesn’t keep up at the same pace. Mathematically: \(f(x) = O(x^a)\) where \(0 < a < 1\), or \(f(x) = O(\log(x))\).

Conclusion

The Exponential Weighted Average Algorithm provides strong guarantees for regret minimization with convex loss functions. Its use of exponential weight updates makes it both adaptable and theoretically elegant, though its dependence on the time horizon \(T\) can present practical challenges.

In the next post, we’ll dive into the doubling trick for selecting \(\eta\) and how it improves regret bounds. Stay tuned—see you in the next one!

References

Online Learning: Halving Algorithm and Exponential Weights(Notes)