Follow the Leader (FL) and Follow the Perturbed Leader (FPL) in Online Learning

Online learning is a fascinating area of machine learning where an agent makes sequential decisions, aiming to minimize loss over time. Unlike traditional supervised learning, where a model is trained on a fixed dataset, online learning involves continuously adapting based on feedback from past decisions. One of the simplest and most natural strategies in this setting is Follow the Leader (FL)—a method that always picks the action that has performed best in hindsight.

While FL is intuitive, it has critical weaknesses that make it unreliable in adversarial settings. To address these limitations, a more robust strategy called Follow the Perturbed Leader (FPL) was introduced, which introduces randomness to stabilize decision-making. In this post, we will explore the core ideas behind FL and FPL, develop their mathematical formulations, and derive their performance bounds.

Learning from Structure: Decomposing Losses

Many learning problems can be framed in terms of minimizing cumulative loss, which often decomposes naturally across substructures. For instance, in structured prediction tasks, the total loss may be expressed as a sum over smaller components such as:

The sum of losses along edges in a tree.
The sum of losses along a specific path in a graph.
The sum of individual substructure losses in a discrete optimization problem.
The cumulative loss incurred in an expert setting, where multiple strategies are available, and the goal is to learn from the best-performing one.

This decomposition plays a key role in designing online learning algorithms, as it allows us to analyze decision-making strategies in a structured manner.

The Follow the Leader (FL) Strategy

To formalize the Follow the Leader approach, let’s consider a general linear decision problem. At each round \(t\), a player selects a decision \(w_t\) from a set \(W\). Once the decision is made, the environment reveals a loss vector \(x_t\), and the player incurs a loss given by the inner product:

\[L_t = w_t \cdot x_t\]

The objective is to minimize the cumulative loss over \(T\) rounds:

\[L_T = \sum_{t=1}^{T} w_t \cdot x_t\]

Alternatively, we can evaluate performance through regret, which measures how much worse our strategy is compared to the best fixed action in hindsight:

\[\text{Regret}_T = L_T - \min_{w \in W} \sum_{t=1}^{T} w \cdot x_t\]

The Follow the Leader strategy takes a straightforward approach: at each step, it selects the action that has performed best so far. Mathematically, this means choosing:

\[w_t = \arg\min_{w \in W} \sum_{s=1}^{t-1} w \cdot x_s\]

This approach, sometimes referred to as fictitious play, seems reasonable—after all, picking the historically best-performing action should be a good idea, right? However, FL has a major flaw: it can be highly unstable. In adversarial settings, small changes in past losses can cause large shifts in decisions, leading to poor performance.

To illustrate this, consider a simple scenario where we alternate between two actions. If FL starts with an initial action and then alternates between two different ones, it can end up incurring a loss of 1 at every round, whereas a fixed expert strategy would accumulate much less loss. This instability calls for a more robust alternative.

Follow the Perturbed Leader (FPL): Adding Randomization

A simple yet effective modification to FL is Follow the Perturbed Leader (FPL). The idea is to introduce random noise before selecting the best action, preventing the algorithm from making overly rigid choices based on past data.

Instead of selecting the exact leader, FPL picks an action by solving:

\[w_t = \arg\min_{w \in W} \left( \sum_{s=1}^{t-1} w \cdot x_s + w \cdot p_t \right)\]

where \(p_t\) is a random perturbation added to the cumulative loss. This small tweak dramatically improves performance in adversarial settings by preventing sudden shifts in decisions.

A common choice for the perturbation is uniform noise:

\[p_t \sim U([0, 1/\epsilon]^N)\]

which ensures that no two actions have exactly the same cumulative loss, effectively breaking ties in a randomized manner. Another approach is to use multiplicative perturbations, where the noise follows a Laplacian distribution:

\[f(x) = \frac{\epsilon}{2} e^{-\epsilon \|x\|_1}\]

This version, referred to as FPL*, has even stronger guarantees and is particularly effective in adversarial settings. The theoretical analysis of FPL dates back to Hannan (1957) and was later refined by Kalai & Vempala (2004).

Follow-Up Questions!

Is FPL an online learning strategy similar to the Randomized Weighted Majority Algorithm (RWMA), or is it something else designed for specific cases?

Yes, Follow the Perturbed Leader (FPL) is an online learning strategy, and it shares some similarities with the Randomized Weighted Majority Algorithm (RWMA) in that both incorporate randomness to improve decision-making and achieve better regret bounds. However, their underlying mechanisms differ.

RWMA (Randomized Weighted Majority Algorithm):
This is an expert-based algorithm where the learner maintains a set of expert weights, updating them multiplicatively based on incurred losses. RWMA is particularly effective in adversarial settings and can be viewed as a special case of Exponentiated Gradient (EG) methods. Its main idea is to give more weight to the experts who perform well, based on past performance.
FPL (Follow the Perturbed Leader):
FPL, on the other hand, introduces random perturbations to the past loss values before selecting the best action. This process smooths decision-making, avoiding the instability seen in Follow the Leader (FL) while keeping the simplicity of choosing the best historical action.

FPL is especially useful when working with linear loss functions and structured decision spaces, such as combinatorial optimization problems (e.g., shortest paths, spanning trees). It is closely related to mirror descent and regularization-based algorithms and, in some cases, can be viewed as an implicit form of regularization via perturbations.

Key Differences:

RWMA is expert-based, where weights are adjusted multiplicatively based on performance.
FPL uses random perturbations to past losses, making it more general and suitable for structured decision-making problems, particularly in combinatorial settings.

In summary, while both algorithms use randomness to stabilize learning, RWMA is more tailored to expert settings, while FPL is a broader decision-making framework effective for combinatorial and structured problems.

What Do We Mean by Structured Problems?

A structured problem refers to a decision-making setting where the available choices or actions have an underlying structure, often governed by combinatorial, geometric, or graph-based constraints. Instead of choosing from a simple finite set of actions (like in a standard multi-armed bandit or expert setting), the learner must select complex objects such as:

Paths in a graph (e.g., shortest path routing)
Spanning trees (e.g., network design)
Matchings in a bipartite graph (e.g., job allocation)
Binary vectors with constraints (e.g., feature selection)
Matrices or sequences (e.g., scheduling problems)

In these cases, the decision space is often exponentially large, making direct enumeration or simple expert-based approaches infeasible.

Why Does FL/FPL Apply to Structured Problems?

The Follow the Leader (FL) and Follow the Perturbed Leader (FPL) strategies naturally extend to structured problems because:

Linear Loss Structure: Many structured problems can be formulated in terms of a linear loss function over the structure. For example, if selecting a path in a graph, the total loss might be the sum of edge losses along the path. FL/FPL directly works with such additive losses.
Combinatorial Decision Spaces: Since the space of possible decisions is structured (e.g., spanning trees, paths), explicitly maintaining weights over all choices, as in the Weighted Majority Algorithm, is impractical. Instead, FL/FPL selects actions by solving an optimization problem over the structured space.
Computational Feasibility: The FL and FPL updates involve solving an argmin optimization problem over past losses:

\[w_t = \arg \min_{w \in W} \sum_{s=1}^{t-1} w \cdot x_s\]

This optimization often reduces to a well-known combinatorial problem, such as minimum spanning tree, shortest path, or maximum matching, which can be solved efficiently using algorithms from combinatorial optimization.

How Does FL/FPL Work in Structured Settings?

Follow the Leader (FL):
FL picks the structure (e.g., path, tree) that has accumulated the lowest loss so far. However, FL is unstable in adversarial settings because a small change in losses can drastically change the optimal structure.
Follow the Perturbed Leader (FPL):
FPL introduces random perturbations to smooth decision-making. The learner solves:

\[w_t = \arg \min_{w \in W} \sum_{s=1}^{t-1} w \cdot x_s + w \cdot p_t\]

where \(p_t\) is a noise vector. This prevents overfitting to small loss differences and ensures better stability.

Key Takeaways:

FL/FPL applies well to structured problems where decisions involve combinatorial choices rather than simple finite sets.
FL works when losses are additive over substructures, but it is unstable in adversarial settings.
FPL improves stability by adding randomness, making it a powerful tool for structured online learning.

This is why FL/FPL is commonly used in structured online learning, where actions have combinatorial dependencies, and stability is crucial.

Understanding Structured Problems: An Intuition

Imagine you’re navigating a city using Google Maps, and every road has a traffic delay (which changes over time).

Your goal is to pick the best route (shortest travel time) every day. The total delay you experience is the sum of individual road delays along your chosen path.

This is an example of a structured decision problem, because:

You’re not picking a single element (e.g., a traffic light or a road segment).
You’re picking a structured object—a full path (set of roads that form a connected route).
The loss (delay) of a path is determined by the sum of its parts (individual road delays).

Contrast with Expert-Based Settings

Now, contrast this with a typical expert-based setting, like a multi-armed bandit:

If Google simply gave you a few pre-defined expert routes, and you chose among them, it wouldn’t be flexible.
If a road suddenly gets blocked, all routes using that road become bad at once, and the whole system collapses.
Expert algorithms assume fixed actions, but in structured problems, you can dynamically build better solutions.

Instead of treating entire routes as fixed “experts,” FL/FPL can adaptively construct the best route based on updated delays.

Why FL/FPL is Better for Structured Problems

Expert-Based Methods Struggle with Exponentially Large Action Spaces

If we use an expert-based algorithm, we’d need to maintain weights over all possible routes, but the number of possible paths grows exponentially. Instead of tracking every possible path, FL/FPL works directly with roads (substructures), summing losses efficiently.

Follow the Leader (FL) Uses the Best Past Route

In each round, FL picks the route with the lowest total delay so far. However, FL is unstable—if one road suddenly gets bad, it can completely switch to a totally different route, causing large fluctuations.

Follow the Perturbed Leader (FPL) Adds Stability

Instead of blindly trusting past delays, FPL adds a small random perturbation to each road’s delay before choosing a route. This prevents overcommitting to minor variations and makes the algorithm more robust to sudden changes.

Final Takeaway:

FL/FPL works well for structured problems where decisions are not independent (like picking a single stock) but instead composed of multiple interacting elements (like building a portfolio or choosing a route).
FL is too reactive, making drastic changes when small variations happen.
FPL stabilizes decisions, avoiding erratic shifts and leading to better long-term performance.

This is why FL/FPL is widely used in structured online learning, like routing, combinatorial optimization, and dynamic decision-making.

What happens if we stop adding perturbation after some rounds \(T\), that is, when we have achieved good low regret?

Great question! If you stop adding perturbations after some round \(T\) when you’ve achieved good performance and low regret, the behavior of the algorithm will change, and it will essentially revert to the Follow the Leader (FTL) strategy. This transition can have interesting consequences.

Understanding the Effect of Stopping Perturbations

Initial Phase with Perturbations (Exploration):
- When you apply perturbations, the Follow the Perturbed Leader (FPL) strategy introduces randomness into the decision-making process. This helps the algorithm explore different experts and prevents it from sticking to an expert that might perform poorly in the future, despite performing well in the past.
- The perturbations encourage the algorithm to occasionally “switch” experts, even if the cumulative loss of one expert is slightly better than the others. This helps to avoid overfitting to a single expert and explores new possibilities, which is particularly helpful if the environment is non-stationary or if the best expert changes over time.
After Stopping Perturbations:
- Once you stop adding perturbations after round \(T\), the algorithm will no longer introduce any randomness and will instead follow the expert with the lowest cumulative loss up until that point. Effectively, the algorithm will behave like Follow the Leader (FTL) after the perturbation phase.
- FTL always follows the expert with the best cumulative performance (i.e., the lowest cumulative loss). If one expert has been performing better consistently, the algorithm will stick to that expert.

Possible Outcomes of Stopping Perturbations

Let’s consider a few different scenarios that can arise depending on when you stop perturbing and how the environment behaves.

Environment is Stationary (Best Expert Doesn’t Change)
- Situation: If the best expert has been consistently the best throughout all rounds, stopping the perturbations at round \(T\) will result in the algorithm behaving like FTL from round \(T\) onward.
- Result: The algorithm will follow the best expert without any risk of choosing a suboptimal one. The cumulative loss will continue to accumulate for the best expert, and the regret will stay small or grow slowly.
- Performance: If the best expert is already clearly the best by round \(T\), the algorithm will perform optimally from round \(T\) onward (i.e., no more regret). The cumulative regret will be close to the best expert’s cumulative loss, but there might still be a small amount of regret during the exploration phase before round \(T\).
Environment is Non-Stationary (Best Expert Changes Over Time)
- Situation: If the best expert changes over time (e.g., due to shifts in the environment or external factors), stopping perturbations at round \(T\) might cause the algorithm to get stuck with the wrong expert.
- Result: Once you stop perturbing, the algorithm might follow an expert that was the best up to round \(T\) but is no longer the best expert from that point onward.
- Performance: In this case, the algorithm could suffer from increased regret after round \(T\), because it will stick with an expert that was optimal earlier but is now suboptimal. The regret will start to grow again because the algorithm has stopped exploring and is no longer adapting to changes in the environment.
- Consequence: The algorithm will lose the flexibility that perturbations gave it, and it might be too slow to react to changes in the best expert. The regret could potentially grow at a linear rate after perturbations stop, similar to FTL.
Regret After Stopping Perturbations
- If the perturbations were helping to adapt to changes in the environment, stopping them could result in the regret increasing once the environment changes and the algorithm locks onto a potentially suboptimal expert.
- The cumulative regret after round \(T\) (where perturbations stop) would depend on how well the algorithm was able to adapt and learn the best expert before the perturbations were turned off.

Long-Term Regret Bound After Stopping Perturbations

If you stop perturbing after \(T\) rounds, the regret bound will likely change.

Before round \(T\), the regret was bounded by the \(O(\sqrt{T \log N})\) bound, due to the perturbations helping with exploration and avoiding suboptimal sticking to an expert.
After round \(T\), once perturbations are stopped, the algorithm behaves like Follow the Leader (FTL), and the regret bound could transition to something like:

\[R(T') \leq O(T' - T)\]

where \(T'\) is the number of rounds after you stop perturbations. Essentially, the regret will grow linearly with the number of rounds after \(T\), because the algorithm no longer explores and may follow a suboptimal expert.

Thus, the overall regret after \(T\) rounds will be:

\[R(T) \leq O(\sqrt{T \log N}) + O(T' - T)\]

where:

The first term accounts for the exploration phase with perturbations.
The second term captures the linear regret that may occur once perturbations are turned off.

Optimizing the Stopping Point: To minimize the long-term regret after perturbations stop, you would want to stop perturbing at a point where:

The best expert has stabilized — meaning the algorithm has successfully identified the best expert up until round \(T\), and the environment is either stationary or predictable.
The expert switching rate is low — if the best expert has already been chosen and the environment isn’t changing much, then stopping perturbations at round \(T\) will not lead to significant regret in the long run.

Summary of Outcomes

If the environment is stationary, stopping perturbations at round \(T\) when good performance has been achieved is beneficial and will result in very low regret going forward, since the algorithm will just follow the best expert.
If the environment is non-stationary, stopping perturbations could lead to higher regret since the algorithm will stop adapting to new changes and could get stuck with a suboptimal expert.
The regret bound will change after you stop perturbations. Initially, regret grows sublinearly (\(O(\sqrt{T \log N})\)), but after stopping perturbations, regret may grow linearly, like FTL (\(O(T' - T)\)).

Theoretical Bounds for FPL

A crucial advantage of FPL is that it achieves sublinear regret, meaning that over time, its cumulative loss approaches that of the best fixed decision. We now present regret bounds for both the additive and multiplicative versions of FPL.

Additive FPL Bound

For a fixed \(\epsilon > 0\), the expected cumulative loss of FPL with additive perturbations is bounded as:

\[E[\mathcal{L}_T] \leq \mathcal{L}_T^{\min} + \epsilon R X_1 T + \frac{W_1}{\epsilon}\]

By choosing an optimal \(\epsilon\) value:

\[\epsilon = \sqrt{\frac{W_1}{R X_1 T}}\]

we obtain the bound:

\[E[\mathcal{L}_T] \leq \mathcal{L}_T^{\min} + 2 \sqrt{X_1 W_1 R T}\]

which ensures that the regret grows sublinearly with \(T\), implying that as time progresses, the algorithm performs nearly as well as the best fixed action in hindsight.

**Multiplicative FPL* Bound**

For FPL* with multiplicative perturbations, the expected cumulative loss is bounded by:

\[E[\mathcal{L}_T] \leq \mathcal{L}_T^{\min} + 4 \sqrt{\mathcal{L}_T^{\min} X_1 W_1 (1 + \log N)} + 4 X_1 W_1 (1 + \log N)\]

where the optimal choice of \(\epsilon\) is:

\[\epsilon = \min \left( \frac{1}{2X_1}, \sqrt{\frac{W_1(1 + \log N)}{X_1 \mathcal{L}_T^{\min}}} \right)\]

This bound highlights the benefits of multiplicative perturbations, which further stabilize decision-making in adversarial scenarios.

Conclusion

The Follow the Leader (FL) algorithm, while simple, suffers from instability in adversarial settings. By introducing random perturbations, the Follow the Perturbed Leader (FPL) approach overcomes these weaknesses, ensuring better performance over time.

Additive perturbations smooth out decision-making, reducing instability.
Multiplicative perturbations (FPL*) provide even stronger guarantees, particularly in adversarial environments.

By leveraging randomization, FPL achieves better regret bounds, making it a powerful tool in online learning. These ideas form the foundation for many modern online learning algorithms, and understanding them provides valuable insight into sequential decision-making strategies.

If you’re interested in diving deeper into online learning, stay tuned for more explorations into regret minimization techniques and advanced algorithms!

References

Efficient algorithms for online decision problems