My Understanding of "Efficient Algorithms for Online Decision Problems" Paper

Here is the link to the paper - Efficient Algorithms for Online Decision Problems. I highly recommend reading through the entire paper once or twice before continuing with this blog.

1. Introduction

This paper explores online decision problems, where decisions must be made sequentially without knowing future costs. The goal is to ensure that the total cost incurred is close to the best fixed decision in hindsight.

Key Contributions:

Introducing Follow the Perturbed Leader (FPL), a computationally efficient alternative to exponential weighting methods like Weighted Majority.
Extending FPL to structured problems (e.g., shortest paths, tree-based search, adaptive Huffman coding).
Demonstrating that FPL achieves low regret while remaining computationally efficient.
Showing how FPL can be applied even when exact offline solutions are infeasible, by leveraging approximation algorithms.

2. Online Decision Problem Setup

Problem Definition

At each time step \(t\), a cost vector \(s_t\) is revealed.
The algorithm picks a decision \(d_t\) from a set \(D\).
The objective is to minimize the cumulative cost: \(\sum_{t=1}^{T} d_t \cdot s_t\) compared to the best single decision in hindsight:
\[\min_{d \in D} \sum_{t=1}^{T} d \cdot s_t.\]

Example: The Experts Problem

Suppose there are \(n\) experts providing recommendations.
Each expert incurs a cost at each time step.
The goal is to perform as well as the best expert.

Traditional solutions use exponential weighting but are computationally expensive. FPL provides a simpler and faster alternative.

3. Follow the Perturbed Leader (FPL) Algorithm

Key Idea

Instead of following the exact best decision so far (Follow the Leader, FTL), which can lead to excessive switching, we introduce random perturbations.
This smooths decision-making and prevents adversarial manipulation.

Algorithm

Compute cumulative costs for each decision: \(c_t(e) = \sum_{\tau=1}^{t} s_\tau(e)\)
Add random perturbation \(p_t(e)\) drawn from an exponential distribution: \(\tilde{c}_t(e) = c_t(e) + p_t(e).\)
Choose the decision with the lowest perturbed cost.

Why Does This Work?

Reduces frequent switching of decisions.
Makes the algorithm less predictable to adversarial environments.
Maintains a low regret bound.

4. Regret Analysis of FPL

The regret measures how much worse FPL is compared to the best decision in hindsight.

\[E[\text{Cost of FPL}] - \min_{d \in D} \sum_{t=1}^{T} d \cdot s_t.\]

Step 1: The “Be the Leader” Algorithm

The hypothetical “Be the Leader” algorithm always picks the best decision so far: \(d_t = M(s_{1:t})\)
It incurs zero regret: \(\sum_{t=1}^{T} M(s_{1:t}) \cdot s_t \leq M(s_{1:T}) \cdot s_{1:T}.\)
However, it switches too often.

Step 2: Introducing Perturbations

FPL adds perturbations to cumulative costs before selecting the leader: \(M(s_{1:t-1} + p_t).\) The regret bound becomes:

\[E[\text{Cost of FPL}] \leq \min_{\text{fixed decision } d} \sum_{t=1}^{T} d \cdot s_t + \eta R A T + D / \eta.\]

Step 3: Choosing the Optimal Perturbation Scale

To balance stability and adaptation, we set: \(\eta = \sqrt{\frac{D}{RAT}}.\) This leads to the final regret bound: \(O(\sqrt{T}).\) This ensures sublinear regret, meaning FPL performs nearly as well as the best expert over time.

5. Applying FPL to Structured Problems

FPL can be extended to problems beyond expert selection, such as shortest paths.

Online Shortest Paths Problem

Given a graph with edge costs changing over time.
Each round, the algorithm must select a path from source \(s\) to destination \(t\).
The naive approach (treating paths as independent experts) is computationally infeasible.

Efficient Approach: FPL at the Edge Level

Instead of applying FPL to entire paths, apply perturbations at the edge level.
Compute perturbed edge costs: \(\tilde{c}_t(e) = c_t(e) + p_t(e).\)
Compute the shortest path based on these perturbed edge costs.
Select the shortest path.

Why Does This Work?

Avoids exponential complexity by working at the edge level.
Efficiently computed using shortest path algorithms (e.g., Dijkstra’s).
Regret bound remains low: \((1 + \epsilon) \times (\text{Best Path Cost}) + O(m n \log n).\)

Other Structured Problems

FPL has also been applied to:

Tree search: Efficiently updating search trees.
Adaptive Huffman coding: Dynamically optimizing prefix codes.
Online approximation algorithms: Extending FPL when exact offline solutions are infeasible.

6. Summary

FPL is a simple, efficient alternative to exponential weighting methods.
It achieves regret \(O(\sqrt{T})\), ensuring performance close to the best decision in hindsight.
The perturbation method generalizes to structured problems like shortest paths.
FPL is computationally feasible even when the number of decisions is large.

Follow Up Questions;

Understanding the Additive Analysis and Regret Bounds

The Additive Analysis section in the paper derives a regret bound for the Follow the Perturbed Leader (FPL) algorithm. The key goal is to compare the cumulative cost of FPL to the best fixed decision in hindsight.

Key Notation

Before deriving the regret bound, let’s clarify the notation:

\(s_t\): State (cost vector) at time \(t\)
- At each time step, we observe a cost vector \(s_t\), where each component represents the cost of a different decision.
\(d_t\): Decision made at time \(t\)
- The action chosen by the algorithm at time \(t\).
\(M(x)\): Best fixed decision in hindsight
- Given a total cost vector \(x\), \(M(x)\) returns the best decision: \(M(x) = \arg\min_{d \in D} d \cdot x\)
\(s_{1:T}\): Total cost vector over all \(T\) time steps
- This is simply the sum of all cost vectors: \(s_{1:T} = s_1 + s_2 + \dots + s_T\)
\(p_t\): Random perturbation added at time \(t\)
- Introduced to smooth out decision-making and prevent frequent switches.

Goal: Regret Minimization

We define regret as the difference between FPL’s total cost and the best fixed decision in hindsight:

\[E \left[ \sum_{t=1}^{T} d_t \cdot s_t \right] - \min_{d \in D} \sum_{t=1}^{T} d \cdot s_t.\]

Step 1: The Hypothetical “Be the Leader” Algorithm

To analyze FPL, we first consider a hypothetical “Be the Leader” algorithm, which always picks the best decision so far:

\[d_t = M(s_{1:t}).\]

The key property of this algorithm is:

\[\sum_{t=1}^{T} M(s_{1:t}) \cdot s_t \leq M(s_{1:T}) \cdot s_{1:T}.\]

Why Does This Hold?

The best decision in hindsight is optimal for the full sequence.
If we could always “be the leader,” we would incur zero regret.
However, this algorithm switches too frequently, making it unstable.

Step 2: Introducing Perturbations

FPL smooths out decision-making by adding random perturbations before selecting the leader:

\[M(s_{1:t-1} + p_t).\]

The analysis shows:

\[\sum_{t=1}^{T} M(s_{1:t} + p_t) \cdot s_t \leq M(s_{1:T}) \cdot s_{1:T} + D \sum_{t=1}^{T} |p_t - p_{t-1}|_\infty.\]

Key Insight

The first term on the RHS is the cost of the best decision in hindsight.
The second term represents the additional cost due to perturbations.
This term grows at most as \(O(\sqrt{T})\), ensuring that regret remains sublinear.

Step 3: Bounding the Impact of Perturbations

The final step is to bound the effect of perturbations:

\[E[\text{Cost of FPL}] \leq \min_{\text{fixed decision } d} \sum_{t=1}^{T} d \cdot s_t + \eta R A T + D / \eta.\]

where:

\(\eta\) is a tuning parameter controlling the size of perturbations.
\(R, A, D\) are problem-dependent constants.

Choosing the Optimal \(\eta\)

To minimize regret, we set:

\[\eta = \sqrt{\frac{D}{RAT}}.\]

Plugging this into the regret bound:

\[E[\text{Cost of FPL}] \leq \min_{\text{fixed decision } d} \sum_{t=1}^{T} d \cdot s_t + 2 \sqrt{DRAT}.\]

Final Regret Bound \(O(\sqrt{T}).\)

This ensures sublinear regret, meaning that over time, FPL performs nearly as well as the best fixed decision.

Key Takeaways

“Be the Leader” is optimal but switches too often.
FPL adds perturbations to smooth decision-making.
The regret bound is controlled by the trade-off between making stable choices and avoiding excessive randomness.
Choosing \(\eta\) optimally gives a regret bound of \(O(\sqrt{T})\), ensuring long-term efficiency.

Understanding the Key Notation and “Be the Leader” Algorithm

This section provides a clear explanation of the Key Notation and the “Be the Leader” algorithm used in the Additive Analysis of the paper.

1. Key Notation (Clarified)

To understand the regret bound derivation, let’s first clarify the key mathematical symbols.

Online Decision Problem Setup

In an online decision problem, we repeatedly make decisions without knowing future costs. Our goal is to minimize total cost over time.

Symbol	Definition
\(s_t\)	State (cost vector) at time \(t\), representing the cost of each decision at step \(t\).
\(d_t\)	Decision chosen at time \(t\) (e.g., selecting an expert or a path).
\(M(x)\)	Best fixed decision in hindsight, meaning the best decision if we knew all costs in advance.
\(s_{1:T}\)	Total cost vector over \(T\) rounds, defined as \(s_{1:T} = \sum_{t=1}^{T} s_t\).
\(p_t\)	Random perturbation added at time \(t\) to smooth decision-making.

Example: Experts Problem

Suppose we are choosing between two experts (A and B).
Each expert has a different cost at each time step.
We want to pick the expert that minimizes the total cost over time.

Time \(t\)	Expert A’s Cost \(s_t(A)\)	Expert B’s Cost \(s_t(B)\)
\(t = 1\)	0.3	0.4
\(t = 2\)	0.5	0.2
\(t = 3\)	0.4	0.5
\(t = 4\)	0.2	0.3

Without perturbations, we would always select the expert with the lowest cumulative cost.
However, this can cause excessive switching, which leads to instability.

2. Step 1: The “Be the Leader” Algorithm

The “Be the Leader” algorithm is a hypothetical strategy where we always choose the best decision so far.

How It Works

At time \(t\), select:

\[d_t = M(s_{1:t}),\]

meaning:

Choose the decision that has had the lowest total cost so far.
This ensures no regret because we are always picking the best option up to that point.

Key Property

\[\sum_{t=1}^{T} M(s_{1:t}) \cdot s_t \leq M(s_{1:T}) \cdot s_{1:T}.\]

Why is this true?

The best decision in hindsight is optimal for the full sequence.
If we could always “be the leader,” we would incur zero regret.
However, this algorithm switches decisions too frequently, making it unstable.

Example Calculation

Time \(t\)	Cumulative Cost \(A\)	Cumulative Cost \(B\)	Chosen Expert
\(t = 1\)	0.3	0.4	A
\(t = 2\)	0.8	0.6	B
\(t = 3\)	1.2	1.1	B
\(t = 4\)	1.4	1.4	Switching rapidly!

The leader switches frequently whenever cumulative costs change slightly.
This is problematic, especially in adversarial settings, because an adversary can force unnecessary switches.

3. Why Do We Need Perturbations?

The “Be the Leader” algorithm switches too often, making it inefficient.

Follow the Perturbed Leader (FPL) Fixes This

To avoid excessive switching, FPL adds random perturbations before selecting the leader:

\[M(s_{1:t-1} + p_t).\]

This smooths out decision-making:

Prevents rapid switches caused by small cost changes.
Balances stability and adaptability.
Maintains a low regret bound.

Key Insight

FPL ensures that decisions do not fluctuate excessively.
Adding perturbations leads to a regret bound of \(O(\sqrt{T})\), ensuring long-term efficiency.

Understanding Step 2 and Step 3 in Additive Analysis

In the Additive Analysis section, Steps 2 and 3 are crucial for deriving the regret bound for Follow the Perturbed Leader (FPL). These steps show how perturbations help smooth decision-making while maintaining low regret.

Step 2: Introducing Perturbations

The issue with the “Be the Leader” algorithm is that it switches too frequently, leading to instability. To fix this, Follow the Perturbed Leader (FPL) adds small random perturbations to past costs before selecting the leader:

\[d_t = M(s_{1:t-1} + p_t).\]

Effect of Perturbations

Instead of choosing the decision with the exact lowest cumulative cost, FPL selects the decision with the lowest perturbed cost:

\[\tilde{c}_t(e) = c_t(e) + p_t(e).\]

This ensures:

Fewer unnecessary switches: Small cost fluctuations no longer cause frequent decision changes.
Better robustness against adversarial cost sequences.

Key Inequality

The analysis shows:

\[\sum_{t=1}^{T} M(s_{1:t} + p_t) \cdot s_t \leq M(s_{1:T}) \cdot s_{1:T} + D \sum_{t=1}^{T} |p_t - p_{t-1}|_\infty.\]

Breaking It Down

LHS: The total cost incurred by FPL.
RHS: The total cost of the best decision in hindsight plus an extra term due to perturbations.
The second term captures the additional cost introduced by randomness.

Since perturbations are drawn from a well-chosen distribution, their effect remains small (bounded by \(O(\sqrt{T})\)).

Step 3: Bounding the Impact of Perturbations

The final step is to quantify how much extra cost perturbations introduce.

The key regret bound derived is:

\[E[\text{Cost of FPL}] \leq \min_{\text{fixed decision } d} \sum_{t=1}^{T} d \cdot s_t + \eta R A T + D / \eta.\]

where:

\(\eta\) controls how large the perturbations are.
\(R, A, D\) are constants depending on the problem setup.

Choosing the Optimal Perturbation Scale

To balance stability and adaptation, they choose:

\[\eta = \sqrt{\frac{D}{RAT}}.\]

Plugging this into the regret bound:

\[E[\text{Cost of FPL}] \leq \min_{\text{fixed decision } d} \sum_{t=1}^{T} d \cdot s_t + 2 \sqrt{DRAT}.\]

Final Regret Bound \(O(\sqrt{T}).\)

This means that the regret grows sublinearly, ensuring that over time, FPL performs nearly as well as the best fixed decision.

Understanding FPL with a Worked-Out Example

To better understand Follow the Perturbed Leader (FPL), let’s go through a step-by-step numerical example.

Problem Setup

We have two experts: A and B.
Each expert incurs a cost at each time step.
We must pick one expert per round without knowing future costs.
Our goal is to minimize total cost over \(T = 4\) rounds.

Cost Sequence

Time \(t\)	Expert A’s Cost \(s_t(A)\)	Expert B’s Cost \(s_t(B)\)
\(t = 1\)	0.3	0.4
\(t = 2\)	0.5	0.2
\(t = 3\)	0.4	0.5
\(t = 4\)	0.2	0.3

Without perturbations, FPL would always pick the expert with the lowest cumulative cost.
However, this leads to frequent switching.

Step 1: Follow the Leader (FTL) - No Perturbation

If we naively follow the leader without perturbations, we get:

Time \(t\)	Cumulative Cost \(A\)	Cumulative Cost \(B\)	Chosen Expert
\(t = 1\)	0.3	0.4	A
\(t = 2\)	0.8	0.6	B
\(t = 3\)	1.2	1.1	B
\(t = 4\)	1.4	1.4	Switching rapidly!

Problem with FTL

The algorithm switches too frequently, making it unstable.
Small cost differences cause unnecessary leader changes.
This is bad in adversarial settings, where cost sequences can be manipulated.

Step 2: Follow the Perturbed Leader (FPL) - Adding Perturbations

Now, let’s add perturbations.

We randomly sample perturbations \(p_t(A)\) and \(p_t(B)\) from an exponential distribution.
Suppose we get: \(p_1(A) = 0.1, \quad p_1(B) = 0.2\)

Step 2.1: Compute Perturbed Costs

Time \(t\)	Cumulative Cost \(A\)	Cumulative Cost \(B\)	Perturbed \(A\)	Perturbed \(B\)	Chosen Expert
\(t = 1\)	0.3	0.4	0.4	0.6	A
\(t = 2\)	0.8	0.6	0.9	0.8	B
\(t = 3\)	1.2	1.1	1.3	1.2	B
\(t = 4\)	1.4	1.4	1.5	1.6	A

Step 2.2: What Changed?

FPL smooths out decisions: The perturbations prevent unnecessary switching.
Perturbations reduce instability: Instead of switching too frequently (like FTL), FPL stabilizes.
Better stability → Lower regret!

Step 3: Computing the Regret

Step 3.1: Compute Cost of FPL

The total cost incurred by FPL: \(\sum_{t=1}^{T} \text{Chosen Expert's Cost}\) Using the decisions above:

\[\text{Total Cost of FPL} = 0.3 + 0.2 + 0.5 + 0.2 = 1.2.\]

Step 3.2: Compute Cost of Best Expert in Hindsight

If we had perfect hindsight, we would choose the best expert who had the lowest total cost over all \(T\) rounds.

\[\text{Total Cost of Best Expert} = \min(1.4, 1.4) = 1.4.\]

Step 3.3: Compute Regret

Regret is the difference between FPL and the best fixed decision:

\[\text{Regret} = E[\text{Cost of FPL}] - \text{Cost of Best Expert}.\]

Since FPL performs better than the best single expert, it actually has negative regret in this case! In general, the regret is bounded by \(O(\sqrt{T})\), ensuring FPL converges to the best decision in hindsight over time.

Extending FPL to Structured Problems: Online Shortest Paths

One of the major contributions of the paper is extending Follow the Perturbed Leader (FPL) beyond the experts setting to more structured problems, such as online shortest paths. This is important because treating every possible path as an independent expert is computationally infeasible when the number of paths is exponential in the number of edges.

1. The Online Shortest Path Problem

Problem Setup

Given a directed graph with \(n\) nodes and \(m\) edges.
Each edge \(e\) has a time cost \(c_t(e)\) at each time step \(t\).
The goal is to select a path from source \(s\) to destination \(t\) at each time step without knowing future costs.
After selecting a path, we observe the costs of all edges.

Objective

We want to ensure that the total travel time over \(T\) rounds is close to the best single path in hindsight.

Challenges

The number of possible paths grows exponentially with the number of nodes, making treating paths as experts infeasible.
Instead of treating whole paths as independent decisions, we need a way to apply FPL efficiently at the edge level.

2. Applying FPL to Online Shortest Paths

Naïve Approach (Infeasible)

If we were to apply vanilla FPL directly, we would:
1. Treat each entire path as an “expert.”
2. Maintain cumulative travel time for every path.
3. Apply perturbations to total path costs.
4. Choose the best path.
Problem: The number of paths grows exponentially, making this computationally infeasible.

Efficient Approach: FPL at the Edge Level

To make FPL work efficiently, we apply perturbations to edges instead of whole paths:

Follow the Perturbed Leading Path (FPL for Paths)

At each time step \(t\):

For each edge \(e\), draw a random perturbation \(p_t(e)\) from an exponential distribution.
Compute perturbed edge costs: \(\tilde{c}_t(e) = c_t(e) + p_t(e)\)
Find the shortest path using these perturbed edge costs.
Choose the shortest path in the graph based on the perturbed edge weights.

Why Does This Work?

Since perturbations are applied at the edge level, we avoid maintaining explicit costs for all paths.
The standard shortest path algorithm (e.g., Dijkstra’s) can efficiently compute the best path at each step.
Theoretical guarantees remain valid: The expected regret is at most:
\[(1 + \epsilon) \times (\text{Best Path Cost}) + O(m n \log n)\]
where \(m\) is the number of edges and \(n\) is the number of nodes.

3. Understanding the Regret Bound

We now derive the regret bound for online shortest paths.

Key Definitions

Let \(P_t\) be the path chosen by FPL at time \(t\).
Let \(P^*\) be the best fixed path in hindsight, i.e., the path with the lowest total cost over \(T\) rounds:
\[P^* = \arg\min_{P} \sum_{t=1}^{T} c_t(P).\]
The regret of FPL is:
\[\sum_{t=1}^{T} c_t(P_t) - \sum_{t=1}^{T} c_t(P^*).\]

Applying FPL Analysis

The perturbation ensures that bad paths are not chosen too often and good paths are discovered quickly.
The additional regret due to perturbation grows as \(O(m n \log n)\), meaning it is still sublinear in \(T\).

Thus, FPL guarantees:

\[E[\text{Total Cost of FPL}] \leq (1 + \epsilon) \times \text{Best Path Cost} + O(m n \log n).\]

4. Why is This Important?

1. Generalization to Graph-Structured Problems

This method can be applied to any structured problem where the decision space is large.
Example: Instead of treating each full decision tree as an expert, perturbations can be applied at the node level.

2. Computational Efficiency

Unlike exponential weighting algorithms (which require maintaining weights for each path), this approach only requires standard shortest-path computations.
Time complexity: Runs in \(O(m)\) (if using Bellman-Ford) or \(O(m + n \log n)\) (if using Dijkstra’s).

3. Practical Use Cases

Network Routing: Selecting optimal paths in a dynamic network.
Robot Navigation: Choosing paths in a changing environment.
Traffic Prediction: Adjusting routes based on real-time conditions.

5. Summary

Problem: Online shortest path selection where edge costs change over time.
FPL Extension: Instead of treating full paths as experts, apply perturbations to edges.
Algorithm:

Add random noise to edge costs.
Compute the shortest path with the perturbed costs.
Follow that path.
Regret Bound:
- Competitive with the best fixed path in hindsight.
- Extra cost due to perturbations is small (only \(O(m n \log n)\)).
  Key Benefit: Works efficiently even when the number of paths is exponential.