Optimizing Stochastic Gradient Descent - Key Recommendations for Effective Training

This is a continuation of the previous blog, and the content presented here consists of notes extracted from L´eon Bottou’s Stochastic Gradient Descent Tricks. If you haven’t read my previous blog, I recommend taking a look, as the notations used here are introduced there. Alternatively, if you’re comfortable with the notations, you can jump straight into the content. So, let’s get started!

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is a simplified version of the standard gradient descent algorithm. Rather than computing the exact gradient of the entire cost function \(E_n(f_w)\), each iteration of SGD estimates this gradient based on a single randomly chosen example \(z_t\). The update rule for the weights \(w\) at each iteration is:

\[w_{t+1} = w_t - \gamma_t \nabla_w Q(z_t, w_t)\]

where \(\gamma_t\) is the learning rate, and \(Q(z_t, w_t)\) represents the cost function evaluated at the current weights \(w_t\) for the randomly selected example \(z_t\).

Key Features of SGD:

Randomness: The algorithm’s stochastic nature means that the updates depend on the examples randomly picked at each iteration. This randomness introduces some noise into the optimization process, but it is hoped that the algorithm behaves like its expectation despite this noise.
On-the-Fly Computation: Since SGD does not need to store information about the examples visited in previous iterations, it can process examples one by one. This makes it suitable for online learning or deployed systems, where data can arrive sequentially, and the model is updated in real-time.
Expected Risk Optimization: In a deployed system, where examples are drawn randomly from the ground truth distribution, SGD directly optimizes the expected risk, which is the expected value of the loss function over all possible examples. So, to put in simply - Each loss computed during SGD updates serves as an approximation of the expected loss over the true data distribution, and with more examples, it gradually optimizes the expected risk.

Convergence of SGD:

The convergence of SGD has been studied extensively in the stochastic approximation literature. Convergence typically requires that the learning rates satisfy the conditions:

\[\sum_{t=1}^{\infty} \gamma_t^2 < \infty \quad\] \[\text{(It means the total sum of learning rates must go to infinity over time, ensuring enough updates for convergence)}\] \[\text{and} \quad \sum_{t=1}^{\infty} \gamma_t = \infty\] \[\text{(This mean the sum of squared learning rates must remain finite, ensuring the updates become smaller and smaller as the algorithm proceeds)}\]

These conditions help strike the balance between making large enough updates early on to explore the parameter space, but small enough updates later on to fine-tune the model and avoid overshooting the optimum.

The Robbins-Siegmund theorem offers a formal proof that, under the right conditions—such as appropriate decreasing learning rates—SGD will converge almost surely, even when the loss function is non-smooth. This includes cases where the loss function has discontinuities or sharp gradients, making SGD a robust optimization method.

Convergence Speed:

The speed of convergence in SGD is ultimately limited by the noisy gradient approximations. Several factors impact the rate at which the algorithm converges:

Learning Rate Decay:
- If the learning rates decrease too slowly, the variance of the parameter estimates \(w_t\) decreases at a similarly slow rate. Why? If the learning rate decreases too slowly, updates remain large for too long, causing high variance in parameter estimates and preventing the algorithm from stabilizing near the optimum.
- If the learning rates decay too quickly, the parameter estimates \(w_t\) take a long time to approach the optimum. Why? If the learning rate decreases too quickly, updates become too small early on, leading to insufficient exploration of the parameter space and slow convergence to the optimum.
Optimal Convergence Speed:
- When the Hessian matrix of the cost function at the optimum is strictly positive definite, the best convergence rate is achieved using learning rates of the form \(\gamma_t \sim t^{-1}\). In this case, the expectation of the residual error \(\rho\) decreases at the same rate, i.e., \(E(\rho) \sim t^{-1}\). This rate is commonly observed in practice.
Relaxed Assumptions:
- When these regularity assumptions(like a positive definite hessian, smoothness and strong convexity) are relaxed, the convergence rate slows down. The theoretical convergence rate in such cases is typically \(E(\rho) \sim t^{-1/2}\). However, in practice, this slower convergence tends to only manifest during the final stages of the optimization process. Often, optimization is stopped before this stage is reached, making the slower convergence less significant.

In summary, while the convergence of SGD can be slow due to its noisy nature, proper management of the learning rate and understanding of the problem’s characteristics can ensure good performance in practice.

Second-Order Stochastic Gradient Descent (2SGD)

Second-Order Stochastic Gradient Descent (2SGD) extends stochastic gradient descent by incorporating curvature information through a positive definite matrix \(\Gamma_t\), which approximates the inverse of the Hessian matrix. The update rule for 2SGD is:

\[w_{t+1} = w_t - \gamma_t \Gamma_t \nabla_w Q(z_t, w_t),\]

where:

\(w_t\): Current weights at iteration \(t\).
\(\gamma_t\): Learning rate (step size), which may vary over iterations.
\(\Gamma_t\): A positive definite matrix that approximates the inverse of the Hessian.
\(\nabla_w Q(z_t, w_t)\): Gradient of the loss function \(Q\) with respect to \(w_t\) for the stochastic sample \(z_t\).

Key Advantages of 2SGD

Curvature Awareness:
- The inclusion of \(\Gamma_t\) enables the algorithm to account for the curvature of the loss surface.
- This adaptation improves convergence by rescaling updates to balance faster progress in flat directions and slower progress in steep directions.
Improved Constants:
- The scaling introduced by \(\Gamma_t\) can reduce the condition number of the problem. What? The condition number is the ratio of the largest to smallest eigenvalue of the Hessian, reflecting the curvature’s uniformity. A high condition number implies uneven curvature, slowing convergence.
- 2SGD addresses this by scaling the parameter space to reduce the condition number, making the optimization landscape more uniform and this leads to faster convergence in terms of iteration efficiency when compared to standard SGD.

Challenges in 2SGD

Despite the advantages, 2SGD has significant limitations:

Stochastic Noise:
- The introduction of \(\Gamma_t\) does not address the stochastic noise inherent in gradient estimates.
- As a result, the variance in the weights \(w_t\) remains high, which limits its convergence benefits.
Asymptotic Behavior:
- The expected residual error decreases at a rate of \(\mathbb{E}[\rho] \sim t^{-1}\) at best.
- While constants(i.e., step size) are improved, the convergence rate remains fundamentally constrained by the stochastic nature of the gradients.

Comparison with Batch Algorithms

Batch Algorithms:

Batch methods utilize the full dataset to compute gradients at each iteration.
They achieve better asymptotic performance with convergence rates that often scale as \(t^{-2}\) or better, depending on the algorithm.

2SGD:

2SGD operates on a per-sample basis, which limits its ability to achieve higher convergence rates in expectation.
The variance introduced by stochastic gradients limits its asymptotic efficiency compared to batch methods.

The Bigger Picture

Despite being asymptotically slower than batch algorithms, 2SGD remains highly relevant in modern machine learning for many reasons:

Efficiency in Large Datasets:
- When datasets are too large to process as a batch, 2SGD provides an efficient alternative.
- It avoids the computational and memory overhead of storing and processing the entire dataset.
Online Learning:
- In online learning scenarios, where data arrives sequentially, 2SGD offers a practical approach to updating models in real time.

Summary of Convergence Behavior

Algorithm	Error Decay	Asymptotic Behavior
Gradient Descent (GD)	\(\|w_t - w^*\| \sim \rho^t\)	Linear convergence: \(\mathcal{O}(t^{-1})\)
Stochastic Gradient Descent (SGD)	\(\mathbb{E}[\|w_t - w^*\|] \sim t^{-1}\)	Asymptotic rate: \(t^{-1}\)
Second-Order Stochastic GD (2SGD)	\(\mathbb{E}[\|w_t - w^*\|] \sim t^{-1}\)	Same as SGD, but with improved constants

Note:

Linear Convergence (\(\mathcal{O}(t^{-1})\)): Implies an exponential decay of the error over time, with the error shrinking by a constant factor at each step.
Asymptotic Rate (\(t^{-1}\)): Describes the long-term error decay rate, indicating a polynomial decay (slower than exponential) where the error decreases inversely with time.

By incorporating second-order information through \(\Gamma_t\), 2SGD makes more informed updates. However, its performance is ultimately limited by the stochastic noise in gradient estimates. In practice, 2SGD is a compromise between computational efficiency and convergence speed, making it suitable for large-scale and online learning tasks.

When to Use Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) is particularly well-suited when training time is the bottleneck. It is an effective choice in scenarios where computational efficiency and scalability are critical, such as in large-scale machine learning tasks.

Key Insights from Table

The table below summarizes the asymptotic behavior of four optimization algorithms:

Gradient Descent (GD): Standard first-order method.
Second-Order Gradient Descent (2GD): Incorporates curvature information.
Stochastic Gradient Descent (SGD): A stochastic variant of GD.
Second-Order Stochastic Gradient Descent (2SGD): Combines stochastic updates with curvature adaptation.

Algorithm	Time per Iteration	Iterations to Accuracy (\(\rho\))	Time to Accuracy (\(\rho\))	Time to Excess Error \(\epsilon\)
Gradient Descent (GD)	\(n\)	\(\log(1 / \rho)\)	\(n \log(1 / \rho)\)	\(\frac{1}{\epsilon^{1/\alpha}} \log(1 / \epsilon)\)
Second-Order Gradient Descent (2GD)	\(n\)	\(\log \log(1 / \rho)\)	\(n \log \log(1 / \rho)\)	\(\frac{1}{\epsilon^{1/\alpha}} \log(1 / \epsilon) \log \log(1 / \epsilon)\)
Stochastic Gradient Descent (SGD)	\(1\)	\(1 / \rho\)	\(1 / \rho\)	\(1 / \epsilon\)
Second-Order Stochastic GD (2SGD)	\(1\)	\(1 / \rho\)	\(1 / \rho\)	\(1 / \epsilon\)

Discussion

Per-Iteration Cost:
- GD and 2GD: Both require \(\mathcal{O}(n)\) time per iteration due to full-batch gradient computations.
- SGD and 2SGD: Require \(\mathcal{O}(1)\) time per iteration, making them computationally inexpensive for large datasets.
Convergence Speed:
- GD and 2GD converge faster in terms of the number of iterations but incur higher computational costs because of full-batch updates.
- SGD and 2SGD require more iterations to converge but compensate with lower per-iteration costs.
Asymptotic Performance:
- While SGD and 2SGD have worse optimization noise, they require significantly less time to achieve a predefined expected risk \(\epsilon\) due to their reduced computational overhead.
- In large-scale settings where computation time is the limiting factor, stochastic learning algorithms are asymptotically better.

Key Takeaways

Use SGD when:
- Dataset size is large, and full-batch methods become computationally infeasible.
- Real-time or online learning scenarios require frequent updates with minimal latency.
- Memory efficiency is a concern, as SGD processes one sample at a time.
Despite higher variance in updates, SGD and 2SGD are preferred in large-scale setups due to their faster convergence to the expected risk with minimal computational resources.

In conclusion, while SGD and 2SGD might appear less efficient in small-scale setups, their practical advantages in high-dimensional, data-intensive tasks make them highly favorable in modern machine learning applications.

General Recommendations for Stochastic Gradient Descent (SGD)

The following is a series of recommendations for using stochastic gradient algorithms. Though seemingly trivial, the author’s experience highlights how easily they can be overlooked.

1. Randomly Shuffle the Training Examples

Although the theory behind Stochastic Gradient Descent (SGD) calls for picking examples randomly, it is often tempting to process them sequentially through the training set. While sequentially passing through the examples may seem like an optimization, it can be problematic when the data is structured in a way that affects training performance.

Key Points:

Class Grouping and Order: If training examples are grouped by class or presented in a particular order, processing them in sequence can lead to biases in the gradient updates.
The Importance of Randomization: Randomizing the order helps break any inherent structure or patterns in the dataset that may skew the learning process. This ensures that each update is less dependent on the order of the examples, promoting better convergence.

Analogy:

Think of SGD like a person learning to navigate a maze. If they always follow the same path (training examples in order), they may become “stuck” in a loop. However, if they randomly choose different routes (randomized examples), they are more likely to explore and discover the optimal path.

2. Use Preconditioning Techniques

Stochastic Gradient Descent (SGD) is a first-order optimization algorithm, meaning it only uses the first derivatives (gradients) to guide the updates. However, this can lead to significant issues when the optimization process encounters areas where the Hessian (the matrix of second derivatives) is ill-conditioned. In such regions, the gradients may not provide efficient updates, slowing down convergence or leading to poor results.

Fortunately, preconditioning techniques like Adagrad or Adam, adjust the learning rates based on past gradients, helping optimize in ill-conditioned regions for faster and more stable convergence.

Key Points:

Ill-conditioned regions: Areas where the curvature (second derivatives) of the cost function varies dramatically, making it hard for SGD to make efficient progress.
Improved convergence: Preconditioning techniques can rescale the gradients to make the learning process more stable and faster, improving convergence even in difficult regions.

Analogy:

Imagine trying to push a boulder up a steep hill (representing optimization in ill-conditioned areas). Without a proper approach, the effort may be inefficient or lead you off-course. Preconditioning techniques act like a ramp, providing a smoother path and making it easier to move the boulder in the right direction.

3. Monitor Both the Training Cost and the Validation Error

To effectively gauge the performance of your model during training, it is crucial to monitor both the training cost and the validation error. A simple yet effective approach involves repeating the following steps:

Key Steps:

Stochastic Gradient Descent (SGD) Update: Process once through the shuffled training set and perform the SGD updates. This helps adjust the model’s parameters based on the current data.
Compute Training Cost: After the updates, run another loop over the training set to compute the training cost. This cost represents the criterion (such as the loss function) the algorithm is optimizing. Monitoring the training cost provides insight into how well the model is minimizing the objective.
Compute Validation Error: With another loop, calculate the validation error using the validation set. This error is the performance measure of interest (such as classification error, accuracy, etc.). The validation error helps track how well the model generalizes to unseen data.

Although these steps require additional computational effort, including extra passes over both the training and validation datasets, they provide critical feedback and prevent training in isolation, avoiding the risk of overfitting or diverging from the optimal solution.

Analogy:

Think of training a model like tuning a musical instrument. The training cost is like checking the sound of the instrument while you play (adjusting and fine-tuning as you go), while the validation error is like getting feedback from a concert audience (seeing how the performance holds up in a real-world scenario). Without both, you might end up with a well-tuned instrument that doesn’t sound good in a performance.

4. Check the Gradients Using Finite Differences

When the computation of gradients is slightly incorrect, Stochastic Gradient Descent (SGD) tends to behave slowly and erratically. This often leads to the misconception that such behavior is the normal operation of the algorithm.

Over the years, many practitioners have sought advice on how to set the learning rates \(\gamma_t\) for a rebellious SGD program. However, the best advice is often to forget about the learning rates and ensure that the gradients are being computed correctly. Once gradients are correctly computed, setting small enough learning rates becomes easy. Those who struggle with tuning learning rates often have faulty gradients in their computations.

How to Check Gradients Using Finite Differences:

Rather than manually checking each line of the gradient computation code, use finite differences to verify the accuracy of the gradients.

Steps:

Pick an Example: Choose a training example \(z\) from the dataset.
Compute the Loss: Calculate the loss function \(Q(z, w)\) for the current weights \(w\).
Compute the Gradient: Calculate the gradient of the loss with respect to the weights: \(g = \nabla_w Q(z, w)\)
Apply a Perturbation: Slightly perturb the weights by changing them. This can be done by either:
- Changing a single weight by a small increment: \(w' = w + \delta\)
- Perturbing the weights using the gradient: \(w' = w - \gamma g\), where \(\gamma\) is small enough.
Compute the New Loss: After applying the perturbation, compute the new loss \(Q(z, w')\).
Verify the Approximation: Ensure that the new loss approximates the original loss plus the perturbation multiplied by the gradient: \(Q(z, w') \approx Q(z, w) + \delta g\)

Example, consider the MSE loss function:

\[Q(z, w) = (w - z)^2\]

Pick an Example: Let ( z = 5 ), ( w = 4 ).
Compute the Loss: \(Q(z, w) = (4 - 5)^2 = 1\)
Compute the Gradient: \(g = \nabla_w Q(z, w) = 2(w - z) = -2\)
Apply Perturbation: \(w' = w + 0.01 = 4.01\)
Compute the New Loss: \(Q(z, w') = (4.01 - 5)^2 = 0.9801\)
Verify: \(Q(z, w') \approx Q(z, w) + 0.01 \cdot (-2) = 0.98\)

Automating the Process:

This process can be automated and should be repeated for many examples \(z\), many perturbations \(\delta\), and many initial weights \(w\). Often, flaws in the gradient computation only appear under peculiar conditions, and it’s not uncommon to discover such bugs in SGD code that has been used for years without issue.

Analogy:

Think of gradient checking like testing the brakes of a car. If the brakes (gradients) are faulty, the car (SGD) might not stop properly, leading to erratic behavior. Instead of repeatedly adjusting the speed (learning rate), you test the brakes by applying a small perturbation to the system (finite differences). If the brakes are working well, the car will stop smoothly at the right place (convergence).

5. Experiment with the Learning Rates \(\gamma_t\) Using a Small Sample of the Training Set

The mathematics behind Stochastic Gradient Descent (SGD) are surprisingly independent of the training set size. Specifically, the asymptotic convergence rates of SGD are not influenced by the sample size. This means that once you’ve ensured the gradients are correct, the most effective way to determine appropriate learning rates is to experiment with a small, but representative sample of the training set.

Key Steps:

Use a Small Sample: Select a small subset of the training data that still reflects the diversity of the full dataset. The small size allows you to test different learning rates quickly without incurring the computational cost of working with the entire dataset.
Traditional Optimization Methods: Since the sample is small, you can apply traditional optimization techniques (e.g., gradient descent or other optimization algorithms) to find a reference point and set the training cost target. This provides a useful benchmark for SGD.
Refining Learning Rates: Experiment with various learning rates on this small dataset to find a value that minimizes the training cost efficiently. Once you identify a good learning rate, it’s likely to work well on the full dataset.
Scale to Full Dataset: Once the learning rates are set based on the small sample, use the same rates on the full training set. Keep in mind that the performance on the validation set is expected to plateau after a number of epochs. The number of epochs required to reach this plateau should be roughly the same as what was needed on the small dataset.

Analogy:

Think of this like testing the settings of a new recipe. Instead of preparing a full meal, you start with a small portion of ingredients (a small sample). Once you find the perfect amount of seasoning (learning rates), you can apply it to the full dish (the entire training set). While the small sample may not capture every nuance of the full dish, it gives you a good starting point without wasting resources.

This concludes the key points related to Stochastic Gradient Descent (SGD). After iterating through Gradient Descent (GD) and SGD multiple times, I hope the concepts are now firmly imprinted, even if briefly. In the upcoming blog posts, we will delve into loss functions and regression, so stay tuned!