From Baseline to DeepSeek - Single-GPU MoE Training Efficiency
A systems-level analysis of training Mixture-of-Experts (MoE) Transformer models under single-GPU constraints. We compare naive PyTorch MoE, ScatterMoE, MegaBlocks, and DeepSeek-inspired architectures, revealing critical trade-offs between convergence behavior, memory footprint, and training throughput.