-
GEMM Kernel Optimization Notes
My notes from Simon Boehm's CUDA GEMM optimization blog
-
SiLU+Mul+FP8 Block Quant Pattern Matching Pipeline - vLLM Notes
Detailed walkthrough of vLLM's torch.compile pattern matching pipeline that fuses SiLU+Mul and FP8 block quantization into a single kernel launch, covering FX graphs, matchers, and the dispatch machinery
-
Fused SiLU+Mul+FP8 Block Quantization CUDA Kernel - vLLM Notes
Detailed walkthrough of a fused SiLU+Mul+FP8 block quantization CUDA kernel for vLLM, covering memory access patterns, quantization math, and dispatch mechanics
-
Anatomy of a Spark Job Run
Complete Flow of Spark Job Run
-
Transformer Block FLOPs & Parameters Calculations
Resource accounting for Transformer block