How Compilers Optimize Kernels
When I first learned how a simple kernel like matrix multiplication actually runs on hardware, I was surprised by how much performance comes down to how the loops are structured, not just what the math is doing. At first glance, C = A × B looks harmless. But even a 512×512 multiply involves over 100 million operations. That’s where compilers and kernel engineers step in - rearranging loops, tuning memory access, and squeezing every bit of performance from the hardware....