How Compilers Optimize Kernels
When I first learned how a simple kernel like matrix multiplication actually runs on hardware, I was surprised by how much performance comes down to how the loops are structured, not just what the math is doing. At first glance, C = A × B looks harmless. But even a 512×512 multiply involves over 100 million operations. That’s where compilers and kernel engineers step in - rearranging loops, tuning memory access, and squeezing every bit of performance from the hardware....
The Visitor Design Pattern
Description Adding new methods within subclasses works fine if the program being developed isn’t complex - simply add the methods to each of the individual subclasses. For example, if we have an Animal parent class, each of the Animal subclasses may need to implement a makeSounds() method. However, when our codebase consists of dozens of new methods to be implemented for dozens of subclasses, the code within each subclass becomes cluttered and often becomes difficult to maintain....
Getting Started with MLIR
Most of my experience so far has been with CPU compilers. I’m used to thinking about traditional compiler pipelines that start from source code, parse it into an AST, lower it to an intermediate representation (IR), optimize it, and then generate machine code. The work usually revolves around control flow, register allocation, and data dependencies. It’s all about turning human-written code into something that runs efficiently on a CPU. But machine learning (ML) compilers operate in a completely different world....