When I first learned how a simple kernel like matrix multiplication actually runs on hardware, I was surprised by how much performance comes down to how the loops are structured, not just what the math is doing.

At first glance, C = A × B looks harmless. But even a 512×512 multiply involves over 100 million operations. That’s where compilers and kernel engineers step in - rearranging loops, tuning memory access, and squeezing every bit of performance from the hardware.

The Naive Implementation

A basic matrix multiply might look like this:

for (int i = 0; i < N; i++)
  for (int j = 0; j < N; j++)
    for (int k = 0; k < N; k++)
      C[i][j] += A[i][k] * B[k][j];

The reason this version is so slow is that the loops touch memory in a pattern the hardware really doesn’t like. Every time the inner k loop runs, it needs a new A[i][k] and a new B[k][j], but these values are far apart in memory. Matrices are stored in row-major order, yet this loop walks down a column of B, not across a row. That means almost every access to B[k][j] forces the CPU to bring in a completely new cache line from main memory.

Even worse, by the time the loop returns to a value it has seen before, the cache has already evicted it, so it must be loaded again from RAM. Main memory is orders of magnitude slower than the L1 and L2 caches, so the program ends up spending far more time waiting for data than doing actual multiplication. This is why the naïve implementation becomes painfully memory-bound as N grows.

Loop Tiling (Blocking)

To fix that, compilers use loop tiling, also known as blocking. It breaks large loops into smaller chunks that fit in cache.

for (int i0 = 0; i0 < N; i0 += T)
  for (int j0 = 0; j0 < N; j0 += T)
    for (int k0 = 0; k0 < N; k0 += T)
      // multiply T×T sub-matrices

This way, each tile reuses the same data before moving on. The compiler or JIT can even adjust the tile size T depending on hardware cache size.

Loop Unrolling and Vectorization

Another trick is loop unrolling - replacing small loops with repeated statements. This reduces branching (which is expensive) and exposes more parallel instructions.

// Instead of looping four times:
C += A0*B0 + A1*B1 + A2*B2 + A3*B3;

Then comes vectorization, where the compiler packs multiple operations into one instruction using SIMD (Single Instruction, Multiple Data). CPUs and GPUs love this as they can process four, eight, or even sixteen numbers at once.

Loop Fusion

Loop fusion combines two separate loops that iterate over the same range.

// Before:
for (int i = 0; i < N; i++) A[i] = f(B[i]);
for (int i = 0; i < N; i++) C[i] = g(A[i]);

// After:
for (int i = 0; i < N; i++) {
  A[i] = f(B[i]);
  C[i] = g(A[i]);
}

It may look trivial, but it cuts memory traffic in half because the data for A and B stays in cache instead of being reloaded.

Real-World Perspective

In real ML compilers like TVM, XLA, or MLIR, these optimizations are applied automatically through passes that detect patterns and rewrite them for better hardware utilization. It’s the same math, just scheduled smarter.

Once I realized that, I stopped thinking of “compiler optimizations” as mysterious black boxes. They’re really about respecting hardware limits: cache size, bandwidth, instruction width, and memory hierarchy.

Final Thoughts

Understanding how these optimizations work changed how I think about performance. It’s not just about clever math - it’s about moving data efficiently and keeping the machine busy.

When I see a matrix multiply run almost instantly, I now imagine all the hidden layers of transformation - from loop tiling to vectorization - that made it possible.