What:

A way of optimising the running code in High Performance Machine Learning.

How:

As in all HPC, we want to make sure our GPUs are as efficient as possible. The GPU is able to run math incredibly quickly. But loading the individual instructions (kernels) from the CPU takes time.

There’s a microsecond delay called “Launch Overhead”. After all, the CPU has to translate Python, talk to PCIe bus, and send the instructions. This can be orders of magnitude slower than the actual math - thus the GPU is idle for 99%+ of the time!

How to fix:

  • JIT (Just-In-Time) Compilation: In PyTorch, we use Just-In-Time Compiler. When a maths in Python is written (z = (x ** 2) + (y * 3)), the compiler rewrites a brand-new, optimised C++/CUDA kernel that does all three steps at once inside the GPU’s registers, without ever needing to talk back to the CPU.
  • CUDA Graphs: The CPU (Host) and GPU (Device) are separate chips with physical distance, communicating over a bridge called PCIe Bus.