The Problem:

The CPU (Host) and GPU (Device) are separate chips with physical distance, communicating over a bridge called PCIe Bus. Normally, this is the workflow:

  1. CPU calculates what needs to be done
  2. CPU sends a command over the bus: “Launch matrix multiply kernel”. (Sending takes ~5-10 microseconds launch overhead).
  3. GPU receives it, spins up cores and runs the math.
  4. GPU finishes.
  5. CPU realises it finished and sends the next command over the bus (another ~5-10 microseconds overhead)
  6. Process repeats

In modern DL, networks are thousands of tiny operations. If a kernel takes 2 to run, but CPU overhead was 5, then of GPU time was waiting for CPU to speak to it

Solution:

Record & Playback: CUDA graphs bypass the CPU entirely by making the entire process a single executable object on the GPU.

Step 1: The Dry Run:

Create dummy tensors of the correct size and shape you’ll be dealing with. Run a forward and backward pass once. This allocates memory and allows the GPU figure out exactly what memory footprint looks like.

Step 2: Replay Phase:

Run the backward and forward pass again, but record every single kernel launch, memory allocation and order they happen in. It builds a Directed Acyclic Graph of dependencies inside its own memory.

Step 3: Replay:

You drop your real training data into the same memory addresses that you used for the dummy data. Now, your CPU just executes a single command (“Execute graph”). GPU takes over and can launch kernels sequentially. Doesn’t wait for the CPU to issue new instructions over the bus.

The Catch:

If your model has a dynamic control flow, e.g. if/else statement, it won’t work. The graph uses fixed recording and can’t make decisions on the fly.