The Problem:
The CPU (Host) and GPU (Device) are separate chips with physical distance, communicating over a bridge called PCIe Bus. Normally, this is the workflow:
- CPU calculates what needs to be done
- CPU sends a command over the bus: “Launch matrix multiply kernel”. (Sending takes ~5-10 microseconds launch overhead).
- GPU receives it, spins up cores and runs the math.
- GPU finishes.
- CPU realises it finished and sends the next command over the bus (another ~5-10 microseconds overhead)
- Process repeats
In modern DL, networks are thousands of tiny operations. If a kernel takes 2 to run, but CPU overhead was 5, then of GPU time was waiting for CPU to speak to it
Solution:
Record & Playback: CUDA graphs bypass the CPU entirely by making the entire process a single executable object on the GPU.
Step 1: The Dry Run:
Create dummy tensors of the correct size and shape you’ll be dealing with. Run a forward and backward pass once. This allocates memory and allows the GPU figure out exactly what memory footprint looks like.
Step 2: Replay Phase:
Run the backward and forward pass again, but record every single kernel launch, memory allocation and order they happen in. It builds a Directed Acyclic Graph of dependencies inside its own memory.
Step 3: Replay:
You drop your real training data into the same memory addresses that you used for the dummy data. Now, your CPU just executes a single command (“Execute graph”). GPU takes over and can launch kernels sequentially. Doesn’t wait for the CPU to issue new instructions over the bus.
The Catch:
If your model has a dynamic control flow, e.g. if/else statement, it won’t work. The graph uses fixed recording and can’t make decisions on the fly.