CUDA Graphs

The Problem:

The CPU (Host) and GPU (Device) are separate chips with physical distance, communicating over a bridge called PCIe Bus. Normally, this is the workflow:

CPU calculates what needs to be done
CPU sends a command over the bus: “Launch matrix multiply kernel”. (Sending takes ~5-10 microseconds launch overhead).
GPU receives it, spins up cores and runs the math.
GPU finishes.
CPU realises it finished and sends the next command over the bus (another ~5-10 microseconds overhead)
Process repeats

In modern DL, networks are thousands of tiny operations. If a kernel takes 2 $μ s$ to run, but CPU overhead was 5 $μ s$ , then $70%$ of GPU time was waiting for CPU to speak to it

Solution:

Record & Playback: CUDA graphs bypass the CPU entirely by making the entire process a single executable object on the GPU.

Step 1: The Dry Run:

Create dummy tensors of the correct size and shape you’ll be dealing with. Run a forward and backward pass once. This allocates memory and allows the GPU figure out exactly what memory footprint looks like.

Step 2: Replay Phase:

Run the backward and forward pass again, but record every single kernel launch, memory allocation and order they happen in. It builds a Directed Acyclic Graph of dependencies inside its own memory.

Step 3: Replay:

You drop your real training data into the same memory addresses that you used for the dummy data. Now, your CPU just executes a single command (“Execute graph”). GPU takes over and can launch kernels sequentially. Doesn’t wait for the CPU to issue new instructions over the bus.

The Catch:

If your model has a dynamic control flow, e.g. if/else statement, it won’t work. The graph uses fixed recording and can’t make decisions on the fly.

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter