What:

NVIDIA’s C++ library for running code directly on the GPU.

Remember:

  • A program built for CPU executes line by line
  • A program built for GPU consists of single, tiny functions called kernel. The GPU runs that exact kernel 100,000 times simultaneously, just on different pieces of data.

Grids, Blocks & Threads:

  • The Grid represents the total amount of work to be done. It’s the entire GPU.
  • The Grid is divided into Blocks. A block is assigned one specific physical part of the GPU.
  • A Block is divided into Threads. The thread is the individual worker doing the maths.

Threads inside the same block have Shared Memory. Threads in different memory have to use slower, global memory.

The Warp

Inside Blocks, threads are grouped into squads of 32, called a Warp. (Someone had fun when naming things lol).

The Warp has a rule: Single Instruction, Multiple Threads. All threads within it must execute the same line of code at the exact same time.

Thus, it’s smarter to avoid if/else statements, because only some of them would be running per Warp cycle.
**

How To Feed The Beast Blocks:

You want the GPU doing maths as fast as possible. It thus shouldn’t wait for data from it’s global memory (VRAM).

Thus, when a Warp asks for data from memory, the GPU fetches one massive, continuous chunk of memory (a cache-line). Thus, it’s an Amortised Cost. This assumes that each thread asks for data sitting perfectly next to each-other (Coalesced Access). If one thread asks for Address 1, another for Address 250 etc, then we’re cooked here, cos it’s likely they’re not all in order / in a row. We’d thus have to make multiple trips to global memory.

AMD also have a clone of CUDA!