Background:

Basically all of (Machine Learning) Models nowadays are behemoths of engineering challenges. We should optimise them so that we’re bottlenecked by the most expensive - and upgradable - part, the GPU.

Simple Bottlenecks on GPU’s:

  • I/O Bottlenecks: Reading from hard-drive, off-loading to CPU. Fix: Use multiple CPU worker processes to load and pre-fetch data.
  • Kernel Proliferation: Read it.

GPU Parallelism:

Your model, with billions of weights, may not fit in a single GPU’s memory. Or, it may fit, but it takes year to train on a single GPU. Thus, we parallelise it.

There’s different ways to parallelise DNN training:

Making the Model Cheaper:

How Do GPUs Talk To Each Other?