Background:
Basically all of (Machine Learning) Models nowadays are behemoths of engineering challenges. We should optimise them so that we’re bottlenecked by the most expensive - and upgradable - part, the GPU.
Simple Bottlenecks on GPU’s:
- I/O Bottlenecks: Reading from hard-drive, off-loading to CPU. Fix: Use multiple CPU worker processes to load and pre-fetch data.
- Kernel Proliferation: Read it.
GPU Parallelism:
Your model, with billions of weights, may not fit in a single GPU’s memory. Or, it may fit, but it takes year to train on a single GPU. Thus, we parallelise it.
There’s different ways to parallelise DNN training: