What:
In High Performance Machine Learning, the model can be too big for a single GPU. Thus, we should split the model across multiple GPUs. There’s 2 ways of doing it primarily:
1. Pipeline Parallelism:
Split the model by chunking sequential layers and putting each chunk on different GPUs:
- GPU 1 takes input data, runs it through Layers 1-3 and gets an intermediate result (tensor).
- GPU 1 then physically sends that tensor over the network to GPU 2.
- GPU 2 runs layers 4-6, sends result in GPU 3 and so on.
The Problem: The Pipeline Bubble.
There’s only a singe GPU active, the rest are idle. But, if each chunk takes very little time, then it’s a smoother throughput through the GPUs. Similar to on a motorway. Thus, the solution is micro-batching the chunks.
2. Tensor Parallelism:
Pipeline Parallelism is splitting the model horizontally; Tensor parallelism is splitting the model vertically. Remember, a single layer is just . Thus, we actually make and and combine them later. Also, most activation functions are element-wise, so we can actually split them up as well. The problem comes later when we’re syncing the parts across GPUs.
The Problem: Insane communication overhead.
Here, GPUs must talk to each other on every single forward pass and every single backward pass of every single layer. Thus, Tensor Parallelism must ONLY be used inside a single physical node (e.g. across 8GPUs inside one DGX server, communicating over NVLink)