Intra-Node Problem:
In a 8x GPU (and 2x CPU) server, how does GPU 1 talk to GPU 2? Talking via PCIe Bus is slow ().
Solution:
NVIDIA invented its own proprietary cables and switches that sit on top of the GPUs. NVLink meshes 8 GPUs together. …
Inter-Node Problem:
Now, how do you get 32 servers to talk together in a massive data centre? Currently, the flow is:
- GPU → CPU RAM → CPU processes network protocols → Network Card (NIC) → Cable → Server B NIC → Server B CPU → Server B GPU.
- That’s a lot of steps 😅
Solution:
Remote Direct Memory Access (RDMA): is built into high-end NICs.
- The NIC on Server A (which houses GPU 1), reaches into memory, grabs data, sends it right over to GPU 1 in Server B.
Network Cables & Protocols Powering the Comms:
- InfiniBand: Networking protocol, supports RDMA and lossless. Competitor is SlingShot
- Use Optic Fiber cables.
Network Topologies
You have to connect your servers in a clever topology. Optimally, it’s arranged to maximise Bisection Bandwidth (if you cut the data centre in half, how much data can flow between each half).
- Fat Tree: Imagine a tree, but the edges closer to the root node are thicker. In our example, the thickness represents how many cables connect those nodes. The leafs are GPUs. The GPUs are always the same amount of hops away, and there’s no bottleneck at the root.
- Dragonfly: Groups servers into densely connected “pods”. Those pods are less densely connected. Requires less cables.
Software For Communicating:
Of course, NVIDIA came and built a library that automatically handles all the wiring of the entire stack. NCCL (NVIDIA Collective Communications Library) (“Nickel”) recognises the hardware and uses the fastest path possible.