High Performance Machine Learning

Background:

Basically all of (Machine Learning) Models nowadays are behemoths of engineering challenges. We should optimise them so that we’re bottlenecked by the most expensive - and upgradable - part, the GPU.

Simple Bottlenecks on GPU’s:

I/O Bottlenecks: Reading from hard-drive, off-loading to CPU. Fix: Use multiple CPU worker processes to load and pre-fetch data.
Kernel Proliferation: Read it.

GPU Parallelism:

Your model, with billions of weights, may not fit in a single GPU’s memory. Or, it may fit, but it takes year to train on a single GPU. Thus, we parallelise it.

There’s different ways to parallelise DNN training:

Data Parallelism (FSDP)
Model Parallelism

Making the Model Cheaper:

Quantisation
Pruning

How Do GPUs Talk To Each Other?

Intra- and Inter-Node GPU Communication (NVLink, NVSwitch, RDMA)

~/leocamacho.co

Get Around

🧠 EdinburghAI

🛠️ Projects

📝 Essays

Contact Me

📧 Email

💼 LinkedIn

🐦 Twitter

High Performance Machine Learning

Background:

Simple Bottlenecks on GPU’s:

GPU Parallelism:

Making the Model Cheaper:

How Do GPUs Talk To Each Other?

Graph View

Table of Contents

Backlinks