What

Taking a LLM trained in FP32 and converting it to FP16 or INT8. A 230GB model from FP32 to INT 8 becomes a 70GB model. It’s one of the optimisations we can make for a GPU for High Performance Machine Learning.

How:

Remember: FP32 are 32 bit decimals. INT8 are 8-bit integers between and .

1. Post-Training Quantisation:

After training, you freeze the model and convert the decimals by scaling it to the given range. Then round it to the nearest integer.

real_value ≈ scale × (int8_value - zero_point)

Problems:

Quantisation Noise: The model put a lot of effort to delicately balance its weights. A weight of 0.4999 was similar, but distinct, to 0.5001. Now, after this naive quantising, the weights may look like 0 and 1. The slight nuance has become massive differences (one completely off and the other on).

2. Quantisation-Aware Training:

You’ve pre-trained your model. You’re now fine-tuning your model. The weights are in FP32. Before we calculate the forward pass, we simulate the weights (e.g. 0.4999 -> 0) being rounded. We then pass the founded value (0)through the network.

Then, when we’re doing Backpropagation, we use the original gradients and pass them straight through to the hidden FP32 high-precision weights.

Quantisation allows you to double the throughput. QAT barely impacts accuracy.

Summary

Quantisation allows for smaller models.