What:
A way of making a LLM easier and cheaper to run, relevant in High Performance Machine Learning.
How:
- Unstructured: Literally just zero-ing out the millions of neurons that are almost 0 (e.g
0.000001). Normal GPUs arenโt great at processing scattered zeros efficiently. - Structured Pruning: Instead of deleting individual weights randomly, we delete rows, columns or attention heads . Maintaining perfect blocky matrices allows GPUs to process process much faster.