Pruning

A way of making a LLM easier and cheaper to run, relevant in High Performance Machine Learning.

Unstructured: Literally just zero-ing out the millions of neurons that are almost 0 (e.g 0.000001). Normal GPUs aren’t great at processing scattered zeros efficiently.
Structured Pruning: Instead of deleting individual weights randomly, we delete rows, columns or attention heads . Maintaining perfect blocky matrices allows GPUs to process process much faster.

~/leocamacho.co