What:

A way of making a LLM easier and cheaper to run, relevant in High Performance Machine Learning.

How:

  • Unstructured: Literally just zero-ing out the millions of neurons that are almost 0 (e.g 0.000001). Normal GPUs arenโ€™t great at processing scattered zeros efficiently.
  • Structured Pruning: Instead of deleting individual weights randomly, we delete rows, columns or attention heads . Maintaining perfect blocky matrices allows GPUs to process process much faster.