Everything to know about LLM compression

Large language models (LLM) such as LLaMA 2 and Falcon can require dozens, if not hundreds, of gigabytes of GPU memory. They are expensive to run and challenging to set up, and they often need access to robust cloud servers.
To overcome these hurdles, researchers have been developing LLM compression techniques to make the models more compact and executable on devices with limited resources.
Different LLM compression techniques:
Pruning techniques remove the components of models that have little or no contribution to the output.
Unstructured pruning involves zeroing out irrelevant parameters without considering the model's structure, creating sparse models.
Structured pruning involves removing entire parts of a model, such as neurons, channels, or layers.
Knowledge distillation transfers knowledge from a large “teacher” model to a small “student” model.
Standard knowledge distillation aims to transfer the general knowledge of the teacher model to the student.
Emergent ability distillation seeks to extract a specific ability that the teacher model has learned and transfer it to the student model.
Quantization converts floating-point parameters (2 or 4 bytes) into single-byte or smaller integers, significantly reducing the size of an LLM.
Quantization-aware training integrates low-precision support into the LLM’s training process.
Quantization-aware fine-tuning adapts a pre-trained high-precision model to maintain its quality with lower-precision weights.
Post-training quantization transforms the parameters of the LLM to lower-precision data types after the model is trained.
Read the complete guide to LLM compression on TechTalks.
For more AI explainers: