The fastest and most-efficient prompt compression technique
LLMLingua-2 is a prompt compression technique by Microsoft that can reduce the size of prompts by up to five times.
Microsoft Research has released LLMLingua-2, a task-agnostic prompt compression technique that outperforms other similar methods.
LLMLingua-2 formulates prompt compression as a token classification task (as opposed to other methods, which try to measure per-token entropy). It then trains a small encoder transformer model to predict whether each token in a prompt should be preserved or dropped.
Experiments show that LLMLingua-2 can compress large prompts by 2-5x and is 3-6x faster than other task-agnostic techniques. It can be a very useful tool for cutting costs of LLM applications that use very long prompts.
I spoke to the authors of the paper about the background of the paper, its connection to the original LLMLingua paper, and how it can be used in applications.
Read the full article on TechTalks and the paper on arXiv.
For more on AI research: