LLMLingua compresses prompts by up to 20x

Dec 20, 2023

LLMs like GPT-4 and Claude can perform interesting tasks if given long and detailed prompts. However, making the prompts longer can quickly increase the costs of using these models.

There are several ways to optimize models and reduce costs by fine-tuning or quantizing them, but they do not apply to private LLMs such as GPT-4.

A different approach to reducing costs is compressing the prompts and removing parts that will not affect the model’s performance. This approach is based on the premise that natural language contains inherent redundancies that can be removed without losing information.

LLMLingua, a new prompt compression technique developed by Microsoft, compresses prompts by using a small LLM to measure the perplexity of different parts of the prompt and remove those that do not affect the main model’s (e.g., GPT-4) response.

LLMLingua is available as an open-source library and you can integrate it into your applications.

Read all about LLMLingua on TechTalks.

Recommendations:

My go-to platform for working with GPT-4 and Claude is ForeFront.ai, which has a super-flexible pricing plan and plenty of good features for writing and coding. I use ForeFront for all kinds of tasks, including writing, coding, and testing new prompting techniques. The pricing is very convenient and the platform is user-friendly.

For more LLM tips: