How RelayAttention achieves high throughput on LLM applications

RelayAttention is a technique that increases the throughput of LLM servers by reducing memory access to KV values of system prompts.

Mar 12, 2024

Long system prompts can help LLMs perform specialized tasks. However, long system prompts can also slow down the LLM and reduce the model’s throughput.

RelayAttention, a new technique developed by researchers at City University of Hong Kong and SenseTime Research, improves the efficiency of LLM services that involve long system prompts.

The main idea behind relay attention is to reorganize the computation and retrieval of KV values in a way that removes redundant memory access.

RelayAttention is especially useful for LLM services that share a system prompt across multiple user requests and process batched requests. Empirical results show that it can increase both speed and throughput by up to 2X in comparison to other optimization techniques such as PagedAttention and PromptCaching.

For more on AI research:

TechTalks

Discussion about this post

Ready for more?