What to know about Claude's prompt caching feature
Claude's new prompt caching feature enables you to considerably cut the costs of using the LLM and make your applications faster.
Anthropic has just released a new prompt caching feature for its Claude large language model (LLM), which reduces the computational costs of using the models. Adding it as a feature will tremendously impact LLM applications with long prompts.
The feature is still in beta. According to Anthropic, using prompt caching can reduce costs by up to 90% and latency by up to 85% for long prompts.
The stated use cases are applications that use very long instruction sets, documents and books, and code bases. The common denominator of all these applications is that the conversation includes multiple interactions with the model and a very long part of the prompt remains unchanged throughout the conversation.
The pricing for prompt caching is interesting. Caching prompts is more expensive than the basic cost of input tokens. For example, Claude 3.5 Sonnet costs $3 per million input tokens and $3.75 per million cached tokens. But reusing cached tokens costs $0.30 per million tokens, which is an order of magnitude cheaper. This means that you will only start seeing returns if you use multiple interactions with the model. The costs of output tokens is not affected by prompt caching (which is logical because token generation uses a different mechanism).
Anthropic shares some stats on how its customers have reduced latency and cost through prompt caching. For example, when chatting with a book with a 100,000-token cached prompt, developers were able to reduce the costs by 90% and “time to first token” by 79%. On a 10,000 prompt, the latency was reduced by a smaller margin (31%) but they were still able to reduce costs by 86%.
What is missing here is the number of interactions with the model. Prompt caching becomes useful when you use the same prompt multiple times in a single conversation. For example, if you’re doing question-answering on a long book, the conversation will become more expensive if you only ask one question. But the more you use the cached prompt, the more you’re going to save on input tokens (while still spending the same amount on output tokens).
Prompt caching is especially useful for few-shot and many-shot learning, where you try to customize the LLM to perform a new task based on examples provided in the prompt. Since the training examples do not change while the model is being used, you can cache them in your first request and avoid paying the extra cost in subsequent requests.
It is also useful for in-context learning with large documents, which is becoming an easy-to-use alternative to retrieval-augmented generation (RAG). Instead of retrieving specific chunks of information from a database for each new request, you can insert your entire knowledge base into the prompt and have the model choose the parts that are relevant to a given request. With the context windows of LLMs reaching millions of tokens, in-context learning with long documents is becoming an attractive option for developing LLM applications that require bespoke information.
How prompt caching works
From the client’s perspective, what you have to do is mark the parts of your prompt that should be cached when sending them to the API for the first time. On subsequent API calls, you can omit the cached parts and they will be automatically added to your prompt on the server side. This saves you network bandwidth, especially when sending large images and documents.
According to Claude’s documentation, the cache has a “5-minute lifetime, refreshed each time the cached content is used.” This means that if you don’t reuse the cached prompt for five minutes, you have to send it to the API anew. This will not be a problem if your application is being used frequently. Otherwise, you might end up paying more for caching the prompt and seeing it expired before reusing it.
Anthropic does not provide technical details on how prompt caching works behind the scenes. But from what we know of LLMs, there must be more at work than just caching your prompt data on the servers.
LLMs like Claude are “autoregressive models” that generate one token at a time. At each step, the model calculates the attention value of previous tokens and predicts which tokens are most likely to appear next. Calculating attention is very expensive, especially as the sequence becomes longer. However, since each token is only dependent on the previous ones, you can save a lot of computational time by caching the attention values of previous tokens and only calculating the new ones.
This is often referred to as KV caching and is used by most LLM services, which is why the price of input tokens is usually much less expensive than output tokens. It is also why in conversations with very long system prompts, the first token takes a long time to generate but subsequent tokens are generated faster. Other mechanisms such as paged attention and multi-query attention further optimize the attention mechanism of LLMs and are used regularly in model serving platforms.
Claude is probably using some other mechanism to further improve the speed and reduce the costs of cached prompts. Google Gemini has a similar feature called “Context Caching.” Other platforms are likely to add similar features soon.
This feature will further heat up the competition between LLM providers. It means better pricing for application developers and less friction to adopt LLMs in applications.