Plasticity

Prompt caching is an optimization technique that stores computed model state from previous requests. This allows subsequent requests with identical prefixes to skip redundant processing.

Mechanism

When a language model processes tokens, it builds key-value pairs in its attention mechanism representing the computed context. Prompt caching saves this state at specified breakpoints.

On subsequent requests with the same prefix, the system detects the cached prefix and loads precomputed state instead of reprocessing. Only new tokens after the cache point are processed. This reduces time-to-first-token and input token costs.

Pricing

Anthropic's prompt caching charges 25% more than the base input token price for cache writes as a one-time cost. Cache reads cost 90% less than the base input token price.

For a 2,000 token system prompt called 1,000 times, processing without caching requires 2,000,000 input tokens. With caching, the system processes 2,000 tokens once plus 1,000 cache reads.

Use Cases

Prompt caching benefits workloads with repeated context. System prompts containing instructions, persona definitions, and tool descriptions rarely change between requests. Document analysis workflows query large documents multiple times. Conversation history grows but the prefix stays stable. Few-shot examples guiding output format remain fixed across requests.

Implementation

The Anthropic API uses cache control markers:

from anthropic import Anthropic
 
client = Anthropic()
 
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a financial analyst assistant...",  # 2000 tokens
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "Analyze Q4 revenue trends."}
    ]
)

The cache_control marker caches content up to that point. Subsequent requests with identical prefixes use the cached state.

Cache Requirements

Caching requires exact prefix matches. Any character change in the prefix causes a cache miss. Reordering content causes a cache miss. Dynamic content such as timestamps or request IDs causes cache misses on every request.

Effective cache design places stable content first including system prompts, examples, and reference documents. Variable content such as user queries and dynamic data goes last. Dynamic values should not appear in cacheable regions.

Cache Lifetime

Anthropic's ephemeral (temporary) caches have a minimum TTL (time to live) of 5 minutes, refreshed on each cache hit. High-traffic applications maintain warm caches naturally.

Application in Agent Systems

For agent clouds running multi-step tasks, caching becomes architectural. Tool schemas included in every request should be cached. Retrieved documents can form cacheable prefixes for follow-up reasoning. During iterative refinement, stable context remains cached while only the latest attempt varies.

Metrics

Key performance indicators include cache hit rate (percentage of requests using cached state), tokens saved (input tokens that avoided reprocessing), latency reduction (time-to-first-token improvement), and cost savings (reduction in API spend).

Applications typically achieve 30-70% cost reduction on input tokens with proper caching implementation.