Technical
Prompt Caching in Production: Cutting Costs Without Cutting Quality
Token costs add up fast when you run AI in production. My monthly bill tripled between July and September as I added more agent workflows. Prompt caching is the single biggest lever I found for bringing that bill back down without losing any quality.
What Prompt Caching Does
Most AI workflows send the same system prompt, the same style guide, and the same project context on every request. That is a lot of tokens getting retransmitted for no reason. Prompt caching tells the model provider to hold those tokens in a short-lived cache so the next request can reuse them at a steep discount.
On Anthropic's API, cached input tokens cost about 10% of the normal rate. Over a workflow that calls the API fifty times with a shared preamble, that is a massive saving. In my case, prompt caching cut my daily spend by roughly 60% for the workflows that use it.
What To Cache
Cache the things that change rarely:
- System prompts describing the agent's role
- Style guides and voice specifications
- Project context like repository structure or domain glossaries
- Large reference documents the agent consults across many calls
Do not cache:
- Per-user data that changes every request
- Conversation history that grows with each turn
- Anything smaller than the minimum cacheable block size
Code Structure
messages = [
{
"role": "system",
"content": SYSTEM_PROMPT,
"cache_control": {"type": "ephemeral"},
},
{
"role": "user",
"content": current_user_message,
},
]That single cache_control marker flips the cost curve for that request. The first call primes the cache, every subsequent call within the cache window pays the discounted rate.
The Tradeoff
Cache entries are short-lived. If your workflow has long gaps between calls, the cache expires and you pay the full rate to re-prime it. For bursty workloads it is free money. For sparse workloads it is a wash.
Read the Anthropic prompt caching docs before you enable it.
RELATED READING
The Consulting Shift I Am Making In Year Two
After a year of writing and building, my consulting practice is changing shape. Shorter engagements. Sharper outcomes.
ReadThe Frontend Shift: Shipping Less JavaScript In Year Two
A year ago I reached for Next.js for everything. This year I often reach for nothing.
ReadThe Serverless Lesson I Would Write On A Sticky Note
After a year of shipping serverless projects, one rule explains most of the wins and all of the losses.
Read