LLM Evals: Everything You Need to Know – Hamel’s Blog - Hamel Husain
A comprehensive guide to LLM evals, drawn from questions asked in our popular course on AI Evals. Covers everything from basic to advanced topics.
Prompt caching makes cached input tokens 10x cheaper than regular ones for OpenAI and Anthropic APIs by storing attention mechanism data like key-value tensors from repeated prompt prefixes.
This skips recomputation on subsequent requests, reducing time-to-first-token latency by up to 85% for long prompts, as shown in tests with GPT-5 and Sonnet 4.5. The process involves tokenization into integers, embedding, and transformer layers where caching occurs, enabling faster inference without reusing full responses.