WebUI Svelte App for llama.cpp · ggml-org/llama.cpp
Overview This guide highlights the key features of the new SvelteKit-based WebUI of llama.cpp. The new WebUI in combination with the advanced backend ...
Prompt caching makes cached input tokens 10x cheaper than regular ones for OpenAI and Anthropic APIs by storing attention mechanism data like key-value tensors from repeated prompt prefixes.
This skips recomputation on subsequent requests, reducing time-to-first-token latency by up to 85% for long prompts, as shown in tests with GPT-5 and Sonnet 4.5. The process involves tokenization into integers, embedding, and transformer layers where caching occurs, enabling faster inference without reusing full responses.