Home
CV
Experience
Education
Projects
Bookmarks
Investments
Contact
Blog
Welcome! Type "help" for available commands.
$
Loading terminal interface...
AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with...
✕
−
+
~/bookmarks
Discover Similar Content
William's Bookmark Library
/*
What is this?
*/
GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).
github.com
Saved May 21, 2026
24 min
LLM Inference Optimization
Summary
A llama.cpp fork featuring TurboQuant compression and speculative decoding for Gemma 4 and Qwen 3.6 to boost throughput by up to 50%. It optimizes memory usage via low-bit KV caches and supports multimodal inference on various hardware backends.
Highlights
TurboQuant WHT-rotated quantization provides ~4.3x KV cache compression.
Gemma 4 MTP speculative decoding increases short-prompt throughput by 30-50%.
Qwen 3.6 NextN decoding improves speed by 24-36% on MoE models.
Multimodal support allows image processing alongside text speculative decoding.
Compatible with Metal, CUDA, Vulkan, and HIP backends.
auto-generated
AtomicBot-ai
· via GitHub
Context
Audience
AI Engineers and Developers
Domain
Machine Learning
Format
open source software repository
Access
open source
Topics
LLM Inference Optimization
Llama.cpp Forks
Speculative Decoding Techniques
LLM Quantization & Compression
On-Device AI Deployment
GitHub
View on GitHub
All Bookmarks
Related
llama.cpp
TurboQuant
Speculative Decoding
Gemma 4
Qwen 3.6
GGUF Format