Welcome! Type "help" for available commands.

$

Welcome! Type "help" for available commands.

$

~/bookmarks

William's Bookmark Library

/**/

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

github.comSaved May 21, 202624 min

LLM Inference Optimization

Summary

A llama.cpp fork featuring TurboQuant compression and speculative decoding for Gemma 4 and Qwen 3.6 to boost throughput by up to 50%. It optimizes memory usage via low-bit KV caches and supports multimodal inference on various hardware backends.

Highlights

TurboQuant WHT-rotated quantization provides ~4.3x KV cache compression.
Gemma 4 MTP speculative decoding increases short-prompt throughput by 30-50%.
Qwen 3.6 NextN decoding improves speed by 24-36% on MoE models.
Multimodal support allows image processing alongside text speculative decoding.
Compatible with Metal, CUDA, Vulkan, and HIP backends.

auto-generated

Preview of GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

AtomicBot-ai · via GitHub

Context

Audience

AI Engineers and Developers

DomainMachine Learning

Formatopen source software repository

Accessopen source

Topics

LLM Inference Optimization Llama.cpp Forks Speculative Decoding Techniques LLM Quantization & Compression On-Device AI Deployment

View on GitHub All Bookmarks

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

github.comSaved May 21, 202624 min

LLM Inference Optimization

Summary

A llama.cpp fork featuring TurboQuant compression and speculative decoding for Gemma 4 and Qwen 3.6 to boost throughput by up to 50%. It optimizes memory usage via low-bit KV caches and supports multimodal inference on various hardware backends.

Highlights

TurboQuant WHT-rotated quantization provides ~4.3x KV cache compression.
Gemma 4 MTP speculative decoding increases short-prompt throughput by 30-50%.
Qwen 3.6 NextN decoding improves speed by 24-36% on MoE models.
Multimodal support allows image processing alongside text speculative decoding.
Compatible with Metal, CUDA, Vulkan, and HIP backends.

auto-generated

AtomicBot-ai · via GitHub

Context

Audience

AI Engineers and Developers

DomainMachine Learning

Formatopen source software repository

Accessopen source

Topics

LLM Inference Optimization Llama.cpp Forks Speculative Decoding Techniques LLM Quantization & Compression On-Device AI Deployment

View on GitHub All Bookmarks

~/bookmarks

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Summary

Highlights

Context

Topics

Related

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Summary

Highlights

Context

Topics

Related

~/bookmarks

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Summary

Highlights

Context

Topics

Related

Discover Similar Content

GitHub - AtomicBot-ai/atomic-llama-cpp-turboquant: llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% throughput).

Summary

Highlights

Context

Topics

Related