Lucebox-hub offers hand-tuned LLM inference for consumer GPUs like RTX 3090. It includes Megakernel for efficient Qwen3.5-0.8B inference and DFlash for speculative decoding of larger models, maximizing throughput and energy efficiency.
Highlights
Megakernel achieves 1.87 tok/J on RTX 3090 using a single CUDA dispatch for Qwen3.5-0.8B.
DFlash enables speculative decoding for Qwen3.5/3.6-27B GGUF, reaching up to 207 tok/s.
Supports 256K context on 24GB VRAM via TurboQuant KV cache.
Optimizations focus on power efficiency and cooperative grid synchronization.
Benchmarks compare performance against llama.cpp and PyTorch.