Lucebox-hub provides hand-tuned LLM inference optimized for specific consumer GPUs like RTX 3090.
It includes three projects: Megakernel for Qwen3.5-0.8B achieving 1.87 tok/J in a single CUDA dispatch; DFlash for Qwen3.5/3.6-27B GGUF with DDTree speculative decoding up to 207 tok/s; and PFlash. Each project offers benchmarks, installation via Git clone and CUDA builds, and supports extended contexts up to 256K on 24GB cards.