llama.cpp is a C/C++ implementation for LLM inference optimized for minimal setup and performance across diverse hardware including CPUs, GPUs, and Apple silicon.
It supports multiple quantization formats (1.5-bit to 8-bit), various model architectures (LLaMA, Mistral, Qwen, etc.), and hybrid CPU+GPU inference. Recent features include multimodal support, GGUF format compatibility with Hugging Face, and WebUI integration.