llama.cpp is a plain C/C++ library for LLM inference with state-of-the-art performance across hardware like Apple silicon, x86, RISC-V, NVIDIA GPUs, and Vulkan.
It supports models including LLaMA series, Mistral, Mixtral, DBRX, and PLaMo-13B via GGUF format with quantization from 1.5-bit to 8-bit. Installation options include brew, Docker, pre-built binaries, or source builds, with commands like llama-cli for local models or llama-server for OpenAI-compatible APIs.