ds4 is a native Metal inference engine specifically designed for DeepSeek V4 Flash, offering optimized performance through efficient parameter usage, dramatically shorter thinking sections proportional to problem complexity, and support for 1 million token context windows.
The engine features highly compressed KV caches enabling long-context inference on local machines like MacBooks with 128GB RAM, works efficiently with 2-bit quantization, and includes an HTTP API for integration. The project prioritizes end-to-end functionality with official logits validation and agent integration testing rather than generic GGUF support, with performance benchmarks showing 468 tokens/second prefill on Mac Studio M3 Ultra.