Cerebras
Cerebras is the go-to platform for fast and effortless AI training. Learn more at cerebras.ai.
Memchunk is a Rust library for fast text chunking in RAG pipelines, splitting at semantic delimiters like periods, newlines, and question marks to avoid sentence fragments.
It uses the memchr crate with SIMD (AVX2/SSE2) for 1-3 delimiters and lookup tables for more, employing backward search for efficiency. Benchmarks show up to 1TB/s throughput, chunking English Wikipedia in 120ms.