GLM-5: From Vibe Coding to Agentic Engineering
GLM-5 is a 744B-parameter MoE model (40B active) from Zhipu AI, scaled up from GLM-4.5's 355B with 28.5T pre-training tokens and DeepSeek Sparse Atten...
DeepSeek is an open-source LLM utilizing Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), requiring advanced parallelism for efficient large-scale inference.
The implementation uses prefill-decode disaggregation and expert parallelism across 96 H100 GPUs, achieving throughput of 52.3k input and 22.3k output tokens per second per node. This setup delivers up to 5x higher throughput and costs 80% less per token than the official DeepSeek Chat API, with all code and experiments open-sourced.