Synthetic LLM Hosted Models
Chat with open-source models privately
DeepSeek is an open-source LLM utilizing Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), requiring advanced parallelism for efficient large-scale inference.
The implementation uses prefill-decode disaggregation and expert parallelism across 96 H100 GPUs, achieving throughput of 52.3k input and 22.3k output tokens per second per node. This setup delivers up to 5x higher throughput and costs 80% less per token than the official DeepSeek Chat API, with all code and experiments open-sourced.