Repository providing Docker-based configurations for running large language models locally on RTX 3090 GPUs using multiple inference engines (vLLM, llama.cpp, SGLang).
Currently supports Qwen3.6-27B with options for single or dual GPU setups, offering throughput up to 127 tokens/second or robust 262K context handling depending on engine choice. Includes OpenAI-compatible API, benchmarking tools, and scaling guidance for 3+ GPU clusters.