Andrea Pellegrini details building a local LLM inference stack on the AMD Ryzen AI Max+ 395 (Strix Halo) with 128GB unified LPDDR5X memory shared between CPU and GPU.
The article covers running 122B-parameter LLMs using backends like HIP, Vulkan, and ROCm, with benchmarks showing up to 884 tokens/s for Llama 2 7B Q4_0 in Vulkan and 270 tokens/s for 120B models at pp512. Performance varies by model, quantization, and flags like hipBLASLt or WMMA, enabling large models up to 142B weights.