Andrea Pellegrini details building a local LLM inference stack on the AMD Ryzen AI Max+ 395 with 128GB unified memory. The article covers performance benchmarks using Vulkan and ROCm backends for models up to 142B parameters.
Highlights
Utilization of AMD Ryzen AI Max+ 395 with 128GB unified memory for local LLM inference.
Performance benchmarks showing up to 884 tokens/s for Llama 2 7B Q4_0 using Vulkan.
Achievement of 270 tokens/s for 120B parameter models at pp512 context length.
Comparison of different inference backends including HIP, Vulkan, and ROCm.
Demonstration of running large models up to 142B weights through hardware optimization.