Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC - MiniMax-M2 and GLM 4.6

Learn to construct a distributed computing cluster using two AMD Strix Halo systems (Framework Desktop and HP Z2 Mini workstation) for running large language models through llama.cpp RPC implementation. Configure network settings and establish remote procedure call connections between the machines to create a unified 256 GB memory pool for large-scale model inference. Set up the ROCm 7 Toolbox container environment with pre-compiled llama.cpp RPC capabilities, then deploy and benchmark MiniMax-M2 (Q6_K_XL Unsloth dynamic quantization) achieving 17 tokens per second and GLM 4.6 (Q4_K_XL) running at 7-8 tokens per second. Analyze performance metrics using llama-bench testing tools and explore scalability considerations for expanding the cluster architecture to accommodate additional Strix Halo nodes. Master the complete workflow from initial network configuration through performance optimization, including practical demonstrations of distributed inference workloads and evaluation of cluster expansion limitations.

Syllabus

00:00 – Intro
01:48 – Network Setup
04:04 – Llama.cpp RPC Setup
06:14 – Running MiniMax-M2 Q6_K_XL
16:56 – Running GLM 4.6 Q4_K_XL
22:37 – Llama-Bench Results
24:28 – Cluster with 4 Strix Halos?