Class Central is learner-supported. When you buy through links on our site, we may earn an affiliate commission.

YouTube

Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC - MiniMax-M2 and GLM 4.6

Donato Capitella via YouTube

Overview

Coursera Flash Sale
40% Off Coursera Plus for 3 Months!
Grab it
Learn to construct a distributed computing cluster using two AMD Strix Halo systems (Framework Desktop and HP Z2 Mini workstation) for running large language models through llama.cpp RPC implementation. Configure network settings and establish remote procedure call connections between the machines to create a unified 256 GB memory pool for large-scale model inference. Set up the ROCm 7 Toolbox container environment with pre-compiled llama.cpp RPC capabilities, then deploy and benchmark MiniMax-M2 (Q6_K_XL Unsloth dynamic quantization) achieving 17 tokens per second and GLM 4.6 (Q4_K_XL) running at 7-8 tokens per second. Analyze performance metrics using llama-bench testing tools and explore scalability considerations for expanding the cluster architecture to accommodate additional Strix Halo nodes. Master the complete workflow from initial network configuration through performance optimization, including practical demonstrations of distributed inference workloads and evaluation of cluster expansion limitations.

Syllabus

00:00 – Intro
01:48 – Network Setup
04:04 – Llama.cpp RPC Setup
06:14 – Running MiniMax-M2 Q6_K_XL
16:56 – Running GLM 4.6 Q4_K_XL
22:37 – Llama-Bench Results
24:28 – Cluster with 4 Strix Halos?

Taught by

Donato Capitella

Reviews

Start your review of Building a Two-Node AMD Strix Halo Cluster for LLMs with llama.cpp RPC - MiniMax-M2 and GLM 4.6

Never Stop Learning.

Get personalized course recommendations, track subjects and courses with reminders, and more.

Someone learning on their laptop while sitting on the floor.