I just threw Qwen3-0.6B in BF16 into an on device AI drag race on AMD Strix Halo with vLLM:
564 tokens/sec on short 100-token sprints 96 tokens/sec on 8K-token marathons
TL;DR You don't just run AI on AMD. You negotiate with it.
The hardware absolutely delivers. Spoiler alert; there is exactly ONE configuration where vLLM + ROCm + Triton + PyTorch + Drivers + Ubuntu Kernel to work at the same time. Finding it required the patience of a saint
Consumer AMD for AI inference is the ultimate "budget warrior" play, insane performance-per-euro, but you need hardcore technical skills that would make a senior sysadmin nod in quiet respect.
1 reply
·
reacted to mike-ravkine's
post with 🔥29 days ago
There is no anxiety quite like powering up 2KW of basement compute after rewiring it all. Small bit of trouble with the horizontal 3090 because I misread my motherboard manual, but otherwise so far so good.. Next we see if I've built up enough cooling to hit my target TDP on those 3-slot nvlinked cards especially. The 4-slot bridges are much easier to work with but their prices went bananas and I couldn't acquire a second, so gotta get a little creative with intakes.