What is the minimum hardware needed to run Llama-4-Scout-17B-16E-Instruct with vLLM?
Hi everyone,
I’m planning to deploy Llama-4-Scout-17B-16E-Instruct using vLLM, and I want to confirm the realistic hardware requirements for this model.
Right now, my environment is:
ecs.gn8is.2xlarge on alibaba cloud
to check in the link :https://www.alibabacloud.com/help/en/ecs/user-guide/gpu-accelerated-compute-optimized-and-vgpu-accelerated-instance-families-1?spm=a2c63.p38356.help-menu-25365.d_0_1_0_2_10.a54477e7Q37YI8
8 vCPUs
64 GB RAM
1 × NVIDIA L20 GPU (48 GB VRAM)
When I try loading the model, vLLM fails due to GPU memory limits.
My questions:
What is the minimum GPU setup required to load Llama-4-Scout-17B-16E-Instruct for inference?
Is it possible to run this model using:
Tensor Parallelism (TP=2, TP=4, etc.)
Heavy offloading or quantization (FP8 / FP4 / AWQ / GPTQ)?
Do you recommend a specific configuration such as A100/H100 clusters or multi-node setups?
which instance shall I use in alibaba cloud for that model?
https://www.alibabacloud.com/help/en/ecs/user-guide/gpu-accelerated-compute-optimized-and-vgpu-accelerated-instance-families-1?spm=a2c63.p38356.help-menu-25365.d_0_1_0_2_10.a54477e7Q37YI8
Are there any reference scripts or deployment notes for running this model with vLLM?
Thanks in advance — any official or community guidance would be greatly appreciated.