What is the minimum hardware needed to run Llama-4-Scout-17B-16E-Instruct with vLLM?

#106
by PYTHON01100100 - opened

Hi everyone,

I’m planning to deploy Llama-4-Scout-17B-16E-Instruct using vLLM, and I want to confirm the realistic hardware requirements for this model.

Right now, my environment is:

ecs.gn8is.2xlarge on alibaba cloud
to check in the link :https://www.alibabacloud.com/help/en/ecs/user-guide/gpu-accelerated-compute-optimized-and-vgpu-accelerated-instance-families-1?spm=a2c63.p38356.help-menu-25365.d_0_1_0_2_10.a54477e7Q37YI8
8 vCPUs

64 GB RAM

1 × NVIDIA L20 GPU (48 GB VRAM)

When I try loading the model, vLLM fails due to GPU memory limits.

My questions:

What is the minimum GPU setup required to load Llama-4-Scout-17B-16E-Instruct for inference?

Is it possible to run this model using:

Tensor Parallelism (TP=2, TP=4, etc.)

Heavy offloading or quantization (FP8 / FP4 / AWQ / GPTQ)?

Do you recommend a specific configuration such as A100/H100 clusters or multi-node setups?
which instance shall I use in alibaba cloud for that model?
https://www.alibabacloud.com/help/en/ecs/user-guide/gpu-accelerated-compute-optimized-and-vgpu-accelerated-instance-families-1?spm=a2c63.p38356.help-menu-25365.d_0_1_0_2_10.a54477e7Q37YI8

Are there any reference scripts or deployment notes for running this model with vLLM?

Thanks in advance — any official or community guidance would be greatly appreciated.

PYTHON01100100 changed discussion title from can Llama-4-Scout-17B-16E-Instruct run on ecs.gn8is.2xlarge L20 gpu on alibaba cloud to What is the minimum hardware needed to run Llama-4-Scout-17B-16E-Instruct with vLLM?

Sign up or log in to comment