Could you provide the configuration for a 1M context?
Could you provide the configuration for a 1M context?
just added a new file: 1m-ctx.config.json
hope this helps!
How do I enable VLLM on an RTX Pro 6000 96GB device? The official setting, VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN, is causing an error.
just added a new file:
1m-ctx.config.jsonhope this helps!
the rtx pro 6000 is a Blackwell architecture GPU, which is relatively new. the DUAL_CHUNK_FLASH_ATTN backend you're trying to use is causing errors because it may not be fully optimized or compatible with this newest GPU architecture.
so what you could do is use one of the following:
- FLASHINFER
- VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve aquif-ai/aquif-3.5-Max-1205
pip install vllm[flashinfer] # you have to install vllm with flashinfer
- FLASH_ATTN (standard flash attention)
- VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve aquif-ai/aquif-3.5-Max-1205
- XFORMERS (use this as a fallback)
- VLLM_ATTENTION_BACKEND=XFORMERS vllm serve aquif-ai/aquif-3.5-Max-1205
i hope this fixes your issue. i haven't really used blackwell GPUs, so i can't test this.
the rtx pro 6000 is a Blackwell architecture GPU, which is relatively new. the DUAL_CHUNK_FLASH_ATTN backend you're trying to use is causing errors because it may not be fully optimized or compatible with this newest GPU architecture.
so what you could do is use one of the following:
- FLASHINFER
- VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve aquif-ai/aquif-3.5-Max-1205
pip install vllm[flashinfer] # you have to install vllm with flashinfer- FLASH_ATTN (standard flash attention)
- VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve aquif-ai/aquif-3.5-Max-1205
- XFORMERS (use this as a fallback)
- VLLM_ATTENTION_BACKEND=XFORMERS vllm serve aquif-ai/aquif-3.5-Max-1205
i hope this fixes your issue. i haven't really used blackwell GPUs, so i can't test this.
Can the RTX Pro 6000 run with a 1M context?
with FP16, you can load up to 160K context
with FP8, you can load up to 330K context
and with INT4, you can do 660K context
these are rough estimates