aquif-ai/aquif-3.5-Max-1205 · Could you provide the configuration for a 1M context?

Could you provide the configuration for a 1M context?

by win10 - opened 1 day ago

Discussion

win10

1 day ago

Could you provide the configuration for a 1M context?

aquiffoo

aquif AI org 1 day ago

just added a new file: 1m-ctx.config.json

hope this helps!

win10

1 day ago

How do I enable VLLM on an RTX Pro 6000 96GB device? The official setting, VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN, is causing an error.

just added a new file: 1m-ctx.config.json

hope this helps!

aquiffoo

aquif AI org 1 day ago

the rtx pro 6000 is a Blackwell architecture GPU, which is relatively new. the DUAL_CHUNK_FLASH_ATTN backend you're trying to use is causing errors because it may not be fully optimized or compatible with this newest GPU architecture.

so what you could do is use one of the following:

FLASHINFER
- VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve aquif-ai/aquif-3.5-Max-1205
- pip install vllm[flashinfer] # you have to install vllm with flashinfer
FLASH_ATTN (standard flash attention)
- VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve aquif-ai/aquif-3.5-Max-1205
XFORMERS (use this as a fallback)
- VLLM_ATTENTION_BACKEND=XFORMERS vllm serve aquif-ai/aquif-3.5-Max-1205

i hope this fixes your issue. i haven't really used blackwell GPUs, so i can't test this.

win10

about 19 hours ago

the rtx pro 6000 is a Blackwell architecture GPU, which is relatively new. the DUAL_CHUNK_FLASH_ATTN backend you're trying to use is causing errors because it may not be fully optimized or compatible with this newest GPU architecture.

so what you could do is use one of the following:

FLASHINFER

VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve aquif-ai/aquif-3.5-Max-1205

pip install vllm[flashinfer] # you have to install vllm with flashinfer

FLASH_ATTN (standard flash attention)

VLLM_ATTENTION_BACKEND=FLASH_ATTN vllm serve aquif-ai/aquif-3.5-Max-1205

XFORMERS (use this as a fallback)

VLLM_ATTENTION_BACKEND=XFORMERS vllm serve aquif-ai/aquif-3.5-Max-1205

i hope this fixes your issue. i haven't really used blackwell GPUs, so i can't test this.

Can the RTX Pro 6000 run with a 1M context?

aquiffoo

aquif AI org about 18 hours ago

with FP16, you can load up to 160K context
with FP8, you can load up to 330K context
and with INT4, you can do 660K context

these are rough estimates

aquiffoo changed discussion status to closed about 9 hours ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment