amd
/

Llama-3.2-90B-Vision-Instruct-FP8-KV

Model card Files Files and versions

luow-amd commited on Sep 27, 2024

Commit

d488ea1

·

verified ·

1 Parent(s): bd24176

Update README.md

Files changed (1) hide show

README.md +6 -0

README.md CHANGED Viewed

@@ -23,6 +23,9 @@ python3 quantize_quark.py \
         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
@@ -30,6 +33,9 @@ python3 quantize_quark.py \
         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
 ```
 ## Deployment
 Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible).

         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
+        --model_export quark_safetensors \
+        --no_weight_matrix_merge \
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
         --quant_scheme w_fp8_a_fp8 \
         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
+        --model_export quark_safetensors \
+        --no_weight_matrix_merge \
+        --multi_gpu
 ```
 ## Deployment
 Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible).