GLM-4.5-GGUF-3.2162bpw
This is a 3.2 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.
The quant aims to achieve best-in-class performance, by relying on:
- SOTA IQK-quants by @ikawrakow
- GGUF Tool Suite with the amazing calibration data by @Thireus
- Well-balanced importance matrix by @mradermacher
- Top-notch knowledge sharing by @ubergarm, @bartowski, @eaddario, @AesSedai, and many others
Size
The FFN tensors will take about 120GiB, to be loaded into System RAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.
The other tensors will take about 12GiB, to be loaded into VRAM, leaving some space for context and compute buffer.
Size from llama-server output:
llm_load_print_meta: model size = 134.433 GiB (3.223 BPW)
llm_load_print_meta: repeating layers = 133.360 GiB (3.211 BPW, 356.786 B parameters)
...
llm_load_tensors: CPU buffer size = 122416.56 MiB
llm_load_tensors: CUDA_Host buffer size = 486.20 MiB
llm_load_tensors: CUDA0 buffer size = 11829.74 MiB
System RAM usage from top output:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9851 sayap 20 0 185.2g 121.1g 1.2g R 94.6 97.0 1:14.58 llama-server
Quality
Recipe with a mixture of IQ4_KSS, IQ3_KS, IQ2_KL, IQ2_KS for FFN, no harmonization
## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/
## Model head & embeddings — qbits: 32 6 5
^output_norm\.weight$=f32
^token_embd\.weight$=iq5_ks
^output\.weight$=iq6_k
## Multi-headed attention parameters — qbits: 32 8 5
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight$=iq5_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=iq5_ks
## Dense Feed-Forward Network weights — qbits: 8
^blk\.[0-2]\.ffn_gate\.weight$=q8_0
^blk\.[0-2]\.ffn_down\.weight$=q8_0
^blk\.[0-2]\.ffn_up\.weight$=q8_0
## NextN tensors — qbits: 32 5
^blk\.92\.nextn\.enorm\.weight$=f32
^blk\.92\.nextn\.eh_proj\.weight$=iq5_ks
^blk\.92\.nextn\.embed_tokens\.weight$=iq5_ks
^blk\.92\.nextn\.shared_head_norm\.weight$=f32
^blk\.92\.nextn\.shared_head_head\.weight$=iq5_ks
^blk\.92\.nextn\.hnorm\.weight$=f32
## MoE Gating & Routing — qbits: 32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight$=f32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias$=f32
## Misc / Other tensors — qbits: 32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight$=f32
## GPU-loaded - MoE Shared Experts Feed-Forward Network - ffn_*_shexp
# ffn_down_shexp — down-projection (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_down_shexp\.weight$=q8_0
# ffn_up_shexp — up-projection (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_up_shexp\.weight$=q8_0
# ffn_gate_shexp — gating network (shared experts) — qbits: 8
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_shexp\.weight$=q8_0
## CPU-friendly - MoE Per-expert Feed-Forward Network - ffn_*_exps
# ffn_down_exps — down-projection (per-expert) — qbits: 4 3 2
^blk\.([3-5]|8|11|14|18|22|25|3[0-1]|35|43|46|5[6-7]|61|92)\.ffn_down_exps\.weight$=iq4_kss
^blk\.(6|13|1[5-6]|19|24|28|37|42|4[4-5]|4[7-9]|5[0-5]|59|60|6[2-6]|68|7[1-6]|78|8[1-2]|86)\.ffn_down_exps\.weight$=iq3_ks
^blk\.(10|12|17|21|23|2[6-7]|29|32|34|36|3[8-9]|40)\.ffn_down_exps\.weight$=iq2_ks
^blk\.(7|9|20|33|41|58|67|69|70|77|79|80|8[3-5]|8[7-9]|9[0-1])\.ffn_down_exps\.weight$=iq2_kl
# ffn_up_exps — up-projection (per-expert) — qbits: 4 3 2
^blk\.(11|13|1[5-6]|19|20|24|28|34|44|5[6-7]|92)\.ffn_up_exps\.weight$=iq4_kss
^blk\.(14|17|23|3[0-1]|36|4[0-3]|4[6-9]|5[0-5]|5[8-9]|6[0-2]|6[4-5]|6[7-9]|7[0-9]|8[1-2]|85|88|91)\.ffn_up_exps\.weight$=iq3_ks
^blk\.([3-5]|8|12|2[1-2]|3[8-9]|45|63|66|80|8[3-4]|8[6-7]|89|90)\.ffn_up_exps\.weight$=iq2_kl
^blk\.([6-7]|9|10|18|2[5-7]|29|3[2-3]|35|37)\.ffn_up_exps\.weight$=iq2_ks
# ffn_gate_exps — gating network (per-expert) — qbits: 4 3 2
^blk\.(1[0-2]|17|19|2[0-3]|28|3[0-1]|92)\.ffn_gate_exps\.weight$=iq4_kss
^blk\.(1[3-6]|24|26|41|4[3-8]|5[0-9]|6[0-1]|6[3-5]|6[7-9]|7[0-6]|78|[8-9][0-1]|87)\.ffn_gate_exps\.weight$=iq3_ks
^blk\.([3-4]|[6-7]|33|39|42|49|62|66|77|79|8[2-6]|8[8-9])\.ffn_gate_exps\.weight$=iq2_kl
^blk\.(5|[8-9]|18|25|27|29|32|3[4-8]|40)\.ffn_gate_exps\.weight$=iq2_ks
## Summary of tensor sizes per class
# GPU Total: 12.13 GiB (100.0%) | 12.13 GiB max, if all were q8_0 | 12.13 GiB min, if all were q8_0
# CPU Total: 122.03 GiB (76.7%) | 159.18 GiB max, if all were iq4_kss | 88.29 GiB min, if all were iq2_ks
# GPU+CPU Total: 134.17 GiB (88.3%)
## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +f32 835 32 0.28 GiB - -
# +q8_0 186 8.5 0.96 GiB - -
# q8_0 279 8.5 2.66 GiB 100.0% 2.66
# +iq6_k 1 6.625 0.60 GiB - -
# +iq5_ks 187 5.25 7.63 GiB - -
#
# CPU-friendly quants:
# QTYPE Count BPW Assigned GiB % Assigned Max GiB (all)
# +iq5_ks 3 5.25 0.98 GiB - -
# +iq4_kss 3 4 1.76 GiB - -
# iq4_kss 41 4 24.02 GiB 15.4% 156.45
# iq3_ks 127 3.1875 59.30 GiB 47.6% 124.67
# iq2_kl 58 2.6875 22.83 GiB 21.7% 105.11
# iq2_ks 41 2.1875 13.14 GiB 15.4% 85.56
#
# -Average BPW: 3.2162
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-11-28 06:08:52 WIB+0700 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: 569b7f6a3239c9173d71ca1fadf34222607d72a2cfed2c284b42633e95b4a627
# - Calibration dataset 'models/GLM-4.5/ppl_results.csv' SHA-256: eee3f314cf1ecaa746932341b2189f8a7718e62489bbd250f957c4aec82c0015
# - tensors.bf16.map SHA-256: 4e8b7b435f6257174a7adfc90290ac92c36758fef201ba0f5358338eea7606b8
# - tensors.bf16.map model name: GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-01762-of-01762
# - tensors.q8_0.map SHA-256: 2814e1547cf288d327264135ed3f83e612b879826640283037d45f95a22ebfe2
# - tensors.q8_0.map model name: GLM-4.5-THIREUS-Q8_0-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq6_k.map SHA-256: 8ef4d5c379126fc13dfb46bbc8c10308d2c8e78602c0b3f6cea197d963fc80f1
# - tensors.iq6_k.map model name: GLM-4.5-THIREUS-IQ6_K-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq5_ks.map SHA-256: 9af5c1536eebc84dc6c71d855517c9acd534fbeefc3067aef26def0c104c7e64
# - tensors.iq5_ks.map model name: GLM-4.5-THIREUS-IQ5_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_kss.map SHA-256: 78644e76c921c329b6cf32d1c8711766170edea7e8960fcd3e9eb6d94601bc4b
# - tensors.iq4_kss.map model name: GLM-4.5-THIREUS-IQ4_KSS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq3_ks.map SHA-256: 4a33c7b3901cadf1a4e6130aeaed8806168249a0d386219db4ffec31188cb6af
# - tensors.iq3_ks.map model name: GLM-4.5-THIREUS-IQ3_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_kl.map SHA-256: ad6ca540847d2e75cb537169f77e91d8453b70613b3171fed20bc374f202d58a
# - tensors.iq2_kl.map model name: GLM-4.5-THIREUS-IQ2_KL-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_ks.map SHA-256: 4ed9fc5a73d854ad30f9f75577a1d826150cb23283a0bb54dba45d6aba6c9de2
# - tensors.iq2_ks.map model name: GLM-4.5-THIREUS-IQ2_KS-SPECIAL_TENSOR-01762-of-01762
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py models/GLM-4.5/ppl_results.csv --tolerance 0.001 --cpu-tensors-max-size 121.9 \
# --gpu-tensors-max-size 12.2 --exponential-factor 1.0 --skip-gpg --cpu-tensors \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq4_kss iq3_ks iq2_kl \
# iq2_ks --gpu-quants q8_0 --cpu-assign-tensors '^blk\.(92)\.ffn_down_exps\.weight=iq4_kss' \
# '^blk\.(92)\.ffn_up_exps\.weight=iq4_kss' '^blk\.(92)\.ffn_gate_exps\.weight=iq4_kss' \
# '^blk\.92\.nextn\.shared_head_head\.weight=iq5_ks' '^blk\.92\.nextn\.embed_tokens\.weight=iq5_ks' \
# '^blk\.92\.nextn\.eh_proj\.weight=iq5_ks' --gpu-assign-qtype q8_0 --gpu-assign-tensors \
# '^blk\..*\.attn_(q|output)\.weight=iq5_ks' '^token_embd\.weight=iq5_ks' '^output\.weight=iq6_k' --harmonize-tensors \
# '' --harmonization-technique 0
## THE END!
PPL result with wiki.test.raw:
Final estimate: PPL over 565 chunks for n_ctx=512 = 3.3627 +/- 0.01897
Can check the graph from https://huggingface.co/ubergarm/GLM-4.5-GGUF for comparison.
KLD result with combined_all_micro.txt and GLM-4.5-KLD-8192-ref-logits-ed-combined-all-micro-Q8_0.bin:
====== Perplexity statistics ======
Mean PPL(Q) : 4.690483 ± 0.030264
Mean PPL(base) : 4.593228 ± 0.029642
Cor(ln(PPL(Q)), ln(PPL(base))): 96.97%
Mean ln(PPL(Q)/PPL(base)) : 0.020952 ± 0.001589
Mean PPL(Q)/PPL(base) : 1.021174 ± 0.001623
Mean PPL(Q)-PPL(base) : 0.097255 ± 0.007404
====== KL divergence statistics ======
Mean KLD: 0.138892 ± 0.002236
Maximum KLD: 24.789902
99.9% KLD: 14.156975
99.0% KLD: 2.560629
95.0% KLD: 0.300147
90.0% KLD: 0.147368
Median KLD: 0.013864
10.0% KLD: 0.000065
5.0% KLD: 0.000017
1.0% KLD: 0.000002
0.1% KLD: -0.000000
Minimum KLD: -0.000019
====== Token probability statistics ======
Mean Δp: -0.783 ± 0.025 %
Maximum Δp: 99.991%
99.9% Δp: 88.958%
99.0% Δp: 26.269%
95.0% Δp: 8.516%
90.0% Δp: 3.926%
75.0% Δp: 0.273%
Median Δp: -0.009%
25.0% Δp: -1.137%
10.0% Δp: -6.428%
5.0% Δp: -12.124%
1.0% Δp: -37.051%
0.1% Δp: -91.898%
Minimum Δp: -99.991%
RMS Δp : 10.192 ± 0.077 %
Same top p: 90.530 ± 0.073 %
Can check the graph from https://huggingface.co/AesSedai/GLM-4.6-GGUF/discussions/1 for comparison.
- Downloads last month
- 302
We're not able to determine the quantization variants.
Model tree for sokann/GLM-4.5-GGUF-3.2162bpw
Base model
zai-org/GLM-4.5