GLM-4.5-GGUF-3.2162bpw

This is a 3.2 BPW quantized model for the GPU poors with 128 GiB of System RAM and 24 GiB of VRAM.

The quant aims to achieve best-in-class performance, by relying on:

SOTA IQK-quants by @ikawrakow
GGUF Tool Suite with the amazing calibration data by @Thireus
Well-balanced importance matrix by @mradermacher
Top-notch knowledge sharing by @ubergarm, @bartowski, @eaddario, @AesSedai, and many others

Size

The FFN tensors will take about 120GiB, to be loaded into System RAM, leaving absolutely no space for anything else. No GUI, no syslog, no cronie, no chronyd. For the GPU poors, every single bit matters.

The other tensors will take about 12GiB, to be loaded into VRAM, leaving some space for context and compute buffer.

Size from llama-server output:

llm_load_print_meta: model size       = 134.433 GiB (3.223 BPW)
llm_load_print_meta: repeating layers = 133.360 GiB (3.211 BPW, 356.786 B parameters)
...
llm_load_tensors:        CPU buffer size = 122416.56 MiB
llm_load_tensors:  CUDA_Host buffer size =   486.20 MiB
llm_load_tensors:      CUDA0 buffer size = 11829.74 MiB

System RAM usage from top output:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 9851 sayap     20   0  185.2g 121.1g   1.2g R  94.6  97.0   1:14.58 llama-server

Quality

Recipe with a mixture of IQ4_KSS, IQ3_KS, IQ2_KL, IQ2_KS for FFN, no harmonization

## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/

## Model head & embeddings — qbits: 32 6 5 
^output_norm\.weight$=f32
^token_embd\.weight$=iq5_ks
^output\.weight$=iq6_k

## Multi-headed attention parameters — qbits: 32 8 5 
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight$=iq5_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=iq5_ks

## Dense Feed-Forward Network weights — qbits: 8 
^blk\.[0-2]\.ffn_gate\.weight$=q8_0
^blk\.[0-2]\.ffn_down\.weight$=q8_0
^blk\.[0-2]\.ffn_up\.weight$=q8_0

## NextN tensors — qbits: 32 5 
^blk\.92\.nextn\.enorm\.weight$=f32
^blk\.92\.nextn\.eh_proj\.weight$=iq5_ks
^blk\.92\.nextn\.embed_tokens\.weight$=iq5_ks
^blk\.92\.nextn\.shared_head_norm\.weight$=f32
^blk\.92\.nextn\.shared_head_head\.weight$=iq5_ks
^blk\.92\.nextn\.hnorm\.weight$=f32

## MoE Gating & Routing — qbits: 32 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight$=f32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias$=f32

## Misc / Other tensors — qbits: 32 
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight$=f32

## GPU-loaded - MoE Shared Experts Feed-Forward Network - ffn_*_shexp
# ffn_down_shexp — down-projection (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_down_shexp\.weight$=q8_0

# ffn_up_shexp — up-projection (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_up_shexp\.weight$=q8_0

# ffn_gate_shexp — gating network (shared experts) — qbits: 8 
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_shexp\.weight$=q8_0

## CPU-friendly - MoE Per-expert Feed-Forward Network - ffn_*_exps
# ffn_down_exps — down-projection (per-expert) — qbits: 4 3 2 
^blk\.([3-5]|8|11|14|18|22|25|3[0-1]|35|43|46|5[6-7]|61|92)\.ffn_down_exps\.weight$=iq4_kss
^blk\.(6|13|1[5-6]|19|24|28|37|42|4[4-5]|4[7-9]|5[0-5]|59|60|6[2-6]|68|7[1-6]|78|8[1-2]|86)\.ffn_down_exps\.weight$=iq3_ks
^blk\.(10|12|17|21|23|2[6-7]|29|32|34|36|3[8-9]|40)\.ffn_down_exps\.weight$=iq2_ks
^blk\.(7|9|20|33|41|58|67|69|70|77|79|80|8[3-5]|8[7-9]|9[0-1])\.ffn_down_exps\.weight$=iq2_kl

# ffn_up_exps — up-projection (per-expert) — qbits: 4 3 2 
^blk\.(11|13|1[5-6]|19|20|24|28|34|44|5[6-7]|92)\.ffn_up_exps\.weight$=iq4_kss
^blk\.(14|17|23|3[0-1]|36|4[0-3]|4[6-9]|5[0-5]|5[8-9]|6[0-2]|6[4-5]|6[7-9]|7[0-9]|8[1-2]|85|88|91)\.ffn_up_exps\.weight$=iq3_ks
^blk\.([3-5]|8|12|2[1-2]|3[8-9]|45|63|66|80|8[3-4]|8[6-7]|89|90)\.ffn_up_exps\.weight$=iq2_kl
^blk\.([6-7]|9|10|18|2[5-7]|29|3[2-3]|35|37)\.ffn_up_exps\.weight$=iq2_ks

# ffn_gate_exps — gating network (per-expert) — qbits: 4 3 2 
^blk\.(1[0-2]|17|19|2[0-3]|28|3[0-1]|92)\.ffn_gate_exps\.weight$=iq4_kss
^blk\.(1[3-6]|24|26|41|4[3-8]|5[0-9]|6[0-1]|6[3-5]|6[7-9]|7[0-6]|78|[8-9][0-1]|87)\.ffn_gate_exps\.weight$=iq3_ks
^blk\.([3-4]|[6-7]|33|39|42|49|62|66|77|79|8[2-6]|8[8-9])\.ffn_gate_exps\.weight$=iq2_kl
^blk\.(5|[8-9]|18|25|27|29|32|3[4-8]|40)\.ffn_gate_exps\.weight$=iq2_ks

## Summary of tensor sizes per class
# GPU Total: 12.13 GiB (100.0%) | 12.13 GiB max, if all were q8_0 | 12.13 GiB min, if all were q8_0
# CPU Total: 122.03 GiB (76.7%) | 159.18 GiB max, if all were iq4_kss | 88.29 GiB min, if all were iq2_ks
# GPU+CPU Total: 134.17 GiB (88.3%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +f32       	835	32    	  0.28 GiB	-		-
# +q8_0      	186	8.5   	  0.96 GiB	-		-
# q8_0      	279	8.5   	  2.66 GiB	100.0%		2.66
# +iq6_k     	1  	6.625 	  0.60 GiB	-		-
# +iq5_ks    	187	5.25  	  7.63 GiB	-		-
#
# CPU-friendly quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +iq5_ks    	3  	5.25  	  0.98 GiB	-		-
# +iq4_kss   	3  	4     	  1.76 GiB	-		-
# iq4_kss   	41 	4     	 24.02 GiB	15.4%		156.45
# iq3_ks    	127	3.1875	 59.30 GiB	47.6%		124.67
# iq2_kl    	58 	2.6875	 22.83 GiB	21.7%		105.11
# iq2_ks    	41 	2.1875	 13.14 GiB	15.4%		85.56
#
# -Average BPW: 3.2162
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Recipe produced on the 2025-11-28 06:08:52 WIB+0700 using Thireus' GGUF tools (https://gguf.thireus.com/)
# - Script SHA-256: 569b7f6a3239c9173d71ca1fadf34222607d72a2cfed2c284b42633e95b4a627
# - Calibration dataset 'models/GLM-4.5/ppl_results.csv' SHA-256: eee3f314cf1ecaa746932341b2189f8a7718e62489bbd250f957c4aec82c0015
# - tensors.bf16.map SHA-256: 4e8b7b435f6257174a7adfc90290ac92c36758fef201ba0f5358338eea7606b8
# - tensors.bf16.map model name: GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-01762-of-01762
# - tensors.q8_0.map SHA-256: 2814e1547cf288d327264135ed3f83e612b879826640283037d45f95a22ebfe2
# - tensors.q8_0.map model name: GLM-4.5-THIREUS-Q8_0-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq6_k.map SHA-256: 8ef4d5c379126fc13dfb46bbc8c10308d2c8e78602c0b3f6cea197d963fc80f1
# - tensors.iq6_k.map model name: GLM-4.5-THIREUS-IQ6_K-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq5_ks.map SHA-256: 9af5c1536eebc84dc6c71d855517c9acd534fbeefc3067aef26def0c104c7e64
# - tensors.iq5_ks.map model name: GLM-4.5-THIREUS-IQ5_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_kss.map SHA-256: 78644e76c921c329b6cf32d1c8711766170edea7e8960fcd3e9eb6d94601bc4b
# - tensors.iq4_kss.map model name: GLM-4.5-THIREUS-IQ4_KSS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq3_ks.map SHA-256: 4a33c7b3901cadf1a4e6130aeaed8806168249a0d386219db4ffec31188cb6af
# - tensors.iq3_ks.map model name: GLM-4.5-THIREUS-IQ3_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_kl.map SHA-256: ad6ca540847d2e75cb537169f77e91d8453b70613b3171fed20bc374f202d58a
# - tensors.iq2_kl.map model name: GLM-4.5-THIREUS-IQ2_KL-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_ks.map SHA-256: 4ed9fc5a73d854ad30f9f75577a1d826150cb23283a0bb54dba45d6aba6c9de2
# - tensors.iq2_ks.map model name: GLM-4.5-THIREUS-IQ2_KS-SPECIAL_TENSOR-01762-of-01762
# - GPG signatures: DISABLED
# - Command used:
# ./quant_assign.py models/GLM-4.5/ppl_results.csv --tolerance 0.001 --cpu-tensors-max-size 121.9 \
# --gpu-tensors-max-size 12.2 --exponential-factor 1.0 --skip-gpg --cpu-tensors \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_down_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_up_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[012])\.ffn_gate_exps\.weight' --gpu-tensors '.*' --cpu-quants iq4_kss iq3_ks iq2_kl \
# iq2_ks --gpu-quants q8_0 --cpu-assign-tensors '^blk\.(92)\.ffn_down_exps\.weight=iq4_kss' \
# '^blk\.(92)\.ffn_up_exps\.weight=iq4_kss' '^blk\.(92)\.ffn_gate_exps\.weight=iq4_kss' \
# '^blk\.92\.nextn\.shared_head_head\.weight=iq5_ks' '^blk\.92\.nextn\.embed_tokens\.weight=iq5_ks' \
# '^blk\.92\.nextn\.eh_proj\.weight=iq5_ks' --gpu-assign-qtype q8_0 --gpu-assign-tensors \
# '^blk\..*\.attn_(q|output)\.weight=iq5_ks' '^token_embd\.weight=iq5_ks' '^output\.weight=iq6_k' --harmonize-tensors \
# '' --harmonization-technique 0

## THE END!

PPL result with wiki.test.raw:

Final estimate: PPL over 565 chunks for n_ctx=512 = 3.3627 +/- 0.01897

Can check the graph from https://huggingface.co/ubergarm/GLM-4.5-GGUF for comparison.

KLD result with combined_all_micro.txt and GLM-4.5-KLD-8192-ref-logits-ed-combined-all-micro-Q8_0.bin:

====== Perplexity statistics ======
Mean PPL(Q)                   :   4.690483 ±   0.030264
Mean PPL(base)                :   4.593228 ±   0.029642
Cor(ln(PPL(Q)), ln(PPL(base))):  96.97%
Mean ln(PPL(Q)/PPL(base))     :   0.020952 ±   0.001589
Mean PPL(Q)/PPL(base)         :   1.021174 ±   0.001623
Mean PPL(Q)-PPL(base)         :   0.097255 ±   0.007404

====== KL divergence statistics ======
Mean    KLD:   0.138892 ±   0.002236
Maximum KLD:  24.789902
99.9%   KLD:  14.156975
99.0%   KLD:   2.560629
95.0%   KLD:   0.300147
90.0%   KLD:   0.147368
Median  KLD:   0.013864
10.0%   KLD:   0.000065
 5.0%   KLD:   0.000017
 1.0%   KLD:   0.000002
 0.1%   KLD:  -0.000000
Minimum KLD:  -0.000019

====== Token probability statistics ======
Mean    Δp: -0.783 ± 0.025 %
Maximum Δp: 99.991%
99.9%   Δp: 88.958%
99.0%   Δp: 26.269%
95.0%   Δp:  8.516%
90.0%   Δp:  3.926%
75.0%   Δp:  0.273%
Median  Δp: -0.009%
25.0%   Δp: -1.137%
10.0%   Δp: -6.428%
 5.0%   Δp: -12.124%
 1.0%   Δp: -37.051%
 0.1%   Δp: -91.898%
Minimum Δp: -99.991%
RMS Δp    : 10.192 ± 0.077 %
Same top p: 90.530 ± 0.073 %

Can check the graph from https://huggingface.co/AesSedai/GLM-4.6-GGUF/discussions/1 for comparison.

Downloads last month: 302

GGUF

Model size

358B params

Architecture

glm4moe

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sokann/GLM-4.5-GGUF-3.2162bpw

Base model

zai-org/GLM-4.5

Quantized

(31)

this model