--- language: - en license: apache-2.0 library_name: transformers tags: - text-generation - gemma - quantized - bnb - nf4 base_model: google/gemma-3-1b-it-qat-int4-unquantized pipeline_tag: text-generation --- # Gemma-3-1B-IT BitsAndBytesConfig NF4 Quantized This model is a quantized version of `google/gemma-3-1b-it-qat-int4-unquantized` using BitsAndBytesConfig with NF4 quantization. ## Model Details - **Base Model**: google/gemma-3-1b-it-qat-int4-unquantized - **Quantization**: BitsAndBytesConfig NF4 (4-bit) - **Quantization Type**: NF4 with double quantization - **Compute Dtype**: bfloat16 - **Storage Dtype**: uint8 ## Quantization Configuration ```python from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_storage=torch.uint8 ) ``` ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load the quantized model model = AutoModelForCausalLM.from_pretrained( "WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("WaveCut/gemma-3-1b-it-qat-int4-bnb-nf4") # Generate text inputs = tokenizer("Hello, how are you?", return_tensors="pt") outputs = model.generate(**inputs, max_length=100) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print(response) ``` ## Benefits - **Reduced Memory Usage**: ~75% reduction in memory footprint compared to full precision - **Faster Inference**: Optimized for inference speed - **Maintained Quality**: NF4 quantization preserves model quality effectively ## Hardware Requirements - **GPU Memory**: ~3-4GB VRAM (vs ~12GB for FP16) - **CUDA Compatible**: Requires CUDA-capable GPU for optimal performance - **CPU Fallback**: Can run on CPU with reduced performance ## Quantization Details This model uses BitsAndBytesConfig for 4-bit quantization: - NF4 (Normal Float 4) quantization for optimal quality/size trade-off - Double quantization for additional compression - Mixed precision with bfloat16 compute dtype ## License This model inherits the Apache 2.0 license from the base model.