Sarashina2-7B BitsAndBytes 4-bit Quantized

This is a 4-bit quantized version of sbintuitions/sarashina2-7b using BitsAndBytes (NF4).

Model Description

  • Base Model: sarashina2-7b (7B parameters)
  • Quantization Method: BitsAndBytes (bitsandbytes library)
  • Quantization Type: NF4 (Normal Float 4-bit)
  • Double Quantization: Enabled
  • Compute dtype: bfloat16
  • Original Size: ~14.6 GB
  • Quantized Size: ~4-5 GB
  • Memory Reduction: ~70-75%

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

model_id = "{hf_model_id}"

# Configure BitsAndBytes
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

# Generate text
prompt = "ใŠใฏใ‚ˆใ†ใ”ใ–ใ„ใพใ™ใ€ไปŠๆ—ฅใฎๅคฉๆฐ—ใฏ"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=50,
    do_sample=True,
    temperature=0.7,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Installation

pip install transformers accelerate bitsandbytes

Requirements

  • CUDA GPU: BitsAndBytes requires CUDA (not compatible with CPU)
  • GPU Memory: ~5-6 GB VRAM recommended
  • Python: 3.8+

Performance

  • Memory Usage: Reduced by ~70-75% compared to FP16
  • Inference Speed: Comparable to FP16 on modern GPUs
  • Quality: Minimal accuracy loss with NF4 quantization

Advantages of BitsAndBytes

  • โœ… No calibration required - quantizes on model load
  • โœ… Easy to use - single configuration parameter
  • โœ… Widely compatible - works with most Hugging Face models
  • โœ… Double quantization - additional memory savings
  • โœ… NF4 quantization - optimized for neural network weights

Limitations

  • Requires CUDA GPU (no CPU support)
  • May have slight quality degradation compared to full precision
  • Cannot export to ONNX or other formats

License

MIT License (inherited from base model)

Citation

@misc{{sarashina2-7b-bnb,
  author = {{Ronan Takizawa}},
  title = {{Sarashina2-7B BitsAndBytes 4-bit Quantized}},
  year = {{2025}},
  publisher = {{Hugging Face}},
  howpublished = {{\\url{{https://huggingface.co/{hf_model_id}}}}}
}}

Base Model Citation

Please refer to the original model card for the base model citation.

Downloads last month
4
Safetensors
Model size
7B params
Tensor type
F16
ยท
F32
ยท
U8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ronantakizawa/sarashina2-7b-4bit-bnb

Quantized
(6)
this model