RadiPro Chatbot - Llama 3.2 1B Instruct (GGUF Q4_K_M)
Model Description
This is a quantized version of the Meta Llama 3.2 1B Instruct model, optimized for efficient inference using the GGUF format with Q4_K_M quantization. The model has been quantized to 4-bit precision using the K-quant method (Medium quality), providing a good balance between model size, inference speed, and quality.
This model has been trained to provide helpful, accurate, and contextually appropriate responses regarding the RadiPro's services. Since RadiPro AI agency is a rather small company with limited number of services the chatbot's main purpose is to demonstrate to clients what potential implementation on their platform might look.
Model Details
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Quantization Method: Q4_K_M (4-bit K-quant, Medium quality)
- Format: GGUF (GPT-Generated Unified Format)
- Model Size: ~700MB (quantized from ~2GB)
- Architecture: Transformer-based language model
- Context Length: 128K tokens
- Languages: Primarily English, with multilingual capabilities
Quantization Details
- Quantization Type: Q4_K_M
- Bits: 4-bit
- Method: K-quant (K-means quantization)
- Quality: Medium (balanced quality/size trade-off)
- Compression Ratio: ~3x reduction in model size
How to Use
Using llama.cpp
# Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
# Run inference
./main -m radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf \
-p "Your prompt here" \
-n 128
Using Python with llama-cpp-python
from llama_cpp import Llama
# Load the model
llm = Llama(
model_path="radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf",
n_ctx=2048, # Context window
n_threads=4, # Number of CPU threads
)
# Generate text
response = llm(
"What is artificial intelligence?",
max_tokens=128,
temperature=0.7,
top_p=0.9,
)
print(response['choices'][0]['text'])
Using Ollama
# Create a Modelfile
cat > Modelfile << EOF
FROM ./radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>
{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>
{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>
{{ .Response }}<|eot_id|>"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF
# Create and run
ollama create radipro-chatbot -f Modelfile
ollama run radipro-chatbot
Performance
- Inference Speed: Significantly faster than FP16/BF16 models
- Memory Usage: ~1-2GB RAM (depending on context size)
- Quality: Maintains ~95-98% of original model quality
- Best For: CPU inference, edge devices, resource-constrained environments
System Requirements
- Minimum RAM: 2GB
- Recommended RAM: 4GB+
- CPU: Any modern CPU (x86_64, ARM64)
- GPU: Optional (supports GPU acceleration via llama.cpp with CUDA/Metal)
Limitations
- Reduced Precision: 4-bit quantization may result in slight quality degradation compared to the full-precision model
- Context Window: While the base model supports 128K tokens, practical context limits depend on available memory
- Language: Primarily optimized for English, though multilingual capabilities exist
- Bias: This model has been specifically trained to serve as a RadiPro chatbot. It won't answer general questions and it will always try to steer conversation towards the agency and their services.
Ethical Considerations
This model is intended for research and educational purposes. Users should be aware of potential biases and limitations. Please review the Llama 3.2 Community License for usage terms.
Citation
If you use this model, please cite the original Llama 3.2 model:
@misc{llama32,
title={Llama 3.2},
author={Meta AI},
year={2024},
howpublished={\url{https://llama.meta.com/llama3/}}
}
License
This model is released under the Llama 3.2 Community License. Please review the license terms before use.
Acknowledgments
- Base Model: Meta AI for the Llama 3.2 1B Instruct model
- Quantization: GGUF format and quantization tools
- Community: The open-source AI community for tools and support
Model Card Contact
For questions, issues, or contributions related to this quantized model, please open an issue in the repository.
Note: This is a quantized model card. For the full-precision model, please refer to the original Llama 3.2 1B Instruct model.
- Downloads last month
- 10
4-bit
Model tree for raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-GGUF
Base model
meta-llama/Llama-3.2-1B-Instruct