RadiPro Chatbot - Llama 3.2 1B Instruct (GGUF Q4_K_M)

Model Description

This is a quantized version of the Meta Llama 3.2 1B Instruct model, optimized for efficient inference using the GGUF format with Q4_K_M quantization. The model has been quantized to 4-bit precision using the K-quant method (Medium quality), providing a good balance between model size, inference speed, and quality.

This model has been trained to provide helpful, accurate, and contextually appropriate responses regarding the RadiPro's services. Since RadiPro AI agency is a rather small company with limited number of services the chatbot's main purpose is to demonstrate to clients what potential implementation on their platform might look.

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Quantization Method: Q4_K_M (4-bit K-quant, Medium quality)
Format: GGUF (GPT-Generated Unified Format)
Model Size: ~700MB (quantized from ~2GB)
Architecture: Transformer-based language model
Context Length: 128K tokens
Languages: Primarily English, with multilingual capabilities

Quantization Details

Quantization Type: Q4_K_M
Bits: 4-bit
Method: K-quant (K-means quantization)
Quality: Medium (balanced quality/size trade-off)
Compression Ratio: ~3x reduction in model size

How to Use

Using llama.cpp

# Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make

# Run inference
./main -m radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf \
       -p "Your prompt here" \
       -n 128

Using Python with llama-cpp-python

from llama_cpp import Llama

# Load the model
llm = Llama(
    model_path="radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf",
    n_ctx=2048,  # Context window
    n_threads=4,  # Number of CPU threads
)

# Generate text
response = llm(
    "What is artificial intelligence?",
    max_tokens=128,
    temperature=0.7,
    top_p=0.9,
)

print(response['choices'][0]['text'])

Using Ollama

# Create a Modelfile
cat > Modelfile << EOF
FROM ./radipro-chatbot-Llama-3.2-1B-Instruct.Q4_K_M.gguf
TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""
PARAMETER temperature 0.7
PARAMETER top_p 0.9
EOF

# Create and run
ollama create radipro-chatbot -f Modelfile
ollama run radipro-chatbot

Performance

Inference Speed: Significantly faster than FP16/BF16 models
Memory Usage: ~1-2GB RAM (depending on context size)
Quality: Maintains ~95-98% of original model quality
Best For: CPU inference, edge devices, resource-constrained environments

System Requirements

Minimum RAM: 2GB
Recommended RAM: 4GB+
CPU: Any modern CPU (x86_64, ARM64)
GPU: Optional (supports GPU acceleration via llama.cpp with CUDA/Metal)

Limitations

Reduced Precision: 4-bit quantization may result in slight quality degradation compared to the full-precision model
Context Window: While the base model supports 128K tokens, practical context limits depend on available memory
Language: Primarily optimized for English, though multilingual capabilities exist
Bias: This model has been specifically trained to serve as a RadiPro chatbot. It won't answer general questions and it will always try to steer conversation towards the agency and their services.

Ethical Considerations

This model is intended for research and educational purposes. Users should be aware of potential biases and limitations. Please review the Llama 3.2 Community License for usage terms.

Citation

If you use this model, please cite the original Llama 3.2 model:

@misc{llama32,
  title={Llama 3.2},
  author={Meta AI},
  year={2024},
  howpublished={\url{https://llama.meta.com/llama3/}}
}

License

This model is released under the Llama 3.2 Community License. Please review the license terms before use.

Acknowledgments

Base Model: Meta AI for the Llama 3.2 1B Instruct model
Quantization: GGUF format and quantization tools
Community: The open-source AI community for tools and support

Model Card Contact

For questions, issues, or contributions related to this quantized model, please open an issue in the repository.

Note: This is a quantized model card. For the full-precision model, please refer to the original Llama 3.2 1B Instruct model.

Downloads last month: 10

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for raditotev/radipro-chatbot-Llama-3.2-1B-Instruct-GGUF

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(327)

this model