# GGUF Conversion Script

This script converts DragonLLM models from Hugging Face to GGUF format for use with oLLama on Mac.

## Quick Start

```bash
# Activate virtual environment
cd /Users/jeanbapt/simple-llm-pro-finance
source venv/bin/activate

# Run conversion (uses default: Qwen-Pro-Finance-R-32B)
python3 scripts/convert_to_gguf.py

# Or specify a model by number (1-5) or name
python3 scripts/convert_to_gguf.py 1  # Qwen-Pro-Finance-R-32B
python3 scripts/convert_to_gguf.py 2  # qwen3-32b-fin-v1.0
python3 scripts/convert_to_gguf.py "DragonLLM/qwen3-32b-fin-v1.0"
```

## Available 32B Models

1. **DragonLLM/Qwen-Pro-Finance-R-32B** (Recommended - latest)
2. DragonLLM/qwen3-32b-fin-v1.0
3. DragonLLM/qwen3-32b-fin-v0.3
4. DragonLLM/qwen3-32b-fin-v1.0-fp8 (Already quantized to FP8)
5. DragonLLM/Qwen-Pro-Finance-R-32B-FP8 (Already quantized to FP8)

## What It Does

1. **Downloads llama.cpp** (if not already present)
2. **Converts model to base GGUF** (FP16, ~64GB)
3. **Quantizes to multiple levels**:
   - Q5_K_M (~20GB) - **Best balance** ⭐
   - Q6_K (~24GB) - Higher quality
   - Q4_K_M (~16GB) - Smaller size
   - Q8_0 (~32GB) - Highest quality

## Memory Requirements

- **Base conversion (FP16)**: ~64GB RAM
- **Quantization**: ~32GB RAM (can be done separately)

## Output

Files are saved to: `simple-llm-pro-finance/gguf_models/`

```
gguf_models/
├── Qwen-Pro-Finance-R-32B-f16.gguf      (~64GB)
├── Qwen-Pro-Finance-R-32B-q5_k_m.gguf  (~20GB) ⭐ Recommended
├── Qwen-Pro-Finance-R-32B-q6_k.gguf    (~24GB)
├── Qwen-Pro-Finance-R-32B-q4_k_m.gguf  (~16GB)
└── Qwen-Pro-Finance-R-32B-q8_0.gguf    (~32GB)
```

## Using with oLLama

After conversion, create an oLLama model:

```bash
# Create Modelfile
cat > Modelfile << EOF
FROM ./gguf_models/Qwen-Pro-Finance-R-32B-q5_k_m.gguf
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
{{ .Response }}<|im_end|>
"""
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
EOF

# Create model
ollama create qwen-finance-32b -f Modelfile

# Use it
ollama run qwen-finance-32b "What is compound interest?"
```

## Tool Calling Support

GGUF models maintain tool calling capabilities. oLLama supports OpenAI-compatible function calling:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="qwen-finance-32b",
    messages=[{"role": "user", "content": "Calculate future value of 10000 at 5% for 10 years"}],
    tools=[{
        "type": "function",
        "function": {
            "name": "calculate_fv",
            "description": "Calculate future value",
            "parameters": {
                "type": "object",
                "properties": {
                    "pv": {"type": "number"},
                    "rate": {"type": "number"},
                    "nper": {"type": "number"}
                }
            }
        }
    }],
    tool_choice="auto"
)
```

## Troubleshooting

### Out of Memory
- Use Q4_K_M instead of Q5_K_M
- Close other applications
- Reduce context window in oLLama (`num_ctx 4096`)

### Conversion Fails
- Ensure HF_TOKEN_LC2 is set in .env
- Check you have access to the model on Hugging Face
- Verify you have enough disk space (~200GB recommended)

### Quantization Fails
- The base FP16 file is still usable
- Try quantizing manually: `./llama.cpp/llama-quantize input.gguf output.gguf Q5_K_M`

## Notes

- **FP8 models** (models 4 and 5) are already quantized, but converting to GGUF still provides benefits for oLLama
- **Q5_K_M is recommended** for best quality/size trade-off on Mac
- Conversion takes 30-60 minutes depending on your system
- Quantization takes 10-20 minutes per level