update README for llama.cpp diffusion runing examples

Browse files

Files changed (5) hide show

GGUF_Q8_0_README.md +38 -8
GGUF_Q8_0_README_EN.md +253 -0
gguf_output/dream-coder-7b-q4_0.gguf +3 -0
gguf_output/dream-coder-7b-q5_0.gguf +3 -0
gguf_output/dream-coder-7b-q5_1.gguf +3 -0

GGUF_Q8_0_README.md CHANGED Viewed

@@ -91,16 +91,19 @@ gguf_output/
 ### llama.cpp 命令行
 ```bash
 # 基本使用
-./llama.cpp/main \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "def quicksort(arr):" \
     -n 512 \
-    -c 2048
 # 高级参数
-./llama.cpp/main \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "Write a binary search function" \
     -n 256 \
@@ -108,9 +111,33 @@ gguf_output/
     --temp 0.1 \
     --top-p 0.95 \
     --repeat-penalty 1.1 \
     -t 8
 ```
 ### Python (llama-cpp-python)
 ```bash
@@ -151,10 +178,11 @@ make clean
 make LLAMA_CUBLAS=1 -j$(nproc)
 # 使用 GPU 加速 (部分层)
-./main \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "def quicksort(arr):" \
     -n 512 \
     -ngl 20  # GPU 层数
 ```
@@ -183,7 +211,7 @@ make LLAMA_CUBLAS=1 -j$(nproc)
 ls -lh gguf_output/dream-coder-7b-q8_0.gguf
 # 简单推理测试
-echo "def hello():" | ./llama.cpp/main -m gguf_output/dream-coder-7b-q8_0.gguf -n 20
 ```
 ## 性能优化
@@ -206,9 +234,11 @@ echo "def hello():" | ./llama.cpp/main -m gguf_output/dream-coder-7b-q8_0.gguf -
 ## 注意事项
 1. **Diffusion 特性**: Dream-Coder 使用 diffusion 生成，与传统 autoregressive 模型不同
-2. **特殊 Token**: 保持 `mask_token_id` (151666) 的正确处理
-3. **上下文长度**: 支持最大 32K tokens，但推荐 2K-4K 以获得最佳性能
-4. **生成参数**: 推荐使用较低的 temperature (0.1-0.3) 和合适的 top_p (0.9-0.95)
 ## 技术支持

 ### llama.cpp 命令行
+由于 Dream-Coder 是基于 diffusion 的模型，需要使用专用的 `llama-diffusion-cli` 工具：
 ```bash
 # 基本使用
+./llama.cpp/build/bin/llama-diffusion-cli \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "def quicksort(arr):" \
     -n 512 \
+    -c 2048 \
+    --diffusion-steps 128
 # 高级参数
+./llama.cpp/build/bin/llama-diffusion-cli \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "Write a binary search function" \
     -n 256 \
     --temp 0.1 \
     --top-p 0.95 \
     --repeat-penalty 1.1 \
+    --diffusion-steps 128 \
+    --diffusion-algorithm 4 \
+    --diffusion-alg-temp 0.0 \
     -t 8
+# 可视化生成过程
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def fibonacci(n):" \
+    -n 256 \
+    --diffusion-steps 64 \
+    --diffusion-visual
 ```
+#### Diffusion 参数说明
+- `--diffusion-steps N`: Diffusion 去噪步数 (默认: 128)
+- `--diffusion-algorithm N`: 算法选择:
+  - 0 = ORIGIN (原始算法)
+  - 1 = ENTROPY_BASED (基于熵)
+  - 2 = MARGIN_BASED (基于边际)
+  - 3 = RANDOM (随机)
+  - 4 = LOW_CONFIDENCE (低置信度，默认)
+- `--diffusion-alg-temp F`: 算法温度 (默认: 0.0)
+- `--diffusion-visual`: 启用可视化模式，显示生成进度
+- `--diffusion-eps F`: 时间步epsilon值
 ### Python (llama-cpp-python)
 ```bash
 make LLAMA_CUBLAS=1 -j$(nproc)
 # 使用 GPU 加速 (部分层)
+./build/bin/llama-diffusion-cli \
     -m gguf_output/dream-coder-7b-q8_0.gguf \
     -p "def quicksort(arr):" \
     -n 512 \
+    --diffusion-steps 128 \
     -ngl 20  # GPU 层数
 ```
 ls -lh gguf_output/dream-coder-7b-q8_0.gguf
 # 简单推理测试
+echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
 ```
 ## 性能优化
 ## 注意事项
 1. **Diffusion 特性**: Dream-Coder 使用 diffusion 生成，与传统 autoregressive 模型不同
+2. **专用工具**: 必须使用 `llama-diffusion-cli` 而非普通的 `main` 工具
+3. **特殊 Token**: 保持 `mask_token_id` (151666) 的正确处理
+4. **上下文长度**: 支持最大 32K tokens，但推荐 2K-4K 以获得最佳性能
+5. **生成参数**: 推荐使用较低的 temperature (0.1-0.3) 和合适的 top_p (0.9-0.95)
+6. **Diffusion 步数**: 推荐 64-128 步，更多步数可能提升质量但增加推理时间
 ## 技术支持

GGUF_Q8_0_README_EN.md ADDED Viewed

	@@ -0,0 +1,253 @@

+# Dream-Coder GGUF Q8_0 Quantization Guide
+This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.
+## Quick Start
+### 1. Environment Setup
+```bash
+# 1. Clone and compile llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make -j$(nproc)
+# 2. Install Python dependencies
+pip install transformers>=4.46.2 torch safetensors numpy
+```
+### 2. Execute Quantization
+#### Method 1: Use the provided script
+```bash
+# Set llama.cpp path
+export LLAMA_CPP_PATH=/path/to/llama.cpp
+# Run quantization script
+./quantize_example.sh
+```
+#### Method 2: Manual execution
+```bash
+python quantize_dream_q8_0.py \
+    --model_path /path/to/Dream-Coder-v0-Instruct-7B \
+    --llama_cpp_path /path/to/llama.cpp \
+    --output_dir ./gguf_output \
+    --keep_f16
+```
+### 3. Parameter Description
+- `--model_path`: Dream-Coder model path (default: current directory)
+- `--llama_cpp_path`: llama.cpp project path (required)
+- `--output_dir`: Output directory (default: ./gguf_output)
+- `--keep_f16`: Keep F16 intermediate files
+## Architecture Adaptation
+### Dream-Coder Special Configuration Handling
+This quantization script specifically handles the following special configurations of Dream-Coder:
+1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility)
+2. **Special Token IDs**:
+   - `mask_token_id`: 151666 (critical diffusion token)
+   - `bos_token_id`: 151665
+   - `eos_token_id`: 151643
+   - `pad_token_id`: 151643
+3. **Model Parameters**:
+   - Vocabulary size: 152,064
+   - Hidden dimension: 3,584
+   - Attention heads: 28 (4 key-value heads)
+   - Layers: 28
+   - Context length: 32,768
+4. **Diffusion Features**:
+   - Preserve `mask_token_id` metadata
+   - RoPE theta: 1,000,000.0
+   - Activation function: SiLU
+## Output Description
+### File Structure
+```
+gguf_output/
+├── dream-coder-7b-f16.gguf      # F16 intermediate file (optionally kept)
+└── dream-coder-7b-q8_0.gguf     # Final Q8_0 quantized file
+```
+### Performance Expectations
+| Metric | Original (BF16) | Q8_0 |
+|--------|-----------------|------|
+| Memory Usage | ~14GB | ~6.7GB |
+| Inference Speed | 1.0x | 1.2-1.5x |
+| Precision Loss | 0% | <0.1% |
+## Usage
+### llama.cpp Command Line
+Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:
+```bash
+# Basic usage
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def quicksort(arr):" \
+    -n 512 \
+    -c 2048 \
+    --diffusion-steps 128
+# Advanced parameters
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "Write a binary search function" \
+    -n 256 \
+    -c 2048 \
+    --temp 0.1 \
+    --top-p 0.95 \
+    --repeat-penalty 1.1 \
+    --diffusion-steps 128 \
+    --diffusion-algorithm 4 \
+    --diffusion-alg-temp 0.0 \
+    -t 8
+# Visualize generation process
+./llama.cpp/build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def fibonacci(n):" \
+    -n 256 \
+    --diffusion-steps 64 \
+    --diffusion-visual
+```
+#### Diffusion Parameter Description
+- `--diffusion-steps N`: Diffusion denoising steps (default: 128)
+- `--diffusion-algorithm N`: Algorithm selection:
+  - 0 = ORIGIN (original algorithm)
+  - 1 = ENTROPY_BASED (entropy-based)
+  - 2 = MARGIN_BASED (margin-based)
+  - 3 = RANDOM (random)
+  - 4 = LOW_CONFIDENCE (low confidence, default)
+- `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
+- `--diffusion-visual`: Enable visualization mode, show generation progress
+- `--diffusion-eps F`: Time step epsilon value
+### Python (llama-cpp-python)
+```bash
+pip install llama-cpp-python
+```
+```python
+from llama_cpp import Llama
+# Load model
+llm = Llama(
+    model_path="gguf_output/dream-coder-7b-q8_0.gguf",
+    n_ctx=2048,
+    n_threads=8,
+    n_gpu_layers=0  # CPU inference, set >0 to enable GPU acceleration
+)
+# Generate code
+output = llm(
+    "def fibonacci(n):",
+    max_tokens=512,
+    temperature=0.1,
+    top_p=0.95,
+    repeat_penalty=1.1
+)
+print(output['choices'][0]['text'])
+```
+### With GPU Acceleration
+If compiled with CUDA support:
+```bash
+# Compile CUDA version
+cd llama.cpp
+make clean
+make LLAMA_CUBLAS=1 -j$(nproc)
+# Use GPU acceleration (partial layers)
+./build/bin/llama-diffusion-cli \
+    -m gguf_output/dream-coder-7b-q8_0.gguf \
+    -p "def quicksort(arr):" \
+    -n 512 \
+    --diffusion-steps 128 \
+    -ngl 20  # Number of GPU layers
+```
+## Troubleshooting
+### Common Issues
+1. **Conversion Failure**:
+   - Ensure llama.cpp is compiled correctly
+   - Check Python dependency versions
+   - Verify model file integrity
+2. **Quantization Failure**:
+   - Check disk space (~20GB temporary space needed)
+   - Ensure sufficient memory (32GB+ recommended)
+3. **Inference Errors**:
+   - Verify GGUF file integrity
+   - Check context length settings
+   - Try reducing `n_gpu_layers`
+### Model Validation
+```bash
+# File integrity check
+ls -lh gguf_output/dream-coder-7b-q8_0.gguf
+# Simple inference test
+echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
+```
+## Performance Optimization
+### CPU Optimization
+- Use `-t` parameter to set thread count
+- Enable AVX2/AVX512 compilation options
+- Adjust batch size (`-b` parameter)
+### GPU Optimization
+- Use CUDA/OpenCL compilation
+- Adjust GPU layer count (`-ngl`)
+- Monitor GPU memory usage
+### Memory Optimization
+- Use `--mmap` to enable memory mapping
+- Adjust `--mlock` parameter
+- Set appropriate context length
+## Important Notes
+1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models
+2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool
+3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666)
+4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
+5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
+6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time
+## Technical Support
+If you encounter issues, please check:
+1. llama.cpp version and compilation status
+2. Python dependency version compatibility
+3. Model file integrity
+4. System resources (memory/disk)
+For more information, refer to:
+- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
+- [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)

gguf_output/dream-coder-7b-q4_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cfbb639c8ad72d325476e7509176f9ec85b4fe07e6af1aa9a15c8fae891bf4b1
+size 4431390752

gguf_output/dream-coder-7b-q5_0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:afca0e2704ba0aad39efc14c32ef6ef8af54d7ac3ff70e2e0bdf7cb6491ffcfb
+size 5315176480

gguf_output/dream-coder-7b-q5_1.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:81ec984ab3a2ca4b99cc7ef4976581dbbea653d3e9fb0838041f9f743a6f6544
+size 5757069344