eric8810 commited on
Commit
bd713c3
·
1 Parent(s): 7b8c2e7

update README for llama.cpp diffusion runing examples

Browse files
GGUF_Q8_0_README.md CHANGED
@@ -91,16 +91,19 @@ gguf_output/
91
 
92
  ### llama.cpp 命令行
93
 
 
 
94
  ```bash
95
  # 基本使用
96
- ./llama.cpp/main \
97
  -m gguf_output/dream-coder-7b-q8_0.gguf \
98
  -p "def quicksort(arr):" \
99
  -n 512 \
100
- -c 2048
 
101
 
102
  # 高级参数
103
- ./llama.cpp/main \
104
  -m gguf_output/dream-coder-7b-q8_0.gguf \
105
  -p "Write a binary search function" \
106
  -n 256 \
@@ -108,9 +111,33 @@ gguf_output/
108
  --temp 0.1 \
109
  --top-p 0.95 \
110
  --repeat-penalty 1.1 \
 
 
 
111
  -t 8
 
 
 
 
 
 
 
 
112
  ```
113
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ### Python (llama-cpp-python)
115
 
116
  ```bash
@@ -151,10 +178,11 @@ make clean
151
  make LLAMA_CUBLAS=1 -j$(nproc)
152
 
153
  # 使用 GPU 加速 (部分层)
154
- ./main \
155
  -m gguf_output/dream-coder-7b-q8_0.gguf \
156
  -p "def quicksort(arr):" \
157
  -n 512 \
 
158
  -ngl 20 # GPU 层数
159
  ```
160
 
@@ -183,7 +211,7 @@ make LLAMA_CUBLAS=1 -j$(nproc)
183
  ls -lh gguf_output/dream-coder-7b-q8_0.gguf
184
 
185
  # 简单推理测试
186
- echo "def hello():" | ./llama.cpp/main -m gguf_output/dream-coder-7b-q8_0.gguf -n 20
187
  ```
188
 
189
  ## 性能优化
@@ -206,9 +234,11 @@ echo "def hello():" | ./llama.cpp/main -m gguf_output/dream-coder-7b-q8_0.gguf -
206
  ## 注意事项
207
 
208
  1. **Diffusion 特性**: Dream-Coder 使用 diffusion 生成,与传统 autoregressive 模型不同
209
- 2. **特殊 Token**: 保持 `mask_token_id` (151666) 的正确处理
210
- 3. **上下文长度**: 支持最大 32K tokens,但推荐 2K-4K 以获得最佳性能
211
- 4. **生成参数**: 推荐使用较低的 temperature (0.1-0.3) 和合适的 top_p (0.9-0.95)
 
 
212
 
213
  ## 技术支持
214
 
 
91
 
92
  ### llama.cpp 命令行
93
 
94
+ 由于 Dream-Coder 是基于 diffusion 的模型,需要使用专用的 `llama-diffusion-cli` 工具:
95
+
96
  ```bash
97
  # 基本使用
98
+ ./llama.cpp/build/bin/llama-diffusion-cli \
99
  -m gguf_output/dream-coder-7b-q8_0.gguf \
100
  -p "def quicksort(arr):" \
101
  -n 512 \
102
+ -c 2048 \
103
+ --diffusion-steps 128
104
 
105
  # 高级参数
106
+ ./llama.cpp/build/bin/llama-diffusion-cli \
107
  -m gguf_output/dream-coder-7b-q8_0.gguf \
108
  -p "Write a binary search function" \
109
  -n 256 \
 
111
  --temp 0.1 \
112
  --top-p 0.95 \
113
  --repeat-penalty 1.1 \
114
+ --diffusion-steps 128 \
115
+ --diffusion-algorithm 4 \
116
+ --diffusion-alg-temp 0.0 \
117
  -t 8
118
+
119
+ # 可视化生成过程
120
+ ./llama.cpp/build/bin/llama-diffusion-cli \
121
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
122
+ -p "def fibonacci(n):" \
123
+ -n 256 \
124
+ --diffusion-steps 64 \
125
+ --diffusion-visual
126
  ```
127
 
128
+ #### Diffusion 参数说明
129
+
130
+ - `--diffusion-steps N`: Diffusion 去噪步数 (默认: 128)
131
+ - `--diffusion-algorithm N`: 算法选择:
132
+ - 0 = ORIGIN (原始算法)
133
+ - 1 = ENTROPY_BASED (基于熵)
134
+ - 2 = MARGIN_BASED (基于边际)
135
+ - 3 = RANDOM (随机)
136
+ - 4 = LOW_CONFIDENCE (低置信度,默认)
137
+ - `--diffusion-alg-temp F`: 算法温度 (默认: 0.0)
138
+ - `--diffusion-visual`: 启用可视化模式,显示生成进度
139
+ - `--diffusion-eps F`: 时间步epsilon值
140
+
141
  ### Python (llama-cpp-python)
142
 
143
  ```bash
 
178
  make LLAMA_CUBLAS=1 -j$(nproc)
179
 
180
  # 使用 GPU 加速 (部分层)
181
+ ./build/bin/llama-diffusion-cli \
182
  -m gguf_output/dream-coder-7b-q8_0.gguf \
183
  -p "def quicksort(arr):" \
184
  -n 512 \
185
+ --diffusion-steps 128 \
186
  -ngl 20 # GPU 层数
187
  ```
188
 
 
211
  ls -lh gguf_output/dream-coder-7b-q8_0.gguf
212
 
213
  # 简单推理测试
214
+ echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
215
  ```
216
 
217
  ## 性能优化
 
234
  ## 注意事项
235
 
236
  1. **Diffusion 特性**: Dream-Coder 使用 diffusion 生成,与传统 autoregressive 模型不同
237
+ 2. **专用工具**: 必须使用 `llama-diffusion-cli` 而非普通的 `main` 工具
238
+ 3. **特殊 Token**: 保持 `mask_token_id` (151666) 的正确处理
239
+ 4. **上下文长度**: 支持最大 32K tokens,但推荐 2K-4K 以获得最佳性能
240
+ 5. **生成参数**: 推荐使用较低的 temperature (0.1-0.3) 和合适的 top_p (0.9-0.95)
241
+ 6. **Diffusion 步数**: 推荐 64-128 步,更多步数可能提升质量但增加推理时间
242
 
243
  ## 技术支持
244
 
GGUF_Q8_0_README_EN.md ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dream-Coder GGUF Q8_0 Quantization Guide
2
+
3
+ This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model.
4
+
5
+ ## Quick Start
6
+
7
+ ### 1. Environment Setup
8
+
9
+ ```bash
10
+ # 1. Clone and compile llama.cpp
11
+ git clone https://github.com/ggerganov/llama.cpp
12
+ cd llama.cpp
13
+ make -j$(nproc)
14
+
15
+ # 2. Install Python dependencies
16
+ pip install transformers>=4.46.2 torch safetensors numpy
17
+ ```
18
+
19
+ ### 2. Execute Quantization
20
+
21
+ #### Method 1: Use the provided script
22
+
23
+ ```bash
24
+ # Set llama.cpp path
25
+ export LLAMA_CPP_PATH=/path/to/llama.cpp
26
+
27
+ # Run quantization script
28
+ ./quantize_example.sh
29
+ ```
30
+
31
+ #### Method 2: Manual execution
32
+
33
+ ```bash
34
+ python quantize_dream_q8_0.py \
35
+ --model_path /path/to/Dream-Coder-v0-Instruct-7B \
36
+ --llama_cpp_path /path/to/llama.cpp \
37
+ --output_dir ./gguf_output \
38
+ --keep_f16
39
+ ```
40
+
41
+ ### 3. Parameter Description
42
+
43
+ - `--model_path`: Dream-Coder model path (default: current directory)
44
+ - `--llama_cpp_path`: llama.cpp project path (required)
45
+ - `--output_dir`: Output directory (default: ./gguf_output)
46
+ - `--keep_f16`: Keep F16 intermediate files
47
+
48
+ ## Architecture Adaptation
49
+
50
+ ### Dream-Coder Special Configuration Handling
51
+
52
+ This quantization script specifically handles the following special configurations of Dream-Coder:
53
+
54
+ 1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility)
55
+ 2. **Special Token IDs**:
56
+ - `mask_token_id`: 151666 (critical diffusion token)
57
+ - `bos_token_id`: 151665
58
+ - `eos_token_id`: 151643
59
+ - `pad_token_id`: 151643
60
+
61
+ 3. **Model Parameters**:
62
+ - Vocabulary size: 152,064
63
+ - Hidden dimension: 3,584
64
+ - Attention heads: 28 (4 key-value heads)
65
+ - Layers: 28
66
+ - Context length: 32,768
67
+
68
+ 4. **Diffusion Features**:
69
+ - Preserve `mask_token_id` metadata
70
+ - RoPE theta: 1,000,000.0
71
+ - Activation function: SiLU
72
+
73
+ ## Output Description
74
+
75
+ ### File Structure
76
+ ```
77
+ gguf_output/
78
+ ├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept)
79
+ └── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file
80
+ ```
81
+
82
+ ### Performance Expectations
83
+
84
+ | Metric | Original (BF16) | Q8_0 |
85
+ |--------|-----------------|------|
86
+ | Memory Usage | ~14GB | ~6.7GB |
87
+ | Inference Speed | 1.0x | 1.2-1.5x |
88
+ | Precision Loss | 0% | <0.1% |
89
+
90
+ ## Usage
91
+
92
+ ### llama.cpp Command Line
93
+
94
+ Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool:
95
+
96
+ ```bash
97
+ # Basic usage
98
+ ./llama.cpp/build/bin/llama-diffusion-cli \
99
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
100
+ -p "def quicksort(arr):" \
101
+ -n 512 \
102
+ -c 2048 \
103
+ --diffusion-steps 128
104
+
105
+ # Advanced parameters
106
+ ./llama.cpp/build/bin/llama-diffusion-cli \
107
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
108
+ -p "Write a binary search function" \
109
+ -n 256 \
110
+ -c 2048 \
111
+ --temp 0.1 \
112
+ --top-p 0.95 \
113
+ --repeat-penalty 1.1 \
114
+ --diffusion-steps 128 \
115
+ --diffusion-algorithm 4 \
116
+ --diffusion-alg-temp 0.0 \
117
+ -t 8
118
+
119
+ # Visualize generation process
120
+ ./llama.cpp/build/bin/llama-diffusion-cli \
121
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
122
+ -p "def fibonacci(n):" \
123
+ -n 256 \
124
+ --diffusion-steps 64 \
125
+ --diffusion-visual
126
+ ```
127
+
128
+ #### Diffusion Parameter Description
129
+
130
+ - `--diffusion-steps N`: Diffusion denoising steps (default: 128)
131
+ - `--diffusion-algorithm N`: Algorithm selection:
132
+ - 0 = ORIGIN (original algorithm)
133
+ - 1 = ENTROPY_BASED (entropy-based)
134
+ - 2 = MARGIN_BASED (margin-based)
135
+ - 3 = RANDOM (random)
136
+ - 4 = LOW_CONFIDENCE (low confidence, default)
137
+ - `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0)
138
+ - `--diffusion-visual`: Enable visualization mode, show generation progress
139
+ - `--diffusion-eps F`: Time step epsilon value
140
+
141
+ ### Python (llama-cpp-python)
142
+
143
+ ```bash
144
+ pip install llama-cpp-python
145
+ ```
146
+
147
+ ```python
148
+ from llama_cpp import Llama
149
+
150
+ # Load model
151
+ llm = Llama(
152
+ model_path="gguf_output/dream-coder-7b-q8_0.gguf",
153
+ n_ctx=2048,
154
+ n_threads=8,
155
+ n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration
156
+ )
157
+
158
+ # Generate code
159
+ output = llm(
160
+ "def fibonacci(n):",
161
+ max_tokens=512,
162
+ temperature=0.1,
163
+ top_p=0.95,
164
+ repeat_penalty=1.1
165
+ )
166
+
167
+ print(output['choices'][0]['text'])
168
+ ```
169
+
170
+ ### With GPU Acceleration
171
+
172
+ If compiled with CUDA support:
173
+
174
+ ```bash
175
+ # Compile CUDA version
176
+ cd llama.cpp
177
+ make clean
178
+ make LLAMA_CUBLAS=1 -j$(nproc)
179
+
180
+ # Use GPU acceleration (partial layers)
181
+ ./build/bin/llama-diffusion-cli \
182
+ -m gguf_output/dream-coder-7b-q8_0.gguf \
183
+ -p "def quicksort(arr):" \
184
+ -n 512 \
185
+ --diffusion-steps 128 \
186
+ -ngl 20 # Number of GPU layers
187
+ ```
188
+
189
+ ## Troubleshooting
190
+
191
+ ### Common Issues
192
+
193
+ 1. **Conversion Failure**:
194
+ - Ensure llama.cpp is compiled correctly
195
+ - Check Python dependency versions
196
+ - Verify model file integrity
197
+
198
+ 2. **Quantization Failure**:
199
+ - Check disk space (~20GB temporary space needed)
200
+ - Ensure sufficient memory (32GB+ recommended)
201
+
202
+ 3. **Inference Errors**:
203
+ - Verify GGUF file integrity
204
+ - Check context length settings
205
+ - Try reducing `n_gpu_layers`
206
+
207
+ ### Model Validation
208
+
209
+ ```bash
210
+ # File integrity check
211
+ ls -lh gguf_output/dream-coder-7b-q8_0.gguf
212
+
213
+ # Simple inference test
214
+ echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64
215
+ ```
216
+
217
+ ## Performance Optimization
218
+
219
+ ### CPU Optimization
220
+ - Use `-t` parameter to set thread count
221
+ - Enable AVX2/AVX512 compilation options
222
+ - Adjust batch size (`-b` parameter)
223
+
224
+ ### GPU Optimization
225
+ - Use CUDA/OpenCL compilation
226
+ - Adjust GPU layer count (`-ngl`)
227
+ - Monitor GPU memory usage
228
+
229
+ ### Memory Optimization
230
+ - Use `--mmap` to enable memory mapping
231
+ - Adjust `--mlock` parameter
232
+ - Set appropriate context length
233
+
234
+ ## Important Notes
235
+
236
+ 1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models
237
+ 2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool
238
+ 3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666)
239
+ 4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance
240
+ 5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95)
241
+ 6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time
242
+
243
+ ## Technical Support
244
+
245
+ If you encounter issues, please check:
246
+ 1. llama.cpp version and compilation status
247
+ 2. Python dependency version compatibility
248
+ 3. Model file integrity
249
+ 4. System resources (memory/disk)
250
+
251
+ For more information, refer to:
252
+ - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
253
+ - [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)
gguf_output/dream-coder-7b-q4_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfbb639c8ad72d325476e7509176f9ec85b4fe07e6af1aa9a15c8fae891bf4b1
3
+ size 4431390752
gguf_output/dream-coder-7b-q5_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:afca0e2704ba0aad39efc14c32ef6ef8af54d7ac3ff70e2e0bdf7cb6491ffcfb
3
+ size 5315176480
gguf_output/dream-coder-7b-q5_1.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:81ec984ab3a2ca4b99cc7ef4976581dbbea653d3e9fb0838041f9f743a6f6544
3
+ size 5757069344