Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
datasets:
|
|
@@ -21,6 +28,8 @@ First, install the required packages:
|
|
| 21 |
|
| 22 |
```bash
|
| 23 |
pip install unsloth transformers torch accelerate
|
|
|
|
|
|
|
| 24 |
```
|
| 25 |
|
| 26 |
## Loading the Model with Unsloth
|
|
@@ -34,6 +43,15 @@ import os
|
|
| 34 |
# Optional: Set HuggingFace Hub token if you have one
|
| 35 |
hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
|
| 36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
# Model ID on HuggingFace Hub
|
| 38 |
model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
|
| 39 |
|
|
@@ -46,21 +64,24 @@ tokenizer = AutoTokenizer.from_pretrained(
|
|
| 46 |
trust_remote_code=True
|
| 47 |
)
|
| 48 |
|
| 49 |
-
#
|
| 50 |
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 51 |
model_name=model_id,
|
| 52 |
token=hf_token,
|
| 53 |
max_seq_length=2048, # Adjust based on your memory constraints
|
| 54 |
dtype=None, # Auto-detect best dtype
|
| 55 |
load_in_4bit=True, # Use 4-bit quantization for efficiency
|
|
|
|
| 56 |
)
|
| 57 |
|
| 58 |
# Enable fast inference mode
|
| 59 |
FastLanguageModel.for_inference(model)
|
| 60 |
|
| 61 |
-
print("Successfully loaded model with Unsloth!")
|
| 62 |
```
|
| 63 |
|
|
|
|
|
|
|
| 64 |
Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
|
| 65 |
|
| 66 |
```python
|
|
@@ -74,8 +95,8 @@ model = AutoModelForCausalLM.from_pretrained(
|
|
| 74 |
trust_remote_code=True
|
| 75 |
)
|
| 76 |
|
| 77 |
-
# Then apply Unsloth optimization
|
| 78 |
-
FastLanguageModel.for_inference(model)
|
| 79 |
```
|
| 80 |
|
| 81 |
## Running Multiple-Choice Inference
|
|
@@ -245,27 +266,29 @@ def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
|
|
| 245 |
|
| 246 |
## Performance Tips for Unsloth
|
| 247 |
|
| 248 |
-
1. **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 249 |
```python
|
| 250 |
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 251 |
model_name=model_id,
|
| 252 |
max_seq_length=1024, # Reduced from 2048
|
| 253 |
-
load_in_4bit=True
|
|
|
|
| 254 |
)
|
| 255 |
```
|
| 256 |
|
| 257 |
-
|
| 258 |
|
| 259 |
-
|
| 260 |
|
| 261 |
-
|
| 262 |
```python
|
| 263 |
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
|
| 264 |
```
|
| 265 |
|
| 266 |
-
|
| 267 |
-
|
| 268 |
-
pip install flash-attn --no-build-isolation
|
| 269 |
-
```
|
| 270 |
-
|
| 271 |
-
With these optimizations, Qwen25_Coder_MultipleChoice should run significantly faster while maintaining the same high-quality multiple-choice reasoning and answers.
|
|
|
|
| 1 |
+
|
| 2 |
+
Let me find the existing markdown content first and then add the flash-attention requirement.
|
| 3 |
+
|
| 4 |
+
Ran tool
|
| 5 |
+
I'll use the markdown content from your earlier message and update it with the flash-attention requirement:
|
| 6 |
+
|
| 7 |
+
```markdown
|
| 8 |
---
|
| 9 |
license: mit
|
| 10 |
datasets:
|
|
|
|
| 28 |
|
| 29 |
```bash
|
| 30 |
pip install unsloth transformers torch accelerate
|
| 31 |
+
# Flash-attention is REQUIRED for correct model behavior!
|
| 32 |
+
pip install flash-attn --no-build-isolation
|
| 33 |
```
|
| 34 |
|
| 35 |
## Loading the Model with Unsloth
|
|
|
|
| 43 |
# Optional: Set HuggingFace Hub token if you have one
|
| 44 |
hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
|
| 45 |
|
| 46 |
+
# Verify flash-attention installation - REQUIRED for correct results
|
| 47 |
+
try:
|
| 48 |
+
import flash_attn
|
| 49 |
+
except ImportError:
|
| 50 |
+
raise ImportError(
|
| 51 |
+
"flash-attn package is required for correct model behavior.\n"
|
| 52 |
+
"Please install it with: pip install flash-attn --no-build-isolation"
|
| 53 |
+
)
|
| 54 |
+
|
| 55 |
# Model ID on HuggingFace Hub
|
| 56 |
model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
|
| 57 |
|
|
|
|
| 64 |
trust_remote_code=True
|
| 65 |
)
|
| 66 |
|
| 67 |
+
# IMPORTANT: Load with flash-attention for correct behavior
|
| 68 |
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 69 |
model_name=model_id,
|
| 70 |
token=hf_token,
|
| 71 |
max_seq_length=2048, # Adjust based on your memory constraints
|
| 72 |
dtype=None, # Auto-detect best dtype
|
| 73 |
load_in_4bit=True, # Use 4-bit quantization for efficiency
|
| 74 |
+
use_flash_attention=True # REQUIRED for correct results
|
| 75 |
)
|
| 76 |
|
| 77 |
# Enable fast inference mode
|
| 78 |
FastLanguageModel.for_inference(model)
|
| 79 |
|
| 80 |
+
print("Successfully loaded model with Unsloth and flash-attention!")
|
| 81 |
```
|
| 82 |
|
| 83 |
+
> ⚠️ **WARNING**: Using this model without flash-attention will produce incorrect results. The flash-attention package is not just for speed, but essential for proper model functionality.
|
| 84 |
+
|
| 85 |
Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
|
| 86 |
|
| 87 |
```python
|
|
|
|
| 95 |
trust_remote_code=True
|
| 96 |
)
|
| 97 |
|
| 98 |
+
# Then apply Unsloth optimization with flash-attention
|
| 99 |
+
FastLanguageModel.for_inference(model, use_flash_attention=True)
|
| 100 |
```
|
| 101 |
|
| 102 |
## Running Multiple-Choice Inference
|
|
|
|
| 266 |
|
| 267 |
## Performance Tips for Unsloth
|
| 268 |
|
| 269 |
+
1. **Flash Attention REQUIRED**: Flash Attention is not just a performance option but a requirement for this model to function correctly:
|
| 270 |
+
```bash
|
| 271 |
+
pip install flash-attn --no-build-isolation
|
| 272 |
+
```
|
| 273 |
+
|
| 274 |
+
2. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
|
| 275 |
```python
|
| 276 |
model, tokenizer = FastLanguageModel.from_pretrained(
|
| 277 |
model_name=model_id,
|
| 278 |
max_seq_length=1024, # Reduced from 2048
|
| 279 |
+
load_in_4bit=True,
|
| 280 |
+
use_flash_attention=True # Always enable
|
| 281 |
)
|
| 282 |
```
|
| 283 |
|
| 284 |
+
3. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
|
| 285 |
|
| 286 |
+
4. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
|
| 287 |
|
| 288 |
+
5. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
|
| 289 |
```python
|
| 290 |
os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
|
| 291 |
```
|
| 292 |
|
| 293 |
+
<!-- With these optimizations, Qwen25_Coder_MultipleChoice will run correctly while maintaining the high-quality multiple-choice reasoning and answers. -->
|
| 294 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|