tuandunghcmut commited on
Commit
1d47972
·
verified ·
1 Parent(s): a45d44b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -15
README.md CHANGED
@@ -1,3 +1,10 @@
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  datasets:
@@ -21,6 +28,8 @@ First, install the required packages:
21
 
22
  ```bash
23
  pip install unsloth transformers torch accelerate
 
 
24
  ```
25
 
26
  ## Loading the Model with Unsloth
@@ -34,6 +43,15 @@ import os
34
  # Optional: Set HuggingFace Hub token if you have one
35
  hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
36
 
 
 
 
 
 
 
 
 
 
37
  # Model ID on HuggingFace Hub
38
  model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
39
 
@@ -46,21 +64,24 @@ tokenizer = AutoTokenizer.from_pretrained(
46
  trust_remote_code=True
47
  )
48
 
49
- # Then load model with Unsloth directly (Method 1)
50
  model, tokenizer = FastLanguageModel.from_pretrained(
51
  model_name=model_id,
52
  token=hf_token,
53
  max_seq_length=2048, # Adjust based on your memory constraints
54
  dtype=None, # Auto-detect best dtype
55
  load_in_4bit=True, # Use 4-bit quantization for efficiency
 
56
  )
57
 
58
  # Enable fast inference mode
59
  FastLanguageModel.for_inference(model)
60
 
61
- print("Successfully loaded model with Unsloth!")
62
  ```
63
 
 
 
64
  Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
65
 
66
  ```python
@@ -74,8 +95,8 @@ model = AutoModelForCausalLM.from_pretrained(
74
  trust_remote_code=True
75
  )
76
 
77
- # Then apply Unsloth optimization
78
- FastLanguageModel.for_inference(model)
79
  ```
80
 
81
  ## Running Multiple-Choice Inference
@@ -245,27 +266,29 @@ def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
245
 
246
  ## Performance Tips for Unsloth
247
 
248
- 1. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
 
 
 
 
 
249
  ```python
250
  model, tokenizer = FastLanguageModel.from_pretrained(
251
  model_name=model_id,
252
  max_seq_length=1024, # Reduced from 2048
253
- load_in_4bit=True
 
254
  )
255
  ```
256
 
257
- 2. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
258
 
259
- 3. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
260
 
261
- 4. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
262
  ```python
263
  os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
264
  ```
265
 
266
- 5. **Flash Attention**: Make sure you have Flash Attention installed for maximum performance:
267
- ```bash
268
- pip install flash-attn --no-build-isolation
269
- ```
270
-
271
- With these optimizations, Qwen25_Coder_MultipleChoice should run significantly faster while maintaining the same high-quality multiple-choice reasoning and answers.
 
1
+
2
+ Let me find the existing markdown content first and then add the flash-attention requirement.
3
+
4
+ Ran tool
5
+ I'll use the markdown content from your earlier message and update it with the flash-attention requirement:
6
+
7
+ ```markdown
8
  ---
9
  license: mit
10
  datasets:
 
28
 
29
  ```bash
30
  pip install unsloth transformers torch accelerate
31
+ # Flash-attention is REQUIRED for correct model behavior!
32
+ pip install flash-attn --no-build-isolation
33
  ```
34
 
35
  ## Loading the Model with Unsloth
 
43
  # Optional: Set HuggingFace Hub token if you have one
44
  hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
45
 
46
+ # Verify flash-attention installation - REQUIRED for correct results
47
+ try:
48
+ import flash_attn
49
+ except ImportError:
50
+ raise ImportError(
51
+ "flash-attn package is required for correct model behavior.\n"
52
+ "Please install it with: pip install flash-attn --no-build-isolation"
53
+ )
54
+
55
  # Model ID on HuggingFace Hub
56
  model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
57
 
 
64
  trust_remote_code=True
65
  )
66
 
67
+ # IMPORTANT: Load with flash-attention for correct behavior
68
  model, tokenizer = FastLanguageModel.from_pretrained(
69
  model_name=model_id,
70
  token=hf_token,
71
  max_seq_length=2048, # Adjust based on your memory constraints
72
  dtype=None, # Auto-detect best dtype
73
  load_in_4bit=True, # Use 4-bit quantization for efficiency
74
+ use_flash_attention=True # REQUIRED for correct results
75
  )
76
 
77
  # Enable fast inference mode
78
  FastLanguageModel.for_inference(model)
79
 
80
+ print("Successfully loaded model with Unsloth and flash-attention!")
81
  ```
82
 
83
+ > ⚠️ **WARNING**: Using this model without flash-attention will produce incorrect results. The flash-attention package is not just for speed, but essential for proper model functionality.
84
+
85
  Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
86
 
87
  ```python
 
95
  trust_remote_code=True
96
  )
97
 
98
+ # Then apply Unsloth optimization with flash-attention
99
+ FastLanguageModel.for_inference(model, use_flash_attention=True)
100
  ```
101
 
102
  ## Running Multiple-Choice Inference
 
266
 
267
  ## Performance Tips for Unsloth
268
 
269
+ 1. **Flash Attention REQUIRED**: Flash Attention is not just a performance option but a requirement for this model to function correctly:
270
+ ```bash
271
+ pip install flash-attn --no-build-isolation
272
+ ```
273
+
274
+ 2. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
275
  ```python
276
  model, tokenizer = FastLanguageModel.from_pretrained(
277
  model_name=model_id,
278
  max_seq_length=1024, # Reduced from 2048
279
+ load_in_4bit=True,
280
+ use_flash_attention=True # Always enable
281
  )
282
  ```
283
 
284
+ 3. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
285
 
286
+ 4. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
287
 
288
+ 5. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
289
  ```python
290
  os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
291
  ```
292
 
293
+ <!-- With these optimizations, Qwen25_Coder_MultipleChoice will run correctly while maintaining the high-quality multiple-choice reasoning and answers. -->
294
+ ```