tuandunghcmut commited on
Commit
e8eac59
·
verified ·
1 Parent(s): ce35eb5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +257 -0
README.md ADDED
@@ -0,0 +1,257 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Using Unsloth to Load and Run Qwen25_Coder_MultipleChoice
2
+
3
+ Unsloth offers significant inference speed improvements for the Qwen25_Coder_MultipleChoice model. Here's how to properly load and use the model with Unsloth:
4
+
5
+ ## Installation
6
+
7
+ First, install the required packages:
8
+
9
+ ```bash
10
+ pip install unsloth transformers torch accelerate
11
+ ```
12
+
13
+ ## Loading the Model with Unsloth
14
+
15
+ ```python
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer
17
+ import torch
18
+ from unsloth import FastLanguageModel
19
+ import os
20
+
21
+ # Optional: Set HuggingFace Hub token if you have one
22
+ hf_token = os.environ.get("HF_TOKEN") # or directly provide your token
23
+
24
+ # Model ID on HuggingFace Hub
25
+ model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
26
+
27
+ print(f"Loading model from HuggingFace Hub: {model_id}")
28
+
29
+ # First load tokenizer
30
+ tokenizer = AutoTokenizer.from_pretrained(
31
+ model_id,
32
+ token=hf_token,
33
+ trust_remote_code=True
34
+ )
35
+
36
+ # Then load model with Unsloth directly (Method 1)
37
+ model, tokenizer = FastLanguageModel.from_pretrained(
38
+ model_name=model_id,
39
+ token=hf_token,
40
+ max_seq_length=2048, # Adjust based on your memory constraints
41
+ dtype=None, # Auto-detect best dtype
42
+ load_in_4bit=True, # Use 4-bit quantization for efficiency
43
+ )
44
+
45
+ # Enable fast inference mode
46
+ FastLanguageModel.for_inference(model)
47
+
48
+ print("Successfully loaded model with Unsloth!")
49
+ ```
50
+
51
+ Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
52
+
53
+ ```python
54
+ # Alternative approach (Method 2)
55
+ # First load with transformers
56
+ model = AutoModelForCausalLM.from_pretrained(
57
+ model_id,
58
+ token=hf_token,
59
+ torch_dtype=torch.bfloat16,
60
+ device_map="auto",
61
+ trust_remote_code=True
62
+ )
63
+
64
+ # Then apply Unsloth optimization
65
+ FastLanguageModel.for_inference(model)
66
+ ```
67
+
68
+ ## Running Multiple-Choice Inference
69
+
70
+ After loading the model with Unsloth, use it to answer multiple-choice questions:
71
+
72
+ ```python
73
+ def format_prompt(question, choices):
74
+ # Format choices as a lettered list
75
+ formatted_choices = "\n".join(
76
+ [f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)]
77
+ )
78
+
79
+ return f"""
80
+ QUESTION:
81
+ {question}
82
+
83
+ CHOICES:
84
+ {formatted_choices}
85
+
86
+ Analyze this question step-by-step and provide a detailed explanation.
87
+ Your response MUST be in YAML format as follows:
88
+
89
+ understanding: |
90
+ <your understanding of what the question is asking>
91
+ analysis: |
92
+ <your analysis of each option>
93
+ reasoning: |
94
+ <your step-by-step reasoning process>
95
+ conclusion: |
96
+ <your final conclusion>
97
+ answer: <single letter A through {chr(64 + len(choices))}>
98
+
99
+ The answer field MUST contain ONLY a single character letter.
100
+ """
101
+
102
+ def get_answer(question, choices, model, tokenizer):
103
+ # Create the prompt
104
+ prompt = format_prompt(question, choices)
105
+
106
+ # Format as chat for the model
107
+ messages = [{"role": "user", "content": prompt}]
108
+ chat_text = tokenizer.apply_chat_template(
109
+ messages,
110
+ tokenize=False,
111
+ add_generation_prompt=True
112
+ )
113
+
114
+ # Tokenize and generate
115
+ inputs = tokenizer(chat_text, return_tensors="pt").to(model.device)
116
+
117
+ # Generate with Unsloth-optimized model
118
+ output = model.generate(
119
+ inputs.input_ids,
120
+ max_new_tokens=512,
121
+ temperature=0.0, # Use deterministic generation for multiple choice
122
+ do_sample=False
123
+ )
124
+
125
+ # Extract and return response
126
+ response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
127
+
128
+ # Extract answer using regex
129
+ import re
130
+ answer_match = re.search(r'answer:\s*([A-Z])', response)
131
+ if answer_match:
132
+ answer = answer_match.group(1)
133
+ else:
134
+ # Default fallback if no answer found
135
+ answer = "A"
136
+
137
+ return {
138
+ "answer": answer,
139
+ "full_response": response
140
+ }
141
+
142
+ # Example usage
143
+ python_example = {
144
+ "question": "Which of the following correctly defines a list comprehension in Python?",
145
+ "choices": [
146
+ "[x**2 for x in range(10)]",
147
+ "for(x in range(10)) { return x**2; }",
148
+ "map(lambda x: x**2, range(10))",
149
+ "[for x in range(10): x**2]"
150
+ ]
151
+ }
152
+
153
+ result = get_answer(
154
+ python_example["question"],
155
+ python_example["choices"],
156
+ model,
157
+ tokenizer
158
+ )
159
+
160
+ print(f"Answer: {result['answer']}")
161
+ print(f"Full explanation:\n{result['full_response']}")
162
+ ```
163
+
164
+ ## Processing Multiple Questions in Batch
165
+
166
+ For better efficiency with multiple questions, use batch processing:
167
+
168
+ ```python
169
+ def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
170
+ """Process multiple questions in efficient batches"""
171
+ results = []
172
+
173
+ for i in range(0, len(questions_list), batch_size):
174
+ batch = questions_list[i:i+batch_size]
175
+ batch_prompts = []
176
+
177
+ # Prepare all prompts in the batch
178
+ for item in batch:
179
+ prompt = format_prompt(item["question"], item["choices"])
180
+ messages = [{"role": "user", "content": prompt}]
181
+ chat_text = tokenizer.apply_chat_template(
182
+ messages,
183
+ tokenize=False,
184
+ add_generation_prompt=True
185
+ )
186
+ batch_prompts.append(chat_text)
187
+
188
+ # Tokenize all inputs with padding
189
+ tokenizer.padding_side = "left" # Important for causal LM generation
190
+ inputs = tokenizer(
191
+ batch_prompts,
192
+ return_tensors="pt",
193
+ padding=True
194
+ ).to(model.device)
195
+
196
+ # Generate all outputs
197
+ outputs = model.generate(
198
+ inputs.input_ids,
199
+ attention_mask=inputs.attention_mask,
200
+ max_new_tokens=512,
201
+ temperature=0.0,
202
+ do_sample=False,
203
+ pad_token_id=tokenizer.pad_token_id
204
+ )
205
+
206
+ # Process each response
207
+ for j, output_ids in enumerate(outputs):
208
+ # Calculate where the generated text begins
209
+ input_length = inputs.input_ids[j].ne(tokenizer.pad_token_id).sum().item()
210
+
211
+ # Decode only the generated part
212
+ response = tokenizer.decode(
213
+ output_ids[input_length:],
214
+ skip_special_tokens=True
215
+ )
216
+
217
+ # Extract answer
218
+ import re
219
+ answer_match = re.search(r'answer:\s*([A-Z])', response)
220
+ answer = answer_match.group(1) if answer_match else "A"
221
+
222
+ # Store result
223
+ results.append({
224
+ "question": batch[j]["question"],
225
+ "answer": answer,
226
+ "full_response": response
227
+ })
228
+
229
+ return results
230
+ ```
231
+
232
+ ## Performance Tips for Unsloth
233
+
234
+ 1. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
235
+ ```python
236
+ model, tokenizer = FastLanguageModel.from_pretrained(
237
+ model_name=model_id,
238
+ max_seq_length=1024, # Reduced from 2048
239
+ load_in_4bit=True
240
+ )
241
+ ```
242
+
243
+ 2. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
244
+
245
+ 3. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
246
+
247
+ 4. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
248
+ ```python
249
+ os.environ["CUDA_VISIBLE_DEVICES"] = "0" # Use first GPU
250
+ ```
251
+
252
+ 5. **Flash Attention**: Make sure you have Flash Attention installed for maximum performance:
253
+ ```bash
254
+ pip install flash-attn --no-build-isolation
255
+ ```
256
+
257
+ With these optimizations, Qwen25_Coder_MultipleChoice should run significantly faster while maintaining the same high-quality multiple-choice reasoning and answers.