tuandunghcmut
/

Qwen25_Coder_MultipleChoice

+# Using Unsloth to Load and Run Qwen25_Coder_MultipleChoice
+Unsloth offers significant inference speed improvements for the Qwen25_Coder_MultipleChoice model. Here's how to properly load and use the model with Unsloth:
+## Installation
+First, install the required packages:
+```bash
+pip install unsloth transformers torch accelerate
+```
+## Loading the Model with Unsloth
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+from unsloth import FastLanguageModel
+import os
+# Optional: Set HuggingFace Hub token if you have one
+hf_token = os.environ.get("HF_TOKEN")  # or directly provide your token
+# Model ID on HuggingFace Hub
+model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
+print(f"Loading model from HuggingFace Hub: {model_id}")
+# First load tokenizer
+tokenizer = AutoTokenizer.from_pretrained(
+    model_id,
+    token=hf_token,
+    trust_remote_code=True
+)
+# Then load model with Unsloth directly (Method 1)
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name=model_id,
+    token=hf_token,
+    max_seq_length=2048,  # Adjust based on your memory constraints
+    dtype=None,  # Auto-detect best dtype
+    load_in_4bit=True,  # Use 4-bit quantization for efficiency
+)
+# Enable fast inference mode
+FastLanguageModel.for_inference(model)
+print("Successfully loaded model with Unsloth!")
+```
+Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
+```python
+# Alternative approach (Method 2)
+# First load with transformers
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    token=hf_token,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+    trust_remote_code=True
+)
+# Then apply Unsloth optimization
+FastLanguageModel.for_inference(model)
+```
+## Running Multiple-Choice Inference
+After loading the model with Unsloth, use it to answer multiple-choice questions:
+```python
+def format_prompt(question, choices):
+    # Format choices as a lettered list
+    formatted_choices = "\n".join(
+        [f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)]
+    )
+    return f"""
+QUESTION:
+{question}
+CHOICES:
+{formatted_choices}
+Analyze this question step-by-step and provide a detailed explanation.
+Your response MUST be in YAML format as follows:
+understanding: |
+  <your understanding of what the question is asking>
+analysis: |
+  <your analysis of each option>
+reasoning: |
+  <your step-by-step reasoning process>
+conclusion: |
+  <your final conclusion>
+answer: <single letter A through {chr(64 + len(choices))}>
+The answer field MUST contain ONLY a single character letter.
+"""
+def get_answer(question, choices, model, tokenizer):
+    # Create the prompt
+    prompt = format_prompt(question, choices)
+    # Format as chat for the model
+    messages = [{"role": "user", "content": prompt}]
+    chat_text = tokenizer.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    # Tokenize and generate
+    inputs = tokenizer(chat_text, return_tensors="pt").to(model.device)
+    # Generate with Unsloth-optimized model
+    output = model.generate(
+        inputs.input_ids,
+        max_new_tokens=512,
+        temperature=0.0,  # Use deterministic generation for multiple choice
+        do_sample=False
+    )
+    # Extract and return response
+    response = tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
+    # Extract answer using regex
+    import re
+    answer_match = re.search(r'answer:\s*([A-Z])', response)
+    if answer_match:
+        answer = answer_match.group(1)
+    else:
+        # Default fallback if no answer found
+        answer = "A"
+    return {
+        "answer": answer,
+        "full_response": response
+    }
+# Example usage
+python_example = {
+    "question": "Which of the following correctly defines a list comprehension in Python?",
+    "choices": [
+        "[x**2 for x in range(10)]",
+        "for(x in range(10)) { return x**2; }",
+        "map(lambda x: x**2, range(10))",
+        "[for x in range(10): x**2]"
+    ]
+}
+result = get_answer(
+    python_example["question"],
+    python_example["choices"],
+    model,
+    tokenizer
+)
+print(f"Answer: {result['answer']}")
+print(f"Full explanation:\n{result['full_response']}")
+```
+## Processing Multiple Questions in Batch
+For better efficiency with multiple questions, use batch processing:
+```python
+def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
+    """Process multiple questions in efficient batches"""
+    results = []
+    for i in range(0, len(questions_list), batch_size):
+        batch = questions_list[i:i+batch_size]
+        batch_prompts = []
+        # Prepare all prompts in the batch
+        for item in batch:
+            prompt = format_prompt(item["question"], item["choices"])
+            messages = [{"role": "user", "content": prompt}]
+            chat_text = tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True
+            )
+            batch_prompts.append(chat_text)
+        # Tokenize all inputs with padding
+        tokenizer.padding_side = "left"  # Important for causal LM generation
+        inputs = tokenizer(
+            batch_prompts,
+            return_tensors="pt",
+            padding=True
+        ).to(model.device)
+        # Generate all outputs
+        outputs = model.generate(
+            inputs.input_ids,
+            attention_mask=inputs.attention_mask,
+            max_new_tokens=512,
+            temperature=0.0,
+            do_sample=False,
+            pad_token_id=tokenizer.pad_token_id
+        )
+        # Process each response
+        for j, output_ids in enumerate(outputs):
+            # Calculate where the generated text begins
+            input_length = inputs.input_ids[j].ne(tokenizer.pad_token_id).sum().item()
+            # Decode only the generated part
+            response = tokenizer.decode(
+                output_ids[input_length:],
+                skip_special_tokens=True
+            )
+            # Extract answer
+            import re
+            answer_match = re.search(r'answer:\s*([A-Z])', response)
+            answer = answer_match.group(1) if answer_match else "A"
+            # Store result
+            results.append({
+                "question": batch[j]["question"],
+                "answer": answer,
+                "full_response": response
+            })
+    return results
+```
+## Performance Tips for Unsloth
+1. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
+   ```python
+   model, tokenizer = FastLanguageModel.from_pretrained(
+       model_name=model_id,
+       max_seq_length=1024,  # Reduced from 2048
+       load_in_4bit=True
+   )
+   ```
+2. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
+3. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
+4. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
+   ```python
+   os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU
+   ```
+5. **Flash Attention**: Make sure you have Flash Attention installed for maximum performance:
+   ```bash
+   pip install flash-attn --no-build-isolation
+   ```
+With these optimizations, Qwen25_Coder_MultipleChoice should run significantly faster while maintaining the same high-quality multiple-choice reasoning and answers.