tuandunghcmut
/

Qwen25_Coder_MultipleChoice

@@ -1,3 +1,10 @@
 ---
 license: mit
 datasets:
@@ -21,6 +28,8 @@ First, install the required packages:
 ```bash
 pip install unsloth transformers torch accelerate
 ```
 ## Loading the Model with Unsloth
@@ -34,6 +43,15 @@ import os
 # Optional: Set HuggingFace Hub token if you have one
 hf_token = os.environ.get("HF_TOKEN")  # or directly provide your token
 # Model ID on HuggingFace Hub
 model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
@@ -46,21 +64,24 @@ tokenizer = AutoTokenizer.from_pretrained(
     trust_remote_code=True
 )
-# Then load model with Unsloth directly (Method 1)
 model, tokenizer = FastLanguageModel.from_pretrained(
     model_name=model_id,
     token=hf_token,
     max_seq_length=2048,  # Adjust based on your memory constraints
     dtype=None,  # Auto-detect best dtype
     load_in_4bit=True,  # Use 4-bit quantization for efficiency
 )
 # Enable fast inference mode
 FastLanguageModel.for_inference(model)
-print("Successfully loaded model with Unsloth!")
 ```
 Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
 ```python
@@ -74,8 +95,8 @@ model = AutoModelForCausalLM.from_pretrained(
     trust_remote_code=True
 )
-# Then apply Unsloth optimization
-FastLanguageModel.for_inference(model)
 ```
 ## Running Multiple-Choice Inference
@@ -245,27 +266,29 @@ def batch_process_questions(questions_list, model, tokenizer, batch_size=4):
 ## Performance Tips for Unsloth
-1. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
    ```python
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_id,
        max_seq_length=1024,  # Reduced from 2048
-       load_in_4bit=True
    )
    ```
-2. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
-3. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
-4. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
    ```python
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU
    ```
-5. **Flash Attention**: Make sure you have Flash Attention installed for maximum performance:
-   ```bash
-   pip install flash-attn --no-build-isolation
-   ```
-With these optimizations, Qwen25_Coder_MultipleChoice should run significantly faster while maintaining the same high-quality multiple-choice reasoning and answers.

+Let me find the existing markdown content first and then add the flash-attention requirement.
+Ran tool
+I'll use the markdown content from your earlier message and update it with the flash-attention requirement:
+```markdown
 ---
 license: mit
 datasets:
 ```bash
 pip install unsloth transformers torch accelerate
+# Flash-attention is REQUIRED for correct model behavior!
+pip install flash-attn --no-build-isolation
 ```
 ## Loading the Model with Unsloth
 # Optional: Set HuggingFace Hub token if you have one
 hf_token = os.environ.get("HF_TOKEN")  # or directly provide your token
+# Verify flash-attention installation - REQUIRED for correct results
+try:
+    import flash_attn
+except ImportError:
+    raise ImportError(
+        "flash-attn package is required for correct model behavior.\n"
+        "Please install it with: pip install flash-attn --no-build-isolation"
+    )
 # Model ID on HuggingFace Hub
 model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
     trust_remote_code=True
 )
+# IMPORTANT: Load with flash-attention for correct behavior
 model, tokenizer = FastLanguageModel.from_pretrained(
     model_name=model_id,
     token=hf_token,
     max_seq_length=2048,  # Adjust based on your memory constraints
     dtype=None,  # Auto-detect best dtype
     load_in_4bit=True,  # Use 4-bit quantization for efficiency
+    use_flash_attention=True  # REQUIRED for correct results
 )
 # Enable fast inference mode
 FastLanguageModel.for_inference(model)
+print("Successfully loaded model with Unsloth and flash-attention!")
 ```
+> ⚠️ **WARNING**: Using this model without flash-attention will produce incorrect results. The flash-attention package is not just for speed, but essential for proper model functionality.
 Alternatively, you can load the model with transformers first and then apply Unsloth optimization:
 ```python
     trust_remote_code=True
 )
+# Then apply Unsloth optimization with flash-attention
+FastLanguageModel.for_inference(model, use_flash_attention=True)
 ```
 ## Running Multiple-Choice Inference
 ## Performance Tips for Unsloth
+1. **Flash Attention REQUIRED**: Flash Attention is not just a performance option but a requirement for this model to function correctly:
+   ```bash
+   pip install flash-attn --no-build-isolation
+   ```
+2. **Memory Optimization**: If you encounter memory issues, reduce `max_seq_length` or use 4-bit quantization:
    ```python
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=model_id,
        max_seq_length=1024,  # Reduced from 2048
+       load_in_4bit=True,
+       use_flash_attention=True  # Always enable
    )
    ```
+3. **Batch Processing**: For multiple questions, always use batching as it's significantly faster.
+4. **Prefill Optimization**: Unsloth has special optimizations for prefill that work best with long contexts and batch processing.
+5. **GPU Selection**: If you have multiple GPUs, you can specify which to use:
    ```python
    os.environ["CUDA_VISIBLE_DEVICES"] = "0"  # Use first GPU
    ```
+<!-- With these optimizations, Qwen25_Coder_MultipleChoice will run correctly while maintaining the high-quality multiple-choice reasoning and answers. -->
+```