tuandunghcmut's picture
Update README.md
6fc37d0 verified
|
raw
history blame
21.7 kB
metadata
license: mit
datasets:
  - tuandunghcmut/normal_dataset
language:
  - en
base_model:
  - unsloth/Qwen2.5-Coder-1.5B-Instruct
pipeline_tag: text-generation

Using tuandunghcmut/Qwen25_Coder_MultipleChoice

The project "Knowledge Distillation About YAML-Based Structured Multi-Step Reasoning from a Teacher Model GPT-4o to a Small LLM: Qwen2.5 Coder 1.5B-Instruct" focuses on distilling structured multi-step reasoning from GPT-4o into a smaller model.

This document provides everything you need to get started with tuandunghcmut/Qwen25_Coder_MultipleChoice, a model designed for multiple-choice coding questions.

I plan to refactor the project into a well-structured GitHub repository, expand the dataset, and re-train it later with distributed training for better scalability.

Installation and Setup

Prerequisites

Make sure you have Python 3.8+ installed. Then install the required packages:

# Install core dependencies
pip install transformers torch pandas

# For faster inference (important)
pip install unsloth accelerate bitsandbytes

# Flash Attention (highly recommended for speed)
pip install flash-attn --no-build-isolation

# For dataset handling and YAML parsing
pip install datasets pyyaml

Flash Attention Setup

Flash Attention provides a significant speedup for transformer models. To use it with the Qwen model:

  1. Install Flash Attention as shown above
  2. Enable it when loading the model:
from transformers import AutoModelForCausalLM, AutoTokenizer

# Enable Flash Attention during model loading
model = AutoModelForCausalLM.from_pretrained(
    "tuandunghcmut/Qwen25_Coder_MultipleChoice",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
    use_flash_attention_2=True  # Enable Flash Attention
)

Flash Attention will provide:

  • 2-3x faster inference speed
  • Lower memory usage
  • Compatible with 4-bit quantization for even more efficiency

Environment Variables

If you're using Hugging Face Hub models, you may want to set up your access token:

# Set environment variable for Hugging Face token
export HF_TOKEN="your_huggingface_token_here"

# Or in Python
import os
os.environ["HF_TOKEN"] = "your_huggingface_token_here"

GPU Setup

For optimal performance, you'll need a CUDA-compatible GPU. Check your installation:

# Verify CUDA is available
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"

# Print CUDA device info
python -c "import torch; print('CUDA device count:', torch.cuda.device_count()); print('CUDA device name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"

Required Classes

Below are the essential classes needed to work with the model. Copy these into your Python files to use them in your project.

PromptCreator

This class formats prompts for multiple-choice questions:

class PromptCreator:
    """
    Creates and formats prompts for multiple choice questions
    Supports different prompt styles for training and inference
    """

    # Prompt types
    BASIC = "basic"  # Simple answer-only format
    YAML_REASONING = "yaml"  # YAML formatted reasoning
    TEACHER_REASONED = "teacher"  # Same YAML format as YAML_REASONING but using teacher completions for training

    def __init__(self, prompt_type=BASIC):
        self.prompt_type = prompt_type
        # Initialize parser mode based on prompt type
        if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
            self.parser_mode = "yaml"
        else:
            self.parser_mode = "basic"

    def format_choices(self, choices):
        """Format choices with letter prefixes"""
        return "\n".join([f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)])

    def get_max_letter(self, choices):
        """Get the last valid letter based on choice count"""
        return chr(65 + len(choices) - 1)

    def create_inference_prompt(self, question, choices):
        """Create a prompt for inference based on the configured prompt type"""
        formatted_choices = self.format_choices(choices)
        max_letter = self.get_max_letter(choices)
        
        if self.prompt_type == self.BASIC:
            return self._create_basic_prompt(question, formatted_choices, max_letter)
        elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
            return self._create_yaml_prompt(question, formatted_choices, max_letter)
        else:
            return self._create_basic_prompt(question, formatted_choices, max_letter)

    def _create_basic_prompt(self, question, formatted_choices, max_letter):
        """Create a basic prompt that just asks for an answer letter"""
        return f"""
{question}

{formatted_choices}

Select the correct answer from A through {max_letter}:
"""

    def _create_yaml_prompt(self, question, formatted_choices, max_letter):
        """Create a prompt with YAML formatted reasoning structure"""
        return f"""
{question}

{formatted_choices}

Think through this step-by-step:
- Understand what the question is asking
- Analyze each option carefully
- Reason about why each option might be correct or incorrect
- Select the most appropriate answer

Your response should be in YAML format:
understanding: |
  <your understanding of the question>
analysis: |
  <your analysis of each option>
reasoning: |
  <your reasoning about the correct answer>
conclusion: |
  <your final conclusion>
answer: <single letter A through {max_letter} representing your final answer>
"""

    def create_training_prompt(self, question, choices):
        """Create a prompt for training based on the configured prompt type"""
        formatted_choices = self.format_choices(choices)
        max_letter = self.get_max_letter(choices)
        
        if self.prompt_type == self.BASIC:
            return self._create_basic_training_prompt(question, formatted_choices, max_letter)
        elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
            return self._create_yaml_training_prompt(question, formatted_choices, max_letter)
        else:
            return self._create_basic_training_prompt(question, formatted_choices, max_letter)

    def _create_basic_training_prompt(self, question, formatted_choices, max_letter):
        """Create a basic training prompt"""
        return f"""
{question}

{formatted_choices}

Select the correct answer from A through {max_letter}:
"""

    def _create_yaml_training_prompt(self, question, formatted_choices, max_letter):
        """Create a training prompt with YAML formatted reasoning structure"""
        return f"""
{question}

{formatted_choices}

Think through this step-by-step:
- Understand what the question is asking
- Analyze each option carefully
- Reason about why each option might be correct or incorrect
- Select the most appropriate answer

Your response should be in YAML format:
understanding: |
  <your understanding of the question>
analysis: |
  <your analysis of each option>
reasoning: |
  <your reasoning about the correct answer>
conclusion: |
  <your final conclusion>
answer: <single letter A through {max_letter} representing your final answer>
"""

    def set_prompt_type(self, prompt_type):
        """Set the prompt type and update parser mode accordingly"""
        self.prompt_type = prompt_type
        if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
            self.parser_mode = "yaml"
        else:
            self.parser_mode = "basic"

    def is_teacher_mode(self):
        """Check if prompt type is teacher mode"""
        return self.prompt_type == self.TEACHER_REASONED

ResponseParser

This class extracts answers from model responses:

class ResponseParser:
    """
    Parser for model responses with support for different formats
    Extracts answers and reasoning from model outputs
    """
    
    # Parser modes
    BASIC = "basic"        # Extract single letter answer
    YAML = "yaml"          # Parse YAML formatted response with reasoning
    
    def __init__(self, parser_mode=BASIC):
        """Initialize with parser mode (basic or yaml)"""
        self.parser_mode = parser_mode
        
    def parse(self, response_text):
        """Parse the response text and extract answer and reasoning"""
        if self.parser_mode == self.YAML:
            return self._parse_yaml_response(response_text)
        else:
            return self._parse_basic_response(response_text)
    
    def _parse_basic_response(self, response_text):
        """
        Parse a basic response to extract the answer letter
        
        Returns:
            tuple: (answer_letter, None)
        """
        # Look for just the letter at the end of text
        import re
        
        # Try to find the last occurrence of letters A-Z by themselves
        matches = re.findall(r'\b([A-Z])\b', response_text)
        if matches:
            return matches[-1], None  # Return the last matching letter
            
        # Try to find "The answer is X" pattern
        answer_match = re.search(r'[Tt]he answer is[:\s]+([A-Z])', response_text)
        if answer_match:
            return answer_match.group(1), None
            
        # If nothing else works, just get the last uppercase letter
        uppercase_letters = re.findall(r'[A-Z]', response_text)
        if uppercase_letters:
            return uppercase_letters[-1], None
            
        return None, None  # No answer found
    
    def _parse_yaml_response(self, response_text):
        """
        Parse a YAML formatted response to extract the answer and reasoning
        
        Returns:
            tuple: (answer_letter, reasoning_dict)
        """
        import re
        import yaml
        
        # First try to extract just the answer field
        answer_match = re.search(r'answer:\s*([A-Z])', response_text)
        answer = answer_match.group(1) if answer_match else None
        
        # Try to extract the entire YAML
        try:
            # Remove potential code block markers
            yaml_text = response_text
            if "```yaml" in yaml_text:
                yaml_text = yaml_text.split("```yaml")[1]
                if "```" in yaml_text:
                    yaml_text = yaml_text.split("```")[0]
            elif "```" in yaml_text:
                # Assume the whole thing is a code block
                parts = yaml_text.split("```")
                if len(parts) >= 3:
                    yaml_text = parts[1]
                    
            # Parse the YAML
            parsed_yaml = yaml.safe_load(yaml_text)
            
            # If successful, use the answer from the YAML, and return the parsed structure
            if isinstance(parsed_yaml, dict) and "answer" in parsed_yaml:
                return parsed_yaml.get("answer"), parsed_yaml
        except Exception:
            # If YAML parsing fails, we already have the answer from regex
            pass
        
        return answer, None
    
    def set_parser_mode(self, parser_mode):
        """Set the parser mode"""
        self.parser_mode = parser_mode
    
    @classmethod
    def from_prompt_type(cls, prompt_type):
        """
        Create a ResponseParser with the appropriate mode based on prompt type
        
        Args:
            prompt_type: The prompt type (e.g., PromptCreator.YAML_REASONING)
            
        Returns:
            ResponseParser: A parser configured for the prompt type
        """
        if prompt_type in ["yaml", "teacher"]:
            return cls("yaml")
        else:
            return cls("basic")

QwenModelHandler

This class handles model loading and inference:

class QwenModelHandler:
    def __init__(self, model_name="unsloth/Qwen2.5-7B", max_seq_length=768, 
                 quantization=None, device_map="auto", cache_dir=None,
                 use_flash_attention=True):
        """
        Initialize a handler for Qwen models
        
        Args:
            model_name: Model identifier (local path or Hugging Face model ID)
            max_seq_length: Maximum sequence length
            quantization: Quantization method ("4bit", "8bit", or None)
            device_map: Device mapping strategy
            cache_dir: Directory to cache downloaded models
            use_flash_attention: Whether to use Flash Attention 2 for faster inference
        """
        self.model_name = model_name
        self.max_seq_length = max_seq_length
        self.quantization = quantization
        self.device_map = device_map
        self.cache_dir = cache_dir
        self.use_flash_attention = use_flash_attention
        
        self.model = None
        self.tokenizer = None
        
        # Load the model and tokenizer
        self._load_model()
        
    def _load_model(self):
        """Load the model and tokenizer with appropriate settings"""
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch
        
        # Load tokenizer
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_name,
            trust_remote_code=True,
            cache_dir=self.cache_dir
        )
        
        # Prepare model loading kwargs
        model_kwargs = {
            "trust_remote_code": True,
            "cache_dir": self.cache_dir,
            "device_map": self.device_map,
        }
        
        # Add Flash Attention if requested and available
        if self.use_flash_attention:
            try:
                import flash_attn
                model_kwargs["use_flash_attention_2"] = True
                print("Flash Attention 2 enabled!")
            except ImportError:
                print("Flash Attention not available. For faster inference, install with: pip install flash-attn")
        
        # Add quantization if specified
        if self.quantization == "4bit":
            try:
                from transformers import BitsAndBytesConfig
                model_kwargs["quantization_config"] = BitsAndBytesConfig(
                    load_in_4bit=True,
                    bnb_4bit_compute_dtype=torch.bfloat16
                )
            except ImportError:
                print("bitsandbytes not available, loading without 4-bit quantization")
        elif self.quantization == "8bit":
            model_kwargs["load_in_8bit"] = True
        else:
            model_kwargs["torch_dtype"] = torch.bfloat16
        
        # Load the model
        self.model = AutoModelForCausalLM.from_pretrained(
            self.model_name,
            **model_kwargs
        )
        
    def generate_with_streaming(self, prompt, temperature=0.7, max_tokens=1024, stream=True):
        """
        Generate text from the model with optional streaming
        
        Args:
            prompt: Input text prompt
            temperature: Temperature for sampling (0 for deterministic)
            max_tokens: Maximum number of tokens to generate
            stream: Whether to stream the output
            
        Returns:
            String containing the generated text, or generator if streaming
        """
        import torch
        
        # Tokenize prompt
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
        input_ids = inputs.input_ids
        attention_mask = inputs.attention_mask
        
        # Set generation parameters
        generation_config = {
            "max_new_tokens": max_tokens,
            "temperature": temperature,
            "do_sample": temperature > 0,
            "top_p": 0.95 if temperature > 0 else 1.0,
            "repetition_penalty": 1.1,
            "pad_token_id": self.tokenizer.eos_token_id,
        }
        
        # If not streaming, do normal generation
        if not stream:
            with torch.no_grad():
                outputs = self.model.generate(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    **generation_config
                )
            
            # Decode the generated text (skip the prompt)
            generated_text = self.tokenizer.decode(
                outputs[0][input_ids.shape[1]:], 
                skip_special_tokens=True
            )
            
            return generated_text
        
        # If streaming, yield generated tokens one by one
        else:
            generated = []
            
            # Initialize generator
            with torch.no_grad():
                generated_ids = self.model.generate(
                    input_ids=input_ids,
                    attention_mask=attention_mask,
                    **generation_config,
                    streamer=None  # Would need a custom streamer here if available
                )
            
            # Decode the entire sequence at once (not truly streaming, but simpler)
            full_text = self.tokenizer.decode(
                generated_ids[0][input_ids.shape[1]:], 
                skip_special_tokens=True
            )
            
            return full_text

Hardware Requirements and Optimization

Flash Attention Benefits

Flash Attention is a highly optimized implementation of the attention mechanism that:

  1. Speeds up inference by 2-3x compared to standard attention
  2. Reduces memory usage by avoiding materializing large attention matrices
  3. Works perfectly with 4-bit quantization for even further optimization
  4. Scales better with sequence length, which is important for complex coding questions

For the best performance, make sure to:

  • Install Flash Attention (pip install flash-attn)
  • Enable it when loading the model (see QwenModelHandler class)
  • Use with CUDA-compatible NVIDIA GPUs

Hardware Recommendations

For optimal performance, we recommend:

  • GPU: NVIDIA GPU with at least 8GB VRAM (16GB+ recommended for larger models)
  • RAM: 16GB+ system RAM
  • Storage: At least 10GB free disk space for model files
  • CPU: Modern multi-core processor (for preprocessing)

Reducing Memory Usage

If you're facing memory constraints:

# Use 4-bit quantization with Flash Attention for optimal memory-efficiency
model_handler = QwenModelHandler(
    model_name="tuandunghcmut/Qwen25_Coder_MultipleChoice",
    quantization="4bit",
    use_flash_attention=True
)

# Further optimize with unsloth
try:
    from unsloth.models import FastLanguageModel
    FastLanguageModel.for_inference(model_handler.model)
    print("Using unsloth for additional optimization")
except ImportError:
    print("unsloth not available")

Usage Example

Here's how to use these classes with Flash Attention enabled:

# 1. Load the model with Flash Attention and 4-bit quantization
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

hub_model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"

# Create model handler with Flash Attention and 4-bit quantization
model_handler = QwenModelHandler(
    model_name=hub_model_id,
    max_seq_length=2048,
    quantization="4bit",
    use_flash_attention=True
)

# Optional: Use unsloth for even faster inference
try:
    from unsloth.models import FastLanguageModel
    FastLanguageModel.for_inference(model_handler.model)
    print("Using unsloth for faster inference")
except ImportError:
    print("unsloth not available, using standard inference")

# 2. Create prompt creator with YAML reasoning format
prompt_creator = PromptCreator(PromptCreator.YAML_REASONING)

# 3. Example question
question = "Which of the following correctly defines a list comprehension in Python?"
choices = [
    "[x**2 for x in range(10)]",
    "for(x in range(10)) { return x**2; }",
    "map(lambda x: x**2, range(10))",
    "[for x in range(10): x**2]"
]

# 4. Create prompt and generate answer
prompt = prompt_creator.create_inference_prompt(question, choices)
response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)

# 5. Parse the response
parser = ResponseParser(prompt_creator.parser_mode)
answer, reasoning = parser.parse(response)

print(f"Question: {question}")
print(f"Answer: {answer}")
if reasoning:
    print(f"Reasoning: {reasoning}")

Troubleshooting

Common Issues

  1. Flash Attention Installation Issues: If you encounter problems installing flash-attn:

    # Try with specific CUDA version (e.g., for CUDA 11.8)
    pip install flash-attn==2.3.4+cu118 --no-build-isolation
    
    # For older GPUs
    pip install flash-attn==2.3.4 --no-build-isolation
    
  2. CUDA Out of Memory: Try combining 4-bit quantization with Flash Attention.

    model_handler = QwenModelHandler(
        model_name=hub_model_id, 
        quantization="4bit",
        use_flash_attention=True
    )
    
  3. Module Not Found Errors: Make sure you've installed all required packages.

    pip install transformers torch unsloth datasets pyyaml bitsandbytes flash-attn
    
  4. Parsing Errors: If the model isn't producing valid YAML responses, try adjusting the temperature:

    response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
    

Getting Help

If you encounter issues, check the model repository on Hugging Face for updates and community discussions.

This guide provides you with all the necessary code and optimization techniques to use the model effectively for multiple-choice coding questions.