license: mit
datasets:
- tuandunghcmut/normal_dataset
language:
- en
base_model:
- unsloth/Qwen2.5-Coder-1.5B-Instruct
pipeline_tag: text-generation
Using tuandunghcmut/Qwen25_Coder_MultipleChoice
The project "Knowledge Distillation About YAML-Based Structured Multi-Step Reasoning from a Teacher Model GPT-4o to a Small LLM: Qwen2.5 Coder 1.5B-Instruct" focuses on distilling structured multi-step reasoning from GPT-4o into a smaller model.
This document provides everything you need to get started with tuandunghcmut/Qwen25_Coder_MultipleChoice, a model designed for multiple-choice coding questions.
I plan to refactor the project into a well-structured GitHub repository, expand the dataset, and re-train it later with distributed training for better scalability.
Installation and Setup
Prerequisites
Make sure you have Python 3.8+ installed. Then install the required packages:
# Install core dependencies
pip install transformers torch pandas
# For faster inference (important)
pip install unsloth accelerate bitsandbytes
# Flash Attention (highly recommended for speed)
pip install flash-attn --no-build-isolation
# For dataset handling and YAML parsing
pip install datasets pyyaml
Flash Attention Setup
Flash Attention provides a significant speedup for transformer models. To use it with the Qwen model:
- Install Flash Attention as shown above
- Enable it when loading the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
# Enable Flash Attention during model loading
model = AutoModelForCausalLM.from_pretrained(
"tuandunghcmut/Qwen25_Coder_MultipleChoice",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
use_flash_attention_2=True # Enable Flash Attention
)
Flash Attention will provide:
- 2-3x faster inference speed
- Lower memory usage
- Compatible with 4-bit quantization for even more efficiency
Environment Variables
If you're using Hugging Face Hub models, you may want to set up your access token:
# Set environment variable for Hugging Face token
export HF_TOKEN="your_huggingface_token_here"
# Or in Python
import os
os.environ["HF_TOKEN"] = "your_huggingface_token_here"
GPU Setup
For optimal performance, you'll need a CUDA-compatible GPU. Check your installation:
# Verify CUDA is available
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
# Print CUDA device info
python -c "import torch; print('CUDA device count:', torch.cuda.device_count()); print('CUDA device name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'No GPU')"
Required Classes
Below are the essential classes needed to work with the model. Copy these into your Python files to use them in your project.
PromptCreator
This class formats prompts for multiple-choice questions:
class PromptCreator:
"""
Creates and formats prompts for multiple choice questions
Supports different prompt styles for training and inference
"""
# Prompt types
BASIC = "basic" # Simple answer-only format
YAML_REASONING = "yaml" # YAML formatted reasoning
TEACHER_REASONED = "teacher" # Same YAML format as YAML_REASONING but using teacher completions for training
def __init__(self, prompt_type=BASIC):
self.prompt_type = prompt_type
# Initialize parser mode based on prompt type
if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
self.parser_mode = "yaml"
else:
self.parser_mode = "basic"
def format_choices(self, choices):
"""Format choices with letter prefixes"""
return "\n".join([f"{chr(65 + i)}. {choice}" for i, choice in enumerate(choices)])
def get_max_letter(self, choices):
"""Get the last valid letter based on choice count"""
return chr(65 + len(choices) - 1)
def create_inference_prompt(self, question, choices):
"""Create a prompt for inference based on the configured prompt type"""
formatted_choices = self.format_choices(choices)
max_letter = self.get_max_letter(choices)
if self.prompt_type == self.BASIC:
return self._create_basic_prompt(question, formatted_choices, max_letter)
elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
return self._create_yaml_prompt(question, formatted_choices, max_letter)
else:
return self._create_basic_prompt(question, formatted_choices, max_letter)
def _create_basic_prompt(self, question, formatted_choices, max_letter):
"""Create a basic prompt that just asks for an answer letter"""
return f"""
{question}
{formatted_choices}
Select the correct answer from A through {max_letter}:
"""
def _create_yaml_prompt(self, question, formatted_choices, max_letter):
"""Create a prompt with YAML formatted reasoning structure"""
return f"""
{question}
{formatted_choices}
Think through this step-by-step:
- Understand what the question is asking
- Analyze each option carefully
- Reason about why each option might be correct or incorrect
- Select the most appropriate answer
Your response should be in YAML format:
understanding: |
<your understanding of the question>
analysis: |
<your analysis of each option>
reasoning: |
<your reasoning about the correct answer>
conclusion: |
<your final conclusion>
answer: <single letter A through {max_letter} representing your final answer>
"""
def create_training_prompt(self, question, choices):
"""Create a prompt for training based on the configured prompt type"""
formatted_choices = self.format_choices(choices)
max_letter = self.get_max_letter(choices)
if self.prompt_type == self.BASIC:
return self._create_basic_training_prompt(question, formatted_choices, max_letter)
elif self.prompt_type == self.YAML_REASONING or self.prompt_type == self.TEACHER_REASONED:
return self._create_yaml_training_prompt(question, formatted_choices, max_letter)
else:
return self._create_basic_training_prompt(question, formatted_choices, max_letter)
def _create_basic_training_prompt(self, question, formatted_choices, max_letter):
"""Create a basic training prompt"""
return f"""
{question}
{formatted_choices}
Select the correct answer from A through {max_letter}:
"""
def _create_yaml_training_prompt(self, question, formatted_choices, max_letter):
"""Create a training prompt with YAML formatted reasoning structure"""
return f"""
{question}
{formatted_choices}
Think through this step-by-step:
- Understand what the question is asking
- Analyze each option carefully
- Reason about why each option might be correct or incorrect
- Select the most appropriate answer
Your response should be in YAML format:
understanding: |
<your understanding of the question>
analysis: |
<your analysis of each option>
reasoning: |
<your reasoning about the correct answer>
conclusion: |
<your final conclusion>
answer: <single letter A through {max_letter} representing your final answer>
"""
def set_prompt_type(self, prompt_type):
"""Set the prompt type and update parser mode accordingly"""
self.prompt_type = prompt_type
if prompt_type == self.YAML_REASONING or prompt_type == self.TEACHER_REASONED:
self.parser_mode = "yaml"
else:
self.parser_mode = "basic"
def is_teacher_mode(self):
"""Check if prompt type is teacher mode"""
return self.prompt_type == self.TEACHER_REASONED
ResponseParser
This class extracts answers from model responses:
class ResponseParser:
"""
Parser for model responses with support for different formats
Extracts answers and reasoning from model outputs
"""
# Parser modes
BASIC = "basic" # Extract single letter answer
YAML = "yaml" # Parse YAML formatted response with reasoning
def __init__(self, parser_mode=BASIC):
"""Initialize with parser mode (basic or yaml)"""
self.parser_mode = parser_mode
def parse(self, response_text):
"""Parse the response text and extract answer and reasoning"""
if self.parser_mode == self.YAML:
return self._parse_yaml_response(response_text)
else:
return self._parse_basic_response(response_text)
def _parse_basic_response(self, response_text):
"""
Parse a basic response to extract the answer letter
Returns:
tuple: (answer_letter, None)
"""
# Look for just the letter at the end of text
import re
# Try to find the last occurrence of letters A-Z by themselves
matches = re.findall(r'\b([A-Z])\b', response_text)
if matches:
return matches[-1], None # Return the last matching letter
# Try to find "The answer is X" pattern
answer_match = re.search(r'[Tt]he answer is[:\s]+([A-Z])', response_text)
if answer_match:
return answer_match.group(1), None
# If nothing else works, just get the last uppercase letter
uppercase_letters = re.findall(r'[A-Z]', response_text)
if uppercase_letters:
return uppercase_letters[-1], None
return None, None # No answer found
def _parse_yaml_response(self, response_text):
"""
Parse a YAML formatted response to extract the answer and reasoning
Returns:
tuple: (answer_letter, reasoning_dict)
"""
import re
import yaml
# First try to extract just the answer field
answer_match = re.search(r'answer:\s*([A-Z])', response_text)
answer = answer_match.group(1) if answer_match else None
# Try to extract the entire YAML
try:
# Remove potential code block markers
yaml_text = response_text
if "```yaml" in yaml_text:
yaml_text = yaml_text.split("```yaml")[1]
if "```" in yaml_text:
yaml_text = yaml_text.split("```")[0]
elif "```" in yaml_text:
# Assume the whole thing is a code block
parts = yaml_text.split("```")
if len(parts) >= 3:
yaml_text = parts[1]
# Parse the YAML
parsed_yaml = yaml.safe_load(yaml_text)
# If successful, use the answer from the YAML, and return the parsed structure
if isinstance(parsed_yaml, dict) and "answer" in parsed_yaml:
return parsed_yaml.get("answer"), parsed_yaml
except Exception:
# If YAML parsing fails, we already have the answer from regex
pass
return answer, None
def set_parser_mode(self, parser_mode):
"""Set the parser mode"""
self.parser_mode = parser_mode
@classmethod
def from_prompt_type(cls, prompt_type):
"""
Create a ResponseParser with the appropriate mode based on prompt type
Args:
prompt_type: The prompt type (e.g., PromptCreator.YAML_REASONING)
Returns:
ResponseParser: A parser configured for the prompt type
"""
if prompt_type in ["yaml", "teacher"]:
return cls("yaml")
else:
return cls("basic")
QwenModelHandler
This class handles model loading and inference:
class QwenModelHandler:
def __init__(self, model_name="unsloth/Qwen2.5-7B", max_seq_length=768,
quantization=None, device_map="auto", cache_dir=None,
use_flash_attention=True):
"""
Initialize a handler for Qwen models
Args:
model_name: Model identifier (local path or Hugging Face model ID)
max_seq_length: Maximum sequence length
quantization: Quantization method ("4bit", "8bit", or None)
device_map: Device mapping strategy
cache_dir: Directory to cache downloaded models
use_flash_attention: Whether to use Flash Attention 2 for faster inference
"""
self.model_name = model_name
self.max_seq_length = max_seq_length
self.quantization = quantization
self.device_map = device_map
self.cache_dir = cache_dir
self.use_flash_attention = use_flash_attention
self.model = None
self.tokenizer = None
# Load the model and tokenizer
self._load_model()
def _load_model(self):
"""Load the model and tokenizer with appropriate settings"""
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Load tokenizer
self.tokenizer = AutoTokenizer.from_pretrained(
self.model_name,
trust_remote_code=True,
cache_dir=self.cache_dir
)
# Prepare model loading kwargs
model_kwargs = {
"trust_remote_code": True,
"cache_dir": self.cache_dir,
"device_map": self.device_map,
}
# Add Flash Attention if requested and available
if self.use_flash_attention:
try:
import flash_attn
model_kwargs["use_flash_attention_2"] = True
print("Flash Attention 2 enabled!")
except ImportError:
print("Flash Attention not available. For faster inference, install with: pip install flash-attn")
# Add quantization if specified
if self.quantization == "4bit":
try:
from transformers import BitsAndBytesConfig
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
except ImportError:
print("bitsandbytes not available, loading without 4-bit quantization")
elif self.quantization == "8bit":
model_kwargs["load_in_8bit"] = True
else:
model_kwargs["torch_dtype"] = torch.bfloat16
# Load the model
self.model = AutoModelForCausalLM.from_pretrained(
self.model_name,
**model_kwargs
)
def generate_with_streaming(self, prompt, temperature=0.7, max_tokens=1024, stream=True):
"""
Generate text from the model with optional streaming
Args:
prompt: Input text prompt
temperature: Temperature for sampling (0 for deterministic)
max_tokens: Maximum number of tokens to generate
stream: Whether to stream the output
Returns:
String containing the generated text, or generator if streaming
"""
import torch
# Tokenize prompt
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
input_ids = inputs.input_ids
attention_mask = inputs.attention_mask
# Set generation parameters
generation_config = {
"max_new_tokens": max_tokens,
"temperature": temperature,
"do_sample": temperature > 0,
"top_p": 0.95 if temperature > 0 else 1.0,
"repetition_penalty": 1.1,
"pad_token_id": self.tokenizer.eos_token_id,
}
# If not streaming, do normal generation
if not stream:
with torch.no_grad():
outputs = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
**generation_config
)
# Decode the generated text (skip the prompt)
generated_text = self.tokenizer.decode(
outputs[0][input_ids.shape[1]:],
skip_special_tokens=True
)
return generated_text
# If streaming, yield generated tokens one by one
else:
generated = []
# Initialize generator
with torch.no_grad():
generated_ids = self.model.generate(
input_ids=input_ids,
attention_mask=attention_mask,
**generation_config,
streamer=None # Would need a custom streamer here if available
)
# Decode the entire sequence at once (not truly streaming, but simpler)
full_text = self.tokenizer.decode(
generated_ids[0][input_ids.shape[1]:],
skip_special_tokens=True
)
return full_text
Hardware Requirements and Optimization
Flash Attention Benefits
Flash Attention is a highly optimized implementation of the attention mechanism that:
- Speeds up inference by 2-3x compared to standard attention
- Reduces memory usage by avoiding materializing large attention matrices
- Works perfectly with 4-bit quantization for even further optimization
- Scales better with sequence length, which is important for complex coding questions
For the best performance, make sure to:
- Install Flash Attention (
pip install flash-attn) - Enable it when loading the model (see QwenModelHandler class)
- Use with CUDA-compatible NVIDIA GPUs
Hardware Recommendations
For optimal performance, we recommend:
- GPU: NVIDIA GPU with at least 8GB VRAM (16GB+ recommended for larger models)
- RAM: 16GB+ system RAM
- Storage: At least 10GB free disk space for model files
- CPU: Modern multi-core processor (for preprocessing)
Reducing Memory Usage
If you're facing memory constraints:
# Use 4-bit quantization with Flash Attention for optimal memory-efficiency
model_handler = QwenModelHandler(
model_name="tuandunghcmut/Qwen25_Coder_MultipleChoice",
quantization="4bit",
use_flash_attention=True
)
# Further optimize with unsloth
try:
from unsloth.models import FastLanguageModel
FastLanguageModel.for_inference(model_handler.model)
print("Using unsloth for additional optimization")
except ImportError:
print("unsloth not available")
Usage Example
Here's how to use these classes with Flash Attention enabled:
# 1. Load the model with Flash Attention and 4-bit quantization
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
hub_model_id = "tuandunghcmut/Qwen25_Coder_MultipleChoice"
# Create model handler with Flash Attention and 4-bit quantization
model_handler = QwenModelHandler(
model_name=hub_model_id,
max_seq_length=2048,
quantization="4bit",
use_flash_attention=True
)
# Optional: Use unsloth for even faster inference
try:
from unsloth.models import FastLanguageModel
FastLanguageModel.for_inference(model_handler.model)
print("Using unsloth for faster inference")
except ImportError:
print("unsloth not available, using standard inference")
# 2. Create prompt creator with YAML reasoning format
prompt_creator = PromptCreator(PromptCreator.YAML_REASONING)
# 3. Example question
question = "Which of the following correctly defines a list comprehension in Python?"
choices = [
"[x**2 for x in range(10)]",
"for(x in range(10)) { return x**2; }",
"map(lambda x: x**2, range(10))",
"[for x in range(10): x**2]"
]
# 4. Create prompt and generate answer
prompt = prompt_creator.create_inference_prompt(question, choices)
response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
# 5. Parse the response
parser = ResponseParser(prompt_creator.parser_mode)
answer, reasoning = parser.parse(response)
print(f"Question: {question}")
print(f"Answer: {answer}")
if reasoning:
print(f"Reasoning: {reasoning}")
Troubleshooting
Common Issues
Flash Attention Installation Issues: If you encounter problems installing
flash-attn:# Try with specific CUDA version (e.g., for CUDA 11.8) pip install flash-attn==2.3.4+cu118 --no-build-isolation # For older GPUs pip install flash-attn==2.3.4 --no-build-isolationCUDA Out of Memory: Try combining 4-bit quantization with Flash Attention.
model_handler = QwenModelHandler( model_name=hub_model_id, quantization="4bit", use_flash_attention=True )Module Not Found Errors: Make sure you've installed all required packages.
pip install transformers torch unsloth datasets pyyaml bitsandbytes flash-attnParsing Errors: If the model isn't producing valid YAML responses, try adjusting the temperature:
response = model_handler.generate_with_streaming(prompt, temperature=0.0, stream=False)
Getting Help
If you encounter issues, check the model repository on Hugging Face for updates and community discussions.
This guide provides you with all the necessary code and optimization techniques to use the model effectively for multiple-choice coding questions.