Qwen3-1.7B Magistral Math (BF16)

License: Apache-2.0 Model: Qwen3-1.7B Domain: Math Reasoning Precision: BF16


Disclaimer:

“Magistral” here refers to a lecture-style (magistral) math reasoning supervision. This is an independent community model, not affiliated with or endorsed by Mistral

TL;DR

Qwen3-1.7B Magistral Math is a math-specialized fine-tune of unsloth/Qwen3-1.7B-Base, trained in full BF16 on a compact, high-quality chain-of-thought dataset:

  • Data: HAD653/GSM8K-OpenMath-MathReason-13k – 13.9k grade-school & early high-school word problems with structured CoT.

  • Goal: a 1.7B model that reliably solves GSM8K / OpenMath-style problems with clear step-by-step reasoning.

  • Answer format:

    Problem:
    ...
    
    Reasoning:
    ...
    
    Answer:
    <final numeric answer>
    
  • Best use: GSM8K-like word problems, OpenMath-style exos, local math tutor.

Model Details

  • Base model: unsloth/Qwen3-1.7B-Base
  • Architecture: Qwen3 dense causal LM, ~1.7B params, 28 layers, GQA attention, long-context support.
  • Training stage: supervised fine-tuning (SFT) for math reasoning.
  • Precision: BF16 (torch_dtype=torch.bfloat16 recommended).
  • Intended backend: Hugging Face transformers, vLLM, TGI.

This repo stores full fine-tuned weights (no LoRA, no adapters).


Training Data

The model is fine-tuned on:

  • Dataset: HAD653/GSM8K-OpenMath-MathReason-13k

  • Size: 13,857 samples.

  • Fields:

    • question: natural-language math word problem.

    • cot: chain-of-thought solution with 3 blocks:

      • Problem:
      • Reasoning:
      • Answer:
    • final_answer: canonical numeric answer (string).

The dataset focuses on easy–medium difficulty:

  • arithmetic, percentages, fractions,
  • basic algebra,
  • simple combinatorics / number problems.

It is deliberately aimed at what a 1–3B model can realistically master.


Training Setup (Summary)

Fine-tuning was done with Unsloth + TRL on a single RTX 4090, using full BF16 finetuning (no LoRA).

Hyperparameters

  • Base: unsloth/Qwen3-1.7B-Base

  • Sequence length: 2048

  • Epochs: 2

  • Batching:

    • per_device_train_batch_size = 2
    • gradient_accumulation_steps = 8
    • Effective batch size ≈ 16 sequences
  • Optimizer / schedule:

    • learning_rate = 7e-5
    • Linear LR scheduler, warmup_ratio = 0.05
    • weight_decay = 0.01
  • Precision / memory:

    • dtype = bfloat16
    • gradient_checkpointing = True

Supervision format

Each example is converted to a single text field:

### Instruction:
{question}

### Response:
{cot}</s>

Where </s> is the tokenizer EOS token.

Adding the EOS token at the end of the target helps the model learn when to stop, and largely removes pathological loops like:

Answer:
36

Answer:
36

Answer:
36
...

Prompting & Templates

Recommended system prompt (optional but helpful)

You are a math reasoning assistant.

For every question, answer in exactly this format:

Problem:
<restate the problem in your own words>

Reasoning:
<step-by-step reasoning showing all intermediate steps>

Answer:
<final numeric answer only, on its own line>

Do not add any extra commentary before or after the answer.
Do not repeat the answer multiple times.
Stop after writing the final answer.

Inference template (matches training)

For single-turn usage:

### Instruction:
{question}

### Response:

The model will then generate:

Problem:
...

Reasoning:
...

Answer:
<number>

Suggested decoding

For math, use low-temperature decoding:

  • temperature: 0.0 – 0.2
  • top_p: 0.9
  • top_k: 20–40 (optional)
  • repetition_penalty: 1.05 – 1.10
  • max_new_tokens: 256–512 for long CoT

Usage (Transformers)

Basic example

from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_NAME = "HAD653/qwen3-1.7b-magistral-math"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype="auto",          # will be bf16 if supported
    device_map="auto",
)

def format_prompt(question: str) -> str:
    return f"### Instruction:\n{question}\n\n### Response:\n"

question = "Albert buys 2 large pizzas and 2 small pizzas. A large pizza has 16 slices and a small pizza has 8 slices. If he eats it all, how many pieces does he eat that day?"

prompt = format_prompt(question)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=False,          # greedy is usually best for math
    repetition_penalty=1.05,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.eos_token_id,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

vLLM / TGI

The model is a standard qwen3 architecture checkpoint, so it should work out-of-the-box with any backend that supports Qwen3 via transformers>=4.51.


Intended Uses & Limitations

Intended uses

  • Step-by-step solutions to GSM8K-like and OpenMath-style word problems.
  • Experiments on small-model math reasoning (1–3B scale).
  • As a local math tutor for grade-school / early high-school algebra & arithmetic.

Limitations

  • Not a general instruction model; it is biased toward math.
  • Chain-of-thought traces are synthetic (teacher model), not human-authored.
  • Not suitable for high-stakes educational or decision-making use without human review.
  • Limited performance on very hard competition math (Olympiad / proof-heavy).

Please also ensure there is no data leakage if you evaluate on GSM8K/OpenMath-derived benchmarks.


Related Models


Citation

If you use this model in your work, please cite:

@misc{had653_qwen3_magistral_math_2025,
  author       = {HAD653},
  title        = {Qwen3-1.7B Magistral Math: A 1.7B Math Reasoning Model with Magistral Chain-of-Thought},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/HAD653/qwen3-1.7b-magistral-math}},
  note         = {Fine-tuned on GSM8K + OpenMath MathReason 13k with full BF16 supervision.}
}
Downloads last month
34
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HAD653/qwen3-1.7b-magistral-math

Finetuned
(69)
this model
Quantizations
3 models

Dataset used to train HAD653/qwen3-1.7b-magistral-math