WeDLM-8B-Instruct-MLX

This is a full-precision (fp16) MLX version of tencent/WeDLM-8B-Instruct for inference on Apple Silicon.

Model Details

About WeDLM

WeDLM (Window-based Efficient Diffusion Language Model) is a novel approach that combines:

  • Entropy-based parallel decoding: Multiple tokens generated simultaneously based on prediction confidence
  • Topological reordering: Efficient KV cache layout while preserving logical positions via RoPE
  • Window-based generation: Fixed-size window processed in parallel per forward pass

Reference: arXiv:2512.22737

Usage

Installation

pip install mlx mlx-lm

Quick Start

from mlx_lm import load, generate

model, tokenizer = load("zimengxiong/WeDLM-8B-Instruct-MLX")
response = generate(model, tokenizer, prompt="What is machine learning?", max_tokens=256)
print(response)

Chat Template

from mlx_lm import load, generate

model, tokenizer = load("zimengxiong/WeDLM-8B-Instruct-MLX")

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=512)
print(response)

Model Architecture

Parameter Value
Hidden Size 4096
Intermediate Size 12288
Num Layers 36
Num Attention Heads 32
Num KV Heads 8
Head Dim 128
Vocab Size 151936
Max Position Embeddings 16384
RoPE Theta 1000000

Related Models

License

This model inherits the license from the base model tencent/WeDLM-8B-Instruct.

Downloads last month
8
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zimengxiong/WeDLM-8B-Instruct-MLX

Finetuned
(1)
this model