hdlm-group
/

hdlm-base-gamma-0.05

@@ -2,139 +2,144 @@
 language:
 - en
 tags:
-- diffusion-language-model
 - dllm
 - text-generation
 - diffusion
 - language-model
-license: mit
 ---
-# hdlm-group/hdlm-base-gamma-0.05
-This is a gamma_hybrid diffusion language model trained on text data.
-## Model Details
-- **Model Type**: gamma_hybrid
-- **Architecture**: Diffusion-based language model
-- **Training Method**: Gamma-hybrid diffusion training
-## Configuration
-```yaml
-ngpus: 4
-gradient_accumulation_steps: 8
-pretrain_autoregressive_path: /home/toolkit/research-diffcodegen/exp_local/openwebtext/mdlm-autoregressive/org-DiTAR-absorb-v2/checkpoints-meta/checkpoint.pth
-tokenizer:
-  tokens: 50257
-  model: gpt2
-training:
-  batch_size: 512
-  accum: ${gradient_accumulation_steps}
-  n_iters: 1000000
-  snapshot_freq: 500
-  log_freq: 100
-  eval_freq: 500
-  snapshot_freq_for_preemption: 3000
-  weight: standard
-  snapshot_sampling: true
-  ema: 0.9999
-  warmup_iter: -1
-data:
-  train: openwebtext-train
-  valid: wikitext103
-  cache_dir: /home/toolkit/research-diffcodegen/data
-  debug: false
-graph:
-  type: QGamma
-  gamma: 0.05
-  file: /home/toolkit/research-diffcodegen/data
-  report_all: false
-  expanded_sigma: true
-noise:
-  type: loglinear
-  sigma_min: 0.0001
-  sigma_max: 2.0
-  ar_diffusion: false
-  expanded_sigma: ${graph.expanded_sigma}
-sampling:
-  predictor: analytic
-  steps_per_level: 1
-  noise_removal: true
-  strategy: direct
-  strategy_param: 0.9
-annealing:
-  type: block
-  efficient: false
-  width: 1024
-  tau: 2048
-  eval_tau: 256
-  steps_per_level: ${sampling.steps_per_level}
-  sampling_method: SAR
-  diffusion_loss_weight: 1.0
-  ce_loss_weight: 4.0
-  sampling_eps: 0.0001
-  attention:
-    context_type: block_causal
-    block_type: full
-  match_inference: true
-eval:
-  batch_size: 32
-  perplexity: true
-  perplexity_batch_size: 16
-optim:
-  weight_decay: 0.0
-  optimizer: AdamW
-  lr: 0.0003
-  beta1: 0.9
-  beta2: 0.999
-  eps: 1.0e-08
-  warmup: 10000
-  grad_clip: 1.0
-  scheduler: lambda
-experiment:
-  name: QGamma0.05-v2
-  wandb_project: debug-QGamma
-model:
-  name: gamma_hdlm
-  type: ddit
-  hidden_size: 768
-  cond_dim: 128
-  length: 1024
-  n_blocks: 12
-  n_heads: 12
-  scale_by_sigma: false
-  dropout: 0.1
-  transformer_sigma_conditioning: true
-  hybrid_sigma_embedding: true
-  post_process_logits: true
-  use_timestep_embedding: true
-model_type: gamma_hybrid
-```
 ## Usage
 ```python
-from our.hf_utils import smart_model_loader
-# Load the model
-model, config, device, accelerator, metaschedule = smart_model_loader(
-    "hdlm-group/hdlm-base-gamma-0.05",
-    model_type="gamma_hybrid"
 )
 ```
 ## Training Details
-please refer to the official GitHub Repository: https://github.com/ServiceNow/hdlm
 ## Citation
-If you use this model in your research, please cite the original paper and this implementation.
 ## License
-This model is released under the Apache License Version 2.0.

 language:
 - en
 tags:
 - dllm
+- diffusion-language-model
 - text-generation
 - diffusion
 - language-model
+license: apache-2.0
 ---
+# HDLM-Gamma: Hybrid Diffusion Language Model
+[![Paper](https://img.shields.io/badge/Paper-arXiv-red)](https://arxiv.org/abs/2504.06416)
+[![Code](https://img.shields.io/badge/Code-GitHub-blue)](https://github.com/ServiceNow/hdlm)
+This is the model card for **dlm-group/hdlm-base-gamma-0.05**.
+## Model Description
+HDLM-Gamma is a hybrid diffusion language model that unifies autoregressive and diffusion-based sequence generation through gamma-hybrid noising. This model interpolates transition operators between absorbing and uniform processes, making it conceptually closer to SEDD (Lou et al. 2024) while maintaining the benefits of both paradigms.
+The gamma parameter (γ) controls the blend between absorbing and uniform transition matrices: Q_gamma = (1-γ) * Q_absorb + γ * Q_uniform, where smaller values emphasize the absorbing process and larger values incorporate more uniform transitions.
+## Model Architecture
+- **Base Model**: Transformer architecture with staggered score conditioning
+- **Vocabulary Size**: 50,258 tokens (GPT-2 vocabulary + absorbing token)
+- **Context Length**: Variable (supports up to 2048 tokens)
+- **Training**: Continuous-time diffusion with gamma-hybrid graph structure
+- **Inference**: Analytic predictor with staggered score computation
 ## Usage
+### Quick Start
 ```python
+from hdlm.hf_utils import smart_model_loader
+from hdlm.gamma_hybrid.sampling import get_sa_sampling_fn
+from transformers import GPT2TokenizerFast
+import torch
+# Load model using smart loader (automatically detects model type)
+model, cfg, device, accelerator, metaschedule = smart_model_loader(
+    model_path="hdlm-group/hdlm-base-gamma-0.05",
+    model_type="auto",  # automatically detects gamma_hybrid
+    device="cuda"
+)
+# Load tokenizer
+tokenizer = GPT2TokenizerFast.from_pretrained('gpt2')
+# Generate text
+prompt = "The future of artificial intelligence"
+prompt_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
+# Configure sampling function (automatically set up from config)
+sampling_fn = get_sa_sampling_fn(
+    config=cfg,
+    graph=None,  # Will be created from config
+    noise=None,  # Will be created from config
+    meta_schedule=metaschedule,
+    batch_dims=(1,),
+    eps=1e-4,
+    device=device
+)
+# Generate samples
+generated = sampling_fn(
+    model=model,
+    prompt=prompt_ids,
+    context_length=1024
 )
+# Decode generated text
+generated_text = tokenizer.decode(generated[0], skip_special_tokens=True)
+print(generated_text)
+```
+### Evaluation
+```bash
+# Text generation evaluation
+python hdlm/eval_generation.py \
+    --checkpoint_path hdlm-group/hdlm-base-gamma-0.05 \
+    --sampling_method SAR \
+    --save_samples
+# Perplexity evaluation
+python hdlm/eval_modeling.py \
+    --checkpoint_path hdlm-group/hdlm-base-gamma-0.05 \
+    --work_dir "./logs/eval_modeling_gamma" \
+    --dataset ptb
 ```
 ## Training Details
+- **Dataset**: OpenWebText
+- **Batch Size**: 256
+- **Learning Rate**: 3e-4 with lambda scheduling
+- **Gamma (γ)**: 0.01 (controls hybrid transition blend)
+- **Graph Type**: QGamma with expanded sigma conditioning
+- **Noise Schedule**: Log-linear (σ_min=1e-4, σ_max=10.0)
+- **Training Steps**: 1M iterations
+- **Warmup**: 50K steps
+## Key Components
+### Graph Structure
+The QGamma graph combines absorbing and uniform transition matrices:
+- **Absorbing component**: Transitions to absorbing state (mask token)
+- **Uniform component**: Uniform transitions between all tokens
+- **Hybrid blend**: Controlled by gamma parameter
+### Staggered Score
+The model uses staggered score computation that applies different transformations to absorbing and uniform branches before combining them, enabling more flexible generation patterns.
+### Sampling Strategy
+- **Predictor**: Analytic predictor with exact transition computation
+- **Strategy**: Direct sampling with configurable strategy parameter
+- **Noise Removal**: Optional final denoising step
+## Model Variants
+Available gamma values and their characteristics:
+- **γ = 0.01**: Minimal uniform transitions, closest to pure absorbing process
+- **γ = 0.1**: Moderate hybrid behavior with increased uniform mixing
+- **γ = 0.5**: Balanced absorbing-uniform transition blend
 ## Citation
+```bibtex
+@article{fathi2025unifying,
+  title={Unifying autoregressive and diffusion-based sequence generation},
+  author={Fathi, Nima and Scholak, Torsten and No{\"e}l, Pierre-Andr{\'e}},
+  journal={arXiv preprint arXiv:2504.06416},
+  year={2025}
+}
+```
 ## License
+This model is released under the same license as the original HDLM codebase. Please refer to the [GitHub repository](https://github.com/ServiceNow/hdlm) for license details.