Locai L1-Large

Locai L1-Large

πŸš€ Try it out on locai.chat

Locai L1-Large is an open-source instruction-tuned model based on Qwen3 235B Instruct (2507), post-trained using our Forget-Me-Not framework. This framework combines experience replay and self-improvement to enhance performance whilst mitigating catastrophic forgetting. Paper coming soon.

The model achieves state-of-the-art performance on Arena Hard v2, outperforming non-reasoning variants of GPT-5, Claude Sonnet 4.5, Gemini Flash 2.5, DeepSeek V3.2, and Mistral Medium, whilst delivering competitive results across instruction-following, mathematics, and scientific reasoning.

For more details on the model training, please refer to our technical report.

Highlights

  • πŸ† State-of-the-art alignment: Highest score on Arena Hard v2, outperforming all non-reasoning frontier models including GPT-5 and Claude Sonnet 4.5
  • 🎯 Improved itself: Model generated and evaluated its own training data across helpfulness, relevance, conciseness, complexity, correctness, and harmlessness, improving the base model's instruction-following, safety and alignment.
  • πŸ›‘οΈ Enhanced Safety: 17% improvement on AgentHarm benchmark (27.7 vs 33.4) compared to the base model
  • πŸ”¬ Maintains base capabilities: Retains Qwen's strong performance in mathematics and scientific reasoning due to forget-me-not method.
  • ⚑ Efficient Training: Parameter Efficient Fine-Tuning using LoRA on just 1 node of 8Γ—H200 GPUs
  • 🌱 Sustainable: Trained using 100% renewable energy on UK data centres
  • 🌍 Low-Resource Language Support: Improved proficiency in Celtic languages (Welsh, Irish, Scottish Gaelic) plus Basque, Armenian, Tagalog, and Swahili through bidirectional translation pairs

Evaluation Results

Model Arena Hard v2 IFEval IFBench GSM Plus GPQA Diamond AgentHarm
Locai L1-Large 72.9 92.45 40.14 90.43 63.63 27.7
Qwen3-235B-Instruct 70.8 91.97 39.46 90.48 62.63 33.4
GPT-5 68.9 91.85 41.5 89.14 70.20 12.8
Claude Sonnet 4.5 52.8 92.57 34.69 91.48 68.69 16.6
Gemini 2.5 Flash 54.4 91.13 34.01 89.67 35.35 40.5
DeepSeek V3.2 52.5 90.89 35.71 90.10 80.30 18.2
Mistral Medium 37.9 81.65 28.91 89.62 71.21 69.1

Benchmark Details

  • Arena Hard v2: Evaluates alignment with human preferences using real-world user queries
  • IFEval: Measures strict instruction-following accuracy
  • IFBench: Tests precise instruction-following on out-of-distribution constraints
  • GSM Plus: Assesses mathematical reasoning on grade-school level problems
  • GPQA Diamond: Evaluates expert-level scientific reasoning
  • AgentHarm: Measures safety and robustness against adversarial attacks (lower is better)

Usage

Installation

pip install transformers torch accelerate

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "locailabs/locai-l1-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "Explain quantum entanglement in simple terms"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.7,
    top_k=20,
    top_p=0.8
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(model="locailabs/locai-l1-large")

sampling_params = SamplingParams(
    temperature=0.7,
    top_k=20,
    top_p=0.8,
)

prompts = [
    "Explain quantum entanglement in simple terms."
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Training Details

Training Configuration

  • Base Model: Qwen3-235B-Instruct-2507
  • Method: Supervised Fine-Tuning (SFT) using Parameter Efficient Fine-Tuning (PEFT) through Low-Rank Adaptation (LoRA)
  • Hardware: 1 node Γ— 8 NVIDIA H200 GPUs
  • Energy: 100% renewable energy (UK data centres)
  • Parallelisation: Tensor parallelism, expert parallelism, and sequence parallelism
  • MoE Optimisations: Grouped GEMM, permute fusion, shared expert overlap, auxiliary loss for balanced expert utilisation
  • Memory & Compute: Activation recomputation, sample packing, Flash Attention, loss fusion with final layer

Training Data

The model was trained on a curated dataset combining:

  • Self-improvement data: Generated and evaluated by the model across helpfulness, relevance, conciseness, complexity, correctness, and harmlessness
  • Low-resource language translations: Bidirectional translation pairs from OpenSubtitles corpora
  • Cultural alignment data: British cultural knowledge generated from CultureBank
  • Self-cognition data: Multilingual Q&A pairs about the model

Ethical Considerations

Locai L1-Large has been developed with consideration for:

  • Sustainability: Trained using 100% renewable energy in UK data centres
  • Inclusivity: Enhanced support for low-resource languages to reduce digital inequality
  • Safety: Improved robustness against adversarial attacks (17% improvement on AgentHarm)

Citation

@misc{locai2025l1large,
  title={Locai L1-Large: Self-Improving Language Models with Forget-Me-Not},
  author={Locai Labs},
  year={2025},
  url={https://www.locai.chat}
}

License

Apache 2.0


Model Card Contact

Downloads last month
42
Safetensors
Model size
235B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 2 Ask for provider support

Model tree for locailabs/locai-l1-large

Finetuned
(8)
this model
Finetunes
1 model
Quantizations
1 model

Collection including locailabs/locai-l1-large