Locai L1-Large

Locai L1-Large is an open-source instruction-tuned model based on Qwen3 235B Instruct (2507), post-trained using our Forget-Me-Not framework. This framework combines experience replay and self-improvement to enhance performance whilst mitigating catastrophic forgetting. Paper coming soon.

The model achieves state-of-the-art performance on Arena Hard v2, outperforming non-reasoning variants of GPT-5, Claude Sonnet 4.5, Gemini Flash 2.5, DeepSeek V3.2, and Mistral Medium, whilst delivering competitive results across instruction-following, mathematics, and scientific reasoning.

For more details on the model training, please refer to our technical report.

Highlights

🏆 State-of-the-art alignment: Highest score on Arena Hard v2, outperforming all non-reasoning frontier models including GPT-5 and Claude Sonnet 4.5
🎯 Improved itself: Model generated and evaluated its own training data across helpfulness, relevance, conciseness, complexity, correctness, and harmlessness, improving the base model's instruction-following, safety and alignment.
🛡️ Enhanced Safety: 17% improvement on AgentHarm benchmark (27.7 vs 33.4) compared to the base model
🔬 Maintains base capabilities: Retains Qwen's strong performance in mathematics and scientific reasoning due to forget-me-not method.
⚡ Efficient Training: Parameter Efficient Fine-Tuning using LoRA on just 1 node of 8×H200 GPUs
🌱 Sustainable: Trained using 100% renewable energy on UK data centres
🌍 Low-Resource Language Support: Improved proficiency in Celtic languages (Welsh, Irish, Scottish Gaelic) plus Basque, Armenian, Tagalog, and Swahili through bidirectional translation pairs

Evaluation Results

Model	Arena Hard v2	IFEval	IFBench	GSM Plus	GPQA Diamond	AgentHarm
Locai L1-Large	72.9	92.45	40.14	90.43	63.63	27.7
Qwen3-235B-Instruct	70.8	91.97	39.46	90.48	62.63	33.4
GPT-5	68.9	91.85	41.5	89.14	70.20	12.8
Claude Sonnet 4.5	52.8	92.57	34.69	91.48	68.69	16.6
Gemini 2.5 Flash	54.4	91.13	34.01	89.67	35.35	40.5
DeepSeek V3.2	52.5	90.89	35.71	90.10	80.30	18.2
Mistral Medium	37.9	81.65	28.91	89.62	71.21	69.1

Benchmark Details

Arena Hard v2: Evaluates alignment with human preferences using real-world user queries
IFEval: Measures strict instruction-following accuracy
IFBench: Tests precise instruction-following on out-of-distribution constraints
GSM Plus: Assesses mathematical reasoning on grade-school level problems
GPQA Diamond: Evaluates expert-level scientific reasoning
AgentHarm: Measures safety and robustness against adversarial attacks (lower is better)

Usage

Installation

pip install transformers torch accelerate

Basic Inference

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "locailabs/locai-l1-large"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

messages = [
    {"role": "user", "content": "Explain quantum entanglement in simple terms"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer([text], return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=2048,
    temperature=0.7,
    top_k=20,
    top_p=0.8
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Using vLLM (Recommended for Production)

from vllm import LLM, SamplingParams

llm = LLM(model="locailabs/locai-l1-large")

sampling_params = SamplingParams(
    temperature=0.7,
    top_k=20,
    top_p=0.8,
)

prompts = [
    "Explain quantum entanglement in simple terms."
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Training Details

Training Configuration

Base Model: Qwen3-235B-Instruct-2507
Method: Supervised Fine-Tuning (SFT) using Parameter Efficient Fine-Tuning (PEFT) through Low-Rank Adaptation (LoRA)
Hardware: 1 node × 8 NVIDIA H200 GPUs
Energy: 100% renewable energy (UK data centres)
Parallelisation: Tensor parallelism, expert parallelism, and sequence parallelism
MoE Optimisations: Grouped GEMM, permute fusion, shared expert overlap, auxiliary loss for balanced expert utilisation
Memory & Compute: Activation recomputation, sample packing, Flash Attention, loss fusion with final layer

Training Data

The model was trained on a curated dataset combining:

Self-improvement data: Generated and evaluated by the model across helpfulness, relevance, conciseness, complexity, correctness, and harmlessness
Low-resource language translations: Bidirectional translation pairs from OpenSubtitles corpora
Cultural alignment data: British cultural knowledge generated from CultureBank
Self-cognition data: Multilingual Q&A pairs about the model

Ethical Considerations

Locai L1-Large has been developed with consideration for:

Sustainability: Trained using 100% renewable energy in UK data centres
Inclusivity: Enhanced support for low-resource languages to reduce digital inequality
Safety: Improved robustness against adversarial attacks (17% improvement on AgentHarm)

Citation

@misc{locai2025l1large,
  title={Locai L1-Large: Self-Improving Language Models with Forget-Me-Not},
  author={Locai Labs},
  year={2025},
  url={https://www.locai.chat}
}