Lettuce MiniLM BGE-M3 v1 - ONNX INT8

Lettuce MiniLM BGE-M3 v1 is a distilled and quantized MiniLM-based sentence embedding model designed for fast, on-device semantic search and conversational memory retrieval.

It provides:

Small model size (int8 quantized ONNX)
Low-latency inference on CPU and mobile devices
384-dimensional embeddings
Sentence-transformers compatible tokenizer + config
Great for roleplay memory systems, local RAG, vector search, clustering

This model is ideal for applications where speed and size matter, such as:

Offline / on-device chat apps
Memory retrieval for roleplay systems
Lightweight RAG pipelines
Mobile devices (Android, iOS)
Desktop apps (Tauri, Electron, native)

Model Description

Base architecture: all-MiniLM-L6-v2 (6-layer MiniLM encoder)
Teacher: BAAI/bge-m3
Embed dimension: 384
Format: ONNX (int8 quantized)
Pooling: Mean pooling + normalization
Tokenizer: WordPiece (MiniLM-compatible)

This model was trained by distilling a larger teacher embedding model (BGE-M3) into a compact MiniLM student, then exporting and quantizing the model to ONNX int8 for maximum runtime efficiency.

Usage (Python + ONNX Runtime)

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zeolit/lettuce-minilm-bge-m3-v1-onnx-int8")
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts):
    if isinstance(texts, str):
        texts = [texts]
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="np",
    )
    outputs = session.run(
        ["sentence_embedding"],
        {
            "input_ids": enc["input_ids"],
            "attention_mask": enc["attention_mask"]
        },
    )[0]
    return outputs  # (batch_size, 384)

embeddings = embed("Sam took a bullet for me.")
print(embeddings.shape)  # (1, 384)

Limitations

Primarily optimized for English text
Student model is smaller than the teacher → not intended for high-precision semantic tasks
Not suitable for safety-critical decision making

License

This model is released under the Apache-2.0 license.

Derived from:

sentence-transformers/all-MiniLM-L6-v2 — Apache-2.0
BAAI/bge-m3 — MIT License

Acknowledgements

Thanks to:

Sentence Transformers
BAAI for the BGE-M3 model
The ONNX Runtime team

Downloads last month: 16