Lettuce MiniLM BGE-M3 v1 - ONNX INT8

Lettuce MiniLM BGE-M3 v1 is a distilled and quantized MiniLM-based sentence embedding model designed for fast, on-device semantic search and conversational memory retrieval.

It provides:

  • Small model size (int8 quantized ONNX)
  • Low-latency inference on CPU and mobile devices
  • 384-dimensional embeddings
  • Sentence-transformers compatible tokenizer + config
  • Great for roleplay memory systems, local RAG, vector search, clustering

This model is ideal for applications where speed and size matter, such as:

  • Offline / on-device chat apps
  • Memory retrieval for roleplay systems
  • Lightweight RAG pipelines
  • Mobile devices (Android, iOS)
  • Desktop apps (Tauri, Electron, native)

Model Description

  • Base architecture: all-MiniLM-L6-v2 (6-layer MiniLM encoder)
  • Teacher: BAAI/bge-m3
  • Embed dimension: 384
  • Format: ONNX (int8 quantized)
  • Pooling: Mean pooling + normalization
  • Tokenizer: WordPiece (MiniLM-compatible)

This model was trained by distilling a larger teacher embedding model (BGE-M3) into a compact MiniLM student, then exporting and quantizing the model to ONNX int8 for maximum runtime efficiency.

Usage (Python + ONNX Runtime)

import onnxruntime as ort
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("zeolit/lettuce-minilm-bge-m3-v1-onnx-int8")
session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])

def embed(texts):
    if isinstance(texts, str):
        texts = [texts]
    enc = tokenizer(
        texts,
        padding=True,
        truncation=True,
        max_length=256,
        return_tensors="np",
    )
    outputs = session.run(
        ["sentence_embedding"],
        {
            "input_ids": enc["input_ids"],
            "attention_mask": enc["attention_mask"]
        },
    )[0]
    return outputs  # (batch_size, 384)

embeddings = embed("Sam took a bullet for me.")
print(embeddings.shape)  # (1, 384)

Limitations

  • Primarily optimized for English text
  • Student model is smaller than the teacher โ†’ not intended for high-precision semantic tasks
  • Not suitable for safety-critical decision making

License

This model is released under the Apache-2.0 license.

Derived from:

Acknowledgements

Thanks to:

  • Sentence Transformers
  • BAAI for the BGE-M3 model
  • The ONNX Runtime team
Downloads last month
16
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support