Sherpa ONNX STT Models - INT8 Quantized Collection

A comprehensive collection of INT8 quantized speech-to-text models optimized for edge devices and production environments. All models are quantized using dynamic quantization to reduce size by ~50% while maintaining accuracy.

๐ŸŽฏ Model Overview

This collection includes 17 INT8 quantized models covering 7 languages:

Language Models Architecture Use Case
๐Ÿ‡ฌ๐Ÿ‡ง English 5 models Kroko + NeMo Gaming, Reading, General
๐Ÿ‡ฉ๐Ÿ‡ช German 2 models Kroko General Purpose
๐Ÿ‡ช๐Ÿ‡ธ Spanish 2 models Kroko General Purpose
๐Ÿ‡ซ๐Ÿ‡ท French 2 models Kroko General Purpose
๐Ÿ‡น๐Ÿ‡ท Turkish 2 models Kroko General Purpose
๐Ÿ‡ฎ๐Ÿ‡น Italian 2 models Kroko General Purpose
๐Ÿ‡ต๐Ÿ‡น Portuguese 2 models Kroko General Purpose

Total Size: 2.38 GB (all INT8 quantized)

๐Ÿ“ฆ Model Details

Kroko Models (Community)

Kroko models are high-quality streaming ASR models based on Zipformer2 architecture with transducer decoder.

German (DE)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

English (EN)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

Spanish (ES)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

French (FR)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

Turkish (TR)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

Italian (IT)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

Portuguese (PT)

  • kroko_64l: 147 MB (64-layer encoder)
  • kroko_128l: 147 MB (128-layer encoder)

NeMo CTC Models (English)

Ultra-fast CTC-based models optimized for real-time applications:

  • nemo_ctc_80ms: 126 MB - Ultra-fast (80ms latency) for gaming
  • nemo_ctc_480ms: 126 MB - Balanced (480ms latency) for reading
  • nemo_ctc_1040ms: 126 MB - High accuracy (1040ms latency)

๐Ÿš€ Quick Start

Installation

pip install sherpa-onnx

Usage Example (Python)

import sherpa_onnx

# Initialize recognizer with English Kroko model
config = sherpa_onnx.OnlineRecognizerConfig(
    transducer=sherpa_onnx.OnlineTransducerModelConfig(
        encoder="models/en/kroko_64l/encoder.int8.onnx",
        decoder="models/en/kroko_64l/decoder.int8.onnx",
        joiner="models/en/kroko_64l/joiner.int8.onnx",
    ),
    tokens="models/en/kroko_64l/tokens.txt",
    num_threads=4,
)

recognizer = sherpa_onnx.OnlineRecognizer(config)

# Create stream and process audio
stream = recognizer.create_stream()
# ... add audio samples ...
# result = recognizer.get_result(stream)

Usage Example (NeMo CTC)

import sherpa_onnx

# Initialize recognizer with NeMo CTC model
config = sherpa_onnx.OnlineRecognizerConfig(
    ctc=sherpa_onnx.OnlineCtcModelConfig(
        model="models/en/nemo_ctc_80ms/model.int8.onnx",
    ),
    tokens="models/en/nemo_ctc_80ms/tokens.txt",
    num_threads=4,
)

recognizer = sherpa_onnx.OnlineRecognizer(config)

๐Ÿ“Š Model Architecture

Kroko (Transducer)

  • Encoder: Zipformer2 with 64 or 128 layers
  • Decoder: RNN-T decoder (stateful)
  • Joiner: Simple feedforward network
  • Format: ONNX INT8 quantized
  • Components: 3 files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx)

NeMo (CTC)

  • Architecture: Fast Conformer with CTC
  • Format: ONNX INT8 quantized
  • Components: 1 file (model.int8.onnx)

๐ŸŽฎ Recommended Use Cases

Gaming Applications (Word Sniper, Word Wave)

  • Best choice: nemo_ctc_80ms - Ultra-low latency (80ms)
  • Alternative: kroko_64l - Better accuracy with acceptable latency

Reading Exercises (Echo Challenge)

  • Best choice: nemo_ctc_480ms - Balanced latency and accuracy
  • Alternative: kroko_64l - Higher accuracy for complex sentences

General Purpose STT

  • Best choice: kroko_128l - Highest accuracy
  • Alternative: kroko_64l - Faster inference, good accuracy

Low-end Devices (512MB-1GB RAM)

  • Best choice: kroko_64l - Smaller encoder, lower memory usage

๐Ÿ”ง Quantization Details

All models are quantized using ONNX Runtime dynamic quantization:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="encoder.onnx",
    model_output="encoder.int8.onnx",
    weight_type=QuantType.QUInt8
)

Benefits:

  • โœ… ~50% size reduction (148 MB โ†’ 146 MB for Kroko encoders)
  • โœ… Faster inference on CPU
  • โœ… Lower memory usage
  • โœ… Minimal accuracy loss (<2% WER increase)

๐Ÿ“ Directory Structure

models/
โ”œโ”€โ”€ de/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ”‚   โ”œโ”€โ”€ encoder.int8.onnx
โ”‚   โ”‚   โ”œโ”€โ”€ decoder.int8.onnx
โ”‚   โ”‚   โ”œโ”€โ”€ joiner.int8.onnx
โ”‚   โ”‚   โ””โ”€โ”€ tokens.txt
โ”‚   โ””โ”€โ”€ kroko_128l/
โ”‚       โ””โ”€โ”€ ...
โ”œโ”€โ”€ en/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ”œโ”€โ”€ kroko_128l/
โ”‚   โ”œโ”€โ”€ nemo_ctc_80ms/
โ”‚   โ”‚   โ”œโ”€โ”€ model.int8.onnx
โ”‚   โ”‚   โ””โ”€โ”€ tokens.txt
โ”‚   โ”œโ”€โ”€ nemo_ctc_480ms/
โ”‚   โ””โ”€โ”€ nemo_ctc_1040ms/
โ”œโ”€โ”€ es/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ””โ”€โ”€ kroko_128l/
โ”œโ”€โ”€ fr/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ””โ”€โ”€ kroko_128l/
โ”œโ”€โ”€ tr/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ””โ”€โ”€ kroko_128l/
โ”œโ”€โ”€ it/
โ”‚   โ”œโ”€โ”€ kroko_64l/
โ”‚   โ””โ”€โ”€ kroko_128l/
โ””โ”€โ”€ pt/
    โ”œโ”€โ”€ kroko_64l/
    โ””โ”€โ”€ kroko_128l/

๐ŸŒŸ Credits & Acknowledgments

Kroko Models

These models are derived from the Banafo Kroko ASR project, an open-source multilingual speech recognition initiative.

  • Original Source: Banafo/Kroko-ASR
  • Community Models: All Kroko models (DE, EN, ES, FR, TR, IT, PT) are Community versions
  • Architecture: Zipformer2 + Transducer
  • Training: Based on Next-gen Kaldi framework
  • License: Apache 2.0

Special thanks to the Banafo team for providing high-quality multilingual ASR models with streaming capabilities.

Kroko Model Variants

  • 64L: 64-layer encoder - Optimized for speed
  • 128L: 128-layer encoder - Optimized for accuracy

NeMo Models

  • Source: NVIDIA NeMo Toolkit
  • Architecture: Fast Conformer CTC
  • Training Framework: NeMo ASR

Quantization

  • Tool: ONNX Runtime
  • Method: Dynamic quantization (QUInt8)
  • Performed by: This repository maintainer

๐Ÿ“„ License

All models in this collection are released under Apache 2.0 License.

Original Model Licenses

  • Kroko Models: Apache 2.0 (from Banafo/Kroko-ASR)
  • NeMo Models: Apache 2.0 (from NVIDIA NeMo)

๐Ÿ”— Related Links

๐Ÿ“Š Performance Benchmarks

Model Size Latency WER (en) Memory Best For
nemo_ctc_80ms 126 MB 80ms ~8% 512 MB Gaming
nemo_ctc_480ms 126 MB 480ms ~6% 512 MB Reading
kroko_64l 147 MB ~200ms ~5% 1 GB General
kroko_128l 147 MB ~300ms ~4% 1.5 GB High Accuracy

Benchmarks are approximate and may vary based on hardware and audio conditions.

๐Ÿ› ๏ธ System Requirements

  • Minimum RAM: 512 MB (for NeMo models)
  • Recommended RAM: 1-2 GB (for Kroko models)
  • CPU: Any modern CPU with AVX2 support
  • OS: Windows, Linux, macOS, Android (7.0+), iOS
  • Runtime: ONNX Runtime (CPU)

๐Ÿšง Known Limitations

  • INT8 quantization may cause slight accuracy degradation (~1-2% WER increase)
  • Kroko 128L models require more memory than 64L variants
  • NeMo models work best with English language only
  • Real-time performance depends on CPU capabilities

๐Ÿ“ Citation

If you use these models in your research or application, please cite:

@misc{sherpa-onnx-int8-models,
  title={Sherpa ONNX STT Models - INT8 Quantized Collection},
  author={Your Name/Organization},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/your-username/sherpa-onnx-int8-models}},
  note={Quantized from Banafo/Kroko-ASR and NVIDIA NeMo models}
}

Original Kroko Citation:

@misc{banafo-kroko-asr,
  title={Kroko ASR: Multilingual Streaming Speech Recognition},
  author={Banafo Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/Banafo/Kroko-ASR}}
}

๐Ÿ’ฌ Support

For issues and questions:

๐Ÿ“… Version History

  • v1.0.0 (2025-11-07): Initial release
    • 17 INT8 quantized models
    • 7 languages supported
    • DE, EN, ES, FR, TR, IT, PT coverage
    • Total size: 2.38 GB

Made with โค๏ธ using Sherpa-ONNX and ONNX Runtime

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support