---
tags:
- text-classification
- gibberish
- detector
- spam
- distilbert
- nlp
- text-filter
- akto
language: en
widget:
- text: I love Machine Learning!
license: mit
library_name: transformers
base_model: distilbert-base-uncased
model-index:
- name: gibberish-detector
  results:
  - task:
      type: text-classification
      name: Gibberish Detection
    metrics:
    - type: accuracy
      value: 0.9736
      name: Accuracy
    - type: f1
      value: 0.9736
      name: F1 Score
---

# Gibberish Detector - Text Classification Model

**High-performance gibberish detection model** for identifying nonsensical text, spam, and incoherent input. Built with DistilBERT, achieving **97.36% accuracy** in multi-class text classification.

This model is designed for production use with Akto's security frameworks and LLM protection systems.

## 🎯 Quick Start

```python
from transformers import pipeline

# Initialize the gibberish detector
detector = pipeline("text-classification", model="TangoBeeAkto/gibberish-detector")

# Detect gibberish in text
result = detector("I love Machine Learning!")
print(result)
# Output: [{'label': 'clean', 'score': 0.99}]
```

## 🔥 Key Features

- **🎯 97.36% Accuracy**: High-performance gibberish detection
- **⚡ Fast Inference**: Optimized DistilBERT architecture
- **🏷️ Multi-Class Detection**: Noise, Word Salad, Mild Gibberish, and Clean text
- **🔧 Easy Integration**: Standard transformers pipeline
- **🌐 Production Ready**: Tested and validated for security applications
- **💚 Efficient**: Low computational footprint

## Problem Description

The ability to process and understand user input is crucial for various applications, such as chatbots or downstream tasks. However, a common challenge faced in such systems is the presence of gibberish or nonsensical input. This project focuses on developing a gibberish detector for the English language.

The primary goal is to classify user input as either **gibberish** or **non-gibberish**, enabling more accurate and meaningful interactions with the system.

## Label Categories

The model classifies text into 4 categories:

1. **Clean (0)**: Proper, meaningful sentences
   - Example: `I love this website`

2. **Mild Gibberish (1)**: Sentences with grammatical or syntactical errors
   - Example: `I study in a teacher`

3. **Noise (2)**: Random character sequences with no meaningful words
   - Example: `dfdfer fgerfow2e0d qsqskdsd`

4. **Word Salad (3)**: Valid words without coherent meaning
   - Example: `apple banana car house randomly`

## 🚀 Use Cases

### Input Validation for Security Systems
```python
def validate_user_input(text):
    result = detector(text)[0]
    if result['label'] in ['noise', 'word_salad']:
        return "Invalid input detected. Please provide meaningful text."
    return process_query(text)
```

### Content Moderation
```python
def moderate_content(post):
    classification = detector(post)[0]
    if classification['label'] != 'clean':
        return f"Content flagged: {classification['label']}"
    return "Content approved"
```

### LLM Prompt Filtering
```python
def filter_prompt(prompt):
    result = detector(prompt)[0]
    if result['label'] in ['noise', 'word_salad'] and result['score'] > 0.8:
        return "Potentially malicious or gibberish prompt detected"
    return "Prompt is valid"
```

## 🛠️ Installation & Usage

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained("TangoBeeAkto/gibberish-detector")
tokenizer = AutoTokenizer.from_pretrained("TangoBeeAkto/gibberish-detector")

def detect_gibberish(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label_id = probabilities.argmax().item()
    
    return model.config.id2label[predicted_label_id]

# Example usage
print(detect_gibberish("Hello world!"))  # Output: clean
print(detect_gibberish("asdkfj asdf"))   # Output: noise
```

## Model Details

- **Architecture**: DistilBERT for Sequence Classification
- **Base Model**: distilbert-base-uncased
- **Max Length**: 64 tokens
- **Vocab Size**: 30,522
- **Parameters**: ~67M

## Performance Metrics

- **Accuracy**: 97.36%
- **F1 Score**: 97.36%
- **Precision**: 97.38%
- **Recall**: 97.36%

## ONNX Support

This model supports ONNX optimization for faster inference in production environments. Use with optimized runtimes for best performance.

## Integration with Akto Security Framework

This model is optimized for use with Akto's LLM security and protection systems. It provides real-time gibberish detection for:

- Prompt injection detection
- Input validation
- Content filtering
- Security monitoring

## License

This model is licensed under the MIT License.

---

**Developed by Akto for enterprise security applications**