Upload folder using huggingface_hub

Browse files

Files changed (7) hide show

README.md +171 -0
config.json +40 -0
model.safetensors +3 -0
special_tokens_map.json +1 -0
tokenizer.json +0 -0
tokenizer_config.json +1 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,171 @@

+---
+tags:
+- text-classification
+- gibberish
+- detector
+- spam
+- distilbert
+- nlp
+- text-filter
+- akto
+language: en
+widget:
+- text: I love Machine Learning!
+license: mit
+library_name: transformers
+base_model: distilbert-base-uncased
+model-index:
+- name: gibberish-detector
+  results:
+  - task:
+      type: text-classification
+      name: Gibberish Detection
+    metrics:
+    - type: accuracy
+      value: 0.9736
+      name: Accuracy
+    - type: f1
+      value: 0.9736
+      name: F1 Score
+---
+# Gibberish Detector - Text Classification Model
+**High-performance gibberish detection model** for identifying nonsensical text, spam, and incoherent input. Built with DistilBERT, achieving **97.36% accuracy** in multi-class text classification.
+This model is designed for production use with Akto's security frameworks and LLM protection systems.
+## 🎯 Quick Start
+```python
+from transformers import pipeline
+# Initialize the gibberish detector
+detector = pipeline("text-classification", model="TangoBeeAkto/gibberish-detector")
+# Detect gibberish in text
+result = detector("I love Machine Learning!")
+print(result)
+# Output: [{'label': 'clean', 'score': 0.99}]
+```
+## 🔥 Key Features
+- **🎯 97.36% Accuracy**: High-performance gibberish detection
+- **⚡ Fast Inference**: Optimized DistilBERT architecture
+- **🏷️ Multi-Class Detection**: Noise, Word Salad, Mild Gibberish, and Clean text
+- **🔧 Easy Integration**: Standard transformers pipeline
+- **🌐 Production Ready**: Tested and validated for security applications
+- **💚 Efficient**: Low computational footprint
+## Problem Description
+The ability to process and understand user input is crucial for various applications, such as chatbots or downstream tasks. However, a common challenge faced in such systems is the presence of gibberish or nonsensical input. This project focuses on developing a gibberish detector for the English language.
+The primary goal is to classify user input as either **gibberish** or **non-gibberish**, enabling more accurate and meaningful interactions with the system.
+## Label Categories
+The model classifies text into 4 categories:
+1. **Clean (0)**: Proper, meaningful sentences
+   - Example: `I love this website`
+2. **Mild Gibberish (1)**: Sentences with grammatical or syntactical errors
+   - Example: `I study in a teacher`
+3. **Noise (2)**: Random character sequences with no meaningful words
+   - Example: `dfdfer fgerfow2e0d qsqskdsd`
+4. **Word Salad (3)**: Valid words without coherent meaning
+   - Example: `apple banana car house randomly`
+## 🚀 Use Cases
+### Input Validation for Security Systems
+```python
+def validate_user_input(text):
+    result = detector(text)[0]
+    if result['label'] in ['noise', 'word_salad']:
+        return "Invalid input detected. Please provide meaningful text."
+    return process_query(text)
+```
+### Content Moderation
+```python
+def moderate_content(post):
+    classification = detector(post)[0]
+    if classification['label'] != 'clean':
+        return f"Content flagged: {classification['label']}"
+    return "Content approved"
+```
+### LLM Prompt Filtering
+```python
+def filter_prompt(prompt):
+    result = detector(prompt)[0]
+    if result['label'] in ['noise', 'word_salad'] and result['score'] > 0.8:
+        return "Potentially malicious or gibberish prompt detected"
+    return "Prompt is valid"
+```
+## 🛠️ Installation & Usage
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+import torch
+# Load model and tokenizer
+model = AutoModelForSequenceClassification.from_pretrained("TangoBeeAkto/gibberish-detector")
+tokenizer = AutoTokenizer.from_pretrained("TangoBeeAkto/gibberish-detector")
+def detect_gibberish(text):
+    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
+    predicted_label_id = probabilities.argmax().item()
+    return model.config.id2label[predicted_label_id]
+# Example usage
+print(detect_gibberish("Hello world!"))  # Output: clean
+print(detect_gibberish("asdkfj asdf"))   # Output: noise
+```
+## Model Details
+- **Architecture**: DistilBERT for Sequence Classification
+- **Base Model**: distilbert-base-uncased
+- **Max Length**: 64 tokens
+- **Vocab Size**: 30,522
+- **Parameters**: ~67M
+## Performance Metrics
+- **Accuracy**: 97.36%
+- **F1 Score**: 97.36%
+- **Precision**: 97.38%
+- **Recall**: 97.36%
+## ONNX Support
+This model supports ONNX optimization for faster inference in production environments. Use with optimized runtimes for best performance.
+## Integration with Akto Security Framework
+This model is optimized for use with Akto's LLM security and protection systems. It provides real-time gibberish detection for:
+- Prompt injection detection
+- Input validation
+- Content filtering
+- Security monitoring
+## License
+This model is licensed under the MIT License.
+---
+**Developed by Akto for enterprise security applications**

config.json ADDED Viewed

	@@ -0,0 +1,40 @@

+{
+  "_name_or_path": "TangoBeeAkto/gibberish-detector",
+  "_num_labels": 4,
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "clean",
+    "1": "mild gibberish",
+    "2": "noise",
+    "3": "word salad"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "clean": 0,
+    "mild gibberish": 1,
+    "noise": 2,
+    "word salad": 3
+  },
+  "max_length": 64,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "padding": "max_length",
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.15.0",
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:278bd923924e82f8021d73c771840e98c5d2a1f032a6cba5d09dab1583cd2e82
+size 267838720

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]"}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"do_lower_case": true, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "name_or_path": "AutoNLP", "tokenizer_class": "DistilBertTokenizer"}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff