Delete Leaderboard.md
Browse files- Leaderboard.md +0 -57
Leaderboard.md
DELETED
|
@@ -1,57 +0,0 @@
|
|
| 1 |
-
# π·πΊ Real Estate Operator Leaderboard v2.0
|
| 2 |
-
|
| 3 |
-
| Model | π΅οΈ Supervisor | π‘οΈ Compliance | βοΈ Logic | π§ Memory | π Identity | π Score |
|
| 4 |
-
|:---------------------------------------------|----------------:|----------------:|-----------:|-----------:|-------------:|----------:|
|
| 5 |
-
| qwen/qwen3-235b-a22b-2507 | 65.8 | 100 | 63.2 | 64.5 | 68 | 81.7 |
|
| 6 |
-
| qwen/qwen3-next-80b-a3b-instruct | 59 | 86.4 | 85.3 | 64.5 | 78 | 81.5 |
|
| 7 |
-
| google/gemini-2.5-flash | 44.5 | 100 | 63.2 | 83.9 | 58 | 80.5 |
|
| 8 |
-
| google/gemma-3-27b-it | 63.8 | 100 | 70.6 | 58.1 | 64 | 80 |
|
| 9 |
-
| google/gemma-3-12b-it | 51.5 | 100 | 76.5 | 48.4 | 84 | 74.5 |
|
| 10 |
-
| mistralai/mixtral-8x22b-instruct | 41.8 | 93.2 | 70.6 | 58.1 | 74 | 71.4 |
|
| 11 |
-
| google/gemma-3n-e4b-it | 34.5 | 93.2 | 63.2 | 67.7 | 58 | 70 |
|
| 12 |
-
| google/gemini-2.5-flash-lite-preview-09-2025 | 58.1 | 93.2 | 63.2 | 41.9 | 64 | 69.5 |
|
| 13 |
-
| google/gemma-3-4b-it | 44.2 | 75 | 61.8 | 64.5 | 52 | 68 |
|
| 14 |
-
| qwen/qwen3-30b-a3b-instruct-2507 | 58 | 90.9 | 55.9 | 41.9 | 72 | 67.9 |
|
| 15 |
-
| mistralai/ministral-3b-2512 | 40.2 | 88.6 | 48.5 | 58.1 | 62 | 64.9 |
|
| 16 |
-
| meta-llama/llama-4-scout | 36.3 | 81.8 | 77.9 | 16.1 | 92 | 54.9 |
|
| 17 |
-
| qwen/qwen3-8b | 22.5 | nan | nan | nan | nan | 22.5 |
|
| 18 |
-
| meta-llama/llama-4-maverick | 17.3 | nan | nan | nan | nan | 17.3 |
|
| 19 |
-
| ibm-granite/granite-4.0-h-micro | 15.7 | nan | nan | nan | nan | 15.7 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
## π¨ Violation Analysis
|
| 23 |
-
|
| 24 |
-
### π Top 10 Global Violations
|
| 25 |
-
| Violation | Count |
|
| 26 |
-
|:----------------------------|--------:|
|
| 27 |
-
| Robotic Anchors | 274 |
|
| 28 |
-
| Abrupt Goodbyes | 210 |
|
| 29 |
-
| Goldfish Memory | 172 |
|
| 30 |
-
| Curtness/Dryness | 138 |
|
| 31 |
-
| Echoing/Parroting | 116 |
|
| 32 |
-
| Looping/Ignoring Correction | 110 |
|
| 33 |
-
| Outbound Fail | 109 |
|
| 34 |
-
| City Pivot Failure | 96 |
|
| 35 |
-
| Introduction Fail | 83 |
|
| 36 |
-
| Context Blindness | 70 |
|
| 37 |
-
|
| 38 |
-
### π€ Model Violation Breakdown
|
| 39 |
-
| Model | Total Violations | Top 3 Violations |
|
| 40 |
-
|:---------------------------------------------|-------------------:|:------------------------------------------------------------------------------|
|
| 41 |
-
| meta-llama/llama-4-maverick | 187 | Abrupt Goodbyes (14), Meta-Commentary (13), Goldfish Memory (13) |
|
| 42 |
-
| google/gemma-3n-e4b-it | 174 | Robotic Anchors (23), Goldfish Memory (20), Curtness/Dryness (16) |
|
| 43 |
-
| ibm-granite/granite-4.0-h-micro | 173 | Looping/Ignoring Correction (22), Robotic Anchors (18), Curtness/Dryness (16) |
|
| 44 |
-
| mistralai/ministral-3b-2512 | 159 | Abrupt Goodbyes (22), Robotic Anchors (19), Goldfish Memory (15) |
|
| 45 |
-
| google/gemini-2.5-flash | 159 | Robotic Anchors (22), Abrupt Goodbyes (19), Echoing/Parroting (15) |
|
| 46 |
-
| qwen/qwen3-30b-a3b-instruct-2507 | 145 | Robotic Anchors (23), Curtness/Dryness (12), Abrupt Goodbyes (12) |
|
| 47 |
-
| mistralai/mixtral-8x22b-instruct | 141 | Abrupt Goodbyes (24), Goldfish Memory (14), Curtness/Dryness (13) |
|
| 48 |
-
| google/gemini-2.5-flash-lite-preview-09-2025 | 138 | Robotic Anchors (23), Abrupt Goodbyes (18), Goldfish Memory (14) |
|
| 49 |
-
| qwen/qwen3-next-80b-a3b-instruct | 137 | Robotic Anchors (24), Echoing/Parroting (13), Curtness/Dryness (11) |
|
| 50 |
-
| google/gemma-3-12b-it | 134 | Robotic Anchors (22), Abrupt Goodbyes (14), Goldfish Memory (13) |
|
| 51 |
-
| google/gemma-3-4b-it | 130 | Robotic Anchors (14), Outbound Fail (11), Abrupt Goodbyes (9) |
|
| 52 |
-
| meta-llama/llama-4-scout | 127 | Robotic Anchors (17), Abrupt Goodbyes (14), Deflection Fail (11) |
|
| 53 |
-
| google/gemma-3-27b-it | 124 | Outbound Fail (19), Robotic Anchors (18), Abrupt Goodbyes (18) |
|
| 54 |
-
| qwen/qwen3-235b-a22b-2507 | 93 | Robotic Anchors (23), Abrupt Goodbyes (11), Outbound Fail (10) |
|
| 55 |
-
| qwen/qwen3-8b | 51 | Goldfish Memory (7), Robotic Anchors (6), Curtness/Dryness (6) |
|
| 56 |
-
|
| 57 |
-
*Updated: 2025-12-12 15:20*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|