languagebench / evals

Commit History

Upload from GitHub Actions: cleaned up code
2586cfe
verified

davidpomerenke commited on

Upload from GitHub Actions: added opus 4.5
0a17acf
verified

davidpomerenke commited on

Upload from GitHub Actions: add gpt-5.1, gemini-3
9ea2dd3
verified

davidpomerenke commited on

Upload from GitHub Actions: flores filter for available dev split
34b05c6
verified

davidpomerenke commited on

Upload from GitHub Actions: model name no bracket stuff
aa92add
verified

davidpomerenke commited on

Upload from GitHub Actions: drop normalization
972026c
verified

davidpomerenke commited on

Upload from GitHub Actions: improve norwegian fix
6f0e312
verified

davidpomerenke commited on

Upload from GitHub Actions: fix norwegian
0cbac6c
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #22 from datenlabor-bmz/dev
2cdada4
verified

davidpomerenke commited on

Upload from GitHub Actions: Add auto-translated datasets
68a93b5
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #18 from datenlabor-bmz/pr-17
a0d1624
verified

davidpomerenke commited on

Upload from GitHub Actions: Add auto-translated datasets
c790fdb
verified

davidpomerenke commited on

Upload from GitHub Actions: ran full evaluation locally
088f96f
verified

davidpomerenke commited on

Upload from GitHub Actions: minor chashing change
b39df3c
verified

davidpomerenke commited on

Upload from GitHub Actions: updated and cleaned up scripts for new eval runs
963cb78
verified

davidpomerenke commited on

Upload from GitHub Actions: Update models.py, models.json, and results.json with latest evaluation data and model additions
8eebb41
verified

davidpomerenke commited on

Upload from GitHub Actions: Add Todos for using existing machine-translated datasets rather than our own ones
56adaa2
verified

davidpomerenke commited on

Upload from GitHub Actions: updated translation functions
8f5ce26
verified

davidpomerenke commited on

Upload from GitHub Actions: import flexibility on backend
b8cbeff
verified

davidpomerenke commited on

Upload from GitHub Actions: fixed import error
0a30811
verified

davidpomerenke commited on

Upload from GitHub Actions: updated frontend and backend to fix bugs
4e8cb1a
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #13 from datenlabor-bmz/jn-dev
80d21cb
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #10 from datenlabor-bmz/jn-dev
c2eeeac
verified

davidpomerenke commited on

Upload from GitHub Actions: updated batch size and delay
02f927b
verified

davidpomerenke commited on

Upload from GitHub Actions: updated workflow settings
e51c770
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #9 from datenlabor-bmz/jn-dev
7c06aef
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #7 from datenlabor-bmz/jn-dev
6878a71
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #6 from datenlabor-bmz/jn-dev
6234f5c
verified

davidpomerenke commited on

Upload from GitHub Actions: Exclude TruthfulQA from proficiency score
3fbff09
verified

davidpomerenke commited on

Upload from GitHub Actions: TruthfulQA translation WIP
fd102e9
verified

davidpomerenke commited on

Upload from GitHub Actions: Scatterplot
353f761
verified

davidpomerenke commited on

Upload from GitHub Actions: Get more results, compute average based on all tasks
98c6811
verified

davidpomerenke commited on

Upload from GitHub Actions: Translate MMLU and evaluate
4c5c136
verified

davidpomerenke commited on

Upload from GitHub Actions: Correlation plot
b0aa389
verified

davidpomerenke commited on

Upload from GitHub Actions: Evaluate on autotranslated GSM dataset
f3a09a2
verified

davidpomerenke commited on

Upload from GitHub Actions: Evaluate Google Translate
338dc9b
verified

davidpomerenke commited on

Upload from GitHub Actions: More models and languages
a73f888
verified

davidpomerenke commited on

Upload from GitHub Actions: Improve UX and style
53d2039
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge remote changes and apply terminology updates: Commercial->closed-source, Open->open-source
ebaf279
verified

davidpomerenke commited on

Upload from GitHub Actions: Use task subset for average score
b1e5b40
verified

davidpomerenke commited on

Upload from GitHub Actions: Eavaluate on 40 languages
941d5c5
verified

davidpomerenke commited on

Upload from GitHub Actions: Add math benchmarks
549360a
verified

davidpomerenke commited on

Upload from GitHub Actions: More results
52abc5b
verified

davidpomerenke commited on

Upload from GitHub Actions: Update model ranking fetching
f840423
verified

davidpomerenke commited on

Upload from GitHub Actions: Use FLORES+ via Huggingface
913253a
verified

davidpomerenke commited on

Upload from GitHub Actions: Quick fixes
9c2c019
verified

davidpomerenke commited on

Upload from GitHub Actions: More models
0bd935e
verified

davidpomerenke commited on

Upload from GitHub Actions: Increase n_models
d09b095
verified

davidpomerenke commited on

Upload from GitHub Actions: New results
b311dd5
verified

davidpomerenke commited on

Upload from GitHub Actions: Merge pull request #4 from datenlabor-bmz/jonas-dev
7c6a118
verified

davidpomerenke commited on