--- library_name: peft license: mit language: - en base_model: - meta-llama/Llama-3.1-8B-Instruct pipeline_tag: question-answering tags: - peft - Universal - ORKGSyn - 33 disciplines ---
YESciEval Logo
Large Language Models (LLMs) have become pivotal in powering scientific question-answering across modern search engines, yet their evaluation robustness remains largely underexplored. To address this gap, we introduce **YESciEval** — an open-source framework that leverages fine-grained rubric-based assessments combined with reinforcement learning to reduce optimism bias in LLM evaluators. YESciEval provides a comprehensive library for evaluating the quality of synthesized scientific answers using predefined rubrics and sophisticated LLM-based judgment models. This framework enables you to assess answers on key criteria by utilizing pretrained judges and parsing LLM outputs into structured JSON formats for detailed analysis. **The `YESciEval-ASK-Llama-3.1-8B` is a multidisciplinary judge tuned on the [ORKGSyn](https://data.uni-hannover.de/dataset/yescieval-corpus) dataset from the Open Research Knowledge Graph.** ## Usage First of all, install the `YESciEval` library via PiP: ```bash pip install yescieval ``` Get started with YESciEval in just a few lines of code. This guide demonstrates how to initialize inputs, load the judge, and initiate the rubric for evaluation of the answer. ``` python from yescieval import Readability, AskAutoJudge # Sample papers with following format {"title": "abstract", ... } papers = { "A Study on AI": "This paper discusses recent advances in artificial intelligence, including deep learning.", "Machine Learning Basics": "An overview of supervised learning methods such as decision trees and SVMs.", "Neural Networks Explained": "Explains backpropagation and gradient descent for training networks.", "Ethics in AI": "Explores ethical concerns in automated decision-making systems.", "Applications of AI in Healthcare": "Details how AI improves diagnostics and personalized medicine." } # Question and synthesized answer question = "How is AI used in modern healthcare systems?" answer = ( "AI is being used in healthcare for diagnosing diseases, predicting patient outcomes, " "and assisting in treatment planning. It also supports personalized medicine and medical imaging." ) # Step 1: Create a rubric rubric = Readability(papers=papers, question=question, answer=answer) # Step 2: Load a judge model judge = AskAutoJudge() judge.from_pretrained(token="your_huggingface_token") # Step 3: Evaluate the answer result = judge.evaluate(rubric=rubric) print("Raw Evaluation Output:") print(result) ``` A total of nine evaluation rubrics were defined as part of the YESciEval test framework and can be used via ``yescieval``. The following simple example shows how to import rubrics in your code: ```python from yescieval import Informativeness, Correctness, Completeness, Coherence, Relevancy, Integration, Cohesion, Readability, Conciseness ``` A complete list of rubrics is available at YESciEval [📚 Rubrics](https://yescieval.readthedocs.io/rubrics.html) page. For more detailed documentation, visit [📚 https://yescieval.readthedocs.io](https://yescieval.readthedocs.io) ## Citation If you find our work helpful, feel free to give us a cite. ```bibtex @article{d2025yescieval, title={YESciEval: Robust LLM-as-a-Judge for Scientific Question Answering}, author={D'Souza, Jennifer and Giglou, Hamed Babaei and M{\"u}nch, Quentin}, journal={arXiv preprint arXiv:2505.14279}, year={2025} } ```