Final_Assignment_Template / GAIA_COMPARISON.md
Romain Fayoux
Added ground evaluation and phoenix login
f9cf36d
|
raw
history blame
4.51 kB

GAIA Ground Truth Comparison

This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.

Features

  • Ground Truth Comparison: Automatically compares agent answers to correct answers from data/metadata.jsonl
  • Multiple Evaluation Metrics: Exact match, similarity score, and contains-answer detection
  • Phoenix Integration: Logs evaluations to Phoenix for persistent tracking and analysis
  • Enhanced Results Display: Shows ground truth and comparison results in the Gradio interface

How It Works

1. Ground Truth Loading

  • Loads correct answers from data/metadata.jsonl
  • Maps task IDs to ground truth answers
  • Currently supports 165 questions from the GAIA dataset

2. Answer Comparison

For each agent answer, the system calculates:

  • Exact Match: Boolean indicating if answers match exactly (after normalization)
  • Similarity Score: 0-1 score using difflib.SequenceMatcher
  • Contains Answer: Boolean indicating if the correct answer is contained in the agent's response

3. Answer Normalization

Before comparison, answers are normalized by:

  • Converting to lowercase
  • Removing punctuation (.,;:!?"')
  • Normalizing whitespace
  • Trimming leading/trailing spaces

4. Phoenix Integration

  • Evaluations are automatically logged to Phoenix
  • Each evaluation includes score, label, explanation, and detailed metrics
  • Viewable in Phoenix UI for historical tracking and analysis

Usage

In Your Agent App

The comparison happens automatically when you run your agent:

  1. Run your agent - Process questions as usual
  2. Automatic comparison - System compares answers to ground truth
  3. Enhanced results - Results table includes comparison columns
  4. Phoenix logging - Evaluations are logged for persistent tracking

Results Display

Your results table now includes these additional columns:

  • Ground Truth: The correct answer from GAIA dataset
  • Exact Match: True/False for exact matches
  • Similarity: Similarity score (0-1)
  • Contains Answer: True/False if correct answer is contained

Status Message

The status message now includes:

Ground Truth Comparison:
Exact matches: 15/50 (30.0%)
Average similarity: 0.654
Contains correct answer: 22/50 (44.0%)
Evaluations logged to Phoenix ✅

Testing

Run the test suite to verify functionality:

python test_comparison.py

This will test:

  • Basic comparison functionality
  • Results enhancement
  • Phoenix integration
  • Ground truth loading

Files Added

  • comparison.py: Main comparison logic and AnswerComparator class
  • phoenix_evaluator.py: Phoenix integration for logging evaluations
  • test_comparison.py: Test suite for verification
  • GAIA_COMPARISON.md: This documentation

Dependencies Added

  • arize-phoenix: For observability and evaluation logging
  • pandas: For data manipulation (if not already present)

Example Evaluation Result

{
    "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
    "predicted_answer": "3",
    "actual_answer": "3", 
    "exact_match": True,
    "similarity_score": 1.0,
    "contains_answer": True,
    "error": None
}

Phoenix UI

In the Phoenix interface, you can:

  • View evaluation results alongside agent traces
  • Track accuracy over time
  • Filter by correct/incorrect answers
  • Analyze which question types your agent struggles with
  • Export evaluation data for further analysis

Troubleshooting

No Ground Truth Available

If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.

Phoenix Connection Issues

If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.

Low Similarity Scores

Low similarity scores might indicate:

  • Agent is providing verbose answers when short ones are expected
  • Answer format doesn't match expected format
  • Agent is partially correct but not exact

Customization

You can adjust the comparison logic in comparison.py:

  • Modify normalize_answer() for different normalization rules
  • Adjust similarity thresholds
  • Add custom evaluation metrics
  • Modify Phoenix logging format

Performance

The comparison adds minimal overhead:

  • Ground truth loading: ~1-2 seconds (one-time)
  • Per-answer comparison: ~1-10ms
  • Phoenix logging: ~10-50ms per evaluation

Total additional time: Usually < 5 seconds for 50 questions.