Final_Assignment_Template

Sleeping

App Files Files Community

Final_Assignment_Template / GAIA_COMPARISON.md

Romain Fayoux

Added ground evaluation and phoenix login

f9cf36d 5 months ago

preview code

raw

history blame

4.51 kB

GAIA Ground Truth Comparison

This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.

Features

Ground Truth Comparison: Automatically compares agent answers to correct answers from data/metadata.jsonl
Multiple Evaluation Metrics: Exact match, similarity score, and contains-answer detection
Phoenix Integration: Logs evaluations to Phoenix for persistent tracking and analysis
Enhanced Results Display: Shows ground truth and comparison results in the Gradio interface

How It Works

1. Ground Truth Loading

Loads correct answers from data/metadata.jsonl
Maps task IDs to ground truth answers
Currently supports 165 questions from the GAIA dataset

2. Answer Comparison

For each agent answer, the system calculates:

Exact Match: Boolean indicating if answers match exactly (after normalization)
Similarity Score: 0-1 score using difflib.SequenceMatcher
Contains Answer: Boolean indicating if the correct answer is contained in the agent's response

3. Answer Normalization

Before comparison, answers are normalized by:

Converting to lowercase
Removing punctuation (.,;:!?"')
Normalizing whitespace
Trimming leading/trailing spaces

4. Phoenix Integration

Evaluations are automatically logged to Phoenix
Each evaluation includes score, label, explanation, and detailed metrics
Viewable in Phoenix UI for historical tracking and analysis

Usage

In Your Agent App

The comparison happens automatically when you run your agent:

Run your agent - Process questions as usual
Automatic comparison - System compares answers to ground truth
Enhanced results - Results table includes comparison columns
Phoenix logging - Evaluations are logged for persistent tracking

Results Display

Your results table now includes these additional columns:

Ground Truth: The correct answer from GAIA dataset
Exact Match: True/False for exact matches
Similarity: Similarity score (0-1)
Contains Answer: True/False if correct answer is contained

Status Message

The status message now includes:

Ground Truth Comparison:
Exact matches: 15/50 (30.0%)
Average similarity: 0.654
Contains correct answer: 22/50 (44.0%)
Evaluations logged to Phoenix ✅

Testing

Run the test suite to verify functionality:

python test_comparison.py

This will test:

Basic comparison functionality
Results enhancement
Phoenix integration
Ground truth loading

Files Added

comparison.py: Main comparison logic and AnswerComparator class
phoenix_evaluator.py: Phoenix integration for logging evaluations
test_comparison.py: Test suite for verification
GAIA_COMPARISON.md: This documentation

Dependencies Added

arize-phoenix: For observability and evaluation logging
pandas: For data manipulation (if not already present)

Example Evaluation Result

{
    "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
    "predicted_answer": "3",
    "actual_answer": "3", 
    "exact_match": True,
    "similarity_score": 1.0,
    "contains_answer": True,
    "error": None
}

Phoenix UI

In the Phoenix interface, you can:

View evaluation results alongside agent traces
Track accuracy over time
Filter by correct/incorrect answers
Analyze which question types your agent struggles with
Export evaluation data for further analysis

Troubleshooting

No Ground Truth Available

If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.

Phoenix Connection Issues

If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.

Low Similarity Scores

Low similarity scores might indicate:

Agent is providing verbose answers when short ones are expected
Answer format doesn't match expected format
Agent is partially correct but not exact

Customization

You can adjust the comparison logic in comparison.py:

Modify normalize_answer() for different normalization rules
Adjust similarity thresholds
Add custom evaluation metrics
Modify Phoenix logging format

Performance

The comparison adds minimal overhead:

Ground truth loading: ~1-2 seconds (one-time)
Per-answer comparison: ~1-10ms
Phoenix logging: ~10-50ms per evaluation

Total additional time: Usually < 5 seconds for 50 questions.