GAIA Ground Truth Comparison
This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.
Features
- Ground Truth Comparison: Automatically compares agent answers to correct answers from
data/metadata.jsonl - Multiple Evaluation Metrics: Exact match, similarity score, and contains-answer detection
- Phoenix Integration: Logs evaluations to Phoenix for persistent tracking and analysis
- Enhanced Results Display: Shows ground truth and comparison results in the Gradio interface
How It Works
1. Ground Truth Loading
- Loads correct answers from
data/metadata.jsonl - Maps task IDs to ground truth answers
- Currently supports 165 questions from the GAIA dataset
2. Answer Comparison
For each agent answer, the system calculates:
- Exact Match: Boolean indicating if answers match exactly (after normalization)
- Similarity Score: 0-1 score using difflib.SequenceMatcher
- Contains Answer: Boolean indicating if the correct answer is contained in the agent's response
3. Answer Normalization
Before comparison, answers are normalized by:
- Converting to lowercase
- Removing punctuation (.,;:!?"')
- Normalizing whitespace
- Trimming leading/trailing spaces
4. Phoenix Integration
- Evaluations are automatically logged to Phoenix
- Each evaluation includes score, label, explanation, and detailed metrics
- Viewable in Phoenix UI for historical tracking and analysis
Usage
In Your Agent App
The comparison happens automatically when you run your agent:
- Run your agent - Process questions as usual
- Automatic comparison - System compares answers to ground truth
- Enhanced results - Results table includes comparison columns
- Phoenix logging - Evaluations are logged for persistent tracking
Results Display
Your results table now includes these additional columns:
- Ground Truth: The correct answer from GAIA dataset
- Exact Match: True/False for exact matches
- Similarity: Similarity score (0-1)
- Contains Answer: True/False if correct answer is contained
Status Message
The status message now includes:
Ground Truth Comparison:
Exact matches: 15/50 (30.0%)
Average similarity: 0.654
Contains correct answer: 22/50 (44.0%)
Evaluations logged to Phoenix ✅
Testing
Run the test suite to verify functionality:
python test_comparison.py
This will test:
- Basic comparison functionality
- Results enhancement
- Phoenix integration
- Ground truth loading
Files Added
comparison.py: Main comparison logic and AnswerComparator classphoenix_evaluator.py: Phoenix integration for logging evaluationstest_comparison.py: Test suite for verificationGAIA_COMPARISON.md: This documentation
Dependencies Added
arize-phoenix: For observability and evaluation loggingpandas: For data manipulation (if not already present)
Example Evaluation Result
{
"task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
"predicted_answer": "3",
"actual_answer": "3",
"exact_match": True,
"similarity_score": 1.0,
"contains_answer": True,
"error": None
}
Phoenix UI
In the Phoenix interface, you can:
- View evaluation results alongside agent traces
- Track accuracy over time
- Filter by correct/incorrect answers
- Analyze which question types your agent struggles with
- Export evaluation data for further analysis
Troubleshooting
No Ground Truth Available
If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.
Phoenix Connection Issues
If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.
Low Similarity Scores
Low similarity scores might indicate:
- Agent is providing verbose answers when short ones are expected
- Answer format doesn't match expected format
- Agent is partially correct but not exact
Customization
You can adjust the comparison logic in comparison.py:
- Modify
normalize_answer()for different normalization rules - Adjust similarity thresholds
- Add custom evaluation metrics
- Modify Phoenix logging format
Performance
The comparison adds minimal overhead:
- Ground truth loading: ~1-2 seconds (one-time)
- Per-answer comparison: ~1-10ms
- Phoenix logging: ~10-50ms per evaluation
Total additional time: Usually < 5 seconds for 50 questions.