# GAIA Ground Truth Comparison This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration. ## Features - **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl` - **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection - **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis - **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface ## How It Works ### 1. Ground Truth Loading - Loads correct answers from `data/metadata.jsonl` - Maps task IDs to ground truth answers - Currently supports 165 questions from the GAIA dataset ### 2. Answer Comparison For each agent answer, the system calculates: - **Exact Match**: Boolean indicating if answers match exactly (after normalization) - **Similarity Score**: 0-1 score using difflib.SequenceMatcher - **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response ### 3. Answer Normalization Before comparison, answers are normalized by: - Converting to lowercase - Removing punctuation (.,;:!?"') - Normalizing whitespace - Trimming leading/trailing spaces ### 4. Phoenix Integration - Evaluations are automatically logged to Phoenix - Each evaluation includes score, label, explanation, and detailed metrics - Viewable in Phoenix UI for historical tracking and analysis ## Usage ### In Your Agent App The comparison happens automatically when you run your agent: 1. **Run your agent** - Process questions as usual 2. **Automatic comparison** - System compares answers to ground truth 3. **Enhanced results** - Results table includes comparison columns 4. **Phoenix logging** - Evaluations are logged for persistent tracking ### Results Display Your results table now includes these additional columns: - **Ground Truth**: The correct answer from GAIA dataset - **Exact Match**: True/False for exact matches - **Similarity**: Similarity score (0-1) - **Contains Answer**: True/False if correct answer is contained ### Status Message The status message now includes: ``` Ground Truth Comparison: Exact matches: 15/50 (30.0%) Average similarity: 0.654 Contains correct answer: 22/50 (44.0%) Evaluations logged to Phoenix ✅ ``` ## Testing Run the test suite to verify functionality: ```bash python test_comparison.py ``` This will test: - Basic comparison functionality - Results enhancement - Phoenix integration - Ground truth loading ## Files Added - `comparison.py`: Main comparison logic and AnswerComparator class - `phoenix_evaluator.py`: Phoenix integration for logging evaluations - `test_comparison.py`: Test suite for verification - `GAIA_COMPARISON.md`: This documentation ## Dependencies Added - `arize-phoenix`: For observability and evaluation logging - `pandas`: For data manipulation (if not already present) ## Example Evaluation Result ```python { "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be", "predicted_answer": "3", "actual_answer": "3", "exact_match": True, "similarity_score": 1.0, "contains_answer": True, "error": None } ``` ## Phoenix UI In the Phoenix interface, you can: - View evaluation results alongside agent traces - Track accuracy over time - Filter by correct/incorrect answers - Analyze which question types your agent struggles with - Export evaluation data for further analysis ## Troubleshooting ### No Ground Truth Available If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file. ### Phoenix Connection Issues If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible. ### Low Similarity Scores Low similarity scores might indicate: - Agent is providing verbose answers when short ones are expected - Answer format doesn't match expected format - Agent is partially correct but not exact ## Customization You can adjust the comparison logic in `comparison.py`: - Modify `normalize_answer()` for different normalization rules - Adjust similarity thresholds - Add custom evaluation metrics - Modify Phoenix logging format ## Performance The comparison adds minimal overhead: - Ground truth loading: ~1-2 seconds (one-time) - Per-answer comparison: ~1-10ms - Phoenix logging: ~10-50ms per evaluation Total additional time: Usually < 5 seconds for 50 questions.