# GAIA Ground Truth Comparison

This project now includes automatic comparison of your agent's answers against the GAIA dataset ground truth, with Phoenix observability integration.

## Features

- **Ground Truth Comparison**: Automatically compares agent answers to correct answers from `data/metadata.jsonl`
- **Multiple Evaluation Metrics**: Exact match, similarity score, and contains-answer detection
- **Phoenix Integration**: Logs evaluations to Phoenix for persistent tracking and analysis
- **Enhanced Results Display**: Shows ground truth and comparison results in the Gradio interface

## How It Works

### 1. Ground Truth Loading
- Loads correct answers from `data/metadata.jsonl` 
- Maps task IDs to ground truth answers
- Currently supports 165 questions from the GAIA dataset

### 2. Answer Comparison
For each agent answer, the system calculates:
- **Exact Match**: Boolean indicating if answers match exactly (after normalization)
- **Similarity Score**: 0-1 score using difflib.SequenceMatcher
- **Contains Answer**: Boolean indicating if the correct answer is contained in the agent's response

### 3. Answer Normalization
Before comparison, answers are normalized by:
- Converting to lowercase
- Removing punctuation (.,;:!?"')
- Normalizing whitespace
- Trimming leading/trailing spaces

### 4. Phoenix Integration
- Evaluations are automatically logged to Phoenix
- Each evaluation includes score, label, explanation, and detailed metrics
- Viewable in Phoenix UI for historical tracking and analysis

## Usage

### In Your Agent App
The comparison happens automatically when you run your agent:

1. **Run your agent** - Process questions as usual
2. **Automatic comparison** - System compares answers to ground truth
3. **Enhanced results** - Results table includes comparison columns
4. **Phoenix logging** - Evaluations are logged for persistent tracking

### Results Display
Your results table now includes these additional columns:
- **Ground Truth**: The correct answer from GAIA dataset
- **Exact Match**: True/False for exact matches
- **Similarity**: Similarity score (0-1)
- **Contains Answer**: True/False if correct answer is contained

### Status Message
The status message now includes:
```
Ground Truth Comparison:
Exact matches: 15/50 (30.0%)
Average similarity: 0.654
Contains correct answer: 22/50 (44.0%)
Evaluations logged to Phoenix ✅
```

## Testing

Run the test suite to verify functionality:

```bash
python test_comparison.py
```

This will test:
- Basic comparison functionality
- Results enhancement
- Phoenix integration
- Ground truth loading

## Files Added

- `comparison.py`: Main comparison logic and AnswerComparator class
- `phoenix_evaluator.py`: Phoenix integration for logging evaluations
- `test_comparison.py`: Test suite for verification
- `GAIA_COMPARISON.md`: This documentation

## Dependencies Added

- `arize-phoenix`: For observability and evaluation logging
- `pandas`: For data manipulation (if not already present)

## Example Evaluation Result

```python
{
    "task_id": "8e867cd7-cff9-4e6c-867a-ff5ddc2550be",
    "predicted_answer": "3",
    "actual_answer": "3", 
    "exact_match": True,
    "similarity_score": 1.0,
    "contains_answer": True,
    "error": None
}
```

## Phoenix UI

In the Phoenix interface, you can:
- View evaluation results alongside agent traces
- Track accuracy over time
- Filter by correct/incorrect answers
- Analyze which question types your agent struggles with
- Export evaluation data for further analysis

## Troubleshooting

### No Ground Truth Available
If you see "N/A" for ground truth, the question's task_id is not in the metadata.jsonl file.

### Phoenix Connection Issues
If Phoenix logging fails, the comparison will still work but won't be persisted. Ensure Phoenix is running and accessible.

### Low Similarity Scores
Low similarity scores might indicate:
- Agent is providing verbose answers when short ones are expected
- Answer format doesn't match expected format
- Agent is partially correct but not exact

## Customization

You can adjust the comparison logic in `comparison.py`:
- Modify `normalize_answer()` for different normalization rules
- Adjust similarity thresholds
- Add custom evaluation metrics
- Modify Phoenix logging format

## Performance

The comparison adds minimal overhead:
- Ground truth loading: ~1-2 seconds (one-time)
- Per-answer comparison: ~1-10ms
- Phoenix logging: ~10-50ms per evaluation

Total additional time: Usually < 5 seconds for 50 questions.