|
|
--- |
|
|
language: |
|
|
- en |
|
|
pipeline_tag: video-classification |
|
|
tags: |
|
|
- birds |
|
|
- swifts |
|
|
- MViTv2 |
|
|
- Ballinrobe |
|
|
license: other |
|
|
license_name: bcs-lcs |
|
|
license_link: LICENSE |
|
|
base_model: |
|
|
- timm/mvitv2_small.fb_in1k |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- odinglynn/swift-150 |
|
|
--- |
|
|
|
|
|
# SwiftViT-150 |
|
|
|
|
|
MViT-v2 fine-tuned on 150 videos for common swift feeding behavior classification. |
|
|
|
|
|
## Model |
|
|
|
|
|
Fine-tuned `mvit_v2_s` (Kinetics-400 pretrained) on single-camera nestbox footage. Achieves ~87% validation accuracy (in controlled settings) and demonstrates surprising cross-camera generalization despite training on a single viewpoint and on a miniscule dataset (150 samples). |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
import torch |
|
|
import torchvision |
|
|
|
|
|
model = torchvision.models.video.mvit_v2_s(weights=None) |
|
|
model.head = torch.nn.Sequential( |
|
|
torch.nn.Dropout(0.5), |
|
|
torch.nn.Linear(768, 512), |
|
|
torch.nn.GELU(), |
|
|
torch.nn.Dropout(0.3), |
|
|
torch.nn.Linear(512, 3), |
|
|
) |
|
|
|
|
|
checkpoint = torch.load("swiftvit-150.pth") |
|
|
model.load_state_dict(checkpoint["model_state_dict"]) |
|
|
model.eval() |
|
|
|
|
|
# Inference |
|
|
with torch.no_grad(): |
|
|
video = load_video() # Shape: [C, T, H, W] |
|
|
output = model(video.unsqueeze(0)) |
|
|
prediction = torch.argmax(output, dim=1) |
|
|
# 0: feeding, 1: possible_feeding, 2: not_feeding |
|
|
``` |
|
|
|
|
|
## Architecture |
|
|
|
|
|
- Base: MViT-v2 Small (24M params) |
|
|
- Head: Custom 768→512→3 with dropout |
|
|
- Input: 16 frames @ 224x224 |
|
|
- Classes: 3 (feeding, possible_feeding, not_feeding) |
|
|
|
|
|
## Training |
|
|
|
|
|
- 120 train / 30 val samples |
|
|
- Batch size: 4 |
|
|
- Optimizer: AdamW (lr=1e-4, wd=0.05) |
|
|
- Scheduler: CosineAnnealingWarmRestarts |
|
|
- Mixed precision training on H100 |
|
|
- Early stopping: 40 epoch patience |
|
|
|
|
|
## Performance |
|
|
|
|
|
- Train accuracy: 100% |
|
|
- Val accuracy: 87% |
|
|
- Unexpected cross-camera generalization observed |
|
|
|
|
|
## Dataset |
|
|
|
|
|
Trained on [swift-150](https://huggingface.co/datasets/odinglynn/swift-150) - 150 videos from GABLE nestbox camera (Ireland, 2020-2025). |
|
|
|
|
|
## Context |
|
|
|
|
|
Part of climate research correlating swift feeding patterns with weather data at terrabyte scale. Ballinrobe Community School entry for REDACTED. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you reference this work, cite: |
|
|
```bibtex |
|
|
@misc{swift150bcs, |
|
|
title={Swift-150: A Dataset for Common Swift Feeding Behavior Analysis}, |
|
|
author={Odin Glynn-Martin, Culan O'Meara, Anas Rashid, Shayden D'Souza, Pádraig Foley and Mark Lally}, |
|
|
year={2025}, |
|
|
institution={Ballinrobe Community School}, |
|
|
url={https://ballinrobecommunityschool.ie}, |
|
|
note={REDACTED - Entry 2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
Proprietary. See LICENSE for restrictions. |