nielsr HF Staff commited on
Commit
1dae94b
·
verified ·
1 Parent(s): 90596f9

Add pipeline tag and library name

Browse files

Hi! I'm Niels from the community science team at Hugging Face.

I'm opening this PR to improve the model card's metadata. Specifically, I've added:
- `pipeline_tag: image-text-to-text`: This helps users find the model when filtering by task and correctly identifies it as a multimodal agent.
- `library_name: transformers`: Since the repository contains a `config.json` and uses a supported architecture, this enables the "Use in Transformers" button.

I've also ensured the model is linked to its research paper. The rest of your detailed documentation and benchmark results remain unchanged.

Files changed (1) hide show
  1. README.md +13 -27
README.md CHANGED
@@ -1,23 +1,22 @@
1
  ---
2
- license: apache-2.0
 
3
  datasets:
4
  - yolay/SmartSnap-FT
5
  - yolay/SmartSnap-RL
6
  language:
7
  - en
 
8
  metrics:
9
  - accuracy
10
- base_model:
11
- - Qwen/Qwen3-32B
12
  tags:
13
  - agent
14
  - gui
15
  - mobile
16
  ---
17
 
18
-
19
-
20
-
21
  <div align="center">
22
  <img src="https://raw.githubusercontent.com/yuleiqin/images/master/SmartSnap/mascot_smartsnap.png" width="400"/>
23
  </div>
@@ -28,8 +27,9 @@ tags:
28
  &nbsp;
29
  </p>
30
 
 
31
 
32
- We introduce **SmartSnap**, a paradigm shift that transforms GUI agents📱💻🤖 from passive task executors into proactive self-verifiers. By empowering agents to curate their own evidence of success through the **3C Principles** (Completeness, Conciseness, Creativity), we eliminate the bottleneck of expensive post-hoc verification while boosting reliability and performance on complex mobile tasks.
33
 
34
  # 📖 Overview
35
 
@@ -75,13 +75,9 @@ We release the following resources to accelerate research in self-verifying agen
75
  # 💡 Key take-home Messages
76
 
77
  - **Synergistic learning loop**: The dual mission of executing and verifying cultivates deeper task understanding—agents learn to decompose problems into evidence milestones, implicitly improving planning capabilities.
78
-
79
  - **Evidence quality matters**: Vanilla SFT only achieves ~22% SR across models, while self-verifying SFT reaches 23-30% SR, demonstrating that evidence curation training is more effective than solution memorization.
80
-
81
  - **RL unlocks generalization**: Fine-tuned models show consistent >16% absolute gains after RL training, with smaller models (8B) outperforming their naive prompting baselines by **26.08%**.
82
-
83
  - **Efficiency through conciseness**: Trained agents converge to submitting **~1.5 evidence snapshots** on average, drastically reducing verifier costs while maintaining high reliability.
84
-
85
  - **Limitations**: Tasks requiring extensive domain knowledge (e.g., Maps.me navigation) remain challenging without explicit knowledge injection, suggesting RL alone cannot bridge large knowledge gaps.
86
 
87
  # 📊 Experimental Results
@@ -93,20 +89,20 @@ We release the following resources to accelerate research in self-verifying agen
93
  | **PT** | Gemini-1.5-Pro | 18.84 | 22.40 | 57.72 | 83.99 |
94
  | **PT** | Gemini-1.00 | 8.70 | 10.75 | 51.80 | 71.08 |
95
  | **PT** | GLM4-Plus | 27.54 | 32.08 | 92.35 | 83.41 |
96
- | **PT** | DeepSeek-V3.1 | **36.23** | <u>40.95</u> | 81.01 | 94.63 |
97
- | **PT** | Qwen3-235B-A22B | <u>34.78</u> | 38.76 | 83.35 | 89.48 |
98
  | | **Act-only**<sup>*</sup> | | | | |
99
  | **PT** | LLaMA3.1-8B-Instruct<sup>‡</sup> | 2.17 | 3.62 | — | 52.77 |
100
  | **FT**<sup>†</sup> | LLaMA3.1-8B-Instruct<sup>‡</sup> | 23.91<sup>(+21.74%)</sup> | 30.31 | 75.58 | 92.46 |
101
  | **PT** | LLaMA3.1-8B-Instruct | 5.07 | 6.28 | 52.77 | 51.82 |
102
  | **FT**<sup>†</sup> | LLaMA3.1-8B-Instruct | 20.28<sup>(+15.21%)</sup> | 26.13 | 69.44 | 90.43 |
103
  | **FT (ours)** | LLaMA3.1-8B-Instruct | 23.91<sup>(+18.84%)</sup> | 30.36 | 37.96 | 83.23 |
104
- | **RL (ours)** | LLaMA3.1-8B-Instruct | 31.15<sup>(+26.08%)</sup> | 38.03 | 81.28 | <u>95.80</u> |
105
  | | **ReAct** | | | | |
106
  | **PT** | Qwen2.5-7B-Instruct | 12.32 | 14.98 | 67.56 | 78.52 |
107
  | **FT**<sup>†</sup> | Qwen2.5-7B-Instruct | 20.28<sup>(+7.96%)</sup> | 27.05 | 35.52 | 62.46 |
108
  | **FT (ours)** | Qwen2.5-7B-Instruct | 30.15<sup>(+17.83%)</sup> | 36.59 | 49.19 | 73.28 |
109
- | **RL (ours)** | Qwen2.5-7B-Instruct | 30.43<sup>(+18.11%)</sup> | 35.20 | <u>102.30</u> | **96.36** |
110
  | **PT** | Qwen3-8B-Instruct | 10.14 | 12.38 | 66.21 | 67.15 |
111
  | **FT**<sup>†</sup> | Qwen3-8B-Instruct | 19.56<sup>(+9.41%)</sup> | 25.60 | 38.69 | 65.18 |
112
  | **FT (ours)** | Qwen3-8B-Instruct | 26.81<sup>(+16.66%)</sup> | 31.09 | 72.16 | 69.85 |
@@ -114,21 +110,12 @@ We release the following resources to accelerate research in self-verifying agen
114
  | **PT** | Qwen3-32B-Instruct | 18.12 | 21.80 | 91.99 | 87.57 |
115
  | **FT**<sup>†</sup> | Qwen3-32B-Instruct | 22.46<sup>(+4.34%)</sup> | 28.20 | 39.28 | 65.50 |
116
  | **FT (ours)** | Qwen3-32B-Instruct | 28.98<sup>(+10.86%)</sup> | 35.92 | 97.79 | 97.33 |
117
- | **RL (ours)** | Qwen3-32B-Instruct | <u>34.78</u><sup>(+16.66%)</sup> | 40.26 | 89.47 | 93.67 |
118
-
119
-
120
 
121
  *<sup>*</sup> LLaMA3.1 models only natively support tool calling w/o reasoning.*
122
  *<sup>†</sup> The Android Instruct dataset is used for fine-tuning where self-verification is not performed.*
123
  *<sup>‡</sup> The official results are cited here for comparison.*
124
 
125
-
126
- ---
127
-
128
- - **Performance gains**: All model families achieve >16% improvement over prompting baselines, reaching competitive performance with models 10-30× larger.
129
- - **RL dynamics**: Training reward increases consistently while intra-group variance decreases, indicating stable convergence despite occasional performance fluctuations in complex domains (Calendar, Zoom).
130
- - **App-specific analysis**: Dominant improvement in Settings (31% of training tasks) validates the importance of balanced task distribution.
131
-
132
  # 📝 Citation
133
 
134
  If you use SmartSnap in your research, please cite:
@@ -137,9 +124,8 @@ If you use SmartSnap in your research, please cite:
137
  @article{smartsnap2025,
138
  title={SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents},
139
  author={Shaofei Cai and Yulei Qin and Haojia Lin and Zihan Xu and Gang Li and Yuchen Shi and Zongyi Li and Yong Mao and Siqi Cai and Xiaoyu Tan and Yitao Liang and Ke Li and Xing Sun},
140
- journal={arXiv preprint arXiv:2025},
141
  year={2025},
142
- eprint={2512.22322},
143
  url={https://arxiv.org/abs/2512.22322},
144
  }
145
  ```
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen3-32B
4
  datasets:
5
  - yolay/SmartSnap-FT
6
  - yolay/SmartSnap-RL
7
  language:
8
  - en
9
+ license: apache-2.0
10
  metrics:
11
  - accuracy
12
+ library_name: transformers
13
+ pipeline_tag: image-text-to-text
14
  tags:
15
  - agent
16
  - gui
17
  - mobile
18
  ---
19
 
 
 
 
20
  <div align="center">
21
  <img src="https://raw.githubusercontent.com/yuleiqin/images/master/SmartSnap/mascot_smartsnap.png" width="400"/>
22
  </div>
 
27
  &nbsp;
28
  </p>
29
 
30
+ This repository contains the model checkpoints for **SmartSnap**, a paradigm presented in the paper [SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents](https://huggingface.co/papers/2512.22322).
31
 
32
+ SmartSnap transforms GUI agents📱💻🤖 from passive task executors into proactive self-verifiers. By empowering agents to curate their own evidence of success through the **3C Principles** (Completeness, Conciseness, Creativity), the framework eliminates the bottleneck of expensive post-hoc verification while boosting reliability and performance on complex mobile tasks.
33
 
34
  # 📖 Overview
35
 
 
75
  # 💡 Key take-home Messages
76
 
77
  - **Synergistic learning loop**: The dual mission of executing and verifying cultivates deeper task understanding—agents learn to decompose problems into evidence milestones, implicitly improving planning capabilities.
 
78
  - **Evidence quality matters**: Vanilla SFT only achieves ~22% SR across models, while self-verifying SFT reaches 23-30% SR, demonstrating that evidence curation training is more effective than solution memorization.
 
79
  - **RL unlocks generalization**: Fine-tuned models show consistent >16% absolute gains after RL training, with smaller models (8B) outperforming their naive prompting baselines by **26.08%**.
 
80
  - **Efficiency through conciseness**: Trained agents converge to submitting **~1.5 evidence snapshots** on average, drastically reducing verifier costs while maintaining high reliability.
 
81
  - **Limitations**: Tasks requiring extensive domain knowledge (e.g., Maps.me navigation) remain challenging without explicit knowledge injection, suggesting RL alone cannot bridge large knowledge gaps.
82
 
83
  # 📊 Experimental Results
 
89
  | **PT** | Gemini-1.5-Pro | 18.84 | 22.40 | 57.72 | 83.99 |
90
  | **PT** | Gemini-1.00 | 8.70 | 10.75 | 51.80 | 71.08 |
91
  | **PT** | GLM4-Plus | 27.54 | 32.08 | 92.35 | 83.41 |
92
+ | **PT** | DeepSeek-V3.1 | **36.23** | 40.95 | 81.01 | 94.63 |
93
+ | **PT** | Qwen3-235B-A22B | 34.78 | 38.76 | 83.35 | 89.48 |
94
  | | **Act-only**<sup>*</sup> | | | | |
95
  | **PT** | LLaMA3.1-8B-Instruct<sup>‡</sup> | 2.17 | 3.62 | — | 52.77 |
96
  | **FT**<sup>†</sup> | LLaMA3.1-8B-Instruct<sup>‡</sup> | 23.91<sup>(+21.74%)</sup> | 30.31 | 75.58 | 92.46 |
97
  | **PT** | LLaMA3.1-8B-Instruct | 5.07 | 6.28 | 52.77 | 51.82 |
98
  | **FT**<sup>†</sup> | LLaMA3.1-8B-Instruct | 20.28<sup>(+15.21%)</sup> | 26.13 | 69.44 | 90.43 |
99
  | **FT (ours)** | LLaMA3.1-8B-Instruct | 23.91<sup>(+18.84%)</sup> | 30.36 | 37.96 | 83.23 |
100
+ | **RL (ours)** | LLaMA3.1-8B-Instruct | 31.15<sup>(+26.08%)</sup> | 38.03 | 81.28 | 95.80 |
101
  | | **ReAct** | | | | |
102
  | **PT** | Qwen2.5-7B-Instruct | 12.32 | 14.98 | 67.56 | 78.52 |
103
  | **FT**<sup>†</sup> | Qwen2.5-7B-Instruct | 20.28<sup>(+7.96%)</sup> | 27.05 | 35.52 | 62.46 |
104
  | **FT (ours)** | Qwen2.5-7B-Instruct | 30.15<sup>(+17.83%)</sup> | 36.59 | 49.19 | 73.28 |
105
+ | **RL (ours)** | Qwen2.5-7B-Instruct | 30.43<sup>(+18.11%)</sup> | 35.20 | 102.30 | **96.36** |
106
  | **PT** | Qwen3-8B-Instruct | 10.14 | 12.38 | 66.21 | 67.15 |
107
  | **FT**<sup>†</sup> | Qwen3-8B-Instruct | 19.56<sup>(+9.41%)</sup> | 25.60 | 38.69 | 65.18 |
108
  | **FT (ours)** | Qwen3-8B-Instruct | 26.81<sup>(+16.66%)</sup> | 31.09 | 72.16 | 69.85 |
 
110
  | **PT** | Qwen3-32B-Instruct | 18.12 | 21.80 | 91.99 | 87.57 |
111
  | **FT**<sup>†</sup> | Qwen3-32B-Instruct | 22.46<sup>(+4.34%)</sup> | 28.20 | 39.28 | 65.50 |
112
  | **FT (ours)** | Qwen3-32B-Instruct | 28.98<sup>(+10.86%)</sup> | 35.92 | 97.79 | 97.33 |
113
+ | **RL (ours)** | Qwen3-32B-Instruct | 34.78<sup>(+16.66%)</sup> | 40.26 | 89.47 | 93.67 |
 
 
114
 
115
  *<sup>*</sup> LLaMA3.1 models only natively support tool calling w/o reasoning.*
116
  *<sup>†</sup> The Android Instruct dataset is used for fine-tuning where self-verification is not performed.*
117
  *<sup>‡</sup> The official results are cited here for comparison.*
118
 
 
 
 
 
 
 
 
119
  # 📝 Citation
120
 
121
  If you use SmartSnap in your research, please cite:
 
124
  @article{smartsnap2025,
125
  title={SmartSnap: Proactive Evidence Seeking for Self-Verifying Agents},
126
  author={Shaofei Cai and Yulei Qin and Haojia Lin and Zihan Xu and Gang Li and Yuchen Shi and Zongyi Li and Yong Mao and Siqi Cai and Xiaoyu Tan and Yitao Liang and Ke Li and Xing Sun},
127
+ journal={arXiv preprint arXiv:2512.22322},
128
  year={2025},
 
129
  url={https://arxiv.org/abs/2512.22322},
130
  }
131
  ```