zhilinw commited on
Commit
1b7537e
·
verified ·
1 Parent(s): d2b54aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -28
README.md CHANGED
@@ -22,12 +22,42 @@ library_name: transformers
22
 
23
  Llama-3.3-Nemotron-Super-49B-GenRM is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses.
24
 
 
 
25
  See details on how this model was trained at [https://arxiv.org/abs/2505.11475](https://arxiv.org/abs/2505.11475)
26
 
 
 
27
  ## License/Terms of Use:
28
 
29
  GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) . Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  ## RM-Bench LeaderBoard
32
 
33
  As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the top performing Bradley-Terry reward models on [RM-Bench](https://arxiv.org/abs/2410.16184), an improved variant of RewardBench for evaluating Reward Models in Chat, Math, Code and Safety. Our GenRMs also outperform the corresponding Bradley-Terry reward models.
@@ -65,34 +95,11 @@ As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the
65
  *Note that Skywork-Reward-Gemma-2-27B was the best performing reward model reported on JudgeBench and we evaluated all other numbers.*
66
 
67
 
68
- ## Use Case:
69
-
70
- Llama-3.3-Nemotron-Super-49B-GenRM can be used to judge the quality of one response, or the ranking between two responses given an English conversation history. It will first generate reasoning traces then output an integer score.
71
-
72
-
73
- ## Release Date:
74
-
75
- 05/30/2025
76
-
77
-
78
- ## Referencess:
79
-
80
- * [HelpSteer3-Preference](https://arxiv.org/abs/2505.11475)
81
- * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
82
- * [SteerLM method](https://arxiv.org/abs/2310.05344)
83
- * [HelpSteer](https://arxiv.org/abs/2311.09528)
84
- * [HelpSteer2](https://arxiv.org/abs/2406.08673)
85
- * [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
86
- * [The future of AI: Built with Llama](https://ai.meta.com/blog/future-of-ai-built-with-llama/)
87
- * [Meta's Llama 3.3 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3)
88
- * [Meta's Llama 3.3 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)
89
-
90
-
91
  ## Model Architecture:
92
  **Architecture Type:** Transformer <br>
93
  **Network Architecture:** Llama-3.3-Nemotron-Super-49B-v1 <br>
94
 
95
- We developed this model using Llama-3.3-Nemotron-Super-49B-v1 as its foundation. This model contains 49 billion parameters.
96
 
97
  ## Input:
98
  **Input Type(s):** Text <br>
@@ -106,19 +113,21 @@ We developed this model using Llama-3.3-Nemotron-Super-49B-v1 as its foundation.
106
  **Output Parameters:** One-Dimensional (1D) <br>
107
  **Other Properties Related to Output:** The output contains a reasoning trace and a final score. <br>
108
 
 
 
109
  ## Software Integration:
110
  **Runtime Engine(s):** <br>
111
  * vLLM 0.8.3 <br>
112
 
113
  **Supported Hardware Microarchitecture Compatibility:** <br>
114
  * NVIDIA Ampere <br>
115
- * NVIDIA Hopper
116
 
117
  **Supported Operating System(s):** Linux <br>
118
 
119
  ## Quick Start
120
 
121
- We recommend serving the model with vLLM. You can use the model with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 100GB of free disk space to accomodate the download.
122
 
123
  ```
124
  pip install vllm==0.8.3
@@ -208,7 +217,7 @@ Response 2 correctly states "1+2=3". It is accurate, clear, relevant, and fully
208
  ```
209
  Note that the conversation history should be presented in "user" and "assistant" roles, where the last turn is user turn. The responses to be judged should be in "response_1" (and "response_2") roles.
210
 
211
- ### Intepretation of Scores
212
  When judging one response, the model will generate a helpfulness score from 1 to 5, where higher is better.
213
 
214
  When judging two responses, the model will generate an individual helpfulness score for each response, then a ranking score. The ranking score is a number between 1 and 6, where:
@@ -230,7 +239,7 @@ For details, please see Appendix J in our [paper](https://arxiv.org/abs/2505.114
230
  ## Model Version:
231
  v1.0
232
 
233
- # Training and Testing Datasets:
234
 
235
  ## Training Datasets:
236
 
@@ -260,6 +269,33 @@ v1.0
260
  **Properties:** <br>
261
  * 2,017 prompts, each with a pair of responses as well as human preferences between the pair of responses.
262
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
263
 
264
  # Inference:
265
  **Engine:** vLLM <br>
@@ -268,6 +304,8 @@ v1.0
268
 
269
  ## Ethical Considerations:
270
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
 
 
271
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
272
 
273
  ## Citation
 
22
 
23
  Llama-3.3-Nemotron-Super-49B-GenRM is a generative reward model that leverages Llama-3.3-Nemotron-Super-49B-v1 as the foundation and is fine-tuned using Reinforcement Learning to predict the quality of LLM generated responses.
24
 
25
+ Llama-3.3-Nemotron-Super-49B-GenRM can be used to judge the quality of one response, or the ranking between two responses given an English conversation history. It will first generate reasoning traces then output an integer score. A higher score means the response is of higher quality.
26
+
27
  See details on how this model was trained at [https://arxiv.org/abs/2505.11475](https://arxiv.org/abs/2505.11475)
28
 
29
+ This model is ready for commercial/non-commercial use.
30
+
31
  ## License/Terms of Use:
32
 
33
  GOVERNING TERMS: Use of this model is governed by the [NVIDIA Open Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/) . Additional Information: [Llama 3.3 Community License Agreement](https://www.llama.com/llama3_3/license/). Built with Llama.
34
 
35
+ ### Deployment Geography
36
+
37
+ Global
38
+
39
+ ## Use Case:
40
+
41
+ Llama-3.3-Nemotron-Super-49B-GenRM can be used to judge the quality of one response, or the ranking between two responses given an English conversation history. It will first generate reasoning traces then output an integer score.
42
+
43
+
44
+ ## Release Date:
45
+
46
+ HuggingFace 06/27/2025 via https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-GenRM
47
+
48
+ ## References:
49
+
50
+ * [HelpSteer3-Preference](https://arxiv.org/abs/2505.11475)
51
+ * [HelpSteer2-Preference](https://arxiv.org/abs/2410.01257)
52
+ * [SteerLM method](https://arxiv.org/abs/2310.05344)
53
+ * [HelpSteer](https://arxiv.org/abs/2311.09528)
54
+ * [HelpSteer2](https://arxiv.org/abs/2406.08673)
55
+ * [Llama-Nemotron: Efficient Reasoning Models](https://arxiv.org/abs/2505.00949)
56
+ * [The future of AI: Built with Llama](https://ai.meta.com/blog/future-of-ai-built-with-llama/)
57
+ * [Meta's Llama 3.3 Webpage](https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_3)
58
+ * [Meta's Llama 3.3 Model Card](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md)
59
+
60
+
61
  ## RM-Bench LeaderBoard
62
 
63
  As of 15 May 2025, our reward models trained with HelpSteer3-Preference are the top performing Bradley-Terry reward models on [RM-Bench](https://arxiv.org/abs/2410.16184), an improved variant of RewardBench for evaluating Reward Models in Chat, Math, Code and Safety. Our GenRMs also outperform the corresponding Bradley-Terry reward models.
 
95
  *Note that Skywork-Reward-Gemma-2-27B was the best performing reward model reported on JudgeBench and we evaluated all other numbers.*
96
 
97
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
98
  ## Model Architecture:
99
  **Architecture Type:** Transformer <br>
100
  **Network Architecture:** Llama-3.3-Nemotron-Super-49B-v1 <br>
101
 
102
+ We developed this model using [Llama-3.3-Nemotron-Super-49B-v1](https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1) as its foundation. This model contains 49 billion parameters.
103
 
104
  ## Input:
105
  **Input Type(s):** Text <br>
 
113
  **Output Parameters:** One-Dimensional (1D) <br>
114
  **Other Properties Related to Output:** The output contains a reasoning trace and a final score. <br>
115
 
116
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. <br>
117
+
118
  ## Software Integration:
119
  **Runtime Engine(s):** <br>
120
  * vLLM 0.8.3 <br>
121
 
122
  **Supported Hardware Microarchitecture Compatibility:** <br>
123
  * NVIDIA Ampere <br>
124
+ * NVIDIA Hopper <br>
125
 
126
  **Supported Operating System(s):** Linux <br>
127
 
128
  ## Quick Start
129
 
130
+ We recommend serving the model with vLLM. You can use the model with 2 or more 80GB GPUs (NVIDIA Ampere or newer) with at least 100GB of free disk space to accommodate the download.
131
 
132
  ```
133
  pip install vllm==0.8.3
 
217
  ```
218
  Note that the conversation history should be presented in "user" and "assistant" roles, where the last turn is user turn. The responses to be judged should be in "response_1" (and "response_2") roles.
219
 
220
+ ### Interpretation of Scores
221
  When judging one response, the model will generate a helpfulness score from 1 to 5, where higher is better.
222
 
223
  When judging two responses, the model will generate an individual helpfulness score for each response, then a ranking score. The ranking score is a number between 1 and 6, where:
 
239
  ## Model Version:
240
  v1.0
241
 
242
+ # Training, Testing and Evaluation Datasets:
243
 
244
  ## Training Datasets:
245
 
 
269
  **Properties:** <br>
270
  * 2,017 prompts, each with a pair of responses as well as human preferences between the pair of responses.
271
 
272
+ ## Evaluation Datasets
273
+
274
+ **Dataset Name:** RM-Bench <br>
275
+ **Dataset Link:** https://huggingface.co/datasets/THU-KEG/RM-Bench
276
+
277
+ **Data Collection Method by dataset** <br>
278
+ * [Hybrid: Human, Synthetic] <br>
279
+
280
+ **Labeling Method by dataset** <br>
281
+ * [Hybrid: Human, Synthetic] <br>
282
+
283
+ **Properties:** <br>
284
+ * 1,327 prompts, each with three pairs of responses as well as preferences between the pair of responses.
285
+
286
+
287
+ **Dataset Name:** JudgeBench <br>
288
+ **Dataset Link:** https://huggingface.co/datasets/ScalerLab/JudgeBench
289
+
290
+ **Data Collection Method by dataset** <br>
291
+ * [Hybrid: Human, Synthetic] <br>
292
+
293
+ **Labeling Method by dataset** <br>
294
+ * [Hybrid: Human, Synthetic] <br>
295
+
296
+ **Properties:** <br>
297
+ * 350 prompts, each with a pair of responses as well as preferences between the pair of responses.
298
+
299
 
300
  # Inference:
301
  **Engine:** vLLM <br>
 
304
 
305
  ## Ethical Considerations:
306
  NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
307
+ For more detailed information on ethical considerations for this model, please see the Model Card++ [Explainability](explainability.md), [Bias](bias.md), [Safety & Security](safety.md), and [Privacy](privacy.md) Subcards.
308
+
309
  Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
310
 
311
  ## Citation