nvidia
/

Llama-3.1-Nemotron-Nano-VL-8B-V1-mcore

@@ -41,7 +41,7 @@ Global
 Customers: AI foundry enterprise customers
-Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Comparison and contrast of multiple images, Text Chain-of-Thought reasoning.
 ## Release Date:
@@ -62,7 +62,7 @@ Language Encoder: Llama-3.1-8B-Instruct
 ### Input
 Input Type(s): Image, Text
-- Input Images Supported: Multiple images within 16K input + output tokens
 - Language Supported: English only
 Input Format(s): Image (Red, Green, Blue (RGB)), and Text (String)

 Customers: AI foundry enterprise customers
+Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Text Chain-of-Thought reasoning
 ## Release Date:
 ### Input
 Input Type(s): Image, Text
+- Input Images
 - Language Supported: English only
 Input Format(s): Image (Red, Green, Blue (RGB)), and Text (String)

explainability.md CHANGED Viewed

@@ -4,9 +4,9 @@ Intended Application & Domain:
 Model Type:                                                                                            | Transformer
 Intended Users:                                                                                        | Generative AI creators working with conversational AI models and image content.
 Output:                                                                                                | Text (Responds to posed question, stateful - remembers previous answers)
-Describe how the model works:                                                                          | Chat based on image/video content
 Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:   | Not Applicable
-Technical Limitations:                                                                                 | Max Number of images supported: 4.<br><br>**Context Length:** Supports up to 16,000 tokens total (input + output). If exceeded, input is truncated from the start, and generation ends with an EOS token. Longer prompts may risk performance loss.<br><br>If the model fails (e.g., generates incorrect responses, repeats, or gives poor responses), issues are diagnosed via benchmarks, human review, and internal debugging tools. Only use NVIDIA provided models that use safetensors format. <br><br>Do not expose the vLLM host to a network where any untrusted connections may reach the host. Only use NVIDIA provided models that use safetensors format.
 Verified to have met prescribed NVIDIA quality standards:                                              | Yes
 Performance Metrics:                                                                                   | MMMU Val with chatGPT as a judge, AI2D, ChartQA Test, InfoVQA Val, OCRBench, OCRBenchV2 English, OCRBenchV2 Chinese, DocVQA val, VideoMME (16 frames), SlideQA (F1)
 Potential Known Risks:                                                                                 | The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.<br>While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.

 Model Type:                                                                                            | Transformer
 Intended Users:                                                                                        | Generative AI creators working with conversational AI models and image content.
 Output:                                                                                                | Text (Responds to posed question, stateful - remembers previous answers)
+Describe how the model works:                                                                          | Chat based on image/text
 Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of:   | Not Applicable
+Technical Limitations:                                                                                 | <br>**Context Length:** Supports up to 16,000 tokens total (input + output). If exceeded, input is truncated from the start, and generation ends with an EOS token. Longer prompts may risk performance loss.<br><br>If the model fails (e.g., generates incorrect responses, repeats, or gives poor responses), issues are diagnosed via benchmarks, human review, and internal debugging tools. Only use NVIDIA provided models that use safetensors format. <br><br>Do not expose the vLLM host to a network where any untrusted connections may reach the host. Only use NVIDIA provided models that use safetensors format.
 Verified to have met prescribed NVIDIA quality standards:                                              | Yes
 Performance Metrics:                                                                                   | MMMU Val with chatGPT as a judge, AI2D, ChartQA Test, InfoVQA Val, OCRBench, OCRBenchV2 English, OCRBenchV2 Chinese, DocVQA val, VideoMME (16 frames), SlideQA (F1)
 Potential Known Risks:                                                                                 | The Model may produce output that is biased, toxic, or incorrect responses. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The Model may also generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text, producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive.<br>While we have taken safety and security into account and are continuously improving, outputs may still contain political content, misleading information, or unwanted bias beyond our control.