error on spark

#4
by zuozuo - opened

Thanks for the effort!
on my DGX spark, got some error

=============
== PyTorch ==

NVIDIA Release 25.06 (build 177567387)
PyTorch Version 2.8.0a0+5228986
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2024 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).
WARNING: Detected NVIDIA GB10 GPU, which may not yet be supported in this version of the container

[2025-11-10 01:55:19] INFO config.py:54: PyTorch version 2.8.0a0+5228986c39.nv25.6 available.
[2025-11-10 01:55:19] INFO config.py:66: Polars version 1.25.2 available.
/usr/local/lib/python3.12/dist-packages/modelopt/torch/init.py:36: UserWarning: transformers version 4.55.0 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with pip install nvidia-modelopt[hf] if working with HF models.
_warnings.warn(
2025-11-10 01:55:20,679 - INFO - flashinfer.jit: Prebuilt kernels not found, using JIT backend
[TensorRT-LLM] TensorRT-LLM version: 1.1.0rc3
/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_fields.py:198: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
warnings.warn(
[11/10/2025-01:55:21] [TRT-LLM] [I] Using LLM with PyTorch backend
[11/10/2025-01:55:21] [TRT-LLM] [I] Set nccl_plugin to None.
[11/10/2025-01:55:21] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[11/10/2025-01:55:21] [TRT-LLM] [I] Found quantization_config field in /workspace/model/config.json, pre-quantized checkpoint is used.
Traceback (most recent call last):
File "/usr/local/bin/trtllm-serve", line 8, in
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1442, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1363, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1830, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1226, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 794, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 358, in serve
launch_server(host, port, llm_args, metadata_server_cfg, server_role)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 164, in launch_server
llm = PyTorchLLM(**llm_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1027, in init
super().init(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 942, in init
super().init(model,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 215, in init
self._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 955, in _build_model
super()._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 645, in _build_model
self._engine_dir, self._hf_model_dir = model_loader()
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 651, in call
self.model_loader._update_from_hf_quant_config()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 439, in _update_from_hf_quant_config
raise NotImplementedError(
NotImplementedError: Unsupported quantization_config: {'config_groups': {'group_0': {'format': 'nvfp4-pack-quantized', 'input_activations': {'actorder': None, 'block_structure': None, 'dynamic': 'local', 'group_size': 16, 'num_bits': 4, 'observer': 'minmax', 'observer_kwargs': {}, 'strategy': 'tensor_group', 'symmetric': True, 'type': 'float'}, 'output_activations': None, 'targets': ['Linear'], 'weights': {'actorder': None, 'block_structure': None, 'dynamic': False, 'group_size': 16, 'num_bits': 4, 'observer': 'minmax', 'observer_kwargs': {}, 'strategy': 'tensor_group', 'symmetric': True, 'type': 'float'}}}, 'format': 'nvfp4-pack-quantized', 'global_compression_ratio': None, 'ignore': ['lm_head'], 'kv_cache_scheme': None, 'quant_method': 'compressed-tensors', 'quantization_status': 'compressed', 'sparsity_config': {}, 'transform_config': {}, 'version': '0.12.2'}.

I use docker follow https://build.nvidia.com/spark/nvfp4-quantization/instructions
it reminder maybe not all model architectures are supported for NVFP4 quantization Now.

I'm really unfamiliar with running tensorrt_llm. I usually test the models with VLLM. Next time I've got an instance spun up I'll double check that it runs. I don't have a spark. Maybe someday. I've been doing everything with cloud instances on RTX Pro 6000 Blackwells.

thank you, i will try it on VLLM

Were you able to get it running? I just spent some time testing it on a cloud instance and updated the model card with a working Docker VLLM command (at least for a 2x RTX Pro 6000 Blackwell system). Maybe give it a try and see if it works on the Spark?

Were you able to get it running? I just spent some time testing it on a cloud instance and updated the model card with a working Docker VLLM command (at least for a 2x RTX Pro 6000 Blackwell system). Maybe give it a try and see if it works on the Spark?

i use nvcr.io/nvidia/vllm:25.09-py3 and nvcr.io/nvidia/vllm:25.10-py3.
vllm run success on the Spark but some model not supported.

pydantic_core.ValidationError: 1 validation error for ModelConfig
(APIServer pid=1) Value error, Model architectures ['KimiLinearForCausalLM'] are not supported for now.

qwen3-vl,kimiLiner do not supported for now. i will try qwen3-Next. Because Spark has 128GB of memory, I would like to try a MOE model with a size of 30B~70B to actually test the running speed of nvfp4

Based on the several common libraries I am currently using and the occurrence of errors. Most of the issues point to Triton, and the currently released stable version does not support Spark's architecture code GB10/sm_121a very well. Therefore, the quantization schemes in vllm or huggingface/transformers or diffusers cannot be successfully executed. qwen-image and qwen-image-edit also run very slowly on Spark.
Triton's main branch seems to have fixed these issues, but I failed to compile from the source code on my device and can only wait for its new release version

Sign up or log in to comment