You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to run Llama2 benchmark on 8* Nvidia H100 GPU server and its failing with below error. I am new to python and ML and any help is appreciated:
make run RUN_ARGS="--benchmarks=llama2-70b --scenarios=offline,server"
make[1]: Entering directory '/work'
[2024-10-24 13:15:39,363 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:15:39,632 main.py:229 INFO] Detected system ID: KnownSystem.hgx19
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:16:01,437 generate_engines.py:171 INFO] Building engines for llama2-70b benchmark in Offline scenario...
[10/24/2024-13:16:02] [TRT] [I] [MemUsageChange] Init CUDA: CPU +25, GPU +0, now: CPU 44, GPU 535 (MiB)
[10/24/2024-13:16:05] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4517, GPU +1250, now: CPU 4715, GPU 1785 (MiB)
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/work/code/actionhandler/base.py", line 189, in subprocess_target
return self.action_handler.handle()
File "/work/code/actionhandler/generate_engines.py", line 174, in handle
total_engine_build_time += self.build_engine(job)
File "/work/code/actionhandler/generate_engines.py", line 157, in build_engine
builder = get_benchmark(job.config)
File "/work/code/init.py", line 157, in get_benchmark
return builder_op(conf)
File "/work/code/llama2-70b/tensorrt/builder.py", line 276, in init
super().init(LLAMA2EngineBuilderOp(**args))
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 288, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 256, in _wrapper
raise exc_info[1]
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 243, in _wrapper
retval = obj(*args, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 256, in init
builder = LLAMA2EngineBuilderOp.COMPONENT_BUILDER_MAP[component](*args, batch_size=component_batch_size, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 78, in init
assert self.world_size > 0
AssertionError
[2024-10-24 13:16:27,940 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:16:28,170 generate_engines.py:171 INFO] Building engines for llama2-70b benchmark in Offline scenario...
[10/24/2024-13:16:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +25, GPU +0, now: CPU 44, GPU 535 (MiB)
[10/24/2024-13:16:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4517, GPU +1250, now: CPU 4715, GPU 1785 (MiB)
Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/work/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle()
File "/work/code/actionhandler/generate_engines.py", line 174, in handle
total_engine_build_time += self.build_engine(job) File "/work/code/actionhandler/generate_engines.py", line 157, in build_engine
builder = get_benchmark(job.config)
File "/work/code/init.py", line 157, in get_benchmark
return builder_op(conf)
File "/work/code/llama2-70b/tensorrt/builder.py", line 276, in init
super().init(LLAMA2EngineBuilderOp(**args))
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 288, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 256, in _wrapper
raise exc_info[1]
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 243, in _wrapper
retval = obj(*args, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 256, in init
builder = LLAMA2EngineBuilderOp.COMPONENT_BUILDER_MAP[component](*args, batch_size=component_batch_size, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 78, in init
assert self.world_size > 0
AssertionError
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/work/code/main.py", line 231, in
main(main_args, DETECTED_SYSTEM)
File "/work/code/main.py", line 144, in main
dispatch_action(main_args, config_dict, workload_setting)
File "/work/code/main.py", line 202, in dispatch_action
handler.run()
File "/work/code/actionhandler/base.py", line 82, in run
self.handle_failure()
File "/work/code/actionhandler/base.py", line 186, in handle_failure
self.action_handler.handle_failure()
File "/work/code/actionhandler/generate_engines.py", line 182, in handle_failure
raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
make[1]: *** [Makefile:37: generate_engines] Error 1
make[1]: Leaving directory '/work'
make: *** [Makefile:31: run] Error 2
The text was updated successfully, but these errors were encountered:
Hi,
I am trying to run Llama2 benchmark on 8* Nvidia H100 GPU server and its failing with below error. I am new to python and ML and any help is appreciated:
make run RUN_ARGS="--benchmarks=llama2-70b --scenarios=offline,server"
make[1]: Entering directory '/work'
[2024-10-24 13:15:39,363 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:15:39,364 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:15:39,632 main.py:229 INFO] Detected system ID: KnownSystem.hgx19
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:16:01,224 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:16:01,437 generate_engines.py:171 INFO] Building engines for llama2-70b benchmark in Offline scenario...
[10/24/2024-13:16:02] [TRT] [I] [MemUsageChange] Init CUDA: CPU +25, GPU +0, now: CPU 44, GPU 535 (MiB)
[10/24/2024-13:16:05] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4517, GPU +1250, now: CPU 4715, GPU 1785 (MiB)
Process Process-1:
Traceback (most recent call last):
File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/work/code/actionhandler/base.py", line 189, in subprocess_target
return self.action_handler.handle()
File "/work/code/actionhandler/generate_engines.py", line 174, in handle
total_engine_build_time += self.build_engine(job)
File "/work/code/actionhandler/generate_engines.py", line 157, in build_engine
builder = get_benchmark(job.config)
File "/work/code/init.py", line 157, in get_benchmark
return builder_op(conf)
File "/work/code/llama2-70b/tensorrt/builder.py", line 276, in init
super().init(LLAMA2EngineBuilderOp(**args))
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 288, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 256, in _wrapper
raise exc_info[1]
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 243, in _wrapper
retval = obj(*args, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 256, in init
builder = LLAMA2EngineBuilderOp.COMPONENT_BUILDER_MAP[component](*args, batch_size=component_batch_size, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 78, in init
assert self.world_size > 0
AssertionError
[2024-10-24 13:16:27,940 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC0. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC1. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC2. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC3. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC4. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC5. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC6. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC7. Skipping.
[2024-10-24 13:16:27,941 systems.py:197 INFO] Found unknown device in GPU connection topology: NIC8. Skipping.
[2024-10-24 13:16:28,170 generate_engines.py:171 INFO] Building engines for llama2-70b benchmark in Offline scenario...
[10/24/2024-13:16:28] [TRT] [I] [MemUsageChange] Init CUDA: CPU +25, GPU +0, now: CPU 44, GPU 535 (MiB)
[10/24/2024-13:16:32] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +4517, GPU +1250, now: CPU 4715, GPU 1785 (MiB)
Process Process-2: Traceback (most recent call last): File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/work/code/actionhandler/base.py", line 189, in subprocess_target return self.action_handler.handle()
File "/work/code/actionhandler/generate_engines.py", line 174, in handle
total_engine_build_time += self.build_engine(job) File "/work/code/actionhandler/generate_engines.py", line 157, in build_engine
builder = get_benchmark(job.config)
File "/work/code/init.py", line 157, in get_benchmark
return builder_op(conf)
File "/work/code/llama2-70b/tensorrt/builder.py", line 276, in init
super().init(LLAMA2EngineBuilderOp(**args))
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 288, in _wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 256, in _wrapper
raise exc_info[1]
File "/usr/local/lib/python3.10/dist-packages/nvmitten/debug/debug_manager.py", line 243, in _wrapper
retval = obj(*args, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 256, in init
builder = LLAMA2EngineBuilderOp.COMPONENT_BUILDER_MAP[component](*args, batch_size=component_batch_size, **kwargs)
File "/work/code/llama2-70b/tensorrt/builder.py", line 78, in init
assert self.world_size > 0
AssertionError
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/work/code/main.py", line 231, in
main(main_args, DETECTED_SYSTEM)
File "/work/code/main.py", line 144, in main
dispatch_action(main_args, config_dict, workload_setting)
File "/work/code/main.py", line 202, in dispatch_action
handler.run()
File "/work/code/actionhandler/base.py", line 82, in run
self.handle_failure()
File "/work/code/actionhandler/base.py", line 186, in handle_failure
self.action_handler.handle_failure()
File "/work/code/actionhandler/generate_engines.py", line 182, in handle_failure
raise RuntimeError("Building engines failed!")
RuntimeError: Building engines failed!
make[1]: *** [Makefile:37: generate_engines] Error 1
make[1]: Leaving directory '/work'
make: *** [Makefile:31: run] Error 2
The text was updated successfully, but these errors were encountered: