Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

launching qwen 2.5 using the vllm container error #678

Open
ljwps opened this issue Oct 12, 2024 · 5 comments
Open

launching qwen 2.5 using the vllm container error #678

ljwps opened this issue Oct 12, 2024 · 5 comments

Comments

@ljwps
Copy link

ljwps commented Oct 12, 2024

My device information

NVIDIA Jetson AGX Orin Developer Kit(base) 64G

Package: nvidia-jetpack
Version: 6.1+b123
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.1)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.1+b123), nvidia-jetpack-dev (= 6.1+b123)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Manual-Installed: yes
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.4/main arm64 Packages
Description: NVIDIA Jetpack Meta Package

R36 (release), REVISION: 4.0, GCID: 37537400, BOARD: generic, EABI: aarch64, DATE: Fri Sep 13 04:36:44 UTC 2024
KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.5 LTS"

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:14:07_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

opr steps

1、git update jetson-containers
2、install
3、jetson-containers run $(autotag vllm)
4、into container run model with vllm
5、error

error logs

root@ubuntu:~# python3 -m vllm.entrypoints.openai.api_server --model /data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 10-12 15:47:33 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 10-12 15:47:36 api_server.py:527] vLLM API server version 0.1.dev1+gb22b798.d20241006
INFO 10-12 15:47:36 api_server.py:528] args: Namespace(host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False)
INFO 10-12 15:47:36 api_server.py:165] Multiprocessing frontend to use ipc:///tmp/2c4ac112-50b5-4615-8b1b-8684fe33dbae for IPC Path.
INFO 10-12 15:47:36 api_server.py:178] Started engine process with PID 313
/usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
INFO 10-12 15:47:39 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 10-12 15:47:43 gptq_marlin.py:107] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 10-12 15:47:49 gptq_marlin.py:107] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 10-12 15:47:49 llm_engine.py:237] Initializing an LLM engine (v0.1.dev1+gb22b798.d20241006) with config: model='/data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/', speculative_config=None, tokenizer='/data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 10-12 15:47:51 model_runner.py:1049] Starting to load model /data/models/vllm/Qwen2.5-14B-Instruct-GPTQ-Int4/...
INFO 10-12 15:47:51 gptq_marlin.py:198] Using MarlinLinearKernel for GPTQMarlinLinearMethod
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 21.06it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 20.99it/s]

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 388, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 138, in from_engine_args
    return cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/multiprocessing/engine.py", line 78, in __init__
    self.engine = LLMEngine(*args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 335, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
    self.driver_worker.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 183, in load_model
    self.model_runner.load_model()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1051, in load_model
    self.model = get_model(model_config=self.model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/__init__.py", line 19, in get_model
    return loader.load_model(model_config=model_config,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 413, in load_model
    quant_method.process_weights_after_loading(module)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/gptq_marlin.py", line 296, in process_weights_after_loading
    self.kernel.process_weights_after_loading(layer)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/kernels/marlin.py", line 110, in process_weights_after_loading
    self._transform_param(layer, self.w_s_name, transform_w_s)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/kernels/MPLinearKernel.py", line 64, in _transform_param
    new_param = fn(old_param)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/kernels/marlin.py", line 103, in transform_w_s
    x.data = marlin_permute_scales(x.data.contiguous(),
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 191, in marlin_permute_scales
    s = s.reshape((-1, len(scale_perm)))[:, scale_perm]
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 581, in <module>
    uvloop.run(run_server(args))
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.10/dist-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 548, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 106, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 193, in build_async_engine_client_from_engine_args
    raise RuntimeError(
RuntimeError: Engine process failed to start
@johnnynunez
Copy link
Contributor

johnnynunez commented Oct 12, 2024

vllm is still not possible to use with main/dev branch
@dusty-nv has to merge my second PR

#670

@dusty-nv
Copy link
Owner

@johnnynunez 29843f8 👍

@johnnynunez
Copy link
Contributor

@johnnynunez 29843f8 👍

you have my whl if you don't want to build it

@ljwps
Copy link
Author

ljwps commented Oct 14, 2024

@johnnynunez 29843f8 👍

you have my whl if you don't want to build it

So what should I do now to make the program run properly? Should I wait for the Git project update, or can you provide an update package

@johnnynunez
Copy link
Contributor

@johnnynunez 29843f8 👍

you have my whl if you don't want to build it

So what should I do now to make the program run properly? Should I wait for the Git project update, or can you provide an update package

You have to build container of vllm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants