[Bug] tp-size=2，model launch error #1945

linqingxu · 2024-11-07T06:14:03Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

tp-size=2, model launch is frozen.

Reproduction

python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --tp-size 2 --kv-cache-dtype int8 --attention-backend triton --sampling-backend pytorch --enable-torch-compile

Environment

amd gpu RTX 7900xtx
Name: gfx1100
Uuid: GPU-b1d1b7e55cd7ec87
Marketing Name: Radeon RX 7900 XTX
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 3
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 98304(0x18000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2070
BDFID: 49920
Internal Node ID: 3
Compute Unit: 96
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 202
SDMA engine uCode:: 20
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 25149440(0x17fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***

linqingxu · 2024-11-07T06:15:34Z

python3 -m sglang.launch_server --model-path /root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/ --port 30000 --mem-fraction-static 0.8 --kv-cache-dtype int8 --tp-size 2 --attention-backend triton --sampling-backend pytorch --enable-torch-compile
WARNING 11-07 06:14:22 rocm.py:13] fork method is not supported by ROCm. VLLM_WORKER_MULTIPROC_METHOD is overridden to spawn instead.
[2024-11-07 06:14:30] server_args=ServerArgs(model_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_path='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='int8', kvint4_groupsize=32, quantization=None, context_length=None, device='cuda', served_model_name='/root/.xinference/cache/qwen2_5-instruct-gptq-7b-Int8/', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.8, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=2, stream_interval=1, random_seed=589889175, constrained_json_whitespace_pattern=None, decode_log_interval=40, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, watchdog_timeout=600, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='triton', sampling_backend='pytorch', grammar_backend='outlines', disable_flashinfer=False, disable_flashinfer_sampling=False, disable_radix_cache=False, disable_regex_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_penalizer=False, disable_nan_detection=False, enable_overlap_schedule=False, enable_mixed_chunk=False, enable_torch_compile=True, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1)
[2024-11-07 06:14:43 TP0] Init torch distributed begin.
[2024-11-07 06:14:43 TP1] Init torch distributed begin.

There has been no further output in the logs since then

merrymercy · 2024-11-14T18:55:38Z

Try to drop --kv-cache-dtype int8 and --enable-torch-compile?

cc @HaiShaw can you take a look?

HaiShaw · 2024-11-15T00:25:10Z

@linqingxu

Several things to note:

like issue [Bug] amdgpu，tp-size=2，Detected errors during sampling! NaN in the logits. #1953 --kv-cache-dtype int8 is over your own modified code?
gfx1100 (7900xtx) consumer cards are not officially supported yet. Currently only Instinct (DC GPUs) are supported.
When I tried on MI300X with --enable-torch-compile on Qwen/Qwen2.5-7B-Instruct-GPTQ-Int8, no problem exists, used v0.3.5.post1-rocm620

zhyncs added the amd label Nov 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] tp-size=2，model launch error #1945

[Bug] tp-size=2，model launch error #1945

linqingxu commented Nov 7, 2024

linqingxu commented Nov 7, 2024

merrymercy commented Nov 14, 2024

HaiShaw commented Nov 15, 2024

[Bug] tp-size=2，model launch error #1945

[Bug] tp-size=2，model launch error #1945

Comments

linqingxu commented Nov 7, 2024

Checklist

Describe the bug

Reproduction

Environment

linqingxu commented Nov 7, 2024

merrymercy commented Nov 14, 2024

HaiShaw commented Nov 15, 2024