You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
5. Please use English, otherwise it will be closed.
Describe the bug
I want to follow https://github.com/sgl-project/sglang/blob/main/benchmark/llava_bench/README.md and perform batch inference on llava.
First I launch a llava v1.5 7b model from the local path using: python3 -m sglang.launch_server --model-path /root/llm-project/utils/models/models-repo/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000 --disable-cuda-graph
Every thing look fine:
[2024-11-23 20:21:11] server_args=ServerArgs(model_path='/root/llm-project/utils/models/models-repo/llava-v1.5-7b', tokenizer_path='llava-hf/llava-1.5-7b-hf', tokenizer_mode='auto', skip_tokenizer_init=False, load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization=None, context_length=None, device='cuda', served_model_name='/root/llm-project/utils/models/models-repo/llava-v1.5-7b', chat_template=None, is_embedding=False, host='127.0.0.1', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, tp_size=1, stream_interval=1, random_seed=415803303, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='SGLang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=True, disable_cuda_graph_padding=False, disable_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, num_continuous_decode_steps=1, delete_ckpt_after_loading=False)
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens.
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens.
[2024-11-23 20:21:21 DP-1 TP0] Automatically turn off --chunked-prefill-size and adjust --mem-fraction-static for multimodal models.
[2024-11-23 20:21:21 DP-1 TP0] Init torch distributed begin.
[2024-11-23 20:21:21 DP-1 TP0] Load weight begin. avail mem=38.97 GB
[2024-11-23 20:21:22 DP-1 TP0] lm_eval is not installed, GPTQ may not be usable
Loading pt checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
state = torch.load(bin_file, map_location="cpu")
Loading pt checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.43it/s]
Loading pt checkpoint shards: 67% Completed | 2/3 [00:29<00:16, 16.97s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [01:36<00:00, 39.92s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [01:36<00:00, 32.09s/it]
[2024-11-23 20:23:10 DP-1 TP0] Load weight end. type=LlavaLlamaForCausalLM, dtype=torch.float16, avail mem=25.72 GB
[2024-11-23 20:23:10 DP-1 TP0] Memory pool end. avail mem=6.28 GB
Some kwargs in processor config are unused and will not have any effect: num_additional_image_tokens.
[2024-11-23 20:23:12 DP-1 TP0] max_total_num_tokens=39590, max_prefill_tokens=16384, max_running_requests=4097, context_len=4096
[2024-11-23 20:23:12] INFO: Started server process [169184]
[2024-11-23 20:23:12] INFO: Waiting for application startup.
[2024-11-23 20:23:12] INFO: Application startup complete.
[2024-11-23 20:23:12] INFO: Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2024-11-23 20:23:13] INFO: 127.0.0.1:44342 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-23 20:23:13 DP-1 TP0] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-23 20:23:15] INFO: 127.0.0.1:44354 - "POST /generate HTTP/1.1" 200 OK
[2024-11-23 20:23:15] The server is fired up and ready to roll!
Then I run the llava_bench using python3 bench_sglang.py --num-questions 60, an error occured:
The server side:
[2024-11-23 20:20:03] INFO: 127.0.0.1:52700 - "GET /get_model_info HTTP/1.1" 200 OK
[2024-11-23 20:20:03 DP-1 TP0] Prefill batch. #new-seq: 1, #new-token: 33, #cached-token: 1, cache hit rate: 2.44%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2024-11-23 20:20:03] INFO: 127.0.0.1:52704 - "POST /generate HTTP/1.1" 200 OK
[2024-11-23 20:20:06 DP-1 TP0] Traceback (most recent call last):
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1407, in run_scheduler_process
scheduler.event_loop_overlap()
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 388, in event_loop_overlap
self.process_input_requests(recv_reqs)
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 493, in process_input_requests
self.handle_generate_request(recv_req)
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 556, in handle_generate_request
req.origin_input_ids = self.pad_input_ids_func(
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/models/llava.py", line 67, in pad_input_ids
num_patch_width, num_patch_height = get_anyres_image_grid_shape(
File "/root/anaconda3/envs/sglang/lib/python3.10/site-packages/sglang/srt/mm_utils.py", line 173, in get_anyres_image_grid_shape
possible_resolutions = ast.literal_eval(grid_pinpoints)
File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 110, in literal_eval
return _convert(node_or_string)
File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 109, in _convert
return _convert_signed_num(node)
File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 83, in _convert_signed_num
return _convert_num(node)
File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 74, in _convert_num
_raise_malformed_node(node)
File "/root/anaconda3/envs/sglang/lib/python3.10/ast.py", line 71, in _raise_malformed_node
raise ValueError(msg + f': {node!r}')
ValueError: malformed node or string: None
zsh: killed python3 -m sglang.launch_server --model-path --tokenizer-path --port 30000
The python code (running bench_sglang.py) side:
uncher 58987 -- /root/llm-project/sglang/benchmark/llava_bench/bench_sglang.py
0%| | 0/60 [00:00<?, ?it/s]/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py:339: UserWarning: Error in stream_executor: Traceback (most recent call last):
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 337, in _thread_worker_func
self._execute(expr)
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 380, in _execute
self._execute(x)
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 375, in _execute
self._execute_gen(other)
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/interpreter.py", line 502, in _execute_gen
comp, meta_info = self.backend.generate(
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/lang/backend/runtime_endpoint.py", line 163, in generate
res = http_request(
File "/root/anaconda3/envs/llava/lib/python3.10/site-packages/sglang/utils.py", line 100, in http_request
resp = urllib.request.urlopen(req, data=data, cafile=verify)
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 519, in open
response = self._open(req, data)
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 536, in _open
result = self._call_chain(self.handle_open, protocol, protocol +
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 1377, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/root/anaconda3/envs/llava/lib/python3.10/urllib/request.py", line 1352, in do_open
r = h.getresponse()
File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/root/anaconda3/envs/llava/lib/python3.10/http/client.py", line 287, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
warnings.warn(f"Error in stream_executor: {get_exception_traceback()}")
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 60/60 [00:03<00:00, 17.06it/s]
Latency: 3.821
Write output to answers.jsonl
Checklist
Describe the bug
I want to follow
https://github.com/sgl-project/sglang/blob/main/benchmark/llava_bench/README.md
and perform batch inference on llava.First I launch a
llava v1.5 7b model
from the local path using:python3 -m sglang.launch_server --model-path /root/llm-project/utils/models/models-repo/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --port 30000 --disable-cuda-graph
Every thing look fine:
Then I run the llava_bench using
python3 bench_sglang.py --num-questions 60
, an error occured:The server side:
The python code (running bench_sglang.py) side:
Reproduction
Follow
https://github.com/sgl-project/sglang/blob/main/benchmark/llava_bench/README.md
Environment
The text was updated successfully, but these errors were encountered: