Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] big TPOT and ITL when running the offline benchmark #2097

Open
5 tasks done
TraceIvan opened this issue Nov 20, 2024 · 3 comments
Open
5 tasks done

[Bug] big TPOT and ITL when running the offline benchmark #2097

TraceIvan opened this issue Nov 20, 2024 · 3 comments

Comments

@TraceIvan
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I am trying to compare vLLM and SGLang, and found that in the case of offline benchmarks, SGLang is significantly higher than vLLM in TPOT and ITL. In addition, when conducting online benchmarks, SGLang is only lower than vLLM in TTFT, and the gap in other metrics is not obvious, which is different from the official test results.

model dataset Engine RPS Num prompts Median TTFT Median TPOT Median ITL Request throughput(tok/s) Output token throughput(tok/s)
Qwen2.5-7B-Instruct sharegpt SGLang 4 1200 26.35 11.75 10.71 3.96 759.50
Qwen2.5-7B-Instruct sharegpt SGLang 8 2400 30.31 15.59 12.69 7.69 1510.00
Qwen2.5-7B-Instruct sharegpt SGLang inf 5000 40115.54 480.09 184.38 25.38 5025.33
Qwen2.5-7B-Instruct sharegpt vLLM 4 1200 89.87 12.84 11.75 3.95 757.85
Qwen2.5-7B-Instruct sharegpt vLLM 8 2400 95.95 14.72 12.19 7.68 1508.08
Qwen2.5-7B-Instruct sharegpt vLLM inf 5000 97130.64 48.77 26.82 23.23 4597.77
Llama-3.1-8B-Instruct sharegpt SGLang 4 1200 31.98 13.19 11.92 3.96 748.92
Llama-3.1-8B-Instruct sharegpt SGLang 8 2400 34.75 18.27 14.52 7.70 1481.75
Llama-3.1-8B-Instruct sharegpt SGLang inf 5000 76052.20 249.41 202.11 21.44 4165.55
Llama-3.1-8B-Instruct sharegpt vLLM 4 1200 111.92 13.15 12.35 3.96 747.86
Llama-3.1-8B-Instruct sharegpt vLLM 8 2400 139.91 16.02 14.14 7.69 1480.22
Llama-3.1-8B-Instruct sharegpt vLLM inf 5000 108027.13 55.27 33.92 20.49 3980.71

Reproduction

offline benchmark for sglang:

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     5000      
Benchmark duration (s):                  233.25    
Total input tokens:                      1130466   
Total generated tokens:                  971613    
Total generated tokens (retokenized):    971312    
Request throughput (req/s):              21.44     
Input token throughput (tok/s):          4846.59   
Output token throughput (tok/s):         4165.55   
Total token throughput (tok/s):          9012.13   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122015.10 
Median E2E Latency (ms):                 122810.16 
---------------Time to First Token----------------
Mean TTFT (ms):                          78989.41  
Median TTFT (ms):                        76052.20  
P99 TTFT (ms):                           178133.89 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          323.91    
Median TPOT (ms):                        249.41    
P99 TPOT (ms):                           2075.00   
---------------Inter-token Latency----------------
Mean ITL (ms):                           224.57    
Median ITL (ms):                         202.11    
P99 ITL (ms):                            725.15    
==================================================

offline benchmark for vllm:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    inf       
Successful requests:                     5000      
Benchmark duration (s):                  244.08    
Total input tokens:                      1130466   
Total generated tokens:                  971613    
Total generated tokens (retokenized):    971392    
Request throughput (req/s):              20.49     
Input token throughput (tok/s):          4631.53   
Output token throughput (tok/s):         3980.71   
Total token throughput (tok/s):          8612.24   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   119191.23 
Median E2E Latency (ms):                 118937.91 
---------------Time to First Token----------------
Mean TTFT (ms):                          108699.28 
Median TTFT (ms):                        108027.13 
P99 TTFT (ms):                           216903.82 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.17     
Median TPOT (ms):                        55.27     
P99 TPOT (ms):                           84.80     
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.27     
Median ITL (ms):                         33.92     
P99 ITL (ms):                            412.93    
==================================================

online benchmark for sglang:

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1200 --request-rate 4

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Successful requests:                     1200      
Benchmark duration (s):                  302.97    
Total input tokens:                      269081    
Total generated tokens:                  226901    
Total generated tokens (retokenized):    226855    
Request throughput (req/s):              3.96      
Input token throughput (tok/s):          888.15    
Output token throughput (tok/s):         748.92    
Total token throughput (tok/s):          1637.07   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2564.81   
Median E2E Latency (ms):                 1575.29   
---------------Time to First Token----------------
Mean TTFT (ms):                          38.23     
Median TTFT (ms):                        31.98     
P99 TTFT (ms):                           95.83     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.46     
Median TPOT (ms):                        13.19     
P99 TPOT (ms):                           20.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.43     
Median ITL (ms):                         11.92     
P99 ITL (ms):                            53.09     
==================================================

online benchmark for vllm:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 1200 --request-rate 4

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    4.0       
Successful requests:                     1200      
Benchmark duration (s):                  303.40    
Total input tokens:                      269081    
Total generated tokens:                  226901    
Total generated tokens (retokenized):    226877    
Request throughput (req/s):              3.96      
Input token throughput (tok/s):          886.89    
Output token throughput (tok/s):         747.86    
Total token throughput (tok/s):          1634.75   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2628.77   
Median E2E Latency (ms):                 1637.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          112.34    
Median TTFT (ms):                        111.92    
P99 TTFT (ms):                           226.57    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.26     
Median TPOT (ms):                        13.15     
P99 TPOT (ms):                           15.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.38     
Median ITL (ms):                         12.35     
P99 ITL (ms):                            33.04     
==================================================

Environment

• python:3.10.15
• torch:2.4.0-cu121
• vLLM:0.6.3.post1
• SGLang:0.3.5.post2
• GPU:1×NVIDIA A800 80GB

@zhyncs
Copy link
Member

zhyncs commented Nov 20, 2024

Hi @TraceIvan

Regarding the issues you mentioned, there are three main points:

  1. You can refer to our previously published blog v0.2 v0.3 and README. These benchmark results are fully reproducible, and we have clearly described the vllm and sglang versions, hardware, and scenarios.

  2. Regarding the offline benchmark you mentioned, I believe that for offline benchmark, latency metrics are meaningless. Usually, only throughput is considered because if users have certain SLOs for latency (e.g., TTFT and ITL need to meet certain conditions), it is usually an online scenario where throughput under latency needs to be met. Obviously, this does not align with your scenario of testing 5000 prompts with inf request rate.

  3. For online scenarios, if using the same benchmark configuration without either framework reaching its limit—meaning when the request rate sent is close to the processing request rate—the difference in latency is usually not significant. You need to increase the request rate until a bottleneck is reached, then use that same request rate to benchmark another engine or control another engine's latency requirements to measure what its maximum throughput is. Throughput that meets the latency conditions is called good-put.

BTW I can share with you my offline throughput comparison of Llama 3.1 70B Instruct on H100 TP4 for your reference (since I currently don't have A100 or A800).

From the results, throughput nearly doubles (8893.63 vs 4036.33).

If you have any questions, feel free to communicate with me at any time. Cheers!

# sglang latest main (will release 0.3.6 soon)
# vllm 0.6.4.post1
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct  --tp 4 --disable-radix

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  236.36
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    966174
Request throughput (req/s):              21.15
Input token throughput (tok/s):          4782.86
Output token throughput (tok/s):         4110.77
Total token throughput (tok/s):          8893.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   121975.17
Median E2E Latency (ms):                 122833.42
---------------Time to First Token----------------
Mean TTFT (ms):                          82343.16
Median TTFT (ms):                        79687.05
P99 TTFT (ms):                           180424.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          276.04
Median TPOT (ms):                        225.49
P99 TPOT (ms):                           1503.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.28
Median ITL (ms):                         184.59
P99 ITL (ms):                            1048.51
==================================================

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  520.79
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    969847
Request throughput (req/s):              9.60
Input token throughput (tok/s):          2170.68
Output token throughput (tok/s):         1865.65
Total token throughput (tok/s):          4036.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   315131.04
Median E2E Latency (ms):                 335501.39
---------------Time to First Token----------------
Mean TTFT (ms):                          154018.79
Median TTFT (ms):                        147331.90
P99 TTFT (ms):                           331535.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1032.28
Median TPOT (ms):                        1062.07
P99 TPOT (ms):                           1810.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           830.90
Median ITL (ms):                         298.50
P99 ITL (ms):                            3506.43
==================================================

@sitabulaixizawaluduo
Copy link

SGLang test can try to open enable_mix_chunked and enable_overlap_schedule, in my current test SGLang performance and LMDeploy close, better than vLLM

@TraceIvan
Copy link
Author

Hi @TraceIvan

Regarding the issues you mentioned, there are three main points:

  1. You can refer to our previously published blog v0.2 v0.3 and README. These benchmark results are fully reproducible, and we have clearly described the vllm and sglang versions, hardware, and scenarios.
  2. Regarding the offline benchmark you mentioned, I believe that for offline benchmark, latency metrics are meaningless. Usually, only throughput is considered because if users have certain SLOs for latency (e.g., TTFT and ITL need to meet certain conditions), it is usually an online scenario where throughput under latency needs to be met. Obviously, this does not align with your scenario of testing 5000 prompts with inf request rate.
  3. For online scenarios, if using the same benchmark configuration without either framework reaching its limit—meaning when the request rate sent is close to the processing request rate—the difference in latency is usually not significant. You need to increase the request rate until a bottleneck is reached, then use that same request rate to benchmark another engine or control another engine's latency requirements to measure what its maximum throughput is. Throughput that meets the latency conditions is called good-put.

BTW I can share with you my offline throughput comparison of Llama 3.1 70B Instruct on H100 TP4 for your reference (since I currently don't have A100 or A800).

From the results, throughput nearly doubles (8893.63 vs 4036.33).

If you have any questions, feel free to communicate with me at any time. Cheers!

# sglang latest main (will release 0.3.6 soon)
# vllm 0.6.4.post1
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct  --tp 4 --disable-radix

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  236.36
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    966174
Request throughput (req/s):              21.15
Input token throughput (tok/s):          4782.86
Output token throughput (tok/s):         4110.77
Total token throughput (tok/s):          8893.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   121975.17
Median E2E Latency (ms):                 122833.42
---------------Time to First Token----------------
Mean TTFT (ms):                          82343.16
Median TTFT (ms):                        79687.05
P99 TTFT (ms):                           180424.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          276.04
Median TPOT (ms):                        225.49
P99 TPOT (ms):                           1503.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.28
Median ITL (ms):                         184.59
P99 ITL (ms):                            1048.51
==================================================

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  520.79
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    969847
Request throughput (req/s):              9.60
Input token throughput (tok/s):          2170.68
Output token throughput (tok/s):         1865.65
Total token throughput (tok/s):          4036.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   315131.04
Median E2E Latency (ms):                 335501.39
---------------Time to First Token----------------
Mean TTFT (ms):                          154018.79
Median TTFT (ms):                        147331.90
P99 TTFT (ms):                           331535.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1032.28
Median TPOT (ms):                        1062.07
P99 TPOT (ms):                           1810.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           830.90
Median ITL (ms):                         298.50
P99 ITL (ms):                            3506.43
==================================================

Thank you for your suggestion. I will try to use the versions of vllm and sglang mentioned in the previous blog post and increase the request rate for the online benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants