[Bug] big TPOT and ITL when running the offline benchmark #2097

TraceIvan · 2024-11-20T02:30:49Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I am trying to compare vLLM and SGLang, and found that in the case of offline benchmarks, SGLang is significantly higher than vLLM in TPOT and ITL. In addition, when conducting online benchmarks, SGLang is only lower than vLLM in TTFT, and the gap in other metrics is not obvious, which is different from the official test results.

model	dataset	Engine	RPS	Num prompts	Median TTFT	Median TPOT	Median ITL	Request throughput（tok/s）	Output token throughput（tok/s）
Qwen2.5-7B-Instruct	sharegpt	SGLang	4	1200	26.35	11.75	10.71	3.96	759.50
Qwen2.5-7B-Instruct	sharegpt	SGLang	8	2400	30.31	15.59	12.69	7.69	1510.00
Qwen2.5-7B-Instruct	sharegpt	SGLang	inf	5000	40115.54	480.09	184.38	25.38	5025.33
Qwen2.5-7B-Instruct	sharegpt	vLLM	4	1200	89.87	12.84	11.75	3.95	757.85
Qwen2.5-7B-Instruct	sharegpt	vLLM	8	2400	95.95	14.72	12.19	7.68	1508.08
Qwen2.5-7B-Instruct	sharegpt	vLLM	inf	5000	97130.64	48.77	26.82	23.23	4597.77
Llama-3.1-8B-Instruct	sharegpt	SGLang	4	1200	31.98	13.19	11.92	3.96	748.92
Llama-3.1-8B-Instruct	sharegpt	SGLang	8	2400	34.75	18.27	14.52	7.70	1481.75
Llama-3.1-8B-Instruct	sharegpt	SGLang	inf	5000	76052.20	249.41	202.11	21.44	4165.55
Llama-3.1-8B-Instruct	sharegpt	vLLM	4	1200	111.92	13.15	12.35	3.96	747.86
Llama-3.1-8B-Instruct	sharegpt	vLLM	8	2400	139.91	16.02	14.14	7.69	1480.22
Llama-3.1-8B-Instruct	sharegpt	vLLM	inf	5000	108027.13	55.27	33.92	20.49	3980.71

Reproduction

offline benchmark for sglang：

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    inf       
Successful requests:                     5000      
Benchmark duration (s):                  233.25    
Total input tokens:                      1130466   
Total generated tokens:                  971613    
Total generated tokens (retokenized):    971312    
Request throughput (req/s):              21.44     
Input token throughput (tok/s):          4846.59   
Output token throughput (tok/s):         4165.55   
Total token throughput (tok/s):          9012.13   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   122015.10 
Median E2E Latency (ms):                 122810.16 
---------------Time to First Token----------------
Mean TTFT (ms):                          78989.41  
Median TTFT (ms):                        76052.20  
P99 TTFT (ms):                           178133.89 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          323.91    
Median TPOT (ms):                        249.41    
P99 TPOT (ms):                           2075.00   
---------------Inter-token Latency----------------
Mean ITL (ms):                           224.57    
Median ITL (ms):                         202.11    
P99 ITL (ms):                            725.15    
==================================================

offline benchmark for vllm：

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    inf       
Successful requests:                     5000      
Benchmark duration (s):                  244.08    
Total input tokens:                      1130466   
Total generated tokens:                  971613    
Total generated tokens (retokenized):    971392    
Request throughput (req/s):              20.49     
Input token throughput (tok/s):          4631.53   
Output token throughput (tok/s):         3980.71   
Total token throughput (tok/s):          8612.24   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   119191.23 
Median E2E Latency (ms):                 118937.91 
---------------Time to First Token----------------
Mean TTFT (ms):                          108699.28 
Median TTFT (ms):                        108027.13 
P99 TTFT (ms):                           216903.82 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          53.17     
Median TPOT (ms):                        55.27     
P99 TPOT (ms):                           84.80     
---------------Inter-token Latency----------------
Mean ITL (ms):                           54.27     
Median ITL (ms):                         33.92     
P99 ITL (ms):                            412.93    
==================================================

online benchmark for sglang:

CUDA_VISIBLE_DEVICES=0 python -m sglang.launch_server --model-path Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1200 --request-rate 4

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4.0       
Successful requests:                     1200      
Benchmark duration (s):                  302.97    
Total input tokens:                      269081    
Total generated tokens:                  226901    
Total generated tokens (retokenized):    226855    
Request throughput (req/s):              3.96      
Input token throughput (tok/s):          888.15    
Output token throughput (tok/s):         748.92    
Total token throughput (tok/s):          1637.07   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2564.81   
Median E2E Latency (ms):                 1575.29   
---------------Time to First Token----------------
Mean TTFT (ms):                          38.23     
Median TTFT (ms):                        31.98     
P99 TTFT (ms):                           95.83     
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.46     
Median TPOT (ms):                        13.19     
P99 TPOT (ms):                           20.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.43     
Median ITL (ms):                         11.92     
P99 ITL (ms):                            53.09     
==================================================

online benchmark for vllm:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server --model Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 1200 --request-rate 4

============ Serving Benchmark Result ============
Backend:                                 vllm      
Traffic request rate:                    4.0       
Successful requests:                     1200      
Benchmark duration (s):                  303.40    
Total input tokens:                      269081    
Total generated tokens:                  226901    
Total generated tokens (retokenized):    226877    
Request throughput (req/s):              3.96      
Input token throughput (tok/s):          886.89    
Output token throughput (tok/s):         747.86    
Total token throughput (tok/s):          1634.75   
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   2628.77   
Median E2E Latency (ms):                 1637.06   
---------------Time to First Token----------------
Mean TTFT (ms):                          112.34    
Median TTFT (ms):                        111.92    
P99 TTFT (ms):                           226.57    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          13.26     
Median TPOT (ms):                        13.15     
P99 TPOT (ms):                           15.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           13.38     
Median ITL (ms):                         12.35     
P99 ITL (ms):                            33.04     
==================================================

Environment

• python：3.10.15
• torch：2.4.0-cu121
• vLLM：0.6.3.post1
• SGLang：0.3.5.post2
• GPU：1×NVIDIA A800 80GB

The text was updated successfully, but these errors were encountered:

zhyncs · 2024-11-20T09:41:19Z

Hi @TraceIvan

Regarding the issues you mentioned, there are three main points:

You can refer to our previously published blog v0.2 v0.3 and README. These benchmark results are fully reproducible, and we have clearly described the vllm and sglang versions, hardware, and scenarios.
Regarding the offline benchmark you mentioned, I believe that for offline benchmark, latency metrics are meaningless. Usually, only throughput is considered because if users have certain SLOs for latency (e.g., TTFT and ITL need to meet certain conditions), it is usually an online scenario where throughput under latency needs to be met. Obviously, this does not align with your scenario of testing 5000 prompts with inf request rate.
For online scenarios, if using the same benchmark configuration without either framework reaching its limit—meaning when the request rate sent is close to the processing request rate—the difference in latency is usually not significant. You need to increase the request rate until a bottleneck is reached, then use that same request rate to benchmark another engine or control another engine's latency requirements to measure what its maximum throughput is. Throughput that meets the latency conditions is called good-put.

BTW I can share with you my offline throughput comparison of Llama 3.1 70B Instruct on H100 TP4 for your reference (since I currently don't have A100 or A800).

From the results, throughput nearly doubles (8893.63 vs 4036.33).

If you have any questions, feel free to communicate with me at any time. Cheers!

# sglang latest main (will release 0.3.6 soon)
# vllm 0.6.4.post1
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct  --tp 4 --disable-radix

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --num-prompts 5000

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  236.36
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    966174
Request throughput (req/s):              21.15
Input token throughput (tok/s):          4782.86
Output token throughput (tok/s):         4110.77
Total token throughput (tok/s):          8893.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   121975.17
Median E2E Latency (ms):                 122833.42
---------------Time to First Token----------------
Mean TTFT (ms):                          82343.16
Median TTFT (ms):                        79687.05
P99 TTFT (ms):                           180424.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          276.04
Median TPOT (ms):                        225.49
P99 TPOT (ms):                           1503.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.28
Median ITL (ms):                         184.59
P99 ITL (ms):                            1048.51
==================================================

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  520.79
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    969847
Request throughput (req/s):              9.60
Input token throughput (tok/s):          2170.68
Output token throughput (tok/s):         1865.65
Total token throughput (tok/s):          4036.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   315131.04
Median E2E Latency (ms):                 335501.39
---------------Time to First Token----------------
Mean TTFT (ms):                          154018.79
Median TTFT (ms):                        147331.90
P99 TTFT (ms):                           331535.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1032.28
Median TPOT (ms):                        1062.07
P99 TPOT (ms):                           1810.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           830.90
Median ITL (ms):                         298.50
P99 ITL (ms):                            3506.43
==================================================

sitabulaixizawaluduo · 2024-11-20T13:26:16Z

SGLang test can try to open enable_mix_chunked and enable_overlap_schedule, in my current test SGLang performance and LMDeploy close, better than vLLM

TraceIvan · 2024-11-21T01:21:52Z

Hi @TraceIvan

Regarding the issues you mentioned, there are three main points:

You can refer to our previously published blog v0.2 v0.3 and README. These benchmark results are fully reproducible, and we have clearly described the vllm and sglang versions, hardware, and scenarios.

Regarding the offline benchmark you mentioned, I believe that for offline benchmark, latency metrics are meaningless. Usually, only throughput is considered because if users have certain SLOs for latency (e.g., TTFT and ITL need to meet certain conditions), it is usually an online scenario where throughput under latency needs to be met. Obviously, this does not align with your scenario of testing 5000 prompts with inf request rate.

For online scenarios, if using the same benchmark configuration without either framework reaching its limit—meaning when the request rate sent is close to the processing request rate—the difference in latency is usually not significant. You need to increase the request rate until a bottleneck is reached, then use that same request rate to benchmark another engine or control another engine's latency requirements to measure what its maximum throughput is. Throughput that meets the latency conditions is called good-put.

BTW I can share with you my offline throughput comparison of Llama 3.1 70B Instruct on H100 TP4 for your reference (since I currently don't have A100 or A800).

From the results, throughput nearly doubles (8893.63 vs 4036.33).

If you have any questions, feel free to communicate with me at any time. Cheers!
# sglang latest main (will release 0.3.6 soon)
# vllm 0.6.4.post1
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-70B-Instruct  --tp 4 --disable-radix

python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4

python3 -m sglang.bench_serving --backend sglang --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --num-prompts 5000
============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  236.36
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    966174
Request throughput (req/s):              21.15
Input token throughput (tok/s):          4782.86
Output token throughput (tok/s):         4110.77
Total token throughput (tok/s):          8893.63
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   121975.17
Median E2E Latency (ms):                 122833.42
---------------Time to First Token----------------
Mean TTFT (ms):                          82343.16
Median TTFT (ms):                        79687.05
P99 TTFT (ms):                           180424.62
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          276.04
Median TPOT (ms):                        225.49
P99 TPOT (ms):                           1503.42
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.28
Median ITL (ms):                         184.59
P99 ITL (ms):                            1048.51
==================================================

============ Serving Benchmark Result ============
Backend:                                 vllm
Traffic request rate:                    inf
Successful requests:                     5000
Benchmark duration (s):                  520.79
Total input tokens:                      1130466
Total generated tokens:                  971613
Total generated tokens (retokenized):    969847
Request throughput (req/s):              9.60
Input token throughput (tok/s):          2170.68
Output token throughput (tok/s):         1865.65
Total token throughput (tok/s):          4036.33
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   315131.04
Median E2E Latency (ms):                 335501.39
---------------Time to First Token----------------
Mean TTFT (ms):                          154018.79
Median TTFT (ms):                        147331.90
P99 TTFT (ms):                           331535.74
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          1032.28
Median TPOT (ms):                        1062.07
P99 TPOT (ms):                           1810.00
---------------Inter-token Latency----------------
Mean ITL (ms):                           830.90
Median ITL (ms):                         298.50
P99 ITL (ms):                            3506.43
==================================================

Thank you for your suggestion. I will try to use the versions of vllm and sglang mentioned in the previous blog post and increase the request rate for the online benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] big TPOT and ITL when running the offline benchmark #2097

[Bug] big TPOT and ITL when running the offline benchmark #2097

TraceIvan commented Nov 20, 2024

zhyncs commented Nov 20, 2024

sitabulaixizawaluduo commented Nov 20, 2024

TraceIvan commented Nov 21, 2024

[Bug] big TPOT and ITL when running the offline benchmark #2097

[Bug] big TPOT and ITL when running the offline benchmark #2097

Comments

TraceIvan commented Nov 20, 2024

Checklist

Describe the bug

Reproduction

Environment

zhyncs commented Nov 20, 2024

sitabulaixizawaluduo commented Nov 20, 2024

TraceIvan commented Nov 21, 2024