Skip to content

Commit

Permalink
Merge branch 'vllm-project:main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
tybalex authored Jun 25, 2024
2 parents e617b4c + dd248f7 commit 6ec01b5
Show file tree
Hide file tree
Showing 259 changed files with 9,485 additions and 1,728 deletions.
21 changes: 13 additions & 8 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,17 +13,24 @@ This benchmark will be *triggered* upon:

**Benchmarking Duration**: about 1hr.

## Configuring the workload for the quick benchmark
**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.

The workload of the quick benchmark contains two parts: latency tests in `latency-tests.json`, throughput tests in `throughput-tests.json` and serving tests in `serving-tests.json`.

## Configuring the workload

The benchmarking workload contains three parts:
- Latency tests in `latency-tests.json`.
- Throughput tests in `throughput-tests.json`.
- Serving tests in `serving-tests.json`.

See [descriptions.md](tests/descriptions.md) for detailed descriptions.

### Latency test

Here is an example of one test inside `latency-tests.json`:

```json
[
...
{
"test_name": "latency_llama8B_tp1",
"parameters": {
Expand All @@ -34,7 +41,6 @@ Here is an example of one test inside `latency-tests.json`:
"num_iters": 15
}
},
...
]
```

Expand All @@ -57,7 +63,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t

```
[
...
{
"test_name": "serving_llama8B_tp1_sharegpt",
"qps_list": [1, 4, 16, "inf"],
Expand All @@ -77,7 +82,6 @@ We test the throughput by using `benchmark_serving.py` with request rate = inf t
"num_prompts": 200
}
},
...
]
```

Expand All @@ -92,7 +96,8 @@ The number of this test is less stable compared to the delay and latency benchma
WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

## Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table.
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The JSON file is also attached within each buildkite job for further analysis.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.
1 change: 1 addition & 0 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ steps:
plugins:
- kubernetes:
podSpec:
priorityClassName: perf-benchmark
containers:
- image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
Expand Down
6 changes: 3 additions & 3 deletions .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
Original file line number Diff line number Diff line change
Expand Up @@ -343,9 +343,9 @@ main() {
QUICK_BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/

# benchmarking
run_serving_tests $QUICK_BENCHMARK_ROOT/serving-tests.json
run_latency_tests $QUICK_BENCHMARK_ROOT/latency-tests.json
run_throughput_tests $QUICK_BENCHMARK_ROOT/throughput-tests.json
run_serving_tests $QUICK_BENCHMARK_ROOT/tests/serving-tests.json
run_latency_tests $QUICK_BENCHMARK_ROOT/tests/latency-tests.json
run_throughput_tests $QUICK_BENCHMARK_ROOT/tests/throughput-tests.json


# postprocess benchmarking results
Expand Down
279 changes: 158 additions & 121 deletions .buildkite/nightly-benchmarks/scripts/convert-results-json-to-markdown.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import json
import os
from pathlib import Path

import pandas as pd
Expand All @@ -11,145 +12,181 @@
latency_column_mapping = {
"test_name": "Test name",
"gpu_type": "GPU",
"avg_latency": "Average latency (s)",
"P10": "P10 (s)",
"P25": "P25 (s)",
"P50": "P50 (s)",
"P75": "P75 (s)",
"P90": "P90 (s)",
"avg_latency": "Mean latency (ms)",
# "P10": "P10 (s)",
# "P25": "P25 (s)",
"P50": "Median latency (ms)",
# "P75": "P75 (s)",
# "P90": "P90 (s)",
"P99": "P99 latency (ms)",
}

# thoughput tests and the keys that will be printed into markdown
# throughput tests and the keys that will be printed into markdown
throughput_results = []
throughput_results_column_mapping = {
"test_name": "Test name",
"gpu_type": "GPU",
"num_requests": "# of req.",
"total_num_tokens": "Total # of tokens",
"elapsed_time": "Elapsed time (s)",
# "num_requests": "# of req.",
# "total_num_tokens": "Total # of tokens",
# "elapsed_time": "Elapsed time (s)",
"requests_per_second": "Tput (req/s)",
"tokens_per_second": "Tput (tok/s)",
# "tokens_per_second": "Tput (tok/s)",
}

# serving results and the keys that will be printed into markdown
serving_results = []
serving_column_mapping = {
"test_name": "Test name",
"gpu_type": "GPU",
"completed": "# of req.",
# "completed": "# of req.",
"request_throughput": "Tput (req/s)",
"input_throughput": "Input Tput (tok/s)",
"output_throughput": "Output Tput (tok/s)",
# "input_throughput": "Input Tput (tok/s)",
# "output_throughput": "Output Tput (tok/s)",
"mean_ttft_ms": "Mean TTFT (ms)",
# do not say TTFT again to avoid the table getting too wide
"median_ttft_ms": "Median",
"p99_ttft_ms": "P99",
"mean_tpot_ms": "Mean TPOT (ms)",
"median_tpot_ms": "Median",
"p99_tpot_ms": "P99",
"median_ttft_ms": "Median TTFT (ms)",
"p99_ttft_ms": "P99 TTFT (ms)",
# "mean_tpot_ms": "Mean TPOT (ms)",
# "median_tpot_ms": "Median",
# "p99_tpot_ms": "P99",
"mean_itl_ms": "Mean ITL (ms)",
"median_itl_ms": "Median",
"p99_itl_ms": "P99",
"median_itl_ms": "Median ITL (ms)",
"p99_itl_ms": "P99 ITL (ms)",
}

for test_file in results_folder.glob("*.json"):

with open(test_file, "r") as f:
raw_result = json.loads(f.read())

if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# add the result to raw_result
serving_results.append(raw_result)
continue

elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# get different percentiles
for perc in [10, 25, 50, 75, 90]:
raw_result.update(
{f"P{perc}": raw_result["percentiles"][str(perc)]})

# add the result to raw_result
latency_results.append(raw_result)
continue

elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# add the result to raw_result
throughput_results.append(raw_result)
continue

print(f"Skipping {test_file}")

latency_results = pd.DataFrame.from_dict(latency_results)
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)

# remapping the key, for visualization purpose
if not latency_results.empty:
latency_results = latency_results[list(
latency_column_mapping.keys())].rename(columns=latency_column_mapping)
if not serving_results.empty:
serving_results = serving_results[list(
serving_column_mapping.keys())].rename(columns=serving_column_mapping)
if not throughput_results.empty:
throughput_results = throughput_results[list(
throughput_results_column_mapping.keys())].rename(
columns=throughput_results_column_mapping)

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
tablefmt='pipe',
showindex=False)
serving_md_table = tabulate(serving_results,
headers='keys',
tablefmt='pipe',
showindex=False)
throughput_md_table = tabulate(throughput_results,
headers='keys',
tablefmt='pipe',
showindex=False)

# document the result
with open(results_folder / "benchmark_results.md", "w") as f:

def read_markdown(file):
if os.path.exists(file):
with open(file, "r") as f:
return f.read() + "\n"
else:
return f"{file} not found.\n"


def results_to_json(latency, throughput, serving):
return json.dumps({
'latency': latency.to_dict(),
'throughput': throughput.to_dict(),
'serving': serving.to_dict()
})


if __name__ == "__main__":

# collect results
for test_file in results_folder.glob("*.json"):

with open(test_file, "r") as f:
raw_result = json.loads(f.read())

if "serving" in str(test_file):
# this result is generated via `benchmark_serving.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# add the result to raw_result
serving_results.append(raw_result)
continue

elif "latency" in f.name:
# this result is generated via `benchmark_latency.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# get different percentiles
for perc in [10, 25, 50, 75, 90, 99]:
# Multiply 1000 to convert the time unit from s to ms
raw_result.update(
{f"P{perc}": 1000 * raw_result["percentiles"][str(perc)]})
raw_result["avg_latency"] = raw_result["avg_latency"] * 1000

# add the result to raw_result
latency_results.append(raw_result)
continue

elif "throughput" in f.name:
# this result is generated via `benchmark_throughput.py`

# attach the benchmarking command to raw_result
with open(test_file.with_suffix(".commands"), "r") as f:
command = json.loads(f.read())
raw_result.update(command)

# update the test name of this result
raw_result.update({"test_name": test_file.stem})

# add the result to raw_result
throughput_results.append(raw_result)
continue

print(f"Skipping {test_file}")

latency_results = pd.DataFrame.from_dict(latency_results)
serving_results = pd.DataFrame.from_dict(serving_results)
throughput_results = pd.DataFrame.from_dict(throughput_results)

raw_results_json = results_to_json(latency_results, throughput_results,
serving_results)

# remapping the key, for visualization purpose
if not latency_results.empty:
f.write("## Latency tests\n")
f.write(latency_md_table)
f.write("\n")
if not throughput_results.empty:
f.write("## Throughput tests\n")
f.write(throughput_md_table)
f.write("\n")
latency_results = latency_results[list(
latency_column_mapping.keys())].rename(
columns=latency_column_mapping)
if not serving_results.empty:
f.write("## Serving tests\n")
f.write(serving_md_table)
f.write("\n")
serving_results = serving_results[list(
serving_column_mapping.keys())].rename(
columns=serving_column_mapping)
if not throughput_results.empty:
throughput_results = throughput_results[list(
throughput_results_column_mapping.keys())].rename(
columns=throughput_results_column_mapping)

processed_results_json = results_to_json(latency_results,
throughput_results,
serving_results)

# get markdown tables
latency_md_table = tabulate(latency_results,
headers='keys',
tablefmt='pipe',
showindex=False)
serving_md_table = tabulate(serving_results,
headers='keys',
tablefmt='pipe',
showindex=False)
throughput_md_table = tabulate(throughput_results,
headers='keys',
tablefmt='pipe',
showindex=False)

# document the result
with open(results_folder / "benchmark_results.md", "w") as f:

results = read_markdown(
"../.buildkite/nightly-benchmarks/tests/descriptions.md")
results = results.format(
latency_tests_markdown_table=latency_md_table,
throughput_tests_markdown_table=throughput_md_table,
serving_tests_markdown_table=serving_md_table,
benchmarking_results_in_json_string=processed_results_json)
f.write(results)

# document benchmarking results in json
with open(results_folder / "benchmark_results.json", "w") as f:

results = latency_results.to_dict(
orient='records') + throughput_results.to_dict(
orient='records') + serving_results.to_dict(orient='records')
f.write(json.dumps(results))
Loading

0 comments on commit 6ec01b5

Please sign in to comment.