Stable Diffusion XL inference benchmarks
model:
https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
batch size = 1, image size 1024x1024, 50 iterations
Engine
Time
PT2.0,fp16 + compile
5.35 s
Onnxruntime,fp16,ORT_CUDA
4.28 s
Engine
Time
PT2.0,fp16 + compile
6.02 s
Engine
Time
PT2.0,fp16 + compile
9.07 s
Stable Diffusion inference benchmarks
model:
https://huggingface.co/runwayml/stable-diffusion-v1-5 - sd1.5
https://huggingface.co/stabilityai/stable-diffusion-2-1 - sd2.1
batch size = 1, image size 512x512, 50 iterations
Engine
Time, sd1.5
Time, sd2.1, 512x512
Time, sd2.1, 768x768
PT2.0,fp16
1.96 s (4.54gb VRAM)
PT2.0,fp16 + compile
1.36 s (5.96 gb)
2.37 s
AITemplate,fp16
1.01 s (4.06 gb)
DeepSpeed,fp16
1.18 s
2.28 s
Oneflow,fp16
0.98 s (5.62 gb)
TensorRT 8.6.1, fp16
0.98 s
0.81 s
1.88 s
TensorRT 10.0, fp16
1.57 s
Onnxruntime,fp16,ORT_CUDA
0.85 s
0.76 s
1.63 s
Jax,XLA,bf16
1.58 s
1.35 s
3.61 s
Engine
Time, sd1.5
Time, sd2.1
PT2.0,fp16
1.44 s
PT2.0,fp16,compile
1.11 s
TensorRT 8.6.1,fp16
0.75 s
0.68 s
Jax,XLA,bf16
1.18 s
Engine
Time, sd1.5
Time, sd2.1
Time, sd2.1, 768x768
PT2.0,fp16,compile
0.83 s
0.70 s
1.39 s
TensorRT 8.6.1,fp16
0.49 s
0.48 s
1.05 s
Jax,XLA,bf16
1.00 s
0.79 s
Engine
Time, sd1.5
Time, sd2.1
Time, sd2.1, 768x768
PT2.0,fp16,compile
1.17 s
2.26 s
TensorRT 8.6.1, fp16
0.74 s
0.68 s
1.52 s
Engine
Time, sd1.5
Time, sd2.1
Time, sd2.1, 768x768
PT2.0,fp16,compile
2.09 s
3.08 s
TensorRT 8.6.2,fp16
0.91 s
2.19 s
Engine
Time, sd1.5
Time, sd2.1
Time, sd2.1, 768x768
PT2.0,fp16,compile
1.28 s
2.77 s
TensorRT 8.6.2,fp16
0.90 s
2.25 s
GPU
PT2.0,fp16,xformers
V100, 16gb
2.96 s
T4, 16gb
7.83 s
Ubuntu, Debian VM setup https://gist.github.com/alexeigor/b4c21b5e1fe62d670c433d4ac8c9fd83
docker build -f ./Dockerfile --network=host --build-arg HF_TOKEN=xxxxx -t test_pt .
docker run -it --network=host -v ${PWD} /workspace:/workspace -w /workspace --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 test_engine