[QAT] BERT Model #21228

Sand3r- · 2019-11-18T14:46:16Z

Hello,
as it has been agreed on previously, intel is going to prepare QAT pass for BERT INT8.
Hence, we'd like to find out the following:

What flavour (type) of BERT model shall we optimize? There are several ones that I know of, each differs by the ops it is built of.
Could you please provide data the reader for us? Two of the QAT BERT models we have received accepted only 2 inputs, while other BERT models we have saw (the one from benchmark repository or the one from bert unit-test), contained 4 inputs, named placeholder[0-3].
How can we find out how to compute the accuracy?
What is the performance measure that we use for the model in question? Is it words per second (wps) or something else?

The text was updated successfully, but these errors were encountered:

Sand3r- · 2019-11-19T14:41:23Z

So far we've attempted optimisation on float_model of BERT using QATv1 mechanism, these are the profiling results:
FP32 QAT BERT

Run 100 samples, average latency: 181.305 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 181.006 ms per sample.

QATv1 INT8 model

Run 100 samples, average latency: 50.4984 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 48.1151 ms per sample.

According to the final benchmark result we have managed to achieve ~3.8x speedup, however since fp32 and int8 versions had a lot of outliers in its results (a typical result was ~100.712 ms, while some outliers where much larger, i.e. 705.972 ms or 672.464 ms) the results were skewed. Hence, I want to add, that the typical latency of single batch computation was for FP32 QAT BERT: 100.712 ms and for QAT INT8 BERT 44.4283 ms.

Full output for FP32 QAT
Full output for INT8 QAT

bingyanghuang · 2019-11-19T14:54:48Z

@Sand3r- Thanks for your results. To answer your questions:

The QAT INT8 model which has two inputs.
UT and input data has sent to you by slack
Accuracy can be calculated by comparing the results and label (sent to you by slack)
performance should be the average latency with batch_size=1 and max_seqlen=128

bingyanghuang · 2019-11-27T01:12:31Z

New benchmark follow-up is in PaddlePaddle/benchmark#275

Sand3r- added Intel int8 labels Nov 18, 2019

Sand3r- assigned luotao1 Nov 18, 2019

luotao1 assigned wzzju Nov 19, 2019

bingyanghuang closed this as completed Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QAT] BERT Model #21228

[QAT] BERT Model #21228

Sand3r- commented Nov 18, 2019 •

edited

Loading

Sand3r- commented Nov 19, 2019

bingyanghuang commented Nov 19, 2019

bingyanghuang commented Nov 27, 2019

[QAT] BERT Model #21228

[QAT] BERT Model #21228

Comments

Sand3r- commented Nov 18, 2019 • edited Loading

Sand3r- commented Nov 19, 2019

bingyanghuang commented Nov 19, 2019

bingyanghuang commented Nov 27, 2019

Sand3r- commented Nov 18, 2019 •

edited

Loading