Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QAT] BERT Model #21228

Closed
Sand3r- opened this issue Nov 18, 2019 · 3 comments
Closed

[QAT] BERT Model #21228

Sand3r- opened this issue Nov 18, 2019 · 3 comments
Assignees

Comments

@Sand3r-
Copy link
Contributor

Sand3r- commented Nov 18, 2019

Hello,
as it has been agreed on previously, intel is going to prepare QAT pass for BERT INT8.
Hence, we'd like to find out the following:

  1. What flavour (type) of BERT model shall we optimize? There are several ones that I know of, each differs by the ops it is built of.
  2. Could you please provide data the reader for us? Two of the QAT BERT models we have received accepted only 2 inputs, while other BERT models we have saw (the one from benchmark repository or the one from bert unit-test), contained 4 inputs, named placeholder[0-3].
  3. How can we find out how to compute the accuracy?
  4. What is the performance measure that we use for the model in question? Is it words per second (wps) or something else?
@Sand3r-
Copy link
Contributor Author

Sand3r- commented Nov 19, 2019

So far we've attempted optimisation on float_model of BERT using QATv1 mechanism, these are the profiling results:
FP32 QAT BERT

Run 100 samples, average latency: 181.305 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 181.006 ms per sample.

QATv1 INT8 model

Run 100 samples, average latency: 50.4984 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 48.1151 ms per sample.

According to the final benchmark result we have managed to achieve ~3.8x speedup, however since fp32 and int8 versions had a lot of outliers in its results (a typical result was ~100.712 ms, while some outliers where much larger, i.e. 705.972 ms or 672.464 ms) the results were skewed. Hence, I want to add, that the typical latency of single batch computation was for FP32 QAT BERT: 100.712 ms and for QAT INT8 BERT 44.4283 ms.

Full output for FP32 QAT
Full output for INT8 QAT

@bingyanghuang
Copy link
Contributor

@Sand3r- Thanks for your results. To answer your questions:

  1. The QAT INT8 model which has two inputs.
  2. UT and input data has sent to you by slack
  3. Accuracy can be calculated by comparing the results and label (sent to you by slack)
  4. performance should be the average latency with batch_size=1 and max_seqlen=128

@bingyanghuang
Copy link
Contributor

New benchmark follow-up is in PaddlePaddle/benchmark#275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants