This is the reference implementation for MLPerf Inference benchmarks for Natural Language Processing.
The chosen model is BERT-Large performing SQuAD v1.1 question answering task.
- nvidia-docker
- Any NVIDIA GPU supported by TensorFlow or PyTorch
model | framework | accuracy | dataset | model link | model source | precision | notes |
---|---|---|---|---|---|---|---|
BERT-Large | TensorFlow | f1_score=90.874% | SQuAD v1.1 validation set | from zenodo from zenodo | BERT-Large, trained with NVIDIA DeepLearningExamples | fp32 | |
BERT-Large | PyTorch | f1_score=90.874% | SQuAD v1.1 validation set | from zenodo | BERT-Large, trained with NVIDIA DeepLearningExamples, converted with bert_tf_to_pytorch.py | fp32 | |
BERT-Large | ONNX | f1_score=90.874% | SQuAD v1.1 validation set | from zenodo | BERT-Large, trained with NVIDIA DeepLearningExamples, converted with bert_tf_to_pytorch.py | fp32 | |
BERT-Large | ONNX | f1_score=90.067% | SQuAD v1.1 validation set | from zenodo | Fine-tuned based on the PyTorch model and converted with bert_tf_to_pytorch.py | int8, symetrically per-tensor quantized without bias | See [MLPerf INT8 BERT Finetuning.pdf](MLPerf INT8 BERT Finetuning.pdf) for details about the fine-tuning process |
This benchmark app is a reference implementation that is not meant to be the fastest implementation possible.
Please run the following commands:
make setup
: initialize submodule, download datasets, and download models.make build_docker
: build docker image.make launch_docker
: launch docker container with an interaction session.python3 run.py --backend=[tf|pytorch|onnxruntime|tf_estimator] --scenario=[Offline|SingleStream|MultiStream|Server] [--accuracy] [--quantized]
: run the harness inside the docker container. Performance or Accuracy results will be printed in console.
- SUT implementations are in tf_SUT.py, tf_estimator_SUT.py and pytorch_SUT.py. QSL implementation is in squad_QSL.py.
- The script accuracy-squad.py parses LoadGen accuracy log, post-processes it, and computes the accuracy.
- Tokenization and detokenization (post-processing) are not included in the timed path.
- The inputs to the SUT are
input_ids
,input_make
, andsegment_ids
. The output from SUT isstart_logits
andend_logits
concatenated together. max_seq_length
is 384.- The script [tf_freeze_bert.py] freezes the TensorFlow model into pb file.
- The script [bert_tf_to_pytorch.py] converts the TensorFlow model into the PyTorch
BertForQuestionAnswering
module in HuggingFace Transformers and also exports the model to ONNX format.
Apache License 2.0