diff --git a/neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/index.rst b/neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/index.rst
index b112925b..2bdceec8 100644
--- a/neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/index.rst
+++ b/neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/index.rst
@@ -38,13 +38,11 @@ Computer Vision
 Natural Language Processing
 ---------------------------
 
-* BERT tutorial :ref:`[html] <mxnet-bert-tutorial>`
 * MXNet 1.8: Using data parallel mode tutorial :ref:`[html] </src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb>` :mxnet-neuron-src:`[notebook] <data_parallel/data_parallel_tutorial.ipynb>`
 
 .. toctree::
    :hidden:
 
-   /neuron-guide/neuron-frameworks/mxnet-neuron/tutorials/bert_mxnet/index
    /src/examples/mxnet/data_parallel/data_parallel_tutorial.ipynb
 
    
diff --git a/src/examples/mxnet/bert_mxnet.ipynb b/src/examples/mxnet/bert_mxnet.ipynb
deleted file mode 100644
index dc03738e..00000000
--- a/src/examples/mxnet/bert_mxnet.ipynb
+++ /dev/null
@@ -1,676 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Running MXNet BERT-Base on Inf1 instances\n",
-    "\n",
-    "BERT (Bidirectional Encoder Representations from Transformers) is a Google Research project published in 2018 [https://arxiv.org/abs/1810.04805](https://arxiv.org/abs/1810.04805). BERT has a number of practical applications, it can be used for question answering, sequence prediction and sequence classification amongst other tasks.\n",
-    "\n",
-    "This tutorial will walk you through the process of modifying and compiling Bert Base (L-12 H-768 A-12) with sequence length 64 and batch size of 8 to run on Inferentia servers. \n",
-    "\n",
-    "### Warning\n",
-    "This tutorial was tested on MXNet-1.5\n",
-    "\n",
-    "MXNet-1.5 entered maintenance mode and require Neuron runtime 1.0, please see : [MXNet-1.5 enters maintainence mode](../../../../release-notes/maintenance.html)\n",
-    "\n",
-    "To setup development environment for MXNet-1.5 see installation instructions for Neuron 1.15.1 : [Neuron-1.15.1 MXNet install](../../../../neuron-intro/neuron-install-guide.html#apache-mxnet-setup)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### PRE-REQUISITES\n",
-    "\n",
-    "To ensure we have a clean working environment, clear the hardware of any prior state."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import mxnet as mx\n",
-    "from packaging import version\n",
-    "mxnet_version = version.parse(mx.__version__)\n",
-    "\n",
-    "if mxnet_version >= version.parse(\"1.8\"):\n",
-    "    !sudo rmmod neuron; sudo modprobe neuron\n",
-    "else:\n",
-    "    !neuron-cli reset"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The following pip packages are used in this notebook: `wget`, `os `, `shutil`, `sys`. Kindly install those prior to proceeding. "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### SETUP\n",
-    "\n",
-    "We used publicly available instructions to generate a saved model for open source BERT using fine-tuned SST-2 weights. The steps to generate this model can be found [here](https://gluon-nlp.mxnet.io/v0.9.x/model_zoo/bert/index.html#sentence-classification), or you can download a trained model on SST-2 from [here](https://dist-bert.s3.amazonaws.com/demo/finetune/sst.params). Place the saved model in a directory named \"gluonnlp_bert\" under the bert_demo directory (it is assumed that this notebook is inside bert_demo directory). Download gluon-nlp package and put it in system path so that we can import it in python as a module. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import os \n",
-    "import shutil\n",
-    "import sys\n",
-    "import wget\n",
-    "\n",
-    "# Clone a copy of gluon-nlp and check out 184a0007bc4165d5fe080a58dd3ff9bb413203a6\n",
-    "if os.path.isdir('gluon-nlp'):\n",
-    "    print(\"Removing Gluon-nlp... \")\n",
-    "    shutil.rmtree('gluon-nlp')\n",
-    "os.system(\"git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp; \\\n",
-    "           git checkout 184a0007bc4165d5fe080a58dd3ff9bb413203a6\")\n",
-    "p = 'gluon-nlp/src/'\n",
-    "sys.path.insert(0,p)\n",
-    "\n",
-    "# Download a copy of sst.params \n",
-    "if os.path.isdir('gluonnlp_bert'):\n",
-    "    print(\"Removing existing bert params... \")\n",
-    "    shutil.rmtree('gluonnlp_bert')\n",
-    "os.mkdir('gluonnlp_bert')\n",
-    "print('Beginning download of sst params...')\n",
-    "url = 'https://dist-bert.s3.amazonaws.com/demo/finetune/sst.params'\n",
-    "wget.download(url, 'gluonnlp_bert/sst.params')\n",
-    "\n",
-    "# Remove output_director if present from previous runs\n",
-    "if os.path.isdir('output_dir'):\n",
-    "    print(\"Removing existing output_dir... \")\n",
-    "    shutil.rmtree('output_dir')\n",
-    "os.mkdir('output_dir')\n",
-    "print('Download of all necessary files complete.')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### MODIFYING BERT FOR INFERENTIA\n",
-    "\n",
-    "Create an instance of BERT classifier from gluonnlp and modify it to make it work on inferentia. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import mxnet as mx\n",
-    "import gluonnlp as nlp\n",
-    "from gluonnlp.model import get_model, BERTClassifier\n",
-    "import logging\n",
-    "import warnings\n",
-    "\n",
-    "nlp.utils.check_version('0.8.1')\n",
-    "\n",
-    "# BERT Model design parameters\n",
-    "seq_length = 64\n",
-    "model_parameters = 'gluonnlp_bert/sst.params'\n",
-    "model_name = 'bert_12_768_12'\n",
-    "dataset_name = 'book_corpus_wiki_en_uncased'\n",
-    "output_dir = 'output_dir'\n",
-    "batch_size = 8\n",
-    "dropout = 0.1\n",
-    "num_units = 1024 if model_name == 'bert_24_1024_16' else 768\n",
-    "\n",
-    "# Create an instance of bert classifier and hybridize it \n",
-    "bert, _ = get_model(\n",
-    "    name=model_name,\n",
-    "    dataset_name=dataset_name,\n",
-    "    pretrained=False,\n",
-    "    use_pooler=True,\n",
-    "    use_decoder=False,\n",
-    "    use_classifier=False,\n",
-    "    dropout=0.0)\n",
-    "net = BERTClassifier(bert, num_classes=2, dropout=dropout)\n",
-    "\n",
-    "# Load the parameters from downloaded sst.params files\n",
-    "net.load_parameters(model_parameters)\n",
-    "net.hybridize(static_alloc=True, static_shape=True)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Make the modifications necessary to make bert classifier inferentia compatible and extract maximum performance from the hardware. Following modifications are made:  \n",
-    "1. Remove dropouts from the inference graph\n",
-    "2. Embedding lookup and processing is removed from the network and done on cpu. \n",
-    "3. Mask used when sequence length is less than max_sequence length is also generated on CPU and feed into inferentia as an input tensor."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import math\n",
-    "f = mx.sym\n",
-    "\n",
-    "def broadcast_axis(data=None, axis=None, size=None, out=None, name=None, **kwargs):\n",
-    "    assert axis == 1\n",
-    "    ones = f.ones((1,size,1,1))\n",
-    "    out = f.broadcast_div(data, ones)\n",
-    "    return out\n",
-    "\n",
-    "def div_sqrt_dim(data=None, out=None, name=None, **kwargs):\n",
-    "    assert '1024' in model_name or '768' in model_name\n",
-    "    units = 1024/16 if '1024' in model_name else 768/12\n",
-    "    return data / math.sqrt(units)\n",
-    "\n",
-    "def embedding_op(data=None, weight=None, input_dim=None, output_dim=None, dtype=None,\n",
-    "                 sparse_grad=None, out=None, name=None, batch_mode=True, **kwargs):\n",
-    "    repeat = seq_length if batch_mode else test_batch_size * seq_length\n",
-    "    output_shape = (seq_length, output_dim) if batch_mode else (test_batch_size, seq_length, output_dim)\n",
-    "    return data\n",
-    "\n",
-    "def embedding(self, F, x, weight):\n",
-    "    out = embedding_op(x, weight, name='fwd', batch_mode=False, **self._kwargs)\n",
-    "    return out\n",
-    "\n",
-    "def gelu(self, F, x):\n",
-    "    return 0.5 * x * (1 + F.tanh(math.sqrt(2 / math.pi) * (x + 0.044715 * (x ** 3))))\n",
-    "\n",
-    "def layer_norm(self, F, data, gamma, beta):\n",
-    "    mean = data.mean(axis=self._axis, keepdims=True)\n",
-    "    delta = F.broadcast_sub(data, mean)\n",
-    "    var = (delta ** 2).mean(axis=self._axis, keepdims=True)\n",
-    "    X_hat = F.broadcast_div(delta, var.sqrt() + self._epsilon)\n",
-    "    return F.broadcast_add(F.broadcast_mul(gamma, X_hat), beta)\n",
-    "\n",
-    "def arange_like(x, axis):\n",
-    "    arange = f.arange(start=0, repeat=num_units, step=1, stop=seq_length, dtype='float32')\n",
-    "    return arange\n",
-    "\n",
-    "def where(condition=None, x=None, y=None, name=None, attr=None, out=None, **kwargs):\n",
-    "    return (condition == 0) * y + (1 - condition == 0) * x\n",
-    "\n",
-    "def dropout(data=None, p=None, mode=None, axes=None, cudnn_off=None, out=None, name=None, **kwargs):\n",
-    "    return data\n",
-    "\n",
-    "def bert_model___call__(self, inputs, valid_length=None, mask=None, masked_positions=None):\n",
-    "    # pylint: disable=dangerous-default-value, arguments-differ\n",
-    "    \"\"\"Generate the representation given the inputs.\n",
-    "    This is used in training or fine-tuning a BERT model.\n",
-    "    \"\"\"\n",
-    "    return super(nlp.model.BERTModel, self).__call__(inputs, valid_length, mask, masked_positions)\n",
-    "\n",
-    "def bert_model_hybrid_forward(self, F, inputs, valid_length=None, mask=None, masked_positions=None): # abi\n",
-    "    # pylint: disable=arguments-differ\n",
-    "    \"\"\"Generate the representation given the inputs.\n",
-    "    This is used in training or fine-tuning a BERT model.\n",
-    "    \"\"\"\n",
-    "    outputs = []\n",
-    "    seq_out, attention_out = self._encode_sequence(inputs, valid_length, mask)\n",
-    "    outputs.append(seq_out)\n",
-    "\n",
-    "    if self.encoder._output_all_encodings:\n",
-    "        assert isinstance(seq_out, list)\n",
-    "        output = seq_out[-1]\n",
-    "    else:\n",
-    "        output = seq_out\n",
-    "\n",
-    "    if attention_out:\n",
-    "        outputs.append(attention_out)\n",
-    "\n",
-    "    if self._use_pooler:\n",
-    "        pooled_out = self._apply_pooling(output)\n",
-    "        outputs.append(pooled_out)\n",
-    "        if self._use_classifier:\n",
-    "            next_sentence_classifier_out = self.classifier(pooled_out)\n",
-    "            outputs.append(next_sentence_classifier_out)\n",
-    "    if self._use_decoder:\n",
-    "        assert masked_positions is not None, \\\n",
-    "            'masked_positions tensor is required for decoding masked language model'\n",
-    "        decoder_out = self._decode(F, output, masked_positions)\n",
-    "        outputs.append(decoder_out)\n",
-    "    return tuple(outputs) if len(outputs) > 1 else outputs[0]\n",
-    "\n",
-    "def bert_model__encode_sequence(self, inputs, valid_length=None, mask=None): #abi\n",
-    "    \"\"\"Generate the representation given the input sequences.\n",
-    "    This is used for pre-training or fine-tuning a BERT model.\n",
-    "    \"\"\"\n",
-    "    outputs, additional_outputs = self.encoder(inputs, valid_length=valid_length, mask=mask)\n",
-    "    return outputs, additional_outputs\n",
-    "\n",
-    "def bert_encoder___call__(self, inputs, states=None, valid_length=None, mask=None): #pylint: disable=arguments-differ abi\n",
-    "    return mx.gluon.HybridBlock.__call__(self, inputs, states, valid_length, mask)\n",
-    "\n",
-    "def bert_encoder_hybrid_forward(self, F, inputs, states=None, valid_length=None, mask=None, position_weight=None): #abi\n",
-    "    if self._dropout:\n",
-    "        inputs = self.dropout_layer(inputs)\n",
-    "    inputs = self.layer_norm(inputs)\n",
-    "    outputs = inputs\n",
-    "\n",
-    "    all_encodings_outputs = []\n",
-    "    additional_outputs = []\n",
-    "    for cell in self.transformer_cells:\n",
-    "        outputs, attention_weights = cell(inputs, mask)\n",
-    "        inputs = outputs\n",
-    "        if self._output_all_encodings:\n",
-    "            if valid_length is not None:\n",
-    "                outputs = F.SequenceMask(outputs, sequence_length=valid_length,\n",
-    "                                         use_sequence_length=True, axis=1)\n",
-    "            all_encodings_outputs.append(outputs)\n",
-    "\n",
-    "        if self._output_attention:\n",
-    "            additional_outputs.append(attention_weights)\n",
-    "\n",
-    "    if valid_length is not None and not self._output_all_encodings:\n",
-    "        outputs = F.SequenceMask(outputs, sequence_length=valid_length,\n",
-    "                                 use_sequence_length=True, axis=1)\n",
-    "\n",
-    "    if self._output_all_encodings:\n",
-    "        return all_encodings_outputs, additional_outputs\n",
-    "    return outputs, additional_outputs\n",
-    "\n",
-    "def bert_classifier___call__(self, inputs, valid_length=None, mask=None):\n",
-    "    # pylint: disable=dangerous-default-value, arguments-differ\n",
-    "    return super(BERTClassifier, self).__call__(inputs, valid_length, mask)\n",
-    "\n",
-    "def bert_classifier_hybrid_forward(self, F, inputs, valid_length=None, mask=None):\n",
-    "    # pylint: disable=arguments-differ\n",
-    "    _, pooler_out = self.bert(inputs, valid_length, mask)\n",
-    "    return self.classifier(pooler_out)\n",
-    "\n",
-    "nlp.model.GELU.hybrid_forward = gelu\n",
-    "mx.gluon.nn.LayerNorm.hybrid_forward = layer_norm\n",
-    "mx.gluon.nn.Embedding.hybrid_forward = embedding\n",
-    "f.contrib.arange_like = arange_like\n",
-    "f.Embedding = embedding_op\n",
-    "f.contrib.div_sqrt_dim = div_sqrt_dim\n",
-    "f.broadcast_axis = broadcast_axis\n",
-    "f.where = where\n",
-    "f.Dropout = dropout\n",
-    "nlp.model.bert.BERTModel.__call__ = bert_model___call__\n",
-    "nlp.model.bert.BERTModel._encode_sequence = bert_model__encode_sequence\n",
-    "nlp.model.bert.BERTModel.hybrid_forward = bert_model_hybrid_forward\n",
-    "nlp.model.bert.BERTEncoder.__call__ = bert_encoder___call__\n",
-    "nlp.model.bert.BERTEncoder.hybrid_forward = bert_encoder_hybrid_forward\n",
-    "nlp.model.bert.BERTClassifier.__call__ = bert_classifier___call__\n",
-    "nlp.model.bert.BERTClassifier.hybrid_forward = bert_classifier_hybrid_forward"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now that we have modified the graph, lets save the graph to generate symbol and param files that will be used for compilation for inferentia. Since we partioned the embedding part of the graph to be executed on CPU, we will bring those out from the original params so that we can load it on cpu for pre-processing."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Dummy variables for the new inputs we have:\n",
-    "# inputs: embeddings. shape: (batch_size * seq_length * num_units) \n",
-    "# valid_length: number of valid tokens in the input. shape: (batch_size,)\n",
-    "# mask: mask to remove invalid positions in the graph. \n",
-    "\n",
-    "inputs = mx.nd.arange(batch_size * seq_length * num_units)\n",
-    "inputs = inputs.reshape(shape=(batch_size, seq_length, num_units))\n",
-    "valid_length = mx.nd.arange(batch_size)\n",
-    "steps = mx.nd.arange(start=0, stop=seq_length, dtype='float32')\n",
-    "ones = mx.nd.ones_like(steps)\n",
-    "mask = mx.nd.broadcast_lesser(mx.nd.reshape(steps, shape=(1, -1)),\n",
-    "                          mx.nd.reshape(valid_length, shape=(-1, 1)))\n",
-    "mask = mx.nd.broadcast_mul(mx.nd.expand_dims(mask, axis=1),\n",
-    "                       mx.nd.broadcast_mul(ones, mx.nd.reshape(ones, shape=(-1, 1))))\n",
-    "\n",
-    "def export(batch, prefix):\n",
-    "    \"\"\"Export the model.\"\"\"\n",
-    "    print('Exporting the model ... ')\n",
-    "    out = net(inputs, valid_length, mask)\n",
-    "    export_special(net, prefix, epoch=0)\n",
-    "    assert os.path.isfile(prefix + '-symbol.json')\n",
-    "    assert os.path.isfile(prefix + '-0000.params')\n",
-    "\n",
-    "def export_special(net, path, epoch):\n",
-    "    sym = net._cached_graph[1]\n",
-    "    sym.save('%s-symbol.json'%path, remove_amp_cast=False)\n",
-    "\n",
-    "    arg_names = set(sym.list_arguments())\n",
-    "    aux_names = set(sym.list_auxiliary_states())\n",
-    "    arg_dict = {}\n",
-    "    save_fn = mx.nd.save\n",
-    "    embedding_dict = {}\n",
-    "    for name, param in net.collect_params().items():\n",
-    "        if 'position_weight' in name or 'word_embed_embedding0_weight' in name or 'token_type_embed_embedding0_weight' in name:\n",
-    "            embedding_dict[name] = param._reduce()\n",
-    "        elif name in arg_names:\n",
-    "            arg_dict['arg:%s'%name] = param._reduce()\n",
-    "        else:\n",
-    "            assert name in aux_names, name\n",
-    "            arg_dict['aux:%s'%name] = param._reduce()\n",
-    "    save_fn('%s-%04d.params'%(path, epoch), arg_dict)\n",
-    "    save_fn('%s-%04d.embeddings'%(path, epoch), embedding_dict)\n",
-    "    \n",
-    "prefix = os.path.join('output_dir', 'classification-' + model_name + '-' + str(seq_length))\n",
-    "export(batch_size, prefix)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we implement the portion of the original graph that we removed (embedding lookup and processing) as a pre-process function using mx.nd. This part of the graph gets executed on CPU. And the output are used as input tensors to inferentia graph. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def pre_process(sentences, transform, max_len, embedding_dict):\n",
-    "    \"\"\"\n",
-    "    This pre-processing function executes the part of the network we \n",
-    "    removed in the previous sections. It creates input tensors for a \n",
-    "    Batch of input data / sentences. \n",
-    "    Arguments: \n",
-    "        - sentences: list of inputs of shape: (batch_size, )\n",
-    "        - transform: Sentence transformer which tokenizes the input sentences\n",
-    "        - max_len: Max sequence length the network was designed for. \n",
-    "        - embedding_dict: The embedding dictionary that we extacted from the \n",
-    "                        graph in the previous section. its used for embedding \n",
-    "                        value lookup during inference.\n",
-    "    Return:\n",
-    "        - ips_b: \n",
-    "        - sq_len_b:\n",
-    "        - mask_b:\n",
-    "    \"\"\"\n",
-    "    ips_b = None\n",
-    "    tk_types_b = None\n",
-    "    sq_len_b = None\n",
-    "    mask_b = None\n",
-    "\n",
-    "    for sentence in sentences:\n",
-    "      inputs, seq_len, token_types = transform([sentence])\n",
-    "\n",
-    "      inputs_arr = mx.nd.array([inputs])\n",
-    "      token_types_arr = mx.nd.array([token_types])\n",
-    "      postional_arr = mx.nd.arange(max_len)\n",
-    "      seq_len = mx.nd.array([seq_len])\n",
-    "      max_len1 = mx.nd.array([max_len])\n",
-    "\n",
-    "      # bert_encoder_hybrid_forward ~~\n",
-    "      steps = mx.nd.arange(start=0, stop=max_len, dtype='float32')\n",
-    "      ones = mx.nd.ones_like(steps)\n",
-    "      mask = mx.nd.broadcast_lesser(mx.nd.reshape(steps, shape=(1, -1)),\n",
-    "                                    mx.nd.reshape(max_len1, shape=(-1, 1)))\n",
-    "      mask = mx.nd.broadcast_mul(mx.nd.expand_dims(mask, axis=1),\n",
-    "                                 mx.nd.broadcast_mul(ones, mx.nd.reshape(ones, shape=(-1, 1))))\n",
-    "\n",
-    "      ips = mx.nd.take(embedding_dict['bertmodel0_word_embed_embedding0_weight'], inputs_arr)\n",
-    "      tk_types = mx.nd.take(embedding_dict['bertmodel0_token_type_embed_embedding0_weight'], token_types_arr)\n",
-    "      ps_arr = mx.nd.take(embedding_dict['bertencoder0_position_weight'], postional_arr)\n",
-    "      sq_len = seq_len\n",
-    "\n",
-    "      # BATCHING ~~~\n",
-    "      if ips_b is None:\n",
-    "        ips_b = ips\n",
-    "        tk_types_b = tk_types\n",
-    "        sq_len_b = sq_len\n",
-    "        mask_b = mask\n",
-    "      else:\n",
-    "        ips_b = mx.nd.concat(ips_b, ips, dim=0)\n",
-    "        tk_types_b = mx.nd.concat(tk_types_b, tk_types, dim=0)\n",
-    "        sq_len_b = mx.nd.concat(sq_len_b, sq_len, dim=0)\n",
-    "        mask_b = mx.nd.concat(mask_b, mask, dim=0)\n",
-    "\n",
-    "    # bert_model__encode_sequence ~~~\n",
-    "    ips_b = ips_b + tk_types_b\n",
-    "\n",
-    "    # Broadcast add (remove positional embedding addition from Inferentia graph and\n",
-    "    # do that on CPU\n",
-    "    ips_b = mx.nd.broadcast_add(ips_b, mx.nd.expand_dims(ps_arr, axis=0))\n",
-    "    print(type(ips_b), type(sq_len_b), type(mask_b))\n",
-    "    return ips_b, sq_len_b, mask_b"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Now we create a sample input. Since we are compiling for batch size of 8, we put 8 random sentences into this list. The following code will download vocabulary files necessary (if not already in ~/.mxnet/models/). Using that vocab we create a tokenizer that will be used to transform the input sentences. We also load the embeddings file that we created while saving the model previously. Using these we execute the first part of the graph (pre-process function) on CPU and generate a feed_dict that is next used to feed input tensors to inferentia graph."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Sample inputs that will be used for testing. \n",
-    "sentences = ['Neuron is awesome',\n",
-    "             'Neuron is great',\n",
-    "             'Neuron is confusing',\n",
-    "             'I Like Neuron',\n",
-    "             'Pizza is my favorite food',\n",
-    "             'I love living in bay area',\n",
-    "             'Driving is not fun',\n",
-    "             'Neuron has very good performance']\n",
-    "\n",
-    "if len(sentences) != batch_size:\n",
-    "    raise ValueError(\"Input dimensions don't match batch size\")\n",
-    "\n",
-    "_, vocabulary = nlp.model.get_model('bert_12_768_12',\n",
-    "                                    dataset_name='book_corpus_wiki_en_uncased',\n",
-    "                                    pretrained=False)\n",
-    "tokenizer = nlp.data.BERTTokenizer(vocabulary)\n",
-    "transform = nlp.data.BERTSentenceTransform(tokenizer, max_seq_length=seq_length, \n",
-    "                                           pair=False, pad=True)\n",
-    "\n",
-    "embedding_dict = mx.nd.load(prefix + '-0000.embeddings')\n",
-    "ips_b, sq_len_b, mask_b = pre_process(sentences, transform, seq_length, embedding_dict)\n",
-    "\n",
-    "# ips_b: embeddings. shape: (batch_size * seq_length * num_units) \n",
-    "# sq_len_b: number of valid tokens in the inputs. shape: (batch_size,)\n",
-    "# mask_b: mask to remove invalid positions in the graph.\n",
-    "feed_dict = {'data0': ips_b,\n",
-    "             'data1': sq_len_b,\n",
-    "             'data2': mask_b}"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next we create simple method that takes in a mxnet model and a feed dictionary and runs inferences and returns output values and average inference latencies. This method will be used to benchmark and compare inferentia and cpu runs."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import time\n",
-    "def run_model(sym, args, aux, ctx, args_update, num_runs):\n",
-    "    args.update(args_update)\n",
-    "    exe = sym.bind(ctx, args=args, aux_states=aux, grad_req='null')\n",
-    "\n",
-    "    # Warmup inference\n",
-    "    start = time.time()\n",
-    "    exe.forward()\n",
-    "    out = exe.outputs[0]\n",
-    "    mx.nd.waitall()\n",
-    "    end = time.time()\n",
-    "    print('Warmup time : ', (end - start))\n",
-    "\n",
-    "    start = time.time()\n",
-    "    for i in range(num_runs):\n",
-    "      exe.forward()\n",
-    "      out = exe.outputs[0]\n",
-    "    mx.nd.waitall()\n",
-    "    end = time.time()\n",
-    "    print('Avg inference time : ', (end - start) * 1. / num_runs)\n",
-    "    return out, (end - start) * 1. / num_runs"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Before compiling the new graph for infentia, lets test the network we have for correctness. For this, we shall load checkpoint we generated earlier and run it with the output of the pre-processing function from above."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": false
-   },
-   "outputs": [],
-   "source": [
-    "sym_ref, args_ref, aux_ref = mx.model.load_checkpoint(prefix, 0)\n",
-    "ref_out, _ = run_model(sym_ref, args_ref, aux_ref, mx.cpu(), feed_dict, 1)\n",
-    "label = mx.nd.argmax(ref_out, axis=1)\n",
-    "print(\"~~~~~~~~~~ Running on CPU ~~~~~~~~~~~~~ \")\n",
-    "for i, l in enumerate(label):\n",
-    "    print(sentences[i]+' : '+'positive sentiment' if l.asscalar() == 1 \\\n",
-    "            else 'negative sentiment')"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### COMPILING THE NETWORK FOR INFERENTIA"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import json\n",
-    "def sym_nodes(sym):\n",
-    "    \"\"\"\n",
-    "    Return a list of nodes from sym\n",
-    "    \"\"\"\n",
-    "    return json.loads(sym.tojson())['nodes']\n",
-    "\n",
-    "def count_ops(graph_nodes):\n",
-    "    \"\"\"\n",
-    "    Return number of operations in node list\n",
-    "    \"\"\"\n",
-    "    return len([x['op'] for x in graph_nodes if x['op'] != 'null'])\n",
-    "\n",
-    "def get_compile_stats(sym):\n",
-    "    \"\"\"\n",
-    "    Return triplet of compile statistics\n",
-    "    - count of operations in symbol database\n",
-    "    - number of Neuron subgraphs\n",
-    "    - number of operations compiled to Neuron runtime\n",
-    "    \"\"\"\n",
-    "    cnt = count_ops(sym_nodes(sym))\n",
-    "    neuron_subgraph_cnt = 0\n",
-    "    neuron_compiled_cnt = 0\n",
-    "    for g in sym_nodes(sym):\n",
-    "      if g['op'] == '_neuron_subgraph_op':\n",
-    "        neuron_subgraph_cnt += 1\n",
-    "        for sg in g['subgraphs']:\n",
-    "          neuron_compiled_cnt += count_ops(sg['nodes'])\n",
-    "    return (cnt, neuron_subgraph_cnt, neuron_compiled_cnt)\n",
-    "\n",
-    "def neuron_compile(prefix, inputs):\n",
-    "    # compile for Inferentia using Neuron\n",
-    "    compiler_args = {\"flags\": ['--fp32-cast', 'matmult-fp16']}\n",
-    "    sym, args_loaded, aux = mx.model.load_checkpoint(prefix, 0)\n",
-    "    sym, args_loaded, aux = mx.contrib.neuron.compile(sym, args_loaded, aux, inputs, \\\n",
-    "                                                        **compiler_args)\n",
-    "\n",
-    "    # Check if compilation was successful\n",
-    "    post_compile_cnt, neuron_subgraph_cnt, neuron_compiled_cnt = get_compile_stats(sym)\n",
-    "    print(\"INFO:mxnet: Number of operations in compiled model: \", post_compile_cnt)\n",
-    "    print(\"INFO:mxnet: Number of Neuron subgraphs in compiled model: \", neuron_subgraph_cnt)\n",
-    "    print(\"INFO:mxnet: Number of operations placed on Neuron runtime: \", neuron_compiled_cnt)\n",
-    "    num_ops_orig = (post_compile_cnt - neuron_subgraph_cnt + neuron_compiled_cnt)\n",
-    "    neuron_percentage = (neuron_compiled_cnt / num_ops_orig) * 100\n",
-    "    compile_success = 1 if neuron_percentage > 99.0 else 0\n",
-    "    assert(compile_success), \"Expected > 99% on Inf, but got {}\".format(neuron_percentage)\n",
-    "    \n",
-    "    # save compiled model\n",
-    "    mx.model.save_checkpoint(prefix + \"_compiled\", 0, sym, args_loaded, aux)\n",
-    "    \n",
-    "neuron_compile(prefix, feed_dict)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### RUNNING INFERENCE ON INF1 MACHINES\n",
-    "\n",
-    "Load the checkpoint from the previous step and run inference. Things to note:\n",
-    "1. prefix path has now changed to point to the compiled model\n",
-    "2. context has been changed to mx.neuron()\n",
-    "3. Warmup time is generally much higher than subsequence inference times (because of the time taken to load the compiled model). "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "# Load Inferentia symbol\n",
-    "sym, args, aux = mx.model.load_checkpoint(prefix + '_compiled', 0)\n",
-    "\n",
-    "# Run model and get output and latency numbers for 100 runs \n",
-    "inf_out, latency = run_model(sym, args, aux, mx.neuron(), feed_dict, 100)\n",
-    "label_inf = mx.nd.argmax(inf_out, axis=1)\n",
-    "print(\"~~~~~~~~~~ Running on Inferentia ~~~~~~~~~~~~~ \")\n",
-    "for i, l in enumerate(label_inf):\n",
-    "    print(sentences[i]+' : '+'positive sentiment' if l.asscalar() == 1 \\\n",
-    "            else 'negative sentiment')\n",
-    "    \n",
-    "# Check if the results are similar to CPU\n",
-    "np.testing.assert_allclose(inf_out.asnumpy(), ref_out.asnumpy(), atol=1e-2, rtol=1e-2)"
-   ]
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Environment (conda_mxnet_p37)",
-   "language": "python",
-   "name": "conda_mxnet_p37"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.7.10"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}