ML notebook formatting and text updates (#24437)

* merged and resolved the conflict * more copy edits to the ML notebooks * merged and resolved the conflict * more copy edits to the ML notebooks * more copy edits to the ML notebooks * more copy edits to the ML notebooks * trying to remove a section that shouldn't have been added back in * Update examples/notebooks/beam-ml/custom_remote_inference.ipynb Co-authored-by: Danny McCormick <[email protected]> * Update examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb Co-authored-by: Danny McCormick <[email protected]> * review updates Co-authored-by: Danny McCormick <[email protected]>
apache · Dec 1, 2022 · 2e71061 · 2e71061
1 parent 5af2733
commit 2e71061
Show file tree

Hide file tree

Showing 8 changed files with 197 additions and 187 deletions.
diff --git a/examples/notebooks/beam-ml/custom_remote_inference.ipynb b/examples/notebooks/beam-ml/custom_remote_inference.ipynb
@@ -4,6 +4,7 @@
       "cell_type": "code",
       "execution_count": null,
       "metadata": {
+        "cellView": "form",
         "id": "paYiulysGrwR"
       },
       "outputs": [],
@@ -36,15 +37,16 @@
       "source": [
         "# Remote inference in Apache Beam\n",
         "\n",
+        "This example demonstrates how to implement a custom inference call in Apache Beam using the Google Cloud Vision API.\n",
+        "\n",
         "The prefered way to run inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). \n",
-        "The RunInference API enables you to run your models as part of your pipeline in a way that is optimized for machine learning inference. \n",
+        "The RunInference API enables you to run models as part of your pipeline in a way that is optimized for machine learning inference. \n",
         "To reduce the number of steps that you need to take, RunInference supports features like batching. For more infomation about the RunInference API, review the [RunInference API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.html#apache_beam.ml.inference.RunInference), \n",
         "which demonstrates how to implement model inference in PyTorch, scikit-learn, and TensorFlow.\n",
         "\n",
         "Currently, the RunInference API doesn't support making remote inference calls using the Natural Language API, Cloud Vision API, and so on. \n",
-        "Therefore, to use these remote APIs with Apache Beam, you need to write custom inference calls.\n",
-        "\n",
-        "This notebook shows how to implement a custom inference call in Apache Beam. This example uses the Google Cloud Vision API."
+        "Therefore, to use these remote APIs with Apache Beam, you need to write custom inference calls.\n"
+
       ]
     },
     {
@@ -53,7 +55,7 @@
         "id": "GNbarEZsalS1"
       },
       "source": [
-        "## Use case: run the Cloud Vision API\n",
+        "## Run the Cloud Vision API\n",
         "\n",
         "You can use the Cloud Vision API to retrieve labels that describe an image.\n",
         "For example, the following image shows a lion with possible labels."
@@ -75,20 +77,20 @@
       },
       "source": [
         "We want to run the Google Cloud Vision API on a large set of images, and Apache Beam is the ideal tool to handle this workflow.\n",
-        "This example notebook demonstates how to retrieve image labels with this API on a small set of images.\n",
+        "This example demonstates how to retrieve image labels with this API on a small set of images.\n",
         "\n",
-        "The notebook follows these steps to implement this workflow:\n",
+        "The example follows these steps to implement this workflow:\n",
         "* Read the images.\n",
         "* Batch the images together to optimize the model call.\n",
         "* Send the images to an external API to run inference.\n",
-        "* Post-process the results of your API.\n",
+        "* Postprocess the results of your API.\n",
         "\n",
         "**Caution:** Be aware of API quotas and the heavy load you might incur on your external API. Verify that your pipeline and API are configured correctly for your use case.\n",
         "\n",
         "To optimize the calls to the external API, limit the parallel calls to the external remote API by configuring [PipelineOptions](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options).\n",
         "In Apache Beam, different runners provide options to handle the parallelism, for example:\n",
-        "* With the [Direct Runner](https://beam.apache.org/documentation/runners/direct/), use `direct_num_workers`.\n",
-        "* With the [Google Cloud Dataflow Runner](https://beam.apache.org/documentation/runners/dataflow/), use `max_num_workers`.\n",
+        "* With the [Direct Runner](https://beam.apache.org/documentation/runners/direct/), use the `direct_num_workers` pipeline option.\n",
+        "* With the [Google Cloud Dataflow Runner](https://beam.apache.org/documentation/runners/dataflow/), use the `max_num_workers` pipeline option.\n",
         "\n",
         "For information about other runners, see the [Beam capability matrix](https://beam.apache.org/documentation/runners/capability-matrix/) "
       ]
@@ -99,7 +101,7 @@
         "id": "FAawWOaiIYaS"
       },
       "source": [
-        "## Installation\n",
+        "## Before you begin\n",
         "\n",
         "This section provides installation steps."
       ]
@@ -170,9 +172,11 @@
         "id": "mL4MaHm_XOVd"
       },
       "source": [
-        "## Remote inference on Cloud Vision API\n",
+        "## Run remote inference on Cloud Vision API\n",
+        "\n",
+        "This section demonstates the steps to run remote inference on the Cloud Vision API.\n",
         "\n",
-        "This section demonstates the steps to run remote inference on the Cloud Vision API."
+        "Download and install Apache Beam and the required modules."
       ]
     },
     {
@@ -199,7 +203,7 @@
         "id": "09k08IYlLmON"
       },
       "source": [
-        "For this example, we use images from the [MSCoco dataset](https://cocodataset.org/#explore) as a list of image urls.\n",
+        "This example uses images from the [MSCoco dataset](https://cocodataset.org/#explore) as a list of image URLs.\n",
         "This data is used as the pipeline input."
       ]
     },
@@ -234,20 +238,20 @@
         "id": "HLy7VKJhLrmT"
       },
       "source": [
-        "### Custom DoFn\n",
+        "### Create a custom DoFn\n",
         "\n",
         "In order to implement remote inference, create a DoFn class. This class sends a batch of images to the Cloud vision API.\n",
         "\n",
         "The custom DoFn makes it possible to initialize the API. In case of a custom model, a model can also be loaded in the `setup` function. \n",
         "\n",
-        "The `process` function is the most interesting part. In this function we implement the model call and return its results.\n",
+        "The `process` function is the most interesting part. In this function, we implement the model call and return its results.\n",
         "\n",
-        "**Caution:** When running remote inference, prepare to encounter, identify, and handle failure as gracefully as possible. We recommend using the following techniques: \n",
+        "When running remote inference, prepare to encounter, identify, and handle failure as gracefully as possible. We recommend using the following techniques: \n",
         "\n",
         "* **Exponential backoff:** Retry failed remote calls with exponentially growing pauses between retries. Using exponential backoff ensures that failures don't lead to an overwhelming number of retries in quick succession. \n",
         "\n",
-        "* **Dead letter queues:** Route failed inferences to a separate `PCollection` without failing the whole transform. You can continue execution without failing the job (batch jobs' default behavior) or retrying indefinitely (streaming jobs' default behavior).\n",
-        "You can then run custom pipeline logic on the deadletter queue to log the failure, alert, and push the failed message to temporary storage so that it can eventually be reprocessed. "
+        "* **Dead-letter queues:** Route failed inferences to a separate `PCollection` without failing the whole transform. You can continue execution without failing the job (batch jobs' default behavior) or retrying indefinitely (streaming jobs' default behavior).\n",
+        "You can then run custom pipeline logic on the dead-letter queue (unprocessed messages queue) to log the failure, alert, and push the failed message to temporary storage so that it can eventually be reprocessed."
       ]
     },
     {
@@ -277,7 +281,7 @@
         "    image_requests = [vision.AnnotateImageRequest(image=image, features=[feature]) for image in images]\n",
         "    batch_image_request = vision.BatchAnnotateImagesRequest(requests=image_requests)\n",
         "\n",
-        "    # Send batch request to the remote endpoint.\n",
+        "    # Send the batch request to the remote endpoint.\n",
         "    responses = self._client.batch_annotate_images(request=batch_image_request).responses\n",
         "    \n",
         "    return list(zip(image_urls, responses))\n"
@@ -289,15 +293,15 @@
         "id": "lHJuyHhvL0-a"
       },
       "source": [
-        "### Batching\n",
+        "### Manage batching\n",
         "\n",
         "Before we can chain together the pipeline steps, we need to understand batching.\n",
         "When running inference with your model, either in Apache Beam or in an external API, you can batch your input to increase the efficiency of the model execution.\n",
         "When using a custom DoFn, as in this example, you need to manage the batching.\n",
         "\n",
         "To manage the batching in this pipeline, include a `BatchElements` transform to group elements together and form a batch of the desired size.\n",
         "\n",
-        "* If you have a streaming pipeline, consider using [GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/)\n",
+        "* If you have a streaming pipeline, consider using [GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/),\n",
         "because `BatchElements` doesn't batch items across bundles. `GroupIntoBatches` requires choosing a key within which items are batched.\n",
         "\n",
         "* When batching, make sure that the input batch matches the maximum payload of the external API.  \n",
@@ -619,7 +623,7 @@
         "id": "7gwn5bF1XaDm"
       },
       "source": [
-        "### Metrics\n",
+        "## Monitor the pipeline\n",
         "\n",
         "Because monitoring can provide insight into the status and health of the application, consider monitoring and measuring pipeline performance.\n",
         "For information about the available tracking metrics, see [RunInference Metrics](https://beam.apache.org/documentation/ml/runinference-metrics/)."