From 2e71061e7b9d2383cbea5531215b68e6ec0236cd Mon Sep 17 00:00:00 2001 From: Rebecca Szper <98840847+rszper@users.noreply.github.com> Date: Thu, 1 Dec 2022 06:13:40 -0800 Subject: [PATCH] ML notebook formatting and text updates (#24437) * merged and resolved the conflict * more copy edits to the ML notebooks * merged and resolved the conflict * more copy edits to the ML notebooks * more copy edits to the ML notebooks * more copy edits to the ML notebooks * trying to remove a section that shouldn't have been added back in * Update examples/notebooks/beam-ml/custom_remote_inference.ipynb Co-authored-by: Danny McCormick * Update examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb Co-authored-by: Danny McCormick * review updates Co-authored-by: Danny McCormick --- .../beam-ml/custom_remote_inference.ipynb | 50 +++++------ .../beam-ml/dataframe_api_preprocessing.ipynb | 82 +++++++++---------- .../beam-ml/run_custom_inference.ipynb | 17 ++-- .../beam-ml/run_inference_multi_model.ipynb | 74 +++++++++-------- .../beam-ml/run_inference_pytorch.ipynb | 32 ++++---- ...inference_pytorch_tensorflow_sklearn.ipynb | 57 ++++++------- .../beam-ml/run_inference_sklearn.ipynb | 30 +++---- .../beam-ml/run_inference_tensorflow.ipynb | 42 ++++++---- 8 files changed, 197 insertions(+), 187 deletions(-) diff --git a/examples/notebooks/beam-ml/custom_remote_inference.ipynb b/examples/notebooks/beam-ml/custom_remote_inference.ipynb index 036a9d39d4ea..ad25849e89ed 100644 --- a/examples/notebooks/beam-ml/custom_remote_inference.ipynb +++ b/examples/notebooks/beam-ml/custom_remote_inference.ipynb @@ -4,6 +4,7 @@ "cell_type": "code", "execution_count": null, "metadata": { + "cellView": "form", "id": "paYiulysGrwR" }, "outputs": [], @@ -36,15 +37,16 @@ "source": [ "# Remote inference in Apache Beam\n", "\n", + "This example demonstrates how to implement a custom inference call in Apache Beam using the Google Cloud Vision API.\n", + "\n", "The prefered way to run inference in Apache Beam is by using the [RunInference API](https://beam.apache.org/documentation/sdks/python-machine-learning/). \n", - "The RunInference API enables you to run your models as part of your pipeline in a way that is optimized for machine learning inference. \n", + "The RunInference API enables you to run models as part of your pipeline in a way that is optimized for machine learning inference. \n", "To reduce the number of steps that you need to take, RunInference supports features like batching. For more infomation about the RunInference API, review the [RunInference API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.html#apache_beam.ml.inference.RunInference), \n", "which demonstrates how to implement model inference in PyTorch, scikit-learn, and TensorFlow.\n", "\n", "Currently, the RunInference API doesn't support making remote inference calls using the Natural Language API, Cloud Vision API, and so on. \n", - "Therefore, to use these remote APIs with Apache Beam, you need to write custom inference calls.\n", - "\n", - "This notebook shows how to implement a custom inference call in Apache Beam. This example uses the Google Cloud Vision API." + "Therefore, to use these remote APIs with Apache Beam, you need to write custom inference calls.\n" + ] }, { @@ -53,7 +55,7 @@ "id": "GNbarEZsalS1" }, "source": [ - "## Use case: run the Cloud Vision API\n", + "## Run the Cloud Vision API\n", "\n", "You can use the Cloud Vision API to retrieve labels that describe an image.\n", "For example, the following image shows a lion with possible labels." @@ -75,20 +77,20 @@ }, "source": [ "We want to run the Google Cloud Vision API on a large set of images, and Apache Beam is the ideal tool to handle this workflow.\n", - "This example notebook demonstates how to retrieve image labels with this API on a small set of images.\n", + "This example demonstates how to retrieve image labels with this API on a small set of images.\n", "\n", - "The notebook follows these steps to implement this workflow:\n", + "The example follows these steps to implement this workflow:\n", "* Read the images.\n", "* Batch the images together to optimize the model call.\n", "* Send the images to an external API to run inference.\n", - "* Post-process the results of your API.\n", + "* Postprocess the results of your API.\n", "\n", "**Caution:** Be aware of API quotas and the heavy load you might incur on your external API. Verify that your pipeline and API are configured correctly for your use case.\n", "\n", "To optimize the calls to the external API, limit the parallel calls to the external remote API by configuring [PipelineOptions](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options).\n", "In Apache Beam, different runners provide options to handle the parallelism, for example:\n", - "* With the [Direct Runner](https://beam.apache.org/documentation/runners/direct/), use `direct_num_workers`.\n", - "* With the [Google Cloud Dataflow Runner](https://beam.apache.org/documentation/runners/dataflow/), use `max_num_workers`.\n", + "* With the [Direct Runner](https://beam.apache.org/documentation/runners/direct/), use the `direct_num_workers` pipeline option.\n", + "* With the [Google Cloud Dataflow Runner](https://beam.apache.org/documentation/runners/dataflow/), use the `max_num_workers` pipeline option.\n", "\n", "For information about other runners, see the [Beam capability matrix](https://beam.apache.org/documentation/runners/capability-matrix/) " ] @@ -99,7 +101,7 @@ "id": "FAawWOaiIYaS" }, "source": [ - "## Installation\n", + "## Before you begin\n", "\n", "This section provides installation steps." ] @@ -170,9 +172,11 @@ "id": "mL4MaHm_XOVd" }, "source": [ - "## Remote inference on Cloud Vision API\n", + "## Run remote inference on Cloud Vision API\n", + "\n", + "This section demonstates the steps to run remote inference on the Cloud Vision API.\n", "\n", - "This section demonstates the steps to run remote inference on the Cloud Vision API." + "Download and install Apache Beam and the required modules." ] }, { @@ -199,7 +203,7 @@ "id": "09k08IYlLmON" }, "source": [ - "For this example, we use images from the [MSCoco dataset](https://cocodataset.org/#explore) as a list of image urls.\n", + "This example uses images from the [MSCoco dataset](https://cocodataset.org/#explore) as a list of image URLs.\n", "This data is used as the pipeline input." ] }, @@ -234,20 +238,20 @@ "id": "HLy7VKJhLrmT" }, "source": [ - "### Custom DoFn\n", + "### Create a custom DoFn\n", "\n", "In order to implement remote inference, create a DoFn class. This class sends a batch of images to the Cloud vision API.\n", "\n", "The custom DoFn makes it possible to initialize the API. In case of a custom model, a model can also be loaded in the `setup` function. \n", "\n", - "The `process` function is the most interesting part. In this function we implement the model call and return its results.\n", + "The `process` function is the most interesting part. In this function, we implement the model call and return its results.\n", "\n", - "**Caution:** When running remote inference, prepare to encounter, identify, and handle failure as gracefully as possible. We recommend using the following techniques: \n", + "When running remote inference, prepare to encounter, identify, and handle failure as gracefully as possible. We recommend using the following techniques: \n", "\n", "* **Exponential backoff:** Retry failed remote calls with exponentially growing pauses between retries. Using exponential backoff ensures that failures don't lead to an overwhelming number of retries in quick succession. \n", "\n", - "* **Dead letter queues:** Route failed inferences to a separate `PCollection` without failing the whole transform. You can continue execution without failing the job (batch jobs' default behavior) or retrying indefinitely (streaming jobs' default behavior).\n", - "You can then run custom pipeline logic on the deadletter queue to log the failure, alert, and push the failed message to temporary storage so that it can eventually be reprocessed. " + "* **Dead-letter queues:** Route failed inferences to a separate `PCollection` without failing the whole transform. You can continue execution without failing the job (batch jobs' default behavior) or retrying indefinitely (streaming jobs' default behavior).\n", + "You can then run custom pipeline logic on the dead-letter queue (unprocessed messages queue) to log the failure, alert, and push the failed message to temporary storage so that it can eventually be reprocessed." ] }, { @@ -277,7 +281,7 @@ " image_requests = [vision.AnnotateImageRequest(image=image, features=[feature]) for image in images]\n", " batch_image_request = vision.BatchAnnotateImagesRequest(requests=image_requests)\n", "\n", - " # Send batch request to the remote endpoint.\n", + " # Send the batch request to the remote endpoint.\n", " responses = self._client.batch_annotate_images(request=batch_image_request).responses\n", " \n", " return list(zip(image_urls, responses))\n" @@ -289,7 +293,7 @@ "id": "lHJuyHhvL0-a" }, "source": [ - "### Batching\n", + "### Manage batching\n", "\n", "Before we can chain together the pipeline steps, we need to understand batching.\n", "When running inference with your model, either in Apache Beam or in an external API, you can batch your input to increase the efficiency of the model execution.\n", @@ -297,7 +301,7 @@ "\n", "To manage the batching in this pipeline, include a `BatchElements` transform to group elements together and form a batch of the desired size.\n", "\n", - "* If you have a streaming pipeline, consider using [GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/)\n", + "* If you have a streaming pipeline, consider using [GroupIntoBatches](https://beam.apache.org/documentation/transforms/python/aggregation/groupintobatches/),\n", "because `BatchElements` doesn't batch items across bundles. `GroupIntoBatches` requires choosing a key within which items are batched.\n", "\n", "* When batching, make sure that the input batch matches the maximum payload of the external API. \n", @@ -619,7 +623,7 @@ "id": "7gwn5bF1XaDm" }, "source": [ - "### Metrics\n", + "## Monitor the pipeline\n", "\n", "Because monitoring can provide insight into the status and health of the application, consider monitoring and measuring pipeline performance.\n", "For information about the available tracking metrics, see [RunInference Metrics](https://beam.apache.org/documentation/ml/runinference-metrics/)." diff --git a/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb b/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb index 645d62d32be3..e45f1bd2d397 100644 --- a/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb +++ b/examples/notebooks/beam-ml/dataframe_api_preprocessing.ipynb @@ -38,29 +38,23 @@ "\n", "For rapid execution, Pandas loads all of the data into memory on a single machine (one node). This configuration works well when dealing with small-scale datasets. However, many projects involve datasets that are too big to fit in memory. These use cases generally require parallel data processing frameworks, such as Apache Beam.\n", "\n", - "\n", - "## Apache Beam DataFrames\n", - "\n", - "\n", - "Beam DataFrames provide a pandas-like\n", + "Beam DataFrames provide a Pandas-like\n", "API to declare and define Beam processing pipelines. It provides a familiar interface for machine learning practioners to build complex data-processing pipelines by only invoking standard pandas commands.\n", "\n", "To learn more about Apache Beam DataFrames, see the\n", "[Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) page.\n", "\n", - "## Goal\n", - "The goal of this notebook is to explore a dataset preprocessed with the Beam DataFrame API for machine learning model training.\n", + "## Overview\n", + "The goal of this example is to explore a dataset preprocessed with the Beam DataFrame API for machine learning model training.\n", "\n", - "\n", - "## Tutorial outline\n", - "\n", - "This notebook demonstrates the use of the Apache Beam DataFrames API to perform common data exploration as well as the preprocessing steps that are necessary to prepare your dataset for machine learning model training and inference. These steps include the following: \n", + "This example demonstrates the use of the Apache Beam DataFrames API to perform common data exploration as well as the preprocessing steps that are necessary to prepare your dataset for machine learning model training and inference. This example includes the following steps: \n", "\n", "* Removing unwanted columns.\n", "* One-hot encoding categorical columns.\n", "* Normalizing numerical columns.\n", "\n", - "\n" + "In this example, the first section demonstrates how to build and execute a pipeline locally using the interactive runner.\n", + "The second section uses a distributed runner to demonstrate how to run the pipeline on the full dataset.\n" ], "metadata": { "id": "iFZC1inKuUCy" @@ -69,9 +63,9 @@ { "cell_type": "markdown", "source": [ - "## Installation\n", + "## Install Apache Beam\n", "\n", - "To explore the elements within a `PCollection`, install Apache Beam with the `interactive` component to use the Interactive runner. The latest implemented DataFrames API methods invoked in this notebook are available in Apache Beam SDK versions 2.43 and later.\n" + "To explore the elements within a `PCollection`, install Apache Beam with the `interactive` component to use the Interactive runner. The DataFrames API methods invoked in this example are available in Apache Beam SDK versions 2.43 and later.\n" ], "metadata": { "id": "A0f2HJ22D4lt" @@ -105,8 +99,8 @@ { "cell_type": "markdown", "source": [ - "## Part I : Local exploration with the Interactive Beam runner\n", - "Start by using the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) to explore and develop your pipeline.\n", + "## Local exploration with the Interactive Beam runner\n", + "Use the [Interactive Beam](https://beam.apache.org/releases/pydoc/2.20.0/apache_beam.runners.interactive.interactive_beam.html) runner to explore and develop your pipeline.\n", "This runner allows you to test the code interactively, progressively building out the pipeline before deploying it on a distributed runner. \n", "\n", "\n", @@ -124,12 +118,12 @@ "source": [ "### Load the data\n", "\n", - "To read CSV files into Dataframes, Pandas has the\n", + "To read CSV files into DataFrames, Pandas has the\n", "[`pandas.read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)\n", "function.\n", "This notebook uses the Beam\n", "[`beam.dataframe.io.read_csv`](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html#apache_beam.dataframe.io.read_csv)\n", - "function, which emulates `pandas.read_csv`. The main difference is that the Beam function returns a deferred Beam DataFrame whereas the Pandas function returns a standard DataFrame.\n" + "function, which emulates `pandas.read_csv`. The main difference is that the Beam function returns a deferred Beam DataFrame, whereas the Pandas function returns a standard DataFrame.\n" ] }, { @@ -170,8 +164,8 @@ "### Preprocess the data\n", "\n", "This example uses the [NASA - Nearest Earth Objects dataset](https://cneos.jpl.nasa.gov/ca/).\n", - "This dataset includes information about objects in the outer space. Some objects are close enough to Earth to cause harm.\n", - "Therefore, this dataset compiles the list of NASA certified asteroids that are classified as the nearest earth objects to understand which objects pose a risk." + "This dataset includes information about objects in outer space. Some objects are close enough to Earth to cause harm.\n", + "This dataset compiles the list of NASA certified asteroids that are classified as the nearest earth objects to understand which objects pose a risk." ] }, { @@ -673,7 +667,7 @@ { "cell_type": "markdown", "source": [ - "Use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns like percentile, mean, std, and so on. " + "Use the standard pandas command `DataFrame.describe()` to generate descriptive statistics for the numerical columns, such as percentile, mean, std, and so on. " ], "metadata": { "id": "MGAErO0lAYws" @@ -1006,16 +1000,16 @@ "source": [ "Before running any transformations, verify that all of the columns need to be used for model training. Start by looking at the column description provided by the [JPL website](https://ssd.jpl.nasa.gov/sbdb_query.cgi):\n", "\n", - "* **spk_id:** Object primary SPK-ID\n", - "* **full_name:** Asteroid name\n", - "* **near_earth_object:** Near-earth object flag\n", + "* **spk_id:** Object primary SPK-ID.\n", + "* **full_name:** Asteroid name.\n", + "* **near_earth_object:** Near-earth object flag.\n", "* **absolute_magnitude:** The apparent magnitude an object would have if it were located at a distance of 10 parsecs.\n", "* **diameter:** Object diameter (from equivalent sphere) km unit.\n", - "* **albedo:** A measure of the diffuse reflection of solar radiation out of the total solar radiation and measured on a scale from 0 to 1.\n", + "* **albedo:** A measure of the diffuse reflection of solar radiation out of the total solar radiation, measured on a scale from 0 to 1.\n", "* **diameter_sigma:** 1-sigma uncertainty in object diameter km unit.\n", - "* **eccentricity:** A value between 0 and 1 that refers to how flat or round the asteroid is \n", - "* **inclination:** The angle with respect to the x-y ecliptic plane\n", - "* **moid_ld:** Earth Minimum Orbit Intersection Distance au unit\n", + "* **eccentricity:** A value between 0 and 1 that refers to how flat or round the asteroid is.\n", + "* **inclination:** The angle with respect to the x-y ecliptic plane.\n", + "* **moid_ld:** Earth Minimum Orbit Intersection Distance au unit.\n", "* **object_class:** The classification of the asteroid. For a more detailed description, see [NASA object classifications](https://pdssbn.astro.umd.edu/data_other/objclass.shtml).\n", "* **Semi-major axis au Unit:** The length of half of the long axis in AU unit.\n", "* **hazardous_flag:** Identifies hazardous asteroids." @@ -1027,7 +1021,7 @@ "id": "DzYVKbwTp72d" }, "source": [ - "The **'spk_id'** and **'full_name'** columns are unique for each row. You can remove these columns, because they are not needed for model training." + "The **spk_id** and **full_name** columns are unique for each row. You can remove these columns, because they are not needed for model training." ] }, { @@ -1153,7 +1147,7 @@ "id": "00MRdFGLwQiD" }, "source": [ - "Most of the columns do not have missing values. However, the columns **'diameter'**, **'albedo'** and **'diameter_sigma'** have many missing values. Because these values cannot be measured or derived and aren't needed for training the ML model, remove the columns." + "Most of the columns do not have missing values. However, the columns **diameter**, **albedo**, and **diameter_sigma** have many missing values. Because these values cannot be measured or derived and aren't needed for training the ML model, remove the columns." ] }, { @@ -1511,7 +1505,7 @@ "id": "a3PojL3WBqgE" }, "source": [ - "Next, normalize the numerical columns so that they can be used to train a model. To standarize the data, you can subtract the mean and divide by the standard deviation. This process is also known as finding the [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score).\n", + "Normalize the numerical columns so that they can be used to train a model. To standarize the data, you can subtract the mean and divide by the standard deviation. This process is also known as finding the [z-score](https://developers.google.com/machine-learning/data-prep/transform/normalization#z-score).\n", "This step improves the performance and training stability of the model during training and inference.\n" ] }, @@ -1859,7 +1853,7 @@ "id": "qdNILsajFvex" }, "source": [ - "Convert the categorical columns into one-hot encoded variables to use them during training.\n" + "Next, convert the categorical columns into one-hot encoded variables to use during training.\n" ] }, { @@ -2596,7 +2590,7 @@ "\n", "This section combines the previous steps into a full pipeline implementation, and then visualizes the preprocessed data.\n", "\n", - "Note that the only standard Apache Beam method invoked here is the `pipeline` instance. The rest of the preprocessing commands are based on native Pandas methods that are integrated with the Apache Beam DataFrame API." + "Note that the only standard Apache Beam method invoked here is the `pipeline` instance. The rest of the preprocessing commands are based on native pandas methods that are integrated with the Apache Beam DataFrame API." ] }, { @@ -3339,7 +3333,7 @@ "id": "xZvJTqa3XKI_" }, "source": [ - "## Part II : Process the full dataset with the distributed runner\n", + "## Process the full dataset with the distributed runner\n", "The previous section demonstrates how to build and execute the pipeline locally using the interactive runner.\n", "This section demonstrates how to run the pipeline on the full dataset by switching to a distributed runner. For this example, the pipeline runs on [Dataflow](https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline)." ] @@ -3361,7 +3355,7 @@ { "cell_type": "markdown", "source": [ - "These steps process the full dataset, `full.csv`, which contains approximately one million rows. To materialize the deferred dataframe, these steps also write the results to a CSV file instead of using `ib.collect()`.\n", + "These steps process the full dataset, `full.csv`, which contains approximately one million rows. To materialize the deferred DataFrame, these steps also write the results to a CSV file instead of using `ib.collect()`.\n", "\n", "To switch from an interactive runner to a distributed runner, update the pipeline options. The rest of the pipeline steps don't change." ], @@ -3450,12 +3444,10 @@ "\n", "This tutorial demonstrated how to analyze and preprocess a large-scale dataset with the Apache Beam DataFrames API. You can now train a model on a classification task using the preprocessed dataset.\n", "\n", - "To learn more about how to get started with classifying structured data, see:\n", - "\n", - "* [Structred data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/)\n", + "To learn more about how to get started with classifying structured data, see \n", + "[Structured data classification from scratch](https://keras.io/examples/structured_data/structured_data_classification_from_scratch/).\n", "\n", - "To continue learning, find another dataset to use with the Apache Beam DataFrames API processing. Think carefully about which features to include in your model and how to represent them.\n", - "\n" + "To continue learning, find another dataset to use with the Apache Beam DataFrames API processing. Think carefully about which features to include in your model and how to represent them.\n" ], "metadata": { "id": "UOLr6YgOOSVQ" @@ -3466,11 +3458,11 @@ "source": [ "## Resources\n", "\n", - "* [Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) -- An overview of the Apache Beam DataFrames API.\n", - "* [Differences from pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas) -- Reviews the differences between Apache Beam DataFrames and Pandas DataFrames, as well as some of the workarounds for unsupported operations.\n", - "* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) -- A quickstart guide to the Pandas DataFrames.\n", - "* [Pandas DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) -- The API reference for the Pandas DataFrames.\n", - "* [Data preparation and feature training in ML](https://developers.google.com/machine-learning/data-prep) -- A guideline about data transformation for ML training." + "* [Beam DataFrames overview](https://beam.apache.org/documentation/dsls/dataframes/overview) - An overview of the Apache Beam DataFrames API.\n", + "* [Differences from pandas](https://beam.apache.org/documentation/dsls/dataframes/differences-from-pandas) - Reviews the differences between Apache Beam DataFrames and Pandas DataFrames, as well as some of the workarounds for unsupported operations.\n", + "* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) - A quickstart guide to the Pandas DataFrames.\n", + "* [Pandas DataFrame API](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) - The API reference for the Pandas DataFrames.\n", + "* [Data preparation and feature training in ML](https://developers.google.com/machine-learning/data-prep) - A guideline about data transformation for ML training." ], "metadata": { "id": "nG9WXXVcMCe_" diff --git a/examples/notebooks/beam-ml/run_custom_inference.ipynb b/examples/notebooks/beam-ml/run_custom_inference.ipynb index 9d57bf9f475f..c45405204d22 100644 --- a/examples/notebooks/beam-ml/run_custom_inference.ipynb +++ b/examples/notebooks/beam-ml/run_custom_inference.ipynb @@ -5,6 +5,7 @@ "execution_count": 1, "id": "C1rAsD2L-hSO", "metadata": { + "cellView": "form", "id": "C1rAsD2L-hSO" }, "outputs": [], @@ -41,9 +42,10 @@ "This notebook demonstrates how to run inference on your custom framework using the\n", "[ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class.\n", "\n", - "Named-Entity Recognition (NER) is one of the most common tasks for natural language processing (NLP). \n", - "NLP locates and named entities in unstructured text and classifies the entities using pre-defined labels, such as person name, organization, date, and so on.\n", - "This example illustrates how to use the popular `spaCy` package to load an ML model and perform inference in an Apache Beam pipeline using the RunInference `PTransform`.\n", + "Named-entity recognition (NER) is one of the most common tasks for natural language processing (NLP). \n", + "NLP locates named entities in unstructured text and classifies the entities using pre-defined labels, such as person name, organization, date, and so on.\n", + "\n", + "This example illustrates how to use the popular `spaCy` package to load a machine learning (ML) model and perform inference in an Apache Beam pipeline using the RunInference `PTransform`.\n", "For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation." ] }, @@ -58,7 +60,7 @@ "\n", "The RunInference library is available in Apache Beam versions 2.40 and later.\n", "\n", - "For this example, you need to install `spaCy` and `pandas`. A small NER model (`en_core_web_sm`) is also installed, but you can use any valid `spaCy` model." + "For this example, you need to install `spaCy` and `pandas`. A small NER model, `en_core_web_sm`, is also installed, but you can use any valid `spaCy` model." ] }, { @@ -84,7 +86,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Learn more about `spaCy`\n", + "## Learn about `spaCy`\n", "\n", "To learn more about `spaCy`, create a `spaCy` language object in memory using `spaCy`'s trained models.\n", "You can install these models as Python packages.\n", @@ -242,9 +244,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## Create a`ModelHandler` to use `spaCy` for inference\n", + "## Create a model handler\n", "\n", - "This section demonstrates how to create your own `ModelHandler`." + "This section demonstrates how to create your own `ModelHandler` so that you can use `spaCy` for inference." ] }, { @@ -420,7 +422,6 @@ " | \"CreateSentences\" >> beam.Create(text_strings_with_keys)\n", " | \"RunInferenceSpacy\" >> RunInference(keyed_spacy_model_handler)\n", " # Generate a schema suitable for conversion to a dataframe using Map to Row objects.\n", - " # to a dataframe.\n", " | 'ToRows' >> beam.Map(lambda row: beam.Row(key=row[0], text=row[1][0], predictions=row[1][1]))\n", " )" ] diff --git a/examples/notebooks/beam-ml/run_inference_multi_model.ipynb b/examples/notebooks/beam-ml/run_inference_multi_model.ipynb index a1e52b235461..cabe60e7a3a7 100644 --- a/examples/notebooks/beam-ml/run_inference_multi_model.ipynb +++ b/examples/notebooks/beam-ml/run_inference_multi_model.ipynb @@ -71,7 +71,7 @@ { "cell_type": "markdown", "source": [ - "## Use case: Image captioning with cascade models " + "## Image captioning with cascade models" ], "metadata": { "id": "i1uyzlj3s3e_" @@ -80,12 +80,12 @@ { "cell_type": "markdown", "source": [ - "Image captioning has various applications, such as image indexing for information retreival, virtual assistant training, and various natural language processing applications.\n", + "Image captioning has various applications, such as image indexing for information retrieval, virtual assistant training, and natural language processing.\n", "\n", "This example shows how to generate captions on a a large set of images. Apache Beam is the ideal tool to handle this workflow. We use two models for this task:\n", "\n", - "* [BLIP](https://github.com/salesforce/BLIP): Used to generate a set of candidate captions for a given image. \n", - "* [CLIP](https://github.com/openai/CLIP): Used to rank the generated captions based on accuracy." + "* [BLIP](https://github.com/salesforce/BLIP): Generates a set of candidate captions for a given image. \n", + "* [CLIP](https://github.com/openai/CLIP): Ranks the generated captions based on accuracy." ], "metadata": { "id": "cP1sBhNacS8b" @@ -106,14 +106,14 @@ "The steps to build this pipeline are as follows:\n", "* Read the images.\n", "* Preprocess the images for caption generation for inference with the BLIP model.\n", - "* Inference with BLIP to generate a list of caption candidates.\n", + "* Run inference with BLIP to generate a list of caption candidates.\n", "* Aggregate the generated captions with their source image.\n", - "* Preprocess the aggregated image-caption pair to rank them with CLIP.\n", - "* Inference with CLIP to generate the caption ranking. \n", + "* Preprocess the aggregated image-caption pairs to rank them with CLIP.\n", + "* Run inference with CLIP to generate the caption ranking. \n", "* Print the image names and the captions sorted according to their ranking.\n", "\n", "\n", - "The following diagram illustrates the steps in the inference pipelines used in this notebook:" + "The following diagram illustrates the steps in the inference pipelines used in this notebook." ], "metadata": { "id": "lBPfy-bYgLuD" @@ -284,7 +284,7 @@ { "cell_type": "markdown", "source": [ - "### CLIP\n", + "### Install CLIP dependencies\n", "\n", "Download and install the CLIP dependencies." ], @@ -343,7 +343,7 @@ { "cell_type": "markdown", "source": [ - "### BLIP\n", + "### Install BLIP dependencies\n", "\n", "Download and install the BLIP dependencies." ], @@ -417,7 +417,7 @@ { "cell_type": "markdown", "source": [ - "### I/O helper functions\n", + "### Install I/O helper functions\n", "\n", "Download and install the dependencies for the I/O helper functions." ], @@ -430,7 +430,7 @@ "source": [ "class ReadImagesFromUrl(beam.DoFn):\n", " \"\"\"\n", - " Read an image from a given url and return a tuple of the images_url\n", + " Read an image from a given URL and return a tuple of the images_url\n", " and image data.\n", " \"\"\"\n", " def process(self, element: str) -> Tuple[str, Image.Image]:\n", @@ -441,7 +441,7 @@ "\n", "class FormatCaptions(beam.DoFn):\n", " \"\"\"\n", - " Print the image name and it's most relevant captions after CLIP ranking.\n", + " Print the image name and its most relevant captions after CLIP ranking.\n", " \"\"\"\n", " def __init__(self, number_of_top_captions: int):\n", " self._number_of_top_captions = number_of_top_captions\n", @@ -474,10 +474,10 @@ { "cell_type": "markdown", "source": [ - "Define the preprocessing and postprocessing function for each of the models.\n", + "Define the preprocessing and postprocessing functions for each of the models.\n", "\n", "To prepare the instance for processing bundles of elements by initializing and to cache the processing transform resources, use `DoFn.setup()`.\n", - "This step avoids unnecessary re-initializations on every invocation to the processing method." + "This step avoids unnecessary re-initializations on every invocation of the processing method." ], "metadata": { "id": "wEViP715fes4" @@ -486,8 +486,8 @@ { "cell_type": "markdown", "source": [ - "### BLIP\n", - "Define the preprocessing and postprocessing function for BLIP." + "### Define BLIP functions\n", + "Define the preprocessing and postprocessing functions for BLIP." ], "metadata": { "id": "X1UGv6bbyNxY" @@ -499,7 +499,7 @@ "class PreprocessBLIPInput(beam.DoFn):\n", "\n", " \"\"\"\n", - " Process the raw image input to a format suitable for BLIP Inference. The processed\n", + " Process the raw image input to a format suitable for BLIP inference. The processed\n", " images are duplicated to the number of desired captions per image. \n", "\n", " Preprocessing transformation taken from: \n", @@ -520,7 +520,7 @@ "\n", " def process(self, element):\n", " image_url, image = element \n", - " # Update this step when this ticket is resolved: https://github.com/apache/beam/issues/21863\n", + " # The following lines provide a workaround to turn off BatchElements.\n", " preprocessed_img = self._transform(image).unsqueeze(0)\n", " preprocessed_img = preprocessed_img.repeat(self._captions_per_image, 1, 1, 1)\n", " # Parse the processed input to a dictionary to a format suitable for RunInference.\n", @@ -546,9 +546,9 @@ { "cell_type": "markdown", "source": [ - "### CLIP \n", + "### Define CLIP functions \n", "\n", - "Define the preprocessing and postprocessing function for CLIP." + "Define the preprocessing and postprocessing functions for CLIP." ], "metadata": { "id": "EZHfa1KzWWDI" @@ -642,8 +642,12 @@ { "cell_type": "markdown", "source": [ - "Note that we use a `KeyedModelHandler` for both models to attach a key to the general `ModelHandler`.\n", - "The key is used to keep a reference to the image that the inference is associated with and is used in the postprocessing steps.\n", + "Use a `KeyedModelHandler` for both models to attach a key to the general `ModelHandler`.\n", + "The key is used for the following purposes:\n", + "* To keep a reference to the image that the inference is associated with.\n", + "* To aggregate transforms of different inputs.\n", + "* To run postprocessing steps correctly.\n", + "\n", "In this example, we use the `image_url` as the key." ], "metadata": { @@ -655,13 +659,13 @@ "source": [ "class PytorchNoBatchModelHandlerKeyedTensor(PytorchModelHandlerKeyedTensor):\n", " \"\"\"Wrapper to PytorchModelHandler to limit batch size to 1.\n", - " The caption strings generated from BLIP tokenizer may have different\n", - " lengths, which doesn't work with torch.stack() in current RunInference\n", - " implementation since stack() requires tensors to be the same size.\n", + " The caption strings generated from the BLIP tokenizer might have different\n", + " lengths. Different length strings don't work with torch.stack() in the current RunInference\n", + " implementation, because stack() requires tensors to be the same size.\n", " Restricting max_batch_size to 1 means there is only 1 example per `batch`\n", " in the run_inference() call.\n", " \"\"\"\n", - " # Update this step when this ticket is resolved: https://github.com/apache/beam/issues/21863\n", + " # The following lines provide a workaround to turn off BatchElements.\n", " def batch_elements_kwargs(self):\n", " return {'max_batch_size': 1}" ], @@ -683,7 +687,7 @@ { "cell_type": "markdown", "source": [ - "## BLIP\n", + "## Generate captions with BLIP\n", "\n", "Use BLIP to generate a set of candidate captions for a given image." ], @@ -711,7 +715,7 @@ "source": [ "class BLIPWrapper(torch.nn.Module):\n", " \"\"\"\n", - " Wrapper around the BLIP model to overwrite the default \"forward\" method with the \"generate\" since BLIP uses the \n", + " Wrapper around the BLIP model to overwrite the default \"forward\" method with the \"generate\" method, because BLIP uses the \n", " \"generate\" method to produce the image captions.\n", " \"\"\"\n", " \n", @@ -725,7 +729,7 @@ "\n", " def forward(self, inputs: torch.Tensor):\n", " # Squeeze because RunInference adds an extra dimension, which is empty.\n", - " # Update this step when this ticket is resolved: https://github.com/apache/beam/issues/21863\n", + " # The following lines provide a workaround to turn off BatchElements.\n", " inputs = inputs.squeeze(0)\n", " captions = self._model.generate(inputs,\n", " sample=True,\n", @@ -756,7 +760,7 @@ { "cell_type": "markdown", "source": [ - "## CLIP\n", + "## Rank captions with CLIP\n", "\n", "Use CLIP to rank the generated captions based on the accuracy with which they represent the image." ], @@ -771,7 +775,7 @@ "\n", " def forward(self, **kwargs: Dict[str, torch.Tensor]):\n", " # Squeeze because RunInference adds an extra dimension, which is empty.\n", - " # Update this step when this ticket is resolved: https://github.com/apache/beam/issues/21863.\n", + " # The following lines provide a workaround to turn off BatchElements.\n", " kwargs = {key: tensor.squeeze(0) for key, tensor in kwargs.items()}\n", " output = super().forward(**kwargs)\n", " logits = output.logits_per_image\n", @@ -888,7 +892,7 @@ { "cell_type": "markdown", "source": [ - "## Initialize pipeline run parameters\n", + "## Initialize the pipeline run parameters\n", "\n", "Specify the number of captions generated per image and the number of captions to display with each image." ], @@ -914,7 +918,7 @@ { "cell_type": "markdown", "source": [ - "## Run pipeline" + "## Run the pipeline" ], "metadata": { "id": "5T9Pcdp7oNb8" @@ -923,7 +927,7 @@ { "cell_type": "markdown", "source": [ - "This example uses raw images from the `read_images` pipeline as inputs for both models, because each model needs to preprocess the raw images differently. They require a different embedding representation for image captioning and image-captions pair ranking.\n", + "This example uses raw images from the `read_images` pipeline as inputs for both models. Each model needs to preprocess the raw images differently, because they require a different embedding representation for image captioning and for image-captions pair ranking.\n", "\n", "To aggregate the raw images with the generated caption by their key (the image URL), this example uses `CoGroupByKey`. This process produces a tuple of image-captions pairs that is then passed to the CLIP transform and used for ranking." ], diff --git a/examples/notebooks/beam-ml/run_inference_pytorch.ipynb b/examples/notebooks/beam-ml/run_inference_pytorch.ipynb index 3afc6bad9890..d0a350982f4e 100644 --- a/examples/notebooks/beam-ml/run_inference_pytorch.ipynb +++ b/examples/notebooks/beam-ml/run_inference_pytorch.ipynb @@ -54,7 +54,7 @@ "This notebook demonstrates the use of the RunInference transform for PyTorch. Apache Beam includes implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class for [users of PyTorch](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.pytorch_inference.html). For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n", "\n", "\n", - "This notebook illustrates common RunInference patterns,such as:\n", + "This notebook illustrates common RunInference patterns, such as:\n", "* Using a database with RunInference.\n", "* Postprocessing results after using RunInference.\n", "* Inference with multiple models in the same pipeline.\n", @@ -71,7 +71,7 @@ "source": [ "## Dependencies\n", "\n", - "The RunInference library is available in Apache Beam versions 2.40 and later.\n", + "The RunInference library is available in Apache Beam versions 2.40 and later.\n", "\n", "To use Pytorch RunInference API, you need to install the PyTorch module. To install PyTorch, use `pip`:" ] @@ -235,7 +235,7 @@ }, "source": [ "### Train the linear regression mode on 5 times data\n", - "Use the following to train your linear regression model on the 5 times table." + "Use the following code to train your linear regression model on the 5 times table." ] }, { @@ -270,7 +270,7 @@ "id": "bd106b29-6187-42c1-9743-1666c147b5e3" }, "source": [ - "Save the model using `torch.save()` and then confirm that the saved model file exists." + "Save the model using `torch.save()`, and then confirm that the saved model file exists." ] }, { @@ -304,6 +304,7 @@ }, "source": [ "### Prepare train and test data for a 10 times model\n", + "This example model is a 10 times table.\n", "* `x` contains values in the range from 0 to 99.\n", "* `y` is a list of 10 * `x`. " ] @@ -404,7 +405,7 @@ "source": [ "### Use RunInference within the pipeline\n", "\n", - "1. Create a PyTorch model handler object by passing required arguments such as `state_dict_path`, `model_class`, `model_params` to the `PytorchModelHandlerTensor` class.\n", + "1. Create a PyTorch model handler object by passing required arguments such as `state_dict_path`, `model_class`, and `model_params` to the `PytorchModelHandlerTensor` class.\n", "2. Pass the `PytorchModelHandlerTensor` object to the RunInference transform to perform predictions on unkeyed data." ] }, @@ -455,8 +456,8 @@ "id": "9d95e69b-203f-4abb-9abb-360bdf4d769a" }, "source": [ - "## Pattern 2: Post-process RunInference results.\n", - "This pattern demonstrates how to post-process the RunInference results.\n", + "## Pattern 2: Postprocess RunInference results\n", + "This pattern demonstrates how to postprocess the RunInference results.\n", "\n", "Add a `PredictionProcessor` to the pipeline after `RunInference`. `PredictionProcessor` processes the output of the `RunInference` transform." ] @@ -529,11 +530,11 @@ "\n", "Modify the pipeline to read from sources like CSV files and BigQuery.\n", "\n", - "In this step we do the following:\n", + "In this step, you take the following actions:\n", "\n", "* To handle keyed data, wrap the `PytorchModelHandlerTensor` object around `KeyedModelHandler`.\n", "* Add a map transform that converts a table row into `Tuple[str, float]`.\n", - "* Add a map transform that converts `Tuple[str, float]` from to `Tuple[str, torch.Tensor]`.\n", + "* Add a map transform that converts `Tuple[str, float]` to `Tuple[str, torch.Tensor]`.\n", "* Modify the post-inference processor to output results with the key." ] }, @@ -564,7 +565,8 @@ "id": "f22da313-5bf8-4334-865b-bbfafc374e63" }, "source": [ - "### Create a source with attached key\n" + "### Create a source with attached key\n", + "This section shows how to create either a BigQuery or a CSV source with an attached key." ] }, { @@ -573,7 +575,8 @@ "id": "c9b0fb49-d605-4f26-931a-57f42b0ad253" }, "source": [ - "#### Use BigQuery as the source" + "#### Use BigQuery as the source", + "Follow these steps to use BigQuery as your source." ] }, { @@ -741,7 +744,8 @@ "id": "53ee7f24-5625-475a-b8cc-9c031591f304" }, "source": [ - "#### Use a CSV file as the source" + "#### Use a CSV file as the source", + "Follow these steps to use a CSV file as your source." ] }, { @@ -826,7 +830,7 @@ "## Pattern 4: Inference with multiple models in the same pipeline\n", "This pattern demonstrates how use inference with multiple models in the same pipeline.\n", "\n", - "### Inference with multiple models in parallel\n", + "### Multiple models in parallel\n", "This section demonstrates how use inference with multiple models in parallel." ] }, @@ -926,7 +930,7 @@ "id": "e71e6706-5d8d-4322-9def-ac7fb20d4a50" }, "source": [ - "### Inference with multiple models in sequence\n", + "### Multiple models in sequence\n", "This section demonstrates how use inference with multiple models in sequence.\n", "\n", "In a sequential pattern, data is sent to one or more models in sequence, \n", diff --git a/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb b/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb index 3dac52f9d7a6..60f79d63a5bb 100644 --- a/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb +++ b/examples/notebooks/beam-ml/run_inference_pytorch_tensorflow_sklearn.ipynb @@ -17,11 +17,6 @@ "cells": [ { "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "LzOTNrs_P6Vv" - }, - "outputs": [], "source": [ "# @title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the \"License\")\n", "\n", @@ -41,16 +36,13 @@ "# KIND, either express or implied. See the License for the\n", "# specific language governing permissions and limitations\n", "# under the License" - ] - }, - { - "cell_type": "markdown", + ], "metadata": { + "cellView": "form", "id": "faayYQYrQzY3" - }, - "source": [ - "## Use RunInference in Apache Beam" - ] + }, + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", @@ -58,8 +50,9 @@ "id": "JjAt1GesQ9sg" }, "source": [ - "Starting with Apache Beam 2.40.0, you can use Apache Beam with the RunInference API to use machine learning (ML) models for local and remote inference with batch and streaming pipelines.\n", - "The RunInference API leverages Apache Beam concepts, such as the BatchElements transform and the Shared class, to support models in your pipelines that create transforms optimized for machine learning inferences.\n", + "# Use RunInference in Apache Beam\n", + "You can use Apache Beam versions 2.40.0 and later with the [RunInference API](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) for local and remote inference with batch and streaming pipelines.\n", + "The RunInference API leverages Apache Beam concepts, such as the `BatchElements` transform and the `Shared` class, to support models in your pipelines that create transforms optimized for machine learning inference.\n", "\n", "For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation." ] @@ -70,13 +63,13 @@ "id": "A8xNRyZMW1yK" }, "source": [ - "This notebook demonstrates how to use the RunInference API with three popular ML frameworks: PyTorch, TensorFlow, and scikit-learn. The three pipelines use a text classification model for generating predictions.\n", + "This example demonstrates how to use the RunInference API with three popular ML frameworks: PyTorch, TensorFlow, and scikit-learn. The three pipelines use a text classification model for generating predictions.\n", "\n", "Follow these steps to build a pipeline:\n", "* Read the images.\n", "* If needed, preprocess the text.\n", - "* Inference with the PyTorch, TensorFlow, or Scikit-learn model.\n", - "* If needed, postprocess the output from RunInference." + "* Run inference with the PyTorch, TensorFlow, or Scikit-learn model.\n", + "* If needed, postprocess the output." ] }, { @@ -126,9 +119,9 @@ "id": "ObRPUrlEbjHj" }, "source": [ - "### Model\n", + "### Install the model\n", "\n", - "This example uses a pretrained text classification model, [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you). This model is a checkpoint of DistilBERT-base-uncased, fine-tuned on the SST-2 dataset.\n" + "This example uses a pretrained text classification model, [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english?text=I+like+you.+I+love+you). This model is a checkpoint of `DistilBERT-base-uncased`, fine-tuned on the SST-2 dataset.\n" ] }, { @@ -165,7 +158,7 @@ "id": "vA1UmbFRb5C-" }, "source": [ - "### Helper functions\n", + "### Install helper functions\n", "\n", "The model also uses helper functions." ] @@ -231,9 +224,9 @@ "id": "WYYbQTMWctkW" }, "source": [ - "### RunInference pipeline\n", + "### Run the pipeline\n", "\n", - "This section demonstrates how to use create and run the RunInference pipeline." + "This section demonstrates how to create and run the RunInference pipeline." ] }, { @@ -797,7 +790,7 @@ "id": "h2JP7zsqerCT" }, "source": [ - "### Model" + "### Install the model" ] }, { @@ -827,7 +820,7 @@ "id": "GZ-Ioc8ZfyIT" }, "source": [ - "### Helper functions\n", + "### Install helper functions\n", "\n", "The model also uses helper functions." ] @@ -874,7 +867,7 @@ "id": "PZVwI4BbgaAI" }, "source": [ - "### Prepare the Input\n", + "### Prepare the input\n", "\n", "This section demonstrates how to prepare the input for your model." ] @@ -921,9 +914,9 @@ "id": "BYkQl_l8gRgo" }, "source": [ - "### RunInference Pipeline\n", + "### Run the pipeline\n", "\n", - "This section demonstrates how to use create and run the RunInference pipeline." + "This section demonstrates how to create and run the RunInference pipeline." ] }, { @@ -991,7 +984,7 @@ "id": "6ArL_55kjxkO" }, "source": [ - "### Install Dependencies\n", + "### Install dependencies\n", "\n", "First, download and install the dependencies." ] @@ -1030,7 +1023,7 @@ "id": "-7ABKlZvkFHy" }, "source": [ - "### Model\n", + "### Install the model\n", "\n", "To classify movie reviews as either positive or negative, train and save a sentiment analysis pipeline about movie reviews." ] @@ -1059,9 +1052,9 @@ "id": "KL4Cx8s0mBqn" }, "source": [ - "### RunInference Pipeline\n", + "### Run the pipeline\n", "\n", - "This section demonstrates how to use create and run the RunInference pipeline." + "This section demonstrates how to create and run the RunInference pipeline." ] }, { diff --git a/examples/notebooks/beam-ml/run_inference_sklearn.ipynb b/examples/notebooks/beam-ml/run_inference_sklearn.ipynb index 9afcccc30f60..c9e151750a34 100644 --- a/examples/notebooks/beam-ml/run_inference_sklearn.ipynb +++ b/examples/notebooks/beam-ml/run_inference_sklearn.ipynb @@ -51,21 +51,21 @@ }, "source": [ "# Apache Beam RunInference for scikit-learn\n", - "This notebook demonstrates the use of the RunInference transform for [scikit-learn](https://scikit-learn.org/) also called sklearn.\n", + "This notebook demonstrates the use of the RunInference transform for [scikit-learn](https://scikit-learn.org/), also called sklearn.\n", "Apache Beam [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) has implementations of the [ModelHandler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.ModelHandler) class prebuilt for scikit-learn. For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n", "\n", - "Users can choose a model handler for their input data type:\n", - "* The [numpy model handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n", - "* The [pandas dataframes model handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n", + "You can choose the appropriate model handler based on your input data type:\n", + "* [NumPy model handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n", + "* [Pandas DataFrame model handler](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.sklearn_inference.html#apache_beam.ml.inference.sklearn_inference.SklearnModelHandlerNumpy)\n", "\n", - "With RunInference, these ModelHandlers manage batching, vectorization, and prediction optimization for your scikit-learn pipeline or model.\n", + "With RunInference, these model handlers manage batching, vectorization, and prediction optimization for your scikit-learn pipeline or model.\n", "\n", "This notebook demonstrates the following common RunInference patterns:\n", "* Generate predictions.\n", "* Postprocess results after RunInference.\n", - "* Inference with multiple models in the same pipeline.\n", + "* Run inference with multiple models in the same pipeline.\n", "\n", - "The linear regression models used in these samples are trained on data that correspondes to the 5 and 10 times table; that is,`y = 5x` and `y = 10x` respectively." + "The linear regression models used in these samples are trained on data that correspondes to the 5 and 10 times tables; that is,`y = 5x` and `y = 10x` respectively." ] }, { @@ -75,7 +75,7 @@ "Complete the following setup steps:\n", "1. Install dependencies for Apache Beam.\n", "1. Authenticate with Google Cloud.\n", - "1. Specify your project and bucket. You need the project and bucket to save and load models." + "1. Specify your project and bucket. You use the project and bucket to save and load models." ], "metadata": { "id": "zzwnMzzgdyPB" @@ -176,7 +176,7 @@ "2. Train the linear regression model.\n", "3. Save the scikit-learn model using `pickle`.\n", "\n", - "In this example, we create two models, one with the 5 times model and a section with the 10 times model." + "In this example, you create two models, one with the 5 times model and a second with the 10 times model." ] }, { @@ -214,9 +214,9 @@ "id": "69008a3d-3d15-4643-828c-b0419b347d01" }, "source": [ - "### scikit-learn RunInference pipeline\n", - "This section demonstrates the following steps:\n", - "1. Define the scikit-learn model handler that accepts an `array_like` object as input.\n", + "### Create a scikit-learn RunInference pipeline\n", + "This section demonstrates how to do the following:\n", + "1. Define a scikit-learn model handler that accepts an `array_like` object as input.\n", "2. Read the data from BigQuery.\n", "3. Use the scikit-learn trained model and the scikit-learn RunInference transform on unkeyed data." ] @@ -360,8 +360,8 @@ "id": "33e901d6-ed06-4268-8a5f-685d31b5558f" }, "source": [ - "### Sklearn RunInference on keyed inputs.\n", - "This section demonstrates the following steps:\n", + "### Use sklearn RunInference on keyed inputs\n", + "This section demonstrates how to do the following:\n", "1. Wrap the `SklearnModelHandlerNumpy` object around `KeyedModelHandler` to handle keyed data.\n", "2. Read the data from BigQuery.\n", "3. Use the sklearn trained model and the sklearn RunInference transform on a keyed data." @@ -410,7 +410,7 @@ "source": [ "## Run multiple models\n", "\n", - "This pipeline takes two RunInference transforms with different models and then combines the output." + "This code creates a pipeline that takes two RunInference transforms with different models and then combines the output." ], "metadata": { "id": "JQ4zvlwsRK1W" diff --git a/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb b/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb index 3e2e9e428aee..81e3bd38cac6 100644 --- a/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb +++ b/examples/notebooks/beam-ml/run_inference_tensorflow.ipynb @@ -7,8 +7,8 @@ "collapsed_sections": [] }, "kernelspec": { - "display_name": "Python 3", - "name": "python3" + "name": "python3", + "display_name": "Python 3" }, "language_info": { "name": "python" @@ -39,6 +39,7 @@ "# under the License" ], "metadata": { + "cellView": "form", "id": "fFjof1NgAJwu" }, "execution_count": null, @@ -49,11 +50,11 @@ "source": [ "# Apache Beam RunInference with TensorFlow\n", "This notebook demonstrates the use of the RunInference transform for [TensorFlow](https://www.tensorflow.org/).\n", - "Beam [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) accepts a ModelHandler generated from [`tfx-bsl`](https://github.com/tensorflow/tfx-bsl) via CreateModelHandler.\n", + "Beam [RunInference](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.inference.base.html#apache_beam.ml.inference.base.RunInference) accepts a ModelHandler generated from [`tfx-bsl`](https://github.com/tensorflow/tfx-bsl) using `CreateModelHandler`.\n", "\n", - "The Apache Beam RunInference transform is used for making predictions for\n", + "The Apache Beam RunInference transform is used to make predictions for\n", "a variety of machine learning models. In versions 1.10.0 and later of `tfx-bsl`, you can\n", - "create a TensorFlow ModelHandler for use with Apache Beam. For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n", + "create a TensorFlow `ModelHandler` for use with Apache Beam. For more information about the RunInference API, see [Machine Learning](https://beam.apache.org/documentation/sdks/python-machine-learning) in the Apache Beam documentation.\n", "\n", "This notebook demonstrates the following steps:\n", "- Import [`tfx-bsl`](https://github.com/tensorflow/tfx-bsl).\n", @@ -68,6 +69,9 @@ { "cell_type": "markdown", "source": [ + "## Before you begin\n", + "Complete the following setup steps.\n", + "\n", "First, import `tfx-bsl`." ], "metadata": { @@ -123,7 +127,7 @@ { "cell_type": "markdown", "source": [ - "## Authenticate with Google Cloud\n", + "### Authenticate with Google Cloud\n", "This notebook relies on saving your model to Google Cloud. To use your Google Cloud account, authenticate this notebook." ], "metadata": { @@ -145,7 +149,7 @@ { "cell_type": "markdown", "source": [ - "## Import dependencies and set up your bucket\n", + "### Import dependencies and set up your bucket\n", "Replace `PROJECT_ID` and `BUCKET_NAME` with the ID of your project and the name of your bucket.\n", "\n", "**Important**: If an error occurs, restart your runtime." @@ -193,12 +197,20 @@ "source": [ "## Create and test a simple model\n", "\n", - "This step creates a model that predicts the 5 times table." + "This step creates and tests a model that predicts the 5 times table." ], "metadata": { "id": "YzvZWEv-1oiK" } }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Create the model\n", + "Create training data and build a linear regression model." + ] + }, { "cell_type": "code", "metadata": { @@ -296,7 +308,7 @@ "source": [ "### Populate the data in a TensorFlow proto\n", "\n", - "Tensorflow data uses protos. If you are loading from a file, helpers exist for this step. Because we are using generated data, this code populates a proto." + "Tensorflow data uses protos. If you are loading from a file, helpers exist for this step. Because this example uses generated data, this code populates a proto." ], "metadata": { "id": "dEmleqiH3t71" @@ -356,7 +368,7 @@ "source": [ "### Fit The Model\n", "\n", - "This example builds a model. Because RunInference requires pretrained models, this segment builds a usable model." + "This step builds a model. Because RunInference requires pretrained models, this segment builds a usable model." ], "metadata": { "id": "G-sAu3cf31f3" @@ -445,6 +457,7 @@ "cell_type": "markdown", "source": [ "## Run the Pipeline\n", + "Use the following code to run the pipeline.\n", "\n", "`FormatOutput` demonstrates how to extract values from the output protos.\n", "\n", @@ -507,11 +520,10 @@ "\n", "By default, the `ModelHandler` does not expect a key.\n", "\n", - "If you know that keys are associated with your examples, wrap the model handler with `beam.KeyedModelHandler`.\n", - "\n", - "If you don't know whether keys are associated with your examples, use `beam.MaybeKeyedModelHandler`.\n", + "* If you know that keys are associated with your examples, wrap the model handler with `beam.KeyedModelHandler`.\n", + "* If you don't know whether keys are associated with your examples, use `beam.MaybeKeyedModelHandler`.\n", "\n", - "This step also illustrates how to use `tfx-bsl` examples." + "In addition to demonstrating how to use a keyed model handler, this step demonstrates how to use `tfx-bsl` examples." ], "metadata": { "id": "IXikjkGdHm9n" @@ -583,4 +595,4 @@ ] } ] -} +} \ No newline at end of file