From 3260a7bf25aab25186717823d81cb31af0fae346 Mon Sep 17 00:00:00 2001 From: Rebecca Szper <98840847+rszper@users.noreply.github.com> Date: Fri, 5 Jan 2024 22:24:54 -0800 Subject: [PATCH] Final edit on data preprocessing notebooks (#29940) --- .../compute_and_apply_vocab.ipynb | 14 +++++++------- .../beam-ml/data_preprocessing/scale_data.ipynb | 8 ++++---- .../vertex_ai_text_embeddings.ipynb | 6 ++---- 3 files changed, 13 insertions(+), 15 deletions(-) diff --git a/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb b/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb index 76f26f2aabe0..ee47cb7711fa 100644 --- a/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb +++ b/examples/notebooks/beam-ml/data_preprocessing/compute_and_apply_vocab.ipynb @@ -63,9 +63,9 @@ { "cell_type": "markdown", "source": [ - "[`ComputeAndApplyVocabulary`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) is a data processing transform that computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning (ML) tasks.\n", + "The [`ComputeAndApplyVocabulary`](https://beam.apache.org/releases/pydoc/current/apache_beam.ml.transforms.tft.html#apache_beam.ml.transforms.tft.ComputeAndApplyVocabulary) data processing transform computes a unique vocabulary from a dataset and then maps each word or token to a distinct integer index. Use this transform to change textual data into numerical representations for machine learning (ML) tasks.\n", "\n", - "When you train ML models that use text data, generating a vocabulary on the incoming dataset is a crucial preprocessing step. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset. This step allows ML models to process the same words in a consistent way.\n", + "When you train ML models that use text data, generating a vocabulary on the incoming dataset is an important preprocessing step. By mapping words to numerical indices, the vocabulary reduces the complexity and dimensionality of dataset. This step allows ML models to process the same words in a consistent way.\n", "\n", "This notebook shows how to use `MLTransform` to complete the following tasks:\n", "* Use `write` mode to generate a vocabulary on the input text and assign an index value to each token.\n", @@ -120,7 +120,7 @@ { "cell_type": "markdown", "source": [ - "## Artifact location\n", + "## Use the artifact location\n", "\n", "In `write` mode, the artifact location is used to store artifacts, such as the vocabulary file generated by `ComputeAndApplyVocabulary`.\n", "\n", @@ -163,7 +163,7 @@ { "cell_type": "markdown", "source": [ - "In this example, in `write` mode, `MLTransform` uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens and each token is assigned an unique index.\n", + "In this example, in `write` mode, `MLTransform` uses `ComputeAndApplyVocabulary` to generate vocabulary on the incoming dataset. The incoming text data is split into tokens. Each token is assigned an unique index.\n", "\n", " The generated vocabulary is stored in an artifact location that you can use on a different dataset in `read` mode." ], @@ -270,7 +270,7 @@ { "cell_type": "markdown", "source": [ - "## Frequency Threshold\n", + "## Set the frequency threshold\n", "\n", "The `frequency_threshold` parameter identifies the elements that appear frequently in the dataset. This parameter limits the generated vocabulary to elements with an absolute frequency greater than or equal to the specified threshold. If you don't specify the parameter, the entire vocabulary is generated.\n", "\n", @@ -317,7 +317,7 @@ { "cell_type": "markdown", "source": [ - "In the output, if the frequency of the token is less than the specified frequency, it is assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated." + "In the output, if the frequency of the token is less than the specified frequency, it's assigned to a `default_value` of `-1`. For the other tokens, a vocabulary file is generated." ], "metadata": { "id": "h1s4a6hzxKrb" @@ -357,7 +357,7 @@ { "cell_type": "markdown", "source": [ - "## `MLTransform` for inference workloads\n", + "## Use MLTransform for inference workloads\n", "\n", "When `MLTransform` is in `write` mode, it produces artifacts, such as vocabulary files for `ComputeAndApplyVocabulary`. These artifacts allow you to apply the same vocabulary, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.\n", "\n", diff --git a/examples/notebooks/beam-ml/data_preprocessing/scale_data.ipynb b/examples/notebooks/beam-ml/data_preprocessing/scale_data.ipynb index abeeba2264ee..3c8946362a31 100644 --- a/examples/notebooks/beam-ml/data_preprocessing/scale_data.ipynb +++ b/examples/notebooks/beam-ml/data_preprocessing/scale_data.ipynb @@ -78,11 +78,11 @@ "\n", "For each data processing transform, `MLTransform` runs in both `write` mode and `read` mode. For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.\n", "\n", - "### MLTransform in write mode\n", + "## MLTransform in write mode\n", "\n", "When `MLTransform` is in `write` mode, it produces artifacts, such as minimum, maximum, and variance, for different data processing transforms. These artifacts allow you to ensure that you're applying the same artifacts, and any other preprocessing transforms, when you train your model and serve it in production, or when you test its accuracy.\n", "\n", - "### MLTransform in read mode\n", + "## MLTransform in read mode\n", "\n", "In read mode, `MLTransform` uses the artifacts generated in `write` mode to scale the entire dataset." ], @@ -146,7 +146,7 @@ { "cell_type": "code", "source": [ - "# data used in MLTransform's write mode.\n", + "# data used in MLTransform's write mode\n", "data = [\n", " {'int_feature_1' : 11, 'int_feature_2': -10},\n", " {'int_feature_1': 34, 'int_feature_2': -33},\n", @@ -156,7 +156,7 @@ " {'int_feature_1': 63, 'int_feature_2': -21},\n", "]\n", "\n", - "# data used in MLTransform's read mode.\n", + "# data used in MLTransform's read mode\n", "test_data = [\n", " {'int_feature_1': 29, 'int_feature_2': -20},\n", " {'int_feature_1': -5, 'int_feature_2': -11},\n", diff --git a/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb b/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb index db9e550dc913..ed4da57e297b 100644 --- a/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb +++ b/examples/notebooks/beam-ml/data_preprocessing/vertex_ai_text_embeddings.ipynb @@ -72,7 +72,7 @@ "* **Machine translation:** Translate text from one language to another and preserve the meaning.\n", "* **Text summarization:** Create shorter summaries of text.\n", "\n", - "This notebook uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings). \n", + "This notebook uses the Vertex AI text-embeddings API to generate text embeddings that use Google’s large generative artificial intelligence (AI) models. To generate text embeddings by using the Vertex AI text-embeddings API, use `MLTransform` with the `VertexAITextEmbeddings` class to specify the model configuration. For more information, see [Get text embeddings](https://cloud.google.com/vertex-ai/docs/generative-ai/embeddings/get-text-embeddings) in the Vertex AI documentation. \n", "\n", "For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation.\n", "\n", @@ -156,9 +156,7 @@ "\n", "### Use MLTransform in write mode\n", "\n", - "In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy.\n", - "\n", - "For more information about using `MLTransform`, see [Preprocess data with MLTransform](https://beam.apache.org/documentation/ml/preprocess-data/) in the Apache Beam documentation." + "In `write` mode, `MLTransform` saves the transforms and their attributes to an artifact location. Then, when you run `MLTransform` in `read` mode, these transforms are used. This process ensures that you're applying the same preprocessing steps when you train your model and when you serve the model in production or test its accuracy." ], "metadata": { "id": "cokOaX2kzyke"