Skip to content

Commit

Permalink
revert html encoding
Browse files Browse the repository at this point in the history
  • Loading branch information
hoosierEE committed Nov 1, 2023
1 parent 4516e2c commit 99aad3b
Showing 1 changed file with 40 additions and 45 deletions.
85 changes: 40 additions & 45 deletions docs/tutorials/word2vec.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,20 @@
"id": "AOpGoE2T-YXS"
},
"source": [
"<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://www.tensorflow.org/text/tutorials/word2vec\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
" </td>\n",
" <td>\n",
" <a target=\"_blank\" href=\"https://github.com/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View on GitHub</a>\n",
" </td>\n",
" <td>\n",
" <a href=\"https://storage.googleapis.com/tensorflow_docs/text/docs/tutorials/word2vec.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
" </td>\n",
"</table>"
"\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/text/tutorials/word2vec\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003eView on TensorFlow.org\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003eRun in Google Colab\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/text/blob/master/docs/tutorials/word2vec.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003eView on GitHub\u003c/a\u003e\n",
" \u003c/td\u003e\n",
" \u003ctd\u003e\n",
" \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/text/docs/tutorials/word2vec.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
" \u003c/td\u003e\n",
"\u003c/table\u003e"
]
},
{
Expand Down Expand Up @@ -86,7 +86,7 @@
"id": "xP00WlaMWBZC"
},
"source": [
"## Skip-gram and negative sampling"
"## Skip-gram and negative sampling "
]
},
{
Expand All @@ -95,7 +95,7 @@
"id": "Zr2wjv0bW236"
},
"source": [
"While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of `(target_word, context_word)` where `context_word` appears in the neighboring context of `target_word`."
"While a bag-of-words model predicts a word given the neighboring context, a skip-gram model predicts the context (or neighbors) of a word, given the word itself. The model is trained on skip-grams, which are n-grams that allow tokens to be skipped (see the diagram below for an example). The context of a word can be represented through a set of skip-gram pairs of `(target_word, context_word)` where `context_word` appears in the neighboring context of `target_word`. "
]
},
{
Expand All @@ -106,7 +106,7 @@
"source": [
"Consider the following sentence of eight words:\n",
"\n",
"> The wide road shimmered in the hot sun.\n",
"\u003e The wide road shimmered in the hot sun.\n",
"\n",
"The context words for each of the 8 words of this sentence are defined by a window size. The window size determines the span of words on either side of a `target_word` that can be considered a `context word`. Below is a table of skip-grams for target words based on different window sizes."
]
Expand Down Expand Up @@ -135,7 +135,7 @@
"id": "gK1gN1jwkMpU"
},
"source": [
"The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words *w<sub>1</sub>, w<sub>2</sub>, ... w<sub>T</sub>*, the objective can be written as the average log probability"
"The training objective of the skip-gram model is to maximize the probability of predicting context words given the target word. For a sequence of words *w\u003csub\u003e1\u003c/sub\u003e, w\u003csub\u003e2\u003c/sub\u003e, ... w\u003csub\u003eT\u003c/sub\u003e*, the objective can be written as the average log probability"
]
},
{
Expand Down Expand Up @@ -171,7 +171,7 @@
"id": "axZvd-hhotVB"
},
"source": [
"where *v* and *v<sup>'<sup>* are target and context vector representations of words and *W* is vocabulary size."
"where *v* and *v\u003csup\u003e'\u003csup\u003e* are target and context vector representations of words and *W* is vocabulary size."
]
},
{
Expand All @@ -180,7 +180,7 @@
"id": "SoLzxbqSpT6_"
},
"source": [
"Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words, which are often large (10<sup>5</sup>-10<sup>7</sup>) terms."
"Computing the denominator of this formulation involves performing a full softmax over the entire vocabulary words, which are often large (10\u003csup\u003e5\u003c/sup\u003e-10\u003csup\u003e7\u003c/sup\u003e) terms."
]
},
{
Expand All @@ -189,7 +189,7 @@
"id": "Y5VWYtmFzHkU"
},
"source": [
"The [noise contrastive estimation](https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss) (NCE) loss function is an efficient approximation for a full softmax. With an objective to learn word embeddings instead of modeling the word distribution, the NCE loss can be [simplified](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) to use negative sampling."
"The [noise contrastive estimation](https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss) (NCE) loss function is an efficient approximation for a full softmax. With an objective to learn word embeddings instead of modeling the word distribution, the NCE loss can be [simplified](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) to use negative sampling. "
]
},
{
Expand All @@ -198,7 +198,7 @@
"id": "WTZBPf1RsOsg"
},
"source": [
"The simplified negative sampling objective for a target word is to distinguish the context word from `num_ns` negative samples drawn from noise distribution *P<sub>n</sub>(w)* of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to pose the loss for a target word as a classification problem between the context word and `num_ns` negative samples."
"The simplified negative sampling objective for a target word is to distinguish the context word from `num_ns` negative samples drawn from noise distribution *P\u003csub\u003en\u003c/sub\u003e(w)* of words. More precisely, an efficient approximation of full softmax over the vocabulary is, for a skip-gram pair, to pose the loss for a target word as a classification problem between the context word and `num_ns` negative samples."
]
},
{
Expand Down Expand Up @@ -296,7 +296,7 @@
"source": [
"Consider the following sentence:\n",
"\n",
"> The wide road shimmered in the hot sun.\n",
"\u003e The wide road shimmered in the hot sun.\n",
"\n",
"Tokenize the sentence:"
]
Expand Down Expand Up @@ -332,7 +332,7 @@
"outputs": [],
"source": [
"vocab, index = {}, 1 # start indexing from 1\n",
"vocab['<pad>'] = 0 # add a padding token\n",
"vocab['\u003cpad\u003e'] = 0 # add a padding token\n",
"for token in tokens:\n",
" if token not in vocab:\n",
" vocab[token] = index\n",
Expand Down Expand Up @@ -437,8 +437,7 @@
},
"outputs": [],
"source": [
"# print([inverse_vocab[x] for x in example_sequence])\n",
"for target, context in positive_skip_grams[:10]:\n",
"for target, context in positive_skip_grams[:5]:\n",
" print(f\"({target}, {context}): ({inverse_vocab[target]}, {inverse_vocab[context]})\")"
]
},
Expand All @@ -448,7 +447,7 @@
"id": "_ua9PkMTISF0"
},
"source": [
"### Negative sampling for one skip-gram"
"### Negative sampling for one skip-gram "
]
},
{
Expand All @@ -457,7 +456,7 @@
"id": "Esqn8WBfZnEK"
},
"source": [
"The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you can to sample random words from the vocabulary. Use the `tf.random.log_uniform_candidate_sampler` function to sample `num_ns` number of negative samples for a given target word in a window. You can pass words from the positive class but this does not exclude them from the results. For large vocabularies, this is not a problem because the chance of drawing one of the positive classes is small. However for small data you may see overlap between negative and positive samples. Later we will add code to exclude positive samples for slightly improved accuracy at the cost of longer runtime."
"The `skipgrams` function returns all positive skip-gram pairs by sliding over a given window span. To produce additional skip-gram pairs that would serve as negative samples for training, you can sample random words from the vocabulary. Use the `tf.random.log_uniform_candidate_sampler` function to sample `num_ns` number of negative samples for a given target word in a window. You can pass words from the positive class but this does not exclude them from the results. For large vocabularies, this is not a problem because the chance of drawing one of the positive classes is small. However for small data you may see overlap between negative and positive samples. Later we will add code to exclude positive samples for slightly improved accuracy at the cost of longer runtime."
]
},
{
Expand Down Expand Up @@ -631,7 +630,7 @@
"id": "iLKwNAczHsKg"
},
"source": [
"### Skip-gram sampling table"
"### Skip-gram sampling table "
]
},
{
Expand All @@ -640,7 +639,7 @@
"id": "TUUK3uDtFNFE"
},
"source": [
"A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as `the`, `is`, `on`) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality."
"A large dataset means larger vocabulary with higher number of more frequent words such as stopwords. Training examples obtained from sampling commonly occurring words (such as `the`, `is`, `on`) don't add much useful information for the model to learn from. [Mikolov et al.](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) suggest subsampling of frequent words as a helpful practice to improve embedding quality. "
]
},
{
Expand Down Expand Up @@ -820,7 +819,7 @@
"id": "sOsbLq8a37dr"
},
"source": [
"Read the text from the file and print the first few lines:"
"Read the text from the file and print the first few lines: "
]
},
{
Expand Down Expand Up @@ -1018,7 +1017,7 @@
"outputs": [],
"source": [
"for seq in sequences[:5]:\n",
" print(f\"{seq} => {[inverse_vocab[i] for i in seq]}\")"
" print(f\"{seq} =\u003e {[inverse_vocab[i] for i in seq]}\")"
]
},
{
Expand Down Expand Up @@ -1061,7 +1060,7 @@
"print('\\n')\n",
"print(f\"targets.shape: {targets.shape}\")\n",
"print(f\"contexts.shape: {contexts.shape}\")\n",
"print(f\"labels.shape: {labels.shape}\")"
"print(f\"labels.shape: {labels.shape}\")\n"
]
},
{
Expand Down Expand Up @@ -1200,7 +1199,7 @@
" # word_emb: (batch, embed)\n",
" context_emb = self.context_embedding(context)\n",
" # context_emb: (batch, context, embed)\n",
" dots = tf.einsum('be,bce->bc', word_emb, context_emb)\n",
" dots = tf.einsum('be,bce-\u003ebc', word_emb, context_emb)\n",
" # dots: (batch, context)\n",
" return dots"
]
Expand All @@ -1227,7 +1226,7 @@
" return tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=y_true)\n",
"```\n",
"\n",
"It's time to build your model! Instantiate your word2vec class with an embedding dimension of 128 (you could experiment with different values). Compile the model with the `tf.keras.optimizers.Adam` optimizer."
"It's time to build your model! Instantiate your word2vec class with an embedding dimension of 128 (you could experiment with different values). Compile the model with the `tf.keras.optimizers.Adam` optimizer. "
]
},
{
Expand Down Expand Up @@ -1282,11 +1281,7 @@
},
"outputs": [],
"source": [
"word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])\n",
"# original\n",
"# 63/63 [==============================] - 1s 15ms/step - loss: 0.4750 - accuracy: 0.8917\n",
"# with negative samples\n",
"# 39/39 [==============================] - 1s 23ms/step - loss: 0.4328 - accuracy: 0.9214"
"word2vec.fit(dataset, epochs=20, callbacks=[tensorboard_callback])"
]
},
{
Expand Down Expand Up @@ -1316,7 +1311,7 @@
"id": "awF3iRQCZOLj"
},
"source": [
"<!-- <img class=\"tfo-display-only-on-site\" src=\"images/word2vec_tensorboard.png\"/> -->"
"\u003c!-- \u003cimg class=\"tfo-display-only-on-site\" src=\"images/word2vec_tensorboard.png\"/\u003e --\u003e"
]
},
{
Expand Down Expand Up @@ -1433,9 +1428,9 @@
],
"metadata": {
"colab": {
"toc_visible": true,
"provenance": [],
"private_outputs": true
"collapsed_sections": [],
"name": "word2vec.ipynb",
"toc_visible": true
},
"kernelspec": {
"display_name": "Python 3",
Expand All @@ -1444,4 +1439,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}

0 comments on commit 99aad3b

Please sign in to comment.