Merge pull request #96 from rstudio/henry090-transformers

Henry090 transformers
rstudio · Jul 30, 2020 · 30a4bea · 30a4bea
2 parents 6c48b90 + bc3a80e
commit 30a4bea
Show file tree

Hide file tree

Showing 46 changed files with 11,468 additions and 906 deletions.
diff --git a/_posts/2020-07-29-parallelized-sampling/parallelized-sampling.html b/_posts/2020-07-29-parallelized-sampling/parallelized-sampling.html
@@ -42,7 +42,7 @@
 
   <!--/radix_placeholder_meta_tags-->
 
-  <meta name="citation_reference" content="citation_title=Weighted Random Sampling;citation_publication_date=2016;citation_publisher=Springer New York;citation_doi=10.1007/978-1-4939-2864-4_478;citation_author=Pavlos Efraimidis;citation_author=Paul (Pavlos) Spirakis"/>
+  <meta name="citation_reference" content="citation_title=Weighted random sampling;citation_publication_date=2016;citation_publisher=Springer New York;citation_doi=10.1007/978-1-4939-2864-4_478;citation_author=Pavlos Efraimidis;citation_author=Paul (Pavlos) Spirakis"/>
   <!--radix_placeholder_rmarkdown_metadata-->
 
   <script type="text/json" id="radix-rmarkdown-metadata">
@@ -955,7 +955,7 @@
 <!--radix_placeholder_front_matter-->
 
 <script id="distill-front-matter" type="text/json">
-{"title":"Parallelized sampling using exponential variates","description":"How can the seemingly iterative process of weighted sampling without replacement be transformed into something highly parallelizable? Turns out a well-known technique based on exponential variates accomplishes exactly that.","authors":[{"author":"Yitao Li","authorURL":"#","affiliation":"RStudio","affiliationURL":"https://www.rstudio.com/"}],"publishedDate":"2020-07-29T00:00:00.000+00:00","citationText":"Li, 2020"}
+{"title":"Parallelized sampling using exponential variates","description":"How can the seemingly iterative process of weighted sampling without replacement be transformed into something highly parallelizable? Turns out a well-known technique based on exponential variates accomplishes exactly that.","authors":[{"author":"Yitao Li","authorURL":"#","affiliation":"RStudio","affiliationURL":"https://www.rstudio.com/"}],"publishedDate":"2020-07-29T00:00:00.000+02:00","citationText":"Li, 2020"}
 </script>
 
 <!--/radix_placeholder_front_matter-->

diff --git a/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/files/dino.jpg b/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/files/dino.jpg
diff --git a/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/files/res.csv b/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/files/res.csv
@@ -0,0 +1,13 @@
+rowname,V1,V2,model_names
+loss,0.559825122356415,0.245607495307922,TFGPT2Model
+auc,0.78228098154068,0.963757991790771,TFGPT2Model
+val_loss,0.448985695838928,0.635270476341248,TFGPT2Model
+val_auc,0.867825031280518,0.855400025844574,TFGPT2Model
+loss,0.523191094398499,0.253391593694687,TFRobertaModel
+auc,0.822929620742798,0.954732954502106,TFRobertaModel
+val_loss,0.498315244913101,0.495664745569229,TFRobertaModel
+val_auc,0.892175018787384,0.892712593078613,TFRobertaModel
+loss,0.591742992401123,0.313754796981812,TFElectraModel
+auc,0.737222373485565,0.939261674880981,TFElectraModel
+val_loss,0.494326084852219,0.63808286190033,TFElectraModel
+val_auc,0.844237506389618,0.845062434673309,TFElectraModel
diff --git a/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/state-of-the-art-nlp-models-from-r.Rmd b/_posts/2020-07-30-state-of-the-art-nlp-models-from-r/state-of-the-art-nlp-models-from-r.Rmd
@@ -0,0 +1,275 @@
+---
+title: "State-of-the-art NLP models from R"
+description: |
+  Nowadays, Microsoft, Google, Facebook, and OpenAI are sharing lots of state-of-the-art models in the field of Natural Language Processing. However, fewer materials exist how to use these models from R. In this post, we will show how R users can access and benefit from these models as well.
+author:
+  - name: Turgut Abdullayev 
+    url: https://github.com/henry090
+    affiliation: QSS Analytics
+    affiliation_url: http://www.qss.az/
+date: 07-30-2020
+categories:
+  - Natural Language Processing
+creative_commons: CC BY
+repository_url: https://github.com/henry090/transformers
+output: 
+  distill::distill_article:
+    self_contained: false
+    toc: true
+    toc_depth: 2
+preview: files/dino.jpg
+---
+
+
+
+<style type="text/css">
+.colab-root {
+    display: inline-block;
+    background: rgba(255, 255, 255, 0.75);
+    padding: 4px 8px;
+    border-radius: 4px;
+    font-size: 11px!important;
+    text-decoration: none;
+    color: #aaa;
+    border: none;
+    font-weight: 300;
+    border: solid 1px rgba(0, 0, 0, 0.08);
+    border-bottom-color: rgba(0, 0, 0, 0.15);
+    text-transform: uppercase;
+    line-height: 16px;
+}
+span.colab-span {
+    background-image: url(https://www.vectorlogo.zone/logos/kaggle/kaggle-ar21.svg);
+    background-repeat: no-repeat;
+    background-size: 51px;
+    background-position-y: -4px;
+    display: inline-block;
+    padding-left: 24px;
+    border-radius: 4px;
+    text-decoration: none;
+}
+</style>
+
+```{r setup, include=FALSE, eval=F,echo=T}
+knitr::opts_chunk$set(echo = FALSE, eval = F, echo = T)
+```
+
+
+## Introduction
+
+The _Transformers_ repository from ["Hugging Face"](https://github.com/huggingface/transformers) contains a lot of ready to use, state-of-the-art models, which are straightforward to download and fine-tune with Tensorflow & Keras. 
+
+For this purpose the users usually need to get:
+
+* The model itself (e.g. Bert, Albert, RoBerta, GPT-2 and etc.)
+* The tokenizer object
+* The weights of the model
+
+In this post, we will work on a classic binary classification task and train our dataset on 3 models: 
+
+* [GPT-2](https://blog.openai.com/better-language-models/) from Open AI
+* [RoBERTa](https://arxiv.org/abs/1907.11692) from Facebook
+* [Electra](https://arxiv.org/abs/2003.10555) from Google Research/Stanford University
+
+However, readers should know that one can work with transformers on a variety of down-stream tasks, such as:
+
+1) feature extraction
+2) sentiment analysis
+3) [text classification](https://github.com/huggingface/transformers/tree/master/examples/text-classification)
+4) [question answering](https://github.com/huggingface/transformers/tree/master/examples/question-answering)
+5) [summarization](https://github.com/huggingface/transformers/tree/master/examples/seq2seq)
+6) [translation](https://github.com/huggingface/transformers/tree/master/examples/seq2seq) and [many more](https://github.com/huggingface/transformers/tree/master/examples).
+
+## Prerequisites
+
+Our first job is to install the _transformers_ package via ```reticulate```.
+
+```{r eval = F, echo = T}
+reticulate::py_install('transformers', pip = TRUE)
+```
+
+Then, as usual, load standard 'Keras', 'TensorFlow' >= 2.0 and some classic libraries from R.
+
+```{r eval = F, echo = T}
+library(keras)
+library(tensorflow)
+library(dplyr)
+library(tfdatasets)
+
+transformer = reticulate::import('transformers')
+```
+
+Note that if running TensorFlow on GPU one could specify the following parameters in order to avoid memory issues.
+
+```{r eval = F, echo = T}
+
+physical_devices = tf$config$list_physical_devices('GPU')
+tf$config$experimental$set_memory_growth(physical_devices[[1]],TRUE)
+
+tf$keras$backend$set_floatx('float32')
+
+```
+
+## Template
+
+We already mentioned that to train a data on the specific model, users should download the model, its tokenizer object and weights. For example, to get a RoBERTa model one has to do the following:
+
+```{r eval = F, echo = T}
+# get Tokenizer
+transformer$RobertaTokenizer$from_pretrained('roberta-base', do_lower_case=TRUE)
+
+# get Model with weights
+transformer$TFRobertaModel$from_pretrained('roberta-base')
+```
+
+
+## Data preparation
+
+A dataset for binary classification is provided in [text2vec](http://text2vec.org/) package. Let's load the dataset and take a sample for fast model training.
+
+```{r eval = F, echo = T}
+library(text2vec)
+data("movie_review")
+df = movie_review %>% rename(target = sentiment, comment_text = review) %>% 
+  sample_n(2000) %>% 
+  data.table::as.data.table()
+```
+
+Split our data into 2 parts:
+
+```{r eval = F, echo = T}
+idx_train = sample.int(nrow(df)*0.8)
+
+train = df[idx_train,]
+test = df[!idx_train,]
+```
+
+## Data input for Keras
+
+Until now, we've just covered data import and train-test split. To feed input to the network we have to turn our raw text into indices via the imported tokenizer. And then adapt the model to do binary classification by adding a dense layer with a single unit at the end.
+
+However, we want to train our data for 3 models GPT-2, RoBERTa, and Electra. We need to write a loop for that.
+
+> Note: one model in general requires 500-700 MB
+
+```{r eval = F, echo = T}
+# list of 3 models
+ai_m = list(
+  c('TFGPT2Model',       'GPT2Tokenizer',       'gpt2'),
+   c('TFRobertaModel',    'RobertaTokenizer',    'roberta-base'),
+   c('TFElectraModel',    'ElectraTokenizer',    'google/electra-small-generator')
+)
+
+# parameters
+max_len = 50L
+epochs = 2
+batch_size = 10
+
+# create a list for model results
+gather_history = list()
+
+for (i in 1:length(ai_m)) {
+  
+  # tokenizer
+  tokenizer = glue::glue("transformer${ai_m[[i]][2]}$from_pretrained('{ai_m[[i]][3]}',
+                         do_lower_case=TRUE)") %>% 
+    rlang::parse_expr() %>% eval()
+  
+  # model
+  model_ = glue::glue("transformer${ai_m[[i]][1]}$from_pretrained('{ai_m[[i]][3]}')") %>% 
+    rlang::parse_expr() %>% eval()
+  
+  # inputs
+  text = list()
+  # outputs
+  label = list()
+  
+  data_prep = function(data) {
+    for (i in 1:nrow(data)) {
+      
+      txt = tokenizer$encode(data[['comment_text']][i],max_length = max_len, 
+                             truncation=T) %>% 
+        t() %>% 
+        as.matrix() %>% list()
+      lbl = data[['target']][i] %>% t()
+      
+      text = text %>% append(txt)
+      label = label %>% append(lbl)
+    }
+    list(do.call(plyr::rbind.fill.matrix,text), do.call(plyr::rbind.fill.matrix,label))
+  }
+  
+  train_ = data_prep(train)
+  test_ = data_prep(test)
+  
+  # slice dataset
+  tf_train = tensor_slices_dataset(list(train_[[1]],train_[[2]])) %>% 
+    dataset_batch(batch_size = batch_size, drop_remainder = TRUE) %>% 
+    dataset_shuffle(128) %>% dataset_repeat(epochs) %>% 
+    dataset_prefetch(tf$data$experimental$AUTOTUNE)
+  
+  tf_test = tensor_slices_dataset(list(test_[[1]],test_[[2]])) %>% 
+    dataset_batch(batch_size = batch_size)
+  
+  # create an input layer
+  input = layer_input(shape=c(max_len), dtype='int32')
+  hidden_mean = tf$reduce_mean(model_(input)[[1]], axis=1L) %>% 
+    layer_dense(64,activation = 'relu')
+  # create an output layer for binary classification
+  output = hidden_mean %>% layer_dense(units=1, activation='sigmoid')
+  model = keras_model(inputs=input, outputs = output)
+  
+  # compile with AUC score
+  model %>% compile(optimizer= tf$keras$optimizers$Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0),
+                    loss = tf$losses$BinaryCrossentropy(from_logits=F),
+                    metrics = tf$metrics$AUC())
+  
+  print(glue::glue('{ai_m[[i]][1]}'))
+  # train the model
+  history = model %>% keras::fit(tf_train, epochs=epochs, #steps_per_epoch=len/batch_size,
+                validation_data=tf_test)
+  gather_history[[i]]<- history
+  names(gather_history)[i] = ai_m[[i]][1]
+}
+
+```
+
+<center>
+<a href="https://www.kaggle.com/henry090/transformers" class="colab-root">Reproduce in a<span class="colab-span">  &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Notebook</span></a>
+</center>
+<br>
+
+Extract results to see the benchmarks:
+
+```{r eval = F, echo = T}
+res = sapply(1:3, function(x) {
+  do.call(rbind,gather_history[[x]][["metrics"]]) %>% 
+    as.data.frame() %>% 
+    tibble::rownames_to_column() %>% 
+    mutate(model_names = names(gather_history[x])) 
+}, simplify = F) %>% do.call(plyr::rbind.fill,.) %>% 
+  mutate(rowname = stringr::str_extract(rowname, 'loss|val_loss|auc|val_auc')) %>% 
+  rename(epoch_1 = V1, epoch_2 = V2)
+```
+
+```{r eval=T,echo=F}
+library(dplyr)
+res = data.table::fread('files/res.csv') %>% 
+  filter(rowname %in% 'val_auc') %>% arrange(desc(V2)) %>% 
+   rename(epoch_1 = V1, epoch_2 = V2, metric = rowname) %>% 
+  mutate(epoch_1 = round(epoch_1,3),epoch_2 = round(epoch_2,3))
+DT::datatable(res, options = list(dom = 't'))
+```
+
+Both the _RoBERTa_ and _Electra_ models show some additional improvements after 2 epochs of training, which cannot be said of  _GPT-2_. In this case, it is clear that it can be enough to train a state-of-the-art model even for a single epoch.
+
+
+## Conclusion
+
+In this post, we showed how to use state-of-the-art NLP models from R. 
+To understand how to apply them to more complex tasks, it is highly recommended to review the [transformers tutorial](https://github.com/huggingface/transformers/tree/master/examples).
+
+We encourage readers to try out these models and share their results below in the comments section!
+
+
+