Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Voronov-Andrey · 2024-05-11T08:22:29Z

I use keras to develop an application for sentimental analysis of text in Russian using deep learning models. The vast majority of guides and examples use the methods text_tokenizer, fit_text_tokenizer, texts_to_sequences, pad_sequences to convert texts into numeric sequences, as well as to_categorical() to apply one-hot encoding to class labels. But I ran into the problem that the keras3 package for the R language does not contain the methods I listed. Using keras and keras3 at the same time leads to errors in the code and i always need to restart R session and load the packages to fit model or to use those methods again. As an alternative, I tried to use layer_text_vectorization(), but when using it with the same data as with the text_tokenizer() and other methods, the model does not learn at all. Is there any solutions to this problem?

There is a model plot with using text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() :

And there is with using layer_text_vectorization as alternative:

t-kalinowski · 2024-05-11T20:01:42Z

In {keras3}, much of the legacy text processing API has been removed. Almost everything is now possible with just layer_text_vectorization(). The layer can be used with helpers get_vocabulary(), set_vocabulary() and adapt(). {keras3} also provides text_dataset_from_directory() which maybe be useful.

Some helpful links:

layer_text_vectorization() reference page with many examples and extended descriptions of features: https://keras.posit.co/reference/layer_text_vectorization.html
An end-to-end example showing layer_text_vectorization() in use. https://keras.posit.co/articles/examples/nlp/text_classification_from_scratch.html

If you're running into specific issue with layer_text_vectorization(), please provide a minimal reproducible example and I can help figure out what's going wrong.

Voronov-Andrey · 2024-05-12T09:53:48Z

Thanks for the answer, below I have provided the MRE and dput() output file of the data I use. If you need any additional information, I am always ready to provide it.

library(reticulate)
library(keras3)
library(dplyr)
library(stringr)
library(caret)

data_tidy <- dget("dput_output.txt", keep.source = FALSE)

#Dividing the data into training and test samples
split_data <- function(df) {
  set.seed(123)  
  
  trainIndex <- createDataPartition(df$Sentiment, p = 0.8, list = FALSE)
  train_data <- df[trainIndex, ]
  test_data <- df[-trainIndex, ]
  
  return(list(train_data = train_data, test_data = test_data))
}

splitted_data <- split_data(data_tidy)
train_data <- splitted_data$train_data
test_data <- splitted_data$test_data
train_data_x <- train_data$text_tidy
train_data_y <- to_categorical(train_data$Sentiment)
test_data_x <- test_data$text_tidy
test_data_y <- to_categorical(test_data$Sentiment)

#Setting model constants
max_features <- 50000L
embedding_dim <- 128L
sequence_length <- max(sapply(data_tidy$text_tidy, str_count, pattern = "\\w+")) + 1L

#Initialising vectorize layer (data is already standardized)
vectorize_layer <- layer_text_vectorization(
  standardize = NULL,
  max_tokens = max_features,
  output_mode = "int",
  output_sequence_length = sequence_length,
)

#Adapting vectorize layer on text data
vectorize_layer %>% adapt(data_tidy$text_tidy)

#Building LSTM model
model <- keras_model_sequential() %>%
  vectorize_layer() %>%
  layer_embedding(input_dim = max_features, output_dim = embedding_dim) %>%
  layer_lstm(units = 128, dropout = 0.5) %>%
  layer_dense(3, activation = 'softmax')

model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = c('accuracy'),
  loss = 'categorical_crossentropy'
)

lstm_model <- model %>% fit(
  train_data_x, train_data_y,
  batch_size = 64,
  epochs = 5,
  verbose = 1,
  validation_split = 0.2
)

dput_output.txt

t-kalinowski · 2024-05-13T11:47:48Z

Hi, did you resolve the issue?

Voronov-Andrey · 2024-05-13T14:45:40Z

Hello, not exactly, i just restructured model with Bidirectional LSTM layers and used manual data separation into training and validation samples to use validation_data instead of validation_split. Under such conditions, the fitting of the model is successful. Using a structure with regular LSTM layers leads to the same problem.

t-kalinowski · 2024-05-13T14:56:28Z

So the issue disappears when switching from validation_split to validation_data? That sounds like a bug. What is the version of keras? keras3:::keras_version()? If not 3.3.3, can you re-run keras3::install_keras() to get the latest?

Voronov-Andrey closed this as completed May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Voronov-Andrey commented May 11, 2024

t-kalinowski commented May 11, 2024 •

edited

Loading

Voronov-Andrey commented May 12, 2024 •

edited by t-kalinowski

Loading

t-kalinowski commented May 13, 2024

Voronov-Andrey commented May 13, 2024

t-kalinowski commented May 13, 2024

Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Comments

Voronov-Andrey commented May 11, 2024

t-kalinowski commented May 11, 2024 • edited Loading

Voronov-Andrey commented May 12, 2024 • edited by t-kalinowski Loading

t-kalinowski commented May 13, 2024

Voronov-Andrey commented May 13, 2024

t-kalinowski commented May 13, 2024

t-kalinowski commented May 11, 2024 •

edited

Loading

Voronov-Andrey commented May 12, 2024 •

edited by t-kalinowski

Loading