Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444

Closed
Voronov-Andrey opened this issue May 11, 2024 · 5 comments

Comments

@Voronov-Andrey
Copy link

I use keras to develop an application for sentimental analysis of text in Russian using deep learning models. The vast majority of guides and examples use the methods text_tokenizer, fit_text_tokenizer, texts_to_sequences, pad_sequences to convert texts into numeric sequences, as well as to_categorical() to apply one-hot encoding to class labels. But I ran into the problem that the keras3 package for the R language does not contain the methods I listed. Using keras and keras3 at the same time leads to errors in the code and i always need to restart R session and load the packages to fit model or to use those methods again. As an alternative, I tried to use layer_text_vectorization(), but when using it with the same data as with the text_tokenizer() and other methods, the model does not learn at all. Is there any solutions to this problem?

There is a model plot with using text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() :
Plot_with_using_methods

And there is with using layer_text_vectorization as alternative:
Plot_with_using_layer

@t-kalinowski
Copy link
Member

t-kalinowski commented May 11, 2024

In {keras3}, much of the legacy text processing API has been removed. Almost everything is now possible with just layer_text_vectorization(). The layer can be used with helpers get_vocabulary(), set_vocabulary() and adapt(). {keras3} also provides text_dataset_from_directory() which maybe be useful.

Some helpful links:

If you're running into specific issue with layer_text_vectorization(), please provide a minimal reproducible example and I can help figure out what's going wrong.

@Voronov-Andrey
Copy link
Author

Voronov-Andrey commented May 12, 2024

Thanks for the answer, below I have provided the MRE and dput() output file of the data I use. If you need any additional information, I am always ready to provide it.

library(reticulate)
library(keras3)
library(dplyr)
library(stringr)
library(caret)

data_tidy <- dget("dput_output.txt", keep.source = FALSE)

#Dividing the data into training and test samples
split_data <- function(df) {
  set.seed(123)  
  
  trainIndex <- createDataPartition(df$Sentiment, p = 0.8, list = FALSE)
  train_data <- df[trainIndex, ]
  test_data <- df[-trainIndex, ]
  
  return(list(train_data = train_data, test_data = test_data))
}

splitted_data <- split_data(data_tidy)
train_data <- splitted_data$train_data
test_data <- splitted_data$test_data
train_data_x <- train_data$text_tidy
train_data_y <- to_categorical(train_data$Sentiment)
test_data_x <- test_data$text_tidy
test_data_y <- to_categorical(test_data$Sentiment)

#Setting model constants
max_features <- 50000L
embedding_dim <- 128L
sequence_length <- max(sapply(data_tidy$text_tidy, str_count, pattern = "\\w+")) + 1L

#Initialising vectorize layer (data is already standardized)
vectorize_layer <- layer_text_vectorization(
  standardize = NULL,
  max_tokens = max_features,
  output_mode = "int",
  output_sequence_length = sequence_length,
)

#Adapting vectorize layer on text data
vectorize_layer %>% adapt(data_tidy$text_tidy)

#Building LSTM model
model <- keras_model_sequential() %>%
  vectorize_layer() %>%
  layer_embedding(input_dim = max_features, output_dim = embedding_dim) %>%
  layer_lstm(units = 128, dropout = 0.5) %>%
  layer_dense(3, activation = 'softmax')

model %>% compile(
  optimizer = optimizer_adam(learning_rate = 0.001),
  metrics = c('accuracy'),
  loss = 'categorical_crossentropy'
)

lstm_model <- model %>% fit(
  train_data_x, train_data_y,
  batch_size = 64,
  epochs = 5,
  verbose = 1,
  validation_split = 0.2
)

dput_output.txt

@t-kalinowski
Copy link
Member

Hi, did you resolve the issue?

@Voronov-Andrey
Copy link
Author

Hello, not exactly, i just restructured model with Bidirectional LSTM layers and used manual data separation into training and validation samples to use validation_data instead of validation_split. Under such conditions, the fitting of the model is successful. Using a structure with regular LSTM layers leads to the same problem.

@t-kalinowski
Copy link
Member

So the issue disappears when switching from validation_split to validation_data? That sounds like a bug. What is the version of keras? keras3:::keras_version()? If not 3.3.3, can you re-run keras3::install_keras() to get the latest?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants