-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lack of methods text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() #1444
Comments
In {keras3}, much of the legacy text processing API has been removed. Almost everything is now possible with just Some helpful links:
If you're running into specific issue with |
Thanks for the answer, below I have provided the MRE and dput() output file of the data I use. If you need any additional information, I am always ready to provide it. library(reticulate)
library(keras3)
library(dplyr)
library(stringr)
library(caret)
data_tidy <- dget("dput_output.txt", keep.source = FALSE)
#Dividing the data into training and test samples
split_data <- function(df) {
set.seed(123)
trainIndex <- createDataPartition(df$Sentiment, p = 0.8, list = FALSE)
train_data <- df[trainIndex, ]
test_data <- df[-trainIndex, ]
return(list(train_data = train_data, test_data = test_data))
}
splitted_data <- split_data(data_tidy)
train_data <- splitted_data$train_data
test_data <- splitted_data$test_data
train_data_x <- train_data$text_tidy
train_data_y <- to_categorical(train_data$Sentiment)
test_data_x <- test_data$text_tidy
test_data_y <- to_categorical(test_data$Sentiment)
#Setting model constants
max_features <- 50000L
embedding_dim <- 128L
sequence_length <- max(sapply(data_tidy$text_tidy, str_count, pattern = "\\w+")) + 1L
#Initialising vectorize layer (data is already standardized)
vectorize_layer <- layer_text_vectorization(
standardize = NULL,
max_tokens = max_features,
output_mode = "int",
output_sequence_length = sequence_length,
)
#Adapting vectorize layer on text data
vectorize_layer %>% adapt(data_tidy$text_tidy)
#Building LSTM model
model <- keras_model_sequential() %>%
vectorize_layer() %>%
layer_embedding(input_dim = max_features, output_dim = embedding_dim) %>%
layer_lstm(units = 128, dropout = 0.5) %>%
layer_dense(3, activation = 'softmax')
model %>% compile(
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = c('accuracy'),
loss = 'categorical_crossentropy'
)
lstm_model <- model %>% fit(
train_data_x, train_data_y,
batch_size = 64,
epochs = 5,
verbose = 1,
validation_split = 0.2
) |
Hi, did you resolve the issue? |
Hello, not exactly, i just restructured model with Bidirectional LSTM layers and used manual data separation into training and validation samples to use validation_data instead of validation_split. Under such conditions, the fitting of the model is successful. Using a structure with regular LSTM layers leads to the same problem. |
So the issue disappears when switching from |
I use keras to develop an application for sentimental analysis of text in Russian using deep learning models. The vast majority of guides and examples use the methods text_tokenizer, fit_text_tokenizer, texts_to_sequences, pad_sequences to convert texts into numeric sequences, as well as to_categorical() to apply one-hot encoding to class labels. But I ran into the problem that the keras3 package for the R language does not contain the methods I listed. Using keras and keras3 at the same time leads to errors in the code and i always need to restart R session and load the packages to fit model or to use those methods again. As an alternative, I tried to use layer_text_vectorization(), but when using it with the same data as with the text_tokenizer() and other methods, the model does not learn at all. Is there any solutions to this problem?
There is a model plot with using text_tokenizer(), fit_text_tokenizer() and texts_to_sequences() :
And there is with using layer_text_vectorization as alternative:
The text was updated successfully, but these errors were encountered: