You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I used the sentencepiece library (I built the static library and configured it in my visual studio). I used the source.spm file as the model. My program and output is this:
#include <iostream>
#include <sentencepiece_processor.h>
int main()
{
sentencepiece::SentencePieceProcessor processor;
const auto status = processor.Load("D:\\SentencePiece\\source.spm");
if (!status.ok()) {
std::cerr << status.ToString() << std::endl;
// error
}
std::vector<std::string> pieces;
processor.Encode("Hola mi amor", &pieces);
for (const std::string& token : pieces) {
std::cout << token << std::endl;
}
std::vector<int> ids;
processor.Encode("Hola mi amor", &ids);
for (const int id : ids) {
std::cout << id << std::endl;
}
}
Output
ÔûüHola
Ôûümi
Ôûüamor
868
64
866
Which apparently tokenizes correctly. However, my problem is with the ids. My python program delivers the correct ids.
from transformers import AutoTokenizer, MarianMTModel
src = "es" # source language
trg = "en" # target language
model_name = f"Helsinki-NLP/opus-mt-{src}-{trg}"
model = MarianMTModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sample_text = "Hola mi amor"
batch = tokenizer([sample_text], return_tensors="pt")
print(batch)
Hi,
My intention is to develop a C++ project in Visual Studio (Windows) that runs the MarianMT model (exported to onnx) to translate from Spanish to English: https://huggingface.co/Helsinki-NLP/opus-mt-es-en. For this reason, I want to develop a C++ tokenizer based on sentepiece (https://github.com/google/sentencepiece).
I used the sentencepiece library (I built the static library and configured it in my visual studio). I used the source.spm file as the model. My program and output is this:
Output
Which apparently tokenizes correctly. However, my problem is with the ids. My python program delivers the correct ids.
Output
{'input_ids': tensor([[2119, 155, 1821, 0]]), 'attention_mask': tensor([[1, 1, 1, 1]])}
I'm probably interpreting something wrong? Could you please give me a suggestion on how to proceed?
The text was updated successfully, but these errors were encountered: