-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model architecture adjustment problem #9
Comments
I can send through the script used to create the tokenizer if you are interested. I am personally not sure of the changes that need to be made to it to switch languages.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: yihp ***@***.***>
Sent: Monday, July 15, 2024 7:17:54 PM
To: aehrc/cxrmate ***@***.***>
Cc: Subscribed ***@***.***>
Subject: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9)
Hi! Thanks for your contribution. It is an excellent piece of work!
Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work?
Thank you very much for your time and consideration. I eagerly look forward to your response.
—
Reply to this email directly, view it on GitHub<#9>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGHGZ7RF5NR4JEZBZ334673ZMOHUFAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYDQMRVGQZDQMQ>.
You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>
|
It would be great if you could send through the script used to create the tokenizer. |
https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb Sorry @yihp, I am not sure what you meant by the second part. |
thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself?
|
No worries. Actually, no pretraining checkpoint was used, the decoder was randomly initialised.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: yihp ***@***.***>
Sent: Monday, July 15, 2024 8:10:18 PM
To: aehrc/cxrmate ***@***.***>
Cc: Nicolson, Aaron (H&B, Herston) ***@***.***>; Comment ***@***.***>
Subject: Re: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9)
thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself?
https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb
Sorry @yihp<https://github.com/yihp>, I am not sure what you meant by the second part.
—
Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGHGZ7SQT62FGS2DPZRWQR3ZMONYVAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGE2DKMZTHA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
OK, that is great! I found that cxrmate/model/cxrmate-tf/tokenizer.json does not have the vocabulary and encoding of the language I use, so can I expand it and further train it on my language task? What should I pay attention to? |
Yeah, it's all in English unfortunately. So, just use that tokenizer notebook, you need to create a list of the text from your training set to feed to the tokenizer (as shown in the notebook). You could modify it to use the SentencePiece tokenizer, as it may be more appropriate for some languages: https://huggingface.co/docs/transformers/en/tokenizer_summary#sentencepiece. The other thing you may consider is how many sections of the report you want to handle. Currently it is setup to handle two sections. But that can be changed. |
ok! Do the two sections refer to impression and findings of the report? |
yes :) |
Hi! Thanks for your contribution. It is an excellent piece of work!
Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work?
Thank you very much for your time and consideration. I eagerly look forward to your response.
The text was updated successfully, but these errors were encountered: