Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model architecture adjustment problem #9

Open
yihp opened this issue Jul 15, 2024 · 9 comments
Open

Model architecture adjustment problem #9

yihp opened this issue Jul 15, 2024 · 9 comments

Comments

@yihp
Copy link

yihp commented Jul 15, 2024

Hi! Thanks for your contribution. It is an excellent piece of work!

Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work?

Thank you very much for your time and consideration. I eagerly look forward to your response.

@anicolson
Copy link
Member

anicolson commented Jul 15, 2024 via email

@yihp
Copy link
Author

yihp commented Jul 15, 2024

I can send through the script used to create the tokenizer if you are interested. I am personally not sure of the changes that need to be made to it to switch languages. Get Outlook for iOShttps://aka.ms/o0ukef

________________________________ From: yihp @.> Sent: Monday, July 15, 2024 7:17:54 PM To: aehrc/cxrmate @.> Cc: Subscribed @.> Subject: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) Hi! Thanks for your contribution. It is an excellent piece of work! Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work? Thank you very much for your time and consideration. I eagerly look forward to your response. — Reply to this email directly, view it on GitHub<#9>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHGZ7RF5NR4JEZBZ334673ZMOHUFAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYDQMRVGQZDQMQ. You are receiving this because you are subscribed to this thread.Message ID: @.>

It would be great if you could send through the script used to create the tokenizer.
Another problem is that I can’t find the vocab.txt file of the model config file. I want to confirm whether it contains vector encodings of other languages.

@anicolson
Copy link
Member

https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb

Sorry @yihp, I am not sure what you meant by the second part.

@yihp
Copy link
Author

yihp commented Jul 15, 2024

thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself?

https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb

Sorry @yihp, I am not sure what you meant by the second part.

@anicolson
Copy link
Member

anicolson commented Jul 15, 2024 via email

@yihp
Copy link
Author

yihp commented Jul 15, 2024

No worries. Actually, no pretraining checkpoint was used, the decoder was randomly initialised. Get Outlook for iOShttps://aka.ms/o0ukef

________________________________ From: yihp @.> Sent: Monday, July 15, 2024 8:10:18 PM To: aehrc/cxrmate @.> Cc: Nicolson, Aaron (H&B, Herston) @.>; Comment @.> Subject: Re: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself? https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb Sorry @yihphttps://github.com/yihp, I am not sure what you meant by the second part. — Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHGZ7SQT62FGS2DPZRWQR3ZMONYVAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGE2DKMZTHA. You are receiving this because you commented.Message ID: @.***>

OK, that is great! I found that cxrmate/model/cxrmate-tf/tokenizer.json does not have the vocabulary and encoding of the language I use, so can I expand it and further train it on my language task? What should I pay attention to?

@anicolson
Copy link
Member

Yeah, it's all in English unfortunately. So, just use that tokenizer notebook, you need to create a list of the text from your training set to feed to the tokenizer (as shown in the notebook). You could modify it to use the SentencePiece tokenizer, as it may be more appropriate for some languages: https://huggingface.co/docs/transformers/en/tokenizer_summary#sentencepiece. The other thing you may consider is how many sections of the report you want to handle. Currently it is setup to handle two sections. But that can be changed.

@yihp
Copy link
Author

yihp commented Jul 16, 2024

ok! Do the two sections refer to impression and findings of the report?

@anicolson
Copy link
Member

yes :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants