Model architecture adjustment problem #9

yihp · 2024-07-15T09:17:32Z

Hi! Thanks for your contribution. It is an excellent piece of work!

Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work?

Thank you very much for your time and consideration. I eagerly look forward to your response.

anicolson · 2024-07-15T09:22:47Z

I can send through the script used to create the tokenizer if you are interested. I am personally not sure of the changes that need to be made to it to switch languages. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: yihp ***@***.***> Sent: Monday, July 15, 2024 7:17:54 PM To: aehrc/cxrmate ***@***.***> Cc: Subscribed ***@***.***> Subject: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) Hi! Thanks for your contribution. It is an excellent piece of work! Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work? Thank you very much for your time and consideration. I eagerly look forward to your response. — Reply to this email directly, view it on GitHub<#9>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGHGZ7RF5NR4JEZBZ334673ZMOHUFAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYDQMRVGQZDQMQ>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

yihp · 2024-07-15T09:32:29Z

I can send through the script used to create the tokenizer if you are interested. I am personally not sure of the changes that need to be made to it to switch languages. Get Outlook for iOShttps://aka.ms/o0ukef
…
________________________________ From: yihp @.> Sent: Monday, July 15, 2024 7:17:54 PM To: aehrc/cxrmate @.> Cc: Subscribed @.> Subject: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) Hi! Thanks for your contribution. It is an excellent piece of work! Your idea is great, and I want to test my task. But my corpus language is Chinese, do I need to adjust the tokenizer and pre-trained bert? Will it work? Thank you very much for your time and consideration. I eagerly look forward to your response. — Reply to this email directly, view it on GitHub<#9>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHGZ7RF5NR4JEZBZ334673ZMOHUFAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQYDQMRVGQZDQMQ. You are receiving this because you are subscribed to this thread.Message ID: @.>

It would be great if you could send through the script used to create the tokenizer.
Another problem is that I can’t find the vocab.txt file of the model config file. I want to confirm whether it contains vector encodings of other languages.

anicolson · 2024-07-15T09:46:08Z

https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb

Sorry @yihp, I am not sure what you meant by the second part.

yihp · 2024-07-15T10:09:56Z

thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself?

https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb

Sorry @yihp, I am not sure what you meant by the second part.

anicolson · 2024-07-15T10:23:44Z

No worries. Actually, no pretraining checkpoint was used, the decoder was randomly initialised. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: yihp ***@***.***> Sent: Monday, July 15, 2024 8:10:18 PM To: aehrc/cxrmate ***@***.***> Cc: Nicolson, Aaron (H&B, Herston) ***@***.***>; Comment ***@***.***> Subject: Re: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself? https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb Sorry @yihp<https://github.com/yihp>, I am not sure what you meant by the second part. — Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AGHGZ7SQT62FGS2DPZRWQR3ZMONYVAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGE2DKMZTHA>. You are receiving this because you commented.Message ID: ***@***.***>

yihp · 2024-07-15T12:37:47Z

No worries. Actually, no pretraining checkpoint was used, the decoder was randomly initialised. Get Outlook for iOShttps://aka.ms/o0ukef
…
________________________________ From: yihp @.> Sent: Monday, July 15, 2024 8:10:18 PM To: aehrc/cxrmate @.> Cc: Nicolson, Aaron (H&B, Herston) @.>; Comment @.> Subject: Re: [aehrc/cxrmate] Model architecture adjustment problem (Issue #9) thanks! II'm very interested in that which pre-trained model did you use for the decoder checkpoint in the model architecture? Is it initialized by yourself? https://github.com/aehrc/cxrmate/blob/main/examples/tokenizer.ipynb Sorry @yihp https://github.com/yihp, I am not sure what you meant by the second part. — Reply to this email directly, view it on GitHub<#9 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGHGZ7SQT62FGS2DPZRWQR3ZMONYVAVCNFSM6AAAAABK4ESDAKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRYGE2DKMZTHA. You are receiving this because you commented.Message ID: @.***>

OK, that is great! I found that cxrmate/model/cxrmate-tf/tokenizer.json does not have the vocabulary and encoding of the language I use, so can I expand it and further train it on my language task? What should I pay attention to?

anicolson · 2024-07-15T21:15:03Z

Yeah, it's all in English unfortunately. So, just use that tokenizer notebook, you need to create a list of the text from your training set to feed to the tokenizer (as shown in the notebook). You could modify it to use the SentencePiece tokenizer, as it may be more appropriate for some languages: https://huggingface.co/docs/transformers/en/tokenizer_summary#sentencepiece. The other thing you may consider is how many sections of the report you want to handle. Currently it is setup to handle two sections. But that can be changed.

yihp · 2024-07-16T01:35:02Z

ok! Do the two sections refer to impression and findings of the report?

anicolson · 2024-07-16T01:37:52Z

yes :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model architecture adjustment problem #9

Model architecture adjustment problem #9

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024 via email

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024 via email

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024

yihp commented Jul 16, 2024

anicolson commented Jul 16, 2024

Model architecture adjustment problem #9

Model architecture adjustment problem #9

Comments

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024 via email

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024 via email

yihp commented Jul 15, 2024

anicolson commented Jul 15, 2024

yihp commented Jul 16, 2024

anicolson commented Jul 16, 2024