Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lower performance in alignment compared to another preprocessing script. #5

Open
haorannlp opened this issue Jun 28, 2021 · 0 comments

Comments

@haorannlp
Copy link

Hi Sanxing, thank you for sharing this script!

I run your preprocess.py (clean empty lines; I did not run the whole prepare.sh) and then use fast_align to learn an alignment model on the parallel corpus.
I found that the perplexity of alignmens given by the alignment model is higher than the results of the parallel corpus preprocessed by another script wmt.py.
I guess this is due to that they merge the blank lines.
So could you possibly add this merge blank lines function into your script in the future? Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant