Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

weird tokenization of nunc #34

Open
balmas opened this issue Feb 5, 2015 · 2 comments
Open

weird tokenization of nunc #34

balmas opened this issue Feb 5, 2015 · 2 comments

Comments

@balmas
Copy link
Contributor

balmas commented Feb 5, 2015

tokenize this text:
"+Noster+poēta+,+nisi+cīvis+Rōmānus+esset+,+ā+populō+nunc+cīvitāte+dōnārētur+.+"
(e.g. http://services.perseids.org/llt/segtok?xml=false&shifting=false&newline_boundary=1&inline=true&text=%22+Noster+po%C4%93ta+,+nisi+c%C4%ABvis+R%C5%8Dm%C4%81nus+esset+,+%C4%81+popul%C5%8D+nunc+c%C4%ABvit%C4%81te+d%C5%8Dn%C4%81r%C4%93tur+.+%22&splitting=true)

See that nunc gets tokenized as:

<w s_n="1" n="12">nun</w><pc s_n="1" n="13">-</pc><pc s_n="1" n="14">-</pc><pc s_n="1" n="15">-</pc><w s_n="1" n="16">-c</w>

@lichtr
Copy link
Member

lichtr commented Feb 9, 2015

I had a quick look at this: I could not solve the problem, but I think it has something to do with the nisi in front which is not splitted. I guess the problem lies somewhere in the Worker class...

@balmas
Copy link
Contributor Author

balmas commented Apr 16, 2015

Another example, even weirder:

Nisi pācem sine morā ab hostibus petīverimus, neque urbs neque domus ūlla stāre poterit.

crashes the server, but if I take the long i out of petīverimus it doesn't

Nisi pācem sine morā ab hostibus petīverimus, neque urbs neque domus ūlla stāre poterit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants