Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add offset correction for split filter #149

Merged
merged 13 commits into from
Nov 11, 2024
Merged

Conversation

mh-northlander
Copy link
Collaborator

@mh-northlander mh-northlander commented Nov 1, 2024

fix #148

Fix the offset when the length of the morpheme is different from the actual length in the input text.

Since correctOffset method is only available for CharFilter and Tokenizer, we need to keep offset mapping information for sudachi_split filter somehow.
I chose to put it in MorphemeAttribute because it is morpheme related data.

With combined character, extended mode behaves badly (e.g. "㍍㍉" will be extended to "㍍", "㍉", not "メ", "ー", "ト", "ル”, "ミ", "リ").
In this case using the normalized form will be more natural, but we cannot calculate offsets for them (mapping between surface and normalized form is missing). So I chose to keep using surface, i.e. text before normalization.

Also change dictionary type for test from small to core, to test "㍿" which is not in the small dict.

Note that due to the correctOffset behavior of icu-normalizer, offsets for subsplits of ㍿ (株式 + 会社) is now [0,0] + [0,1].

@mh-northlander mh-northlander merged commit ee664ba into develop Nov 11, 2024
31 checks passed
@mh-northlander mh-northlander deleted the fix/148-correct-offset branch November 11, 2024 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

end_offset of lingature charactor is wrong when using icu_normalizer
2 participants