Add offset correction for split filter #149

mh-northlander · 2024-11-01T06:17:28Z

Fix the offset when the length of the morpheme is different from the actual length in the input text.

Since correctOffset method is only available for CharFilter and Tokenizer, we need to keep offset mapping information for sudachi_split filter somehow.
I chose to put it in MorphemeAttribute because it is morpheme related data.

With combined character, extended mode behaves badly (e.g. "㍍㍉" will be extended to "㍍", "㍉", not "メ", "ー", "ト", "ル”, "ミ", "リ").
In this case using the normalized form will be more natural, but we cannot calculate offsets for them (mapping between surface and normalized form is missing). So I chose to keep using surface, i.e. text before normalization.

Also change dictionary type for test from small to core, to test "㍿" which is not in the small dict.

Note that due to the correctOffset behavior of icu-normalizer, offsets for subsplits of ㍿ (株式 + 会社) is now [0,0] + [0,1].

mh-northlander added 12 commits October 30, 2024 17:31

flatten if-block

d71aacd

refactor oov branch and rm redundant att setting

eafe3f2

add MorphemeSubunits class to handle a/b splits

5b20471

add offset mapping to the morpheme attribute

b3c8c64

add split filter test with input normalization

4dae125

fix toxcontent of morpheme attribute

c5483c3

add integration test with icu normalizer

0a0950a

comment out assertion

5aeb607

use wrapper only for reflectWith

a54675e

update dict version for workflow

38bad32

adjust test script with updated dict

24b6b88

fix test

85a66af

mh-northlander requested a review from kazuma-t November 5, 2024 04:30

calculate offset based on each morpheme's one

a598575

mh-northlander mentioned this pull request Nov 6, 2024

Disallow empty morpheme by default #151

Merged

kazuma-t approved these changes Nov 8, 2024

View reviewed changes

mh-northlander merged commit ee664ba into develop Nov 11, 2024
31 checks passed

mh-northlander deleted the fix/148-correct-offset branch November 11, 2024 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add offset correction for split filter #149

Add offset correction for split filter #149

mh-northlander commented Nov 1, 2024 •

edited

Loading

Add offset correction for split filter #149

Add offset correction for split filter #149

Conversation

mh-northlander commented Nov 1, 2024 • edited Loading

mh-northlander commented Nov 1, 2024 •

edited

Loading