Skip to content

Commit

Permalink
Merge pull request #65 from sorami/develop
Browse files Browse the repository at this point in the history
Add documentation about SudachiDict synonym
  • Loading branch information
mh-northlander authored Jun 12, 2024
2 parents a336eef + 828ee7c commit 5d9a397
Show file tree
Hide file tree
Showing 3 changed files with 193 additions and 0 deletions.
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,14 @@ Returns `スシ`.

Returns `susi`.


# Synonym

There is a temporary way to use Sudachi Dictionary's synonym resource ([Sudachi 同義語辞書](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md)) with Elasticsearch.

Please refer to [this document](docs/synonym.md) for the detail.


# License

Copyright (c) 2017-2024 Works Applications Co., Ltd.
Expand Down
35 changes: 35 additions & 0 deletions docs/ssyn2es.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
#!/usr/bin/env python

import argparse
import fileinput

def main():
parser = argparse.ArgumentParser(prog="ssyn2es.py", description="convert Sudachi synonyms to ES")
parser.add_argument('files', metavar='FILE', nargs='*', help='files to read, if empty, stdin is used')
parser.add_argument('-p', '--output-predicate', action='store_true', help='output predicates')
args = parser.parse_args()

synonyms = {}
with fileinput.input(files = args.files) as input:
for line in input:
line = line.strip()
if line == "":
continue
entry = line.split(",")[0:9]
if entry[2] == "2" or (not args.output_predicate and entry[1] == "2"):
continue
group = synonyms.setdefault(entry[0], [[], []])
group[1 if entry[2] == "1" else 0].append(entry[8])

for groupid in sorted(synonyms):
group = synonyms[groupid]
if not group[1]:
if len(group[0]) > 1:
print(",".join(group[0]))
else:
if len(group[0]) > 0 and len(group[1]) > 0:
print(",".join(group[0]) + "=>" + ",".join(group[0] + group[1]))


if __name__ == "__main__":
main()
150 changes: 150 additions & 0 deletions docs/synonym.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
# Using SudachiDict Synonyms

Here we describe a temporary way to use Sudachi Dictionary's synonym resource ([Sudachi 同義語辞書](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md)) with Elasticsearch. We plan to create a dedicated Sudachi synonym filter for Elasticsearch in the future.

You can convert the synonym file to "Solr synonyms" format, and use it via Elasticsearch's default synonym filters.


## Format Converion

You can simply convert the Sudachi synonym file into the Solr synonyms format. The Sudachi format is described in detail [here](https://github.com/WorksApplications/SudachiDict/blob/develop/docs/synonyms.md).

You can use [our example script (ssyn2es.py)](./ssyn2es.py) for the conversion;

```sh
$ python ssyn2es.py SudachiDict/src/main/text/synonyms.txt > synonym.txt
```

### Expansion Suppresion

You can partially make use of the Sudachi synonym resource's detailed information with the Solr format's `=>` notation, which controls the expansion direction.

```
# synonym entry
アイスクリーム,ice cream,ice=>アイスクリーム,ice cream,ice,アイス
# expansion example
# `アイスクリーム` => `アイスクリーム`, `ice cream`, `ice`, `アイス`
# `アイス` => `アイス` (**no expansion**)
```

### Punctuation Symbols

You may need to remove certain synonym words such as `` and `` when you use the analyzer with setting `"discard_punctuation": true` (Otherwise you will be get an error, e.g., `"term: € was completely eliminated by analyzer"`). Alternatively, you can set `"lenient": true` for the synonym filter to ignore the exceptions.

These symbols are defined as punctuations; See [SudachiTokenizer.java](https://github.com/WorksApplications/elasticsearch-sudachi/blob/develop/src/main/java/com/worksap/nlp/lucene/sudachi/ja/SudachiTokenizer.java#L140) for the detail.


## Synonym Filter

You can use the converted Solr format file with Elasticsearch's default synonym filters, [Synonym token filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-tokenfilter.html) or [Synonym graph filter](https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-synonym-graph-tokenfilter.html).


### Example: Set up

```json
{
"settings": {
"analysis": {
"filter": {
"sudachi_synonym": {
"type": "synonym_graph",
"synonyms_path": "sudachi/synonym.txt"
}
},
"tokenizer": {
"sudachi_tokenizer": {
"type": "sudachi_tokenizer"
}
},
"analyzer": {
"sudachi_synonym_analyzer": {
"type": "custom",
"tokenizer": "sudachi_tokenizer",
"filter": [
"sudachi_synonym"
]
}
}
}
}
}
```

Here we assume that the converted synonym file is placed as `$ES_PATH_CONF/sudachi/synonym.txt`.

If you would like to use `sudachi_split` filter, set it *after* the synonym filter (otherwise you will get an error, e.g., `term: 不明確 analyzed to a token (不) with position increment != 1 (got: 0)`).


### Example: Analysis

#### Case 1.

```json
{
"analyzer": "sudachi_synonym_analyzer",
"text": "アイスクリーム"
}
```

Returns

```json
{
"tokens": [
{
"token": "アイスクリーム",
"start_offset": 0,
"end_offset": 7,
"type": "SYNONYM",
"position": 0
},
{
"token": "ice cream",
"start_offset": 0,
"end_offset": 7,
"type": "SYNONYM",
"position": 0
},
{
"token": "ice",
"start_offset": 0,
"end_offset": 7,
"type": "SYNONYM",
"position": 0
},
{
"token": "アイス",
"start_offset": 0,
"end_offset": 7,
"type": "SYNONYM",
"position": 0
}
]
}
```

#### Case 2.

```json
{
"analyzer": "sudachi_synonym_analyzer",
"text": "アイス"
}
```

Returns

```json
{
"tokens": [
{
"token": "アイス",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
}
]
}
```

0 comments on commit 5d9a397

Please sign in to comment.