Utility scripts to extract entities from sentences in few-shot dataset using
tagme
.
python3
tagme==0.1.3
To use tagme
api, you need to register an account here, and write the token as TAGME_TOKEN
in config.py
.
-
Data source:
https://github.com/zxlzr/FewShotNLP/tree/master/data
-
Command:
python tag_fewshot.py
-
Output: original file named
A.train(dev/test)
->A.train(dev/test).json
-
Data format:
pos_begin
: index of starting character in sentencepos_end
: index of ending character in sentenceentity_id
score
-
Sample data:
[{
"sentence": "lasts only 2 weeks ! try them if you don't believe me",
"class": "-1",
"entities": [
{
"pos_begin": 13,
"pos_end": 18,
"entity_id": 27493154,
"score": 0.0007660030387341976
}, {
"pos_begin": 21,
"pos_end": 24,
"entity_id": 3276812,
"score": 0.009006991051137447
}, {
"pos_begin": 30,
"pos_end": 32,
"entity_id": 1685851,
"score": 0.07183314114809036
}, {
"pos_begin": 33,
"pos_end": 36,
"entity_id": 14148802,
"score": 0.19438420236110687
}, {
"pos_begin": 37,
"pos_end": 40,
"entity_id": 294015,
"score": 0.010517369955778122
}, {
"pos_begin": 37,
"pos_end": 50,
"entity_id": 27690196,
"score": 0.005518087185919285
}, {
"pos_begin": 43,
"pos_end": 53,
"entity_id": 38740213,
"score": 0.0672566369175911
}
]
}]
-
Data source:
https://github.com/thunlp/FewRel/tree/master/data
-
Command:
python tag_fewrel.py
-
Output:
train(val).json
->train(val)_entity.json
-
Data format:
index_begin
: index of starting word in sentenceindex_end
: index of ending word in sentenceentity_id
score
-
Sample data:
[{
"tokens": ["In", "June", "1987", ",", "the", "Missouri", "Highway", "and", "Transportation", "Department", "approved", "design", "location", "of", "a", "new", "four", "-", "lane", "Mississippi", "River", "bridge", "to", "replace", "the", "deteriorating", "Cape", "Girardeau", "Bridge", "."],
"h": ["cape girardeau bridge", "Q5034838", [[26, 27, 28]]],
"t": ["mississippi river", "Q1497", [[19, 20]]],
"entities": [
{
"index_begin": 5,
"index_end": 6,
"entity_id": 19591,
"score": 0.35695624351501465
}, {
"index_begin": 6,
"index_end": 7,
"entity_id": 48519,
"score": 0.22454741597175598
}, {
"index_begin": 8,
"index_end": 10,
"entity_id": 58235,
"score": 0.015105740167200565
}, {
"index_begin": 11,
"index_end": 12,
"entity_id": 631159,
"score": 0.08602561801671982
}, {
"index_begin": 12,
"index_end": 13,
"entity_id": 2272383,
"score": 0.09302778542041779
}, {
"index_begin": 18,
"index_end": 19,
"entity_id": 95699,
"score": 0.0887017548084259
}, {
"index_begin": 19,
"index_end": 21,
"entity_id": 19579,
"score": 0.6045866012573242
}, {
"index_begin": 26,
"index_end": 29,
"entity_id": 4910093,
"score": 0.5
}
]
}]
- Max workers: specify in
config.py
. Using larger numbers would improve efficiency, but not all the time. 64 is the default value here. - Termination: The program terminates when all jobs are done. You may stop processing by
ctrl+c
whenever you want, but please make sure all workers exited and the program outputSaved data.
, which means the extracted entities has been written into output files. You may restart and the program would process from where you stopped last time.