Skip to content

Extract entities from datasets of few-shot problem using tagme tools.

Notifications You must be signed in to change notification settings

231sm/FewShotEntityExtraction

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FewShotEntityExtraction

Utility scripts to extract entities from sentences in few-shot dataset using tagme.

Requirements

  • python3
  • tagme==0.1.3

To use tagme api, you need to register an account here, and write the token as TAGME_TOKEN in config.py.

Fewshot dataset

  • Data source: https://github.com/zxlzr/FewShotNLP/tree/master/data

  • Command: python tag_fewshot.py

  • Output: original file named A.train(dev/test) -> A.train(dev/test).json

  • Data format:

    • pos_begin: index of starting character in sentence
    • pos_end: index of ending character in sentence
    • entity_id
    • score
  • Sample data:

[{
  "sentence": "lasts only 2 weeks ! try them if you don't believe me",
  "class": "-1",
  "entities": [
    {
      "pos_begin": 13,
      "pos_end": 18,
      "entity_id": 27493154,
      "score": 0.0007660030387341976
    }, {
      "pos_begin": 21,
      "pos_end": 24,
      "entity_id": 3276812,
      "score": 0.009006991051137447
    }, {
      "pos_begin": 30,
      "pos_end": 32,
      "entity_id": 1685851,
      "score": 0.07183314114809036
    }, {
      "pos_begin": 33,
      "pos_end": 36,
      "entity_id": 14148802,
      "score": 0.19438420236110687
    }, {
      "pos_begin": 37,
      "pos_end": 40,
      "entity_id": 294015,
      "score": 0.010517369955778122
    }, {
      "pos_begin": 37,
      "pos_end": 50,
      "entity_id": 27690196,
      "score": 0.005518087185919285
    }, {
      "pos_begin": 43,
      "pos_end": 53,
      "entity_id": 38740213,
      "score": 0.0672566369175911
    }
  ]
}]

Fewrel dataset

  • Data source: https://github.com/thunlp/FewRel/tree/master/data

  • Command: python tag_fewrel.py

  • Output: train(val).json -> train(val)_entity.json

  • Data format:

    • index_begin: index of starting word in sentence
    • index_end: index of ending word in sentence
    • entity_id
    • score
  • Sample data:

[{
  "tokens": ["In", "June", "1987", ",", "the", "Missouri", "Highway", "and", "Transportation", "Department", "approved", "design", "location", "of", "a", "new", "four", "-", "lane", "Mississippi", "River", "bridge", "to", "replace", "the", "deteriorating", "Cape", "Girardeau", "Bridge", "."],
  "h": ["cape girardeau bridge", "Q5034838", [[26, 27, 28]]],
  "t": ["mississippi river", "Q1497", [[19, 20]]],
  "entities": [
    {
      "index_begin": 5,
      "index_end": 6,
      "entity_id": 19591,
      "score": 0.35695624351501465
    }, {
      "index_begin": 6,
      "index_end": 7,
      "entity_id": 48519,
      "score": 0.22454741597175598
    }, {
      "index_begin": 8,
      "index_end": 10,
      "entity_id": 58235,
      "score": 0.015105740167200565
    }, {
      "index_begin": 11,
      "index_end": 12,
      "entity_id": 631159,
      "score": 0.08602561801671982
    }, {
      "index_begin": 12,
      "index_end": 13,
      "entity_id": 2272383,
      "score": 0.09302778542041779
    }, {
      "index_begin": 18,
      "index_end": 19,
      "entity_id": 95699,
      "score": 0.0887017548084259
    }, {
      "index_begin": 19,
      "index_end": 21,
      "entity_id": 19579,
      "score": 0.6045866012573242
    }, {
      "index_begin": 26,
      "index_end": 29,
      "entity_id": 4910093,
      "score": 0.5
    }
  ]
}]

Note

  • Max workers: specify in config.py. Using larger numbers would improve efficiency, but not all the time. 64 is the default value here.
  • Termination: The program terminates when all jobs are done. You may stop processing by ctrl+c whenever you want, but please make sure all workers exited and the program output Saved data., which means the extracted entities has been written into output files. You may restart and the program would process from where you stopped last time.

About

Extract entities from datasets of few-shot problem using tagme tools.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%