Python Wrapper for extracting candidate and mapping concepts using MetaMap. Pymm parses the XML output of the MetaMap. The below concept information are extracted:
- score
- matched word
- cui
- semtypes
- negated
- matched word start position
- matched word end position
- ismapping
The flag ismapping
is set to True if it is a mapping concept else it is False for a candidate mapping.
git clone https://github.com/smujjiga/pymm.git cd pymm python setup.py install
Create Python MetaMap wrapper object by pointing it to locaiton of MetaMap
from pymm import Metamap mm = Metamap(METAMAP_PATH)
We can check if metamap is running using
assert mm.is_alive()
Concept extraction is done via parse method
mmos = mm.parse(['heart attack', 'myocardial infarction'])
Parse method returns an iterator of Metamap Object iterators corresponding to each input sentence. Each Metamap Object iterator return the candidate and mapping concepts.
for idx, mmo in enumerate(mmos): for jdx, concept in enumerate(mmo): print (concept.cui, concept.score, concept.matched) print (concept.semtypes, concept.ismapping)
Python MetaMap wrapper object also support debug parameter which persists input and output files as well print the command line used to run the MetaMap
mm = Metamap(METAMAP_PATH, debug=True)
Below shown is a code snippet for extracting concepts on large number of sentences.
def read_lines(file_name, fast_forward_to, batch_size, preprocessing): sentences = list() with open(file_name, 'r') as fp: for i in range(fast_forward_to): fp.readline() for idx, line in enumerate(fp): sentences.append(preprocessing(line)) if (idx+1) % batch_size == 0: yield sentences sentences.clear() try: for i, sentences in enumerate(read_lines(CLINICAL_TEXT_FILE, last_checkpoint, BATCH_SIZE, clean_text)): timeout = 0.33*BATCH_SIZE try_again = False try: mmos = mm.parse(sentences, timeout=timeout) except MetamapStuck: # Try with larger timeout print ("Metamap Stuck !!!; trying with larger timeout") try_again = True except: print ("Exception in mm; skipping the batch") traceback.print_exc(file=sys.stdout) continue if try_again: timeout = BATCH_SIZE*2 try: mmos = mm.parse(sentences, timeout=timeout) except MetamapStuck: # Again stuck; Ignore this batch print ("Metamap Stuck again !!!; ignoring the batch") continue except: print ("Exception in mm; skipping the batch") traceback.print_exc(file=sys.stdout) continue for idx, mmo in enumerate(mmos): for jdx, concept in enumerate(mmo): save(sentences[idx], concept) curr_checkpoint = (i+1)*BATCH_SIZE + last_checkpoint record_checkpoint(curr_checkpoint) finally: mm.close()
This python wrapper is motivated by https://github.com/AnthonyMRios/pymetamap. Pymetamap parses the MMI output where as Pymm parses XML output. I decided to code Pymm targeting extraction of concept on huge corpus. I have used Pymm to extract candidate and mapping concepts of 10 Million sentence.