Recognizing entities in text is the first step towards machines that can extract insights out of enormous document repositories like pubmed.
Using the Python NLP software library spaCy to extract genes from pubmed text
- Anaconda and Jupyter Notebook
- spaCy - Open-source library for industrial-strength Natural Language Processing (NLP) in Python
Anaconda and Jupyter Notebook :
- Downloads and install Anaconda from https://repo.anaconda.com/archive/Anaconda3-2019.07-Windows-x86_64.exe. Select the default options when prompted during the installation of Anaconda.
- Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.
- Type the command
python --version
to verified Anaconda was installed. - Type the command
jupyter notebook
to start Jupyter Notebook.
spaCy :
spaCy is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows.The latest spaCy releases are available over
Windows & OS X & Linux
- Run the below command in Command Prompt
( Make sure you Add Python to PATH )
pip install -U spacy
- Run the below command in Anaconda Prompt
( Run as administrator )
conda install -c conda-forge spacy
- step 1 - Open jupyter notebook
- step 2 - Run
Training spaCy’s Statistical Models
, It will output a trained model calledDemo_1
, You could assign your own test data atline 104
. - step 3 - Run
Testing
, You could assign your own test data atline 55
.
- Load the model, or create an empty model
We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.
if model is not None:
nlp = spacy.load(model) # load existing spaCy model
print("Loaded model '%s'" % model)
else:
nlp = spacy.blank("en") # create blank Language class
print("Created blank 'en' model")
if 'ner' not in nlp.pipe_names :
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
else :
ner = nlp.get_pipe("ner")
- Adding Labels or entities
# add labels
for _, annotations in train_data:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
other_pipe = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
# Only training NER
with nlp.disable_pipes(*other_pipe) :
if model is None:
optimizer = nlp.begin_training()
else:
optimizer = nlp.resume_training()
- Training and updating the model
Training data : Annotated data contain both text and their labels
Text : Input text the model should predict a label for.
Label : The label the model should predict.
# Spacy Training Data Format
Train_data = [
( "Text 1", entities : {
[(start,end, "Label 1"), (start,end, "Label 2"), (start,end, "Label 3")]
}
),
( "Text 2", entities : {
[(start,end, "Label 1"), (start,end,"Label 2")]
}
),
( "Text 3", entities : {
[(start,end, "Label 1"), (start,end, "Label 2"),
(start,end,"Label 3"),(start,end, "Label 4 ")]
}
)
]
- We will train our model for a number of iterations so that the model can learn from it effectively.
for int in range(iteration) :
print("Starting iteration" + str(int))
random.shuffle(train_data)
losses = {}
- At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
- We will update the model for each iteration using
nlp.update()
.
for text, annotation in train_data :
nlp.update(
[text],
[annotation],
drop = 0.2,
sgd = optimizer,
losses = losses
)
#print(losses)
new_model = nlp
- Evaluate the model
# Spacy Testing Data Format
test_data = [
('Text 1',
[(start, end, 'Label 1')]),
('Text 2',
[(start, end, 'Label 1'), (start, end, 'Label 2')])
]
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer
def evaluate(model, examples):
scorer = Scorer()
for input_, annot in examples:
#print(input_)
doc_gold_text = model.make_doc(input_)
gold = GoldParse(doc_gold_text, entities=annot['entities'])
pred_value = model(input_)
scorer.score(pred_value, gold)
return scorer.scores
test_result = evaluate(new_model, test_data)
- Visualization
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")#model name
doc = nlp("""Your test context""")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc,jupyter=True,style='ent')
See the spaCy Tutorials for more details and examples
[1] How to create custom NER in Spacy
[2] How to extract genes from text with Sysrev and spaCy
[3] Custom Named Entity Recognition Using spaCy