Skip to content

Automatically extracting ‘main’ gene name(s) from abstract text-Using the Python NLP software library spaCy

Notifications You must be signed in to change notification settings

ShangYuChiang/NER_GS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Automatically extracting gene and species name(s) from abstract text

Recognizing entities in text is the first step towards machines that can extract insights out of enormous document repositories like pubmed.

Getting Started

Prerequisites

Using the Python NLP software library spaCy to extract genes from pubmed text

  • Anaconda and Jupyter Notebook
  • spaCy - Open-source library for industrial-strength Natural Language Processing (NLP) in Python

Installation

Anaconda and Jupyter Notebook :

  1. Downloads and install Anaconda from https://repo.anaconda.com/archive/Anaconda3-2019.07-Windows-x86_64.exe. Select the default options when prompted during the installation of Anaconda.
  2. Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.
  3. Type the command python --version to verified Anaconda was installed.
  4. Type the command jupyter notebook to start Jupyter Notebook.

spaCy :
spaCy is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows.The latest spaCy releases are available over

Windows & OS X & Linux

  • Run the below command in Command Prompt
    ( Make sure you Add Python to PATH )
pip install -U spacy
  • Run the below command in Anaconda Prompt
    ( Run as administrator )
conda install -c conda-forge spacy

Getting started step by step

  • step 1 - Open jupyter notebook
  • step 2 - Run Training spaCy’s Statistical Models , It will output a trained model called Demo_1, You could assign your own test data at line 104.
  • step 3 - Run Testing, You could assign your own test data at line 55.

Code tutorial and processing walkthrough

  • Load the model, or create an empty model
    We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.
if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

if 'ner' not in nlp.pipe_names :
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else :
    ner = nlp.get_pipe("ner")
  • Adding Labels or entities
# add labels
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

other_pipe = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# Only training NER
with nlp.disable_pipes(*other_pipe) :
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()
  • Training and updating the model
    Training data : Annotated data contain both text and their labels
    Text : Input text the model should predict a label for.
    Label : The label the model should predict.
# Spacy Training Data Format
Train_data = [
    ( "Text 1", entities : {
                [(start,end, "Label 1"), (start,end, "Label 2"), (start,end, "Label 3")]
                }
    ),
    ( "Text 2", entities : {
             [(start,end, "Label 1"), (start,end,"Label 2")]
             }
    ),
    ( "Text 3", entities : {
            [(start,end, "Label 1"), (start,end, "Label 2"), 
            (start,end,"Label 3"),(start,end, "Label 4 ")]
            }
    )
]
  1. We will train our model for a number of iterations so that the model can learn from it effectively.
for int in range(iteration) :
    print("Starting iteration" + str(int))
    random.shuffle(train_data)
    losses = {}
  1. At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
  2. We will update the model for each iteration using nlp.update().
    for text, annotation in train_data :
        nlp.update(
        [text],
        [annotation],
        drop = 0.2,
        sgd = optimizer,
        losses = losses
        )
  #print(losses)
new_model = nlp
  • Evaluate the model
# Spacy Testing Data Format
test_data = [
    ('Text 1',
     [(start, end, 'Label 1')]),
    ('Text 2',
     [(start, end, 'Label 1'), (start, end, 'Label 2')])
]
import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(model, examples):
  scorer = Scorer()
  for input_, annot in examples:
    #print(input_)
    doc_gold_text = model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot['entities'])
    pred_value = model(input_)
    scorer.score(pred_value, gold)
  return scorer.scores

test_result = evaluate(new_model, test_data)
  • Visualization
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")#model name
doc = nlp("""Your test context""")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc,jupyter=True,style='ent')

Reference

See the spaCy Tutorials for more details and examples
[1] How to create custom NER in Spacy
[2] How to extract genes from text with Sysrev and spaCy
[3] Custom Named Entity Recognition Using spaCy

About

Automatically extracting ‘main’ gene name(s) from abstract text-Using the Python NLP software library spaCy

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published