Automatically extracting gene and species name(s) from abstract text

Recognizing entities in text is the first step towards machines that can extract insights out of enormous document repositories like pubmed.

Getting Started

Prerequisites

Using the Python NLP software library spaCy to extract genes from pubmed text

Anaconda and Jupyter Notebook
spaCy - Open-source library for industrial-strength Natural Language Processing (NLP) in Python

Installation

Anaconda and Jupyter Notebook :

Downloads and install Anaconda from https://repo.anaconda.com/archive/Anaconda3-2019.07-Windows-x86_64.exe. Select the default options when prompted during the installation of Anaconda.
Open “Anaconda Prompt” by finding it in the Windows (Start) Menu.
Type the command python --version to verified Anaconda was installed.
Type the command jupyter notebook to start Jupyter Notebook.

spaCy :
spaCy is compatible with 64-bit CPython 2.7 / 3.5+ and runs on Unix/Linux, macOS/OS X and Windows.The latest spaCy releases are available over

Windows & OS X & Linux

Run the below command in Command Prompt
( Make sure you Add Python to PATH )

pip install -U spacy

Run the below command in Anaconda Prompt
( Run as administrator )

conda install -c conda-forge spacy

Getting started step by step

step 1 - Open jupyter notebook
step 2 - Run Training spaCy’s Statistical Models , It will output a trained model called Demo_1, You could assign your own test data at line 104.
step 3 - Run Testing, You could assign your own test data at line 55.

Code tutorial and processing walkthrough

Load the model, or create an empty model
We can create an empty model and train it with our annotated dataset or we can use existing spacy model and re-train with our annotated data.

if model is not None:
    nlp = spacy.load(model)  # load existing spaCy model
    print("Loaded model '%s'" % model)
else:
    nlp = spacy.blank("en")  # create blank Language class
    print("Created blank 'en' model")

if 'ner' not in nlp.pipe_names :
    ner = nlp.create_pipe('ner')
    nlp.add_pipe(ner, last=True)
else :
    ner = nlp.get_pipe("ner")

Adding Labels or entities

# add labels
for _, annotations in train_data:
    for ent in annotations.get('entities'):
        ner.add_label(ent[2])

other_pipe = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# Only training NER
with nlp.disable_pipes(*other_pipe) :
    if model is None:
        optimizer = nlp.begin_training()
    else:
        optimizer = nlp.resume_training()

Training and updating the model
Training data : Annotated data contain both text and their labels
Text : Input text the model should predict a label for.
Label : The label the model should predict.

# Spacy Training Data Format
Train_data = [
    ( "Text 1", entities : {
                [(start,end, "Label 1"), (start,end, "Label 2"), (start,end, "Label 3")]
                }
    ),
    ( "Text 2", entities : {
             [(start,end, "Label 1"), (start,end,"Label 2")]
             }
    ),
    ( "Text 3", entities : {
            [(start,end, "Label 1"), (start,end, "Label 2"), 
            (start,end,"Label 3"),(start,end, "Label 4 ")]
            }
    )
]

We will train our model for a number of iterations so that the model can learn from it effectively.

for int in range(iteration) :
    print("Starting iteration" + str(int))
    random.shuffle(train_data)
    losses = {}

At each iteration, the training data is shuffled to ensure the model doesn’t make any generalisations based on the order of examples.
We will update the model for each iteration using nlp.update().

    for text, annotation in train_data :
        nlp.update(
        [text],
        [annotation],
        drop = 0.2,
        sgd = optimizer,
        losses = losses
        )
  #print(losses)
new_model = nlp

Evaluate the model

# Spacy Testing Data Format
test_data = [
    ('Text 1',
     [(start, end, 'Label 1')]),
    ('Text 2',
     [(start, end, 'Label 1'), (start, end, 'Label 2')])
]

import spacy
from spacy.gold import GoldParse
from spacy.scorer import Scorer

def evaluate(model, examples):
  scorer = Scorer()
  for input_, annot in examples:
    #print(input_)
    doc_gold_text = model.make_doc(input_)
    gold = GoldParse(doc_gold_text, entities=annot['entities'])
    pred_value = model(input_)
    scorer.score(pred_value, gold)
  return scorer.scores

test_result = evaluate(new_model, test_data)

Visualization

import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")#model name
doc = nlp("""Your test context""")
# Since this is an interactive Jupyter environment, we can use displacy.render here
displacy.render(doc,jupyter=True,style='ent')

Reference

See the spaCy Tutorials for more details and examples
[1] How to create custom NER in Spacy
[2] How to extract genes from text with Sysrev and spaCy
[3] Custom Named Entity Recognition Using spaCy

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Demo_1		Demo_1
Demo.ipynb		Demo.ipynb
README.md		README.md
insect.txt		insect.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatically extracting gene and species name(s) from abstract text

Getting Started

Prerequisites

Installation

Getting started step by step

Code tutorial and processing walkthrough

Reference

About

Releases

Packages

Languages

ShangYuChiang/NER_GS

Folders and files

Latest commit

History

Repository files navigation

Automatically extracting gene and species name(s) from abstract text

Getting Started

Prerequisites

Installation

Getting started step by step

Code tutorial and processing walkthrough

Reference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages