-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement bio-med named entity recognition (NER) #46
Comments
Would it make sense to link :Entity to the text fragment (somehow) where it origins from? |
We also have the |
I've also been in touch with https://www.scibite.com/ who build https://www.scibite.com/platform/termite/ |
Interesting paper: |
Yes, I think everything is going BERT now. Large pretrained models that you fine-tune for your use cases. |
Hi, would it make sense to finetune a biobert model on the NER data above as a first step? :) I could look into that if it would help. However, I certainly would need help in the integration/linking I would suggest to use Update: I've started work on a Google Colab Notebook to fine-tune BioBert (or technically any other BERT-like model/transformer model - basically anything supported by huggingface/transformers ) on the NER data linked above. Current status:
Missing:
Will finalize in the colab notebook for quick prototyping - we can run the actual training on separate infrastructure. @mpreusse @seangrant82 |
@MFreidank that looks really interesting. To be honest I don't get all the details of the colab notebook. Regarding preprocessing: The text is stored on several different nodes (such as Last question: How does the output of BioBERT look like for NER? |
I have tried out the BioBert on various articles. There is a public endpoint exposed that one can use to analyze papers. Input can be either a pubmedID or text, details available at https://bern.korea.ac.kr/. An example output looks like:
All entities are available under "denotation" object. AFAIK there are 5 types available: disease, drug, species, mutation, and gene. Each entity has also assigned one or many ids from medicinal databases like Ensembl, Mim, Hgnc, Mesh, and some more. This is very useful as I have noticed genes with the same Ensembl Id might not have exactly the same name in the output, so it is IMO better to use IDs, and then enrich those ids with information from medicinal databases. You can check results directly with Neo4j and APOC
Returns
I am currently in the process of analyzing 57k articles that are available in the PubMed, will share more once it completes the process |
Ok, so give a bit more detail about the Biobert Bern extraction: When we use the PubMed ID as an input, it only evaluates the abstract of the article as the input for NER extraction. On the other hand, it looks like it pulls abstracts directly from PubMed as all articles have them available. Out of 55.000 articles evaluated, only around 32.000 of them have one or more entities detected, which is only 60%. It makes sense to run the bioBert extraction on the whole body text, to get better results. If we look at the results of Ner extraction, I used the entity text as a unique identifier of an entity, we get 105.000 different entities. Some of them could be grouped by medicinal Ids like MesH or Ensembl into a single entity. If we group results by the entity type we get:
There are by far the most genes as entities, but unfortunately, only 20% have a valid id. I imagine there could be some false positives with genes without a valid medicinal id as well. I want to run this NER extraction on the whole body text to see what comes out, but it will definitely take some time. I'll try to split the workload to make it faster. |
I would be interested in participating on this task. I recently ran an experiment where I used two spaCy NER models (one domain specific, and one general), and then cross-referenced the entities which allowed me to eliminate / clarify some bad results. For example, if the legal model identified a JUDGE, but if the general model did not identify this same entity as a PERSON, it was a bad result. I also stored the text around the entity (these were clauses in a legal contract), and maintained sequence for all entities for further analysis using a sequence model. This is where it would be valuable to have labeled data for training an LSTM. Do you have labeled entities in any of the documents you are processing? |
@LegalBIGuy can you join out chat for more details: https://github.com/covidgraph/documentation/wiki/Getting-started-with-the-community |
As we get more data streaming through, we should implement named entity recognition (NER) to identify medical entities such as drug, gene, cell, etc in order to link to nodes in our current graph to gather more insights in our graph
Here is example of identifying entities in CORD-19 data:
COVID-19 NER data here: https://uofi.box.com/s/k8pw7d5kozzpoum2jwfaqdaey1oij93x
Training of BioBERT would also apply here too: https://github.com/dmis-lab/biobert
conceptual schema for entity and entity_types:
This relates to the following data sources:
#1 (CORD-19)
The text was updated successfully, but these errors were encountered: