Using python convert annotation format from BioCxml to Spacy
This project is to convert BioCxml into Spacy format
The BioCxml format of the TeamTat's annotation
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE collection SYSTEM "BioC.dtd">
<collection>
<source>PubTator</source>
<date/>
<key>BioC.key</key>
<document>
<id>3392027</id>
<infon key="tt_curatable">no</infon>
<infon key="tt_version">0</infon>
<infon key="tt_round">0</infon>
<passage>
<infon key="type">title</infon>
<offset>0</offset>
<text>(title)Primary structure of apolipophorin-III from the migratory locust, Locusta migratoria. Potential amphipathic structures and molecular evolution of an insect apolipoprotein.</text>
<annotation id="1">
<infon key="identifier"></infon>
<infon key="type">Gene</infon>
<infon key="updated_at">1980-01-01T00:00:00Z</infon>
<location offset="21" length="17"/>
<text>apolipophorin-III</text>
</annotation>
<annotation id="2">
<infon key="identifier">7004</infon>
<infon key="type">Species</infon>
<infon key="updated_at">1980-01-01T00:00:00Z</infon>
<location offset="48" length="16"/>
<text>migratory locust</text>
</annotation>
</passage>
</document>
</collection>
The Spacy's entity annotations format
train_data = [
('Primary structure of ....',{'entities': [(21,38,'Gene'),(48,64,'Species'),(66,84,'Species'),(225,242,'Gene'),(248,266,'Species'),(423,440,'Gene'),(450,466,'Species'),(468,481,'Species'),(597,610,'Species'),(969,978,'Species'),(1159,1168,'Species'),(1234,1243,'Species')]}),(ex2),(ex3)]
#("Eample text or content"(string), {"entities": [(start_position(int), end_position(int), "label_name"(string))]})
- Python 3.x { x > 4 }
check by commandpython --version
- Pip (package manager)
check by commandpip --version
- lxml module
Install by commandpip install lxml
check by commandpip list
to see whether lxml exists
- Step1 - Clone the repo to local
git clone https://github.com/ShangYuChiang/BioCxml2spacy.git
cd / BioCxml2spacy
- Step2 - Run BioCxml2spacy.py
python BioCxml2spacy.py
- Step3 - The results are shown in the output file
XML Tutorial : w3schools.com
Spacy : Training spaCy’s Statistical Models
Github : BioC-JSON