The package implements routines using pre-trained language models (BERT and DistilBERT in particular) to perform (1) Word Sense Induction and (2) induction of a hierarchy of classes for the annotated words/phrases. Both methods first make use of the mentioned models to produce contextual substitutes for the given annotated entities in the corpus. Then these substitute are grouped using factorization of binary matrices, in particular Formal Concept Analysis methodology. Further information can be found in these papers and blogposts:
-
@incollection{revenko2022learning, title={Learning Ontology Classes from Text by Clustering Lexical Substitutes Derived from Language Models}, author={Revenko, Artem and Mireles, Victor and Breit, Anna and Bourgonje, Peter and Moreno-Schneider, Julian and Khvalchik, Maria and Rehm, Georg}, booktitle={Towards a Knowledge-Aware AI}, pages={155--169}, year={2022}, publisher={IOS Press} }```
-
@InProceedings{10.1007/978-3-030-27684-3_22, author="Revenko, Artem and Mireles, Victor", editor="Anderst-Kotsis, Gabriele and Tjoa, A Min and Khalil, Ismail and Elloumi, Mourad and Mashkoor, Atif and Sametinger, Johannes and Larrucea, Xabier and Fensel, Anna and Martinez-Gil, Jorge and Moser, Bernhard and Seifert, Christin and Stein, Benno and Granitzer, Michael", title="The Use of Class Assertions and Hypernyms to Induce and Disambiguate Word Senses", booktitle="Database and Expert Systems Applications", year="2019", publisher="Springer International Publishing", pages="172--181", isbn="978-3-030-27684-3" }```
- https://medium.com/@revenkoartem/label-unstructured-data-using-enterprise-knowledge-graphs-2-d84bda281270
- https://revenkoartem.medium.com/learning-ontology-classes-from-text-14773e61c076
The language model is loaded in ./ptlm_wsid/target_context.py in load_model
. The current implementation support BERT and DistilBERT from HuggingFace Transformers library. By default distilbert-base-multilingual-cased
is used, to change to BERT you can define the environment variable TRANSFORMER_MODEL = bert-base-multilingual-cased
. For further models see the HF documentation.
You can use the Dockerfile
and run the scripts in the docker container.
cd
to the root folder where theDockerfile
is located.- run
DOCKER_BUILDKIT=1 docker build -t ptlm_wsid .
- run
docker run -it ptlm_wsid
By default, it will execute./scripts/wsid_example.py
script, change the last line in theDockerfile
to run a different script.
Find examples of usage in ./scripts folder.
Easy install with pip from the github repo:
pip install git+https://github.com/semantic-web-company/ptlm_wsid.git#egg=ptlm_wsid
This project is licensed under the MIT License.