FORK

This fork of LUKE is meant to include changes that make the pretraining and NER finetuning tasks compatible with transfomers 4 and newer versions of other dependencies, ultimately allowing training of a Danish LUKE -- see the main Danish LUKE repo. Everything below this point is from the original LUKE repo

===================================================================================================

LUKE (Language Understanding with Knowledge-based Embeddings) is a new pre-trained contextualized representation of words and entities based on transformer. It achieves state-of-the-art results on important NLP benchmarks including SQuAD v1.1 (extractive question answering), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), TACRED (relation classification), and Open Entity (entity typing).

This repository contains the source code to pre-train the model and fine-tune it to solve downstream tasks.

News

November 5, 2021: LUKE-500K (base) model

We released LUKE-500K (base), a new pretrained LUKE model which is smaller than existing LUKE-500K (large). The experimental results of the LUKE-500K (base) and LUKE-500K (large) on SQuAD v1 and CoNLL-2003 are shown as follows:

Task	Dataset	Metric	LUKE-500K (base)	LUKE-500K (large)
Extractive Question Answering	SQuAD v1.1	EM/F1	86.1/92.3	90.2/95.4
Named Entity Recognition	CoNLL-2003	F1	93.3	94.3

We tuned only the batch size and learning rate in the experiments based on LUKE-500K (base).

Comparison with State-of-the-Art

LUKE outperforms the previous state-of-the-art methods on five important NLP tasks:

Task	Dataset	Metric	LUKE-500K (large)	Previous SOTA
Extractive Question Answering	SQuAD v1.1	EM/F1	90.2/95.4	89.9/95.1 (Yang et al., 2019)
Named Entity Recognition	CoNLL-2003	F1	94.3	93.5 (Baevski et al., 2019)
Cloze-style Question Answering	ReCoRD	EM/F1	90.6/91.2	83.1/83.7 (Li et al., 2019)
Relation Classification	TACRED	F1	72.7	72.0 (Wang et al. , 2020)
Fine-grained Entity Typing	Open Entity	F1	78.2	77.6 (Wang et al. , 2020)

These numbers are reported in our EMNLP 2020 paper.

Installation

LUKE can be installed using Poetry:

$ poetry install

The virtual environment automatically created by Poetry can be activated by poetry shell.

Released Models

We initially release the pre-trained model with 500K entity vocabulary based on the roberta.large model.

Name	Base Model	Entity Vocab Size	Params	Download
LUKE-500K (base)	roberta.base	500K	253 M	Link
LUKE-500K (large)	roberta.large	500K	483 M	Link

Reproducing Experimental Results

The experiments were conducted using Python3.6 and PyTorch 1.2.0 installed on a server with a single or eight NVidia V100 GPUs. We used NVidia's PyTorch Docker container 19.02. For computational efficiency, we used mixed precision training based on APEX library which can be installed as follows:

$ git clone https://github.com/NVIDIA/apex.git
$ cd apex
$ git checkout c3fad1ad120b23055f6630da0b029c8b626db78f
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" .

The APEX library is not needed if you do not use --fp16 option or reproduce the results based on the trained checkpoint files.

The commands that reproduce the experimental results are provided as follows: