Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty jsonl file after conversion to DPR format #13

Open
jblagoja opened this issue Sep 7, 2024 · 1 comment
Open

Empty jsonl file after conversion to DPR format #13

jblagoja opened this issue Sep 7, 2024 · 1 comment
Assignees

Comments

@jblagoja
Copy link

jblagoja commented Sep 7, 2024

I am following the step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.

The issue is that after running the below command as output I am getting empty jsonl file:

relik data convert-to-dpr
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data/kb/wikipedia/documents.jsonl
--title-map data/kb/wikipedia/title_map.json

The same happens with train/dev/test subsets of the BLINK dataset as well as with the AIDA dataset.

What I noticed is that if I remove/hide below lines of the code in the convert_to_dpr function found in relik/cli/data.py:

if len(positive_pssgs) == 0:
continue

then the output jsonl file will be filled with data like below (this is a sample from the generated file blink-dev-kilt-relik-windowed-dpr.jsonl for the document with id=0):

{"id": "0_0", "doc_topic": "{", "question": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_31", "doc_topic": "{", "question": ""On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of France and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_119", "doc_topic": "{", "question": "place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_198", "doc_topic": "{", "question": "Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the [", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_290", "doc_topic": "{", "question": ". This (a precursor to the Battle of Schleiz on 9 October and the [START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_357", "doc_topic": "{", "question": "START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was the first battle of the War of the Fourth Coalition.", "output"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_423", "doc_topic": "{", "question": "the first battle of the War of the Fourth Coalition.", "output": [{"answer": "Battle of Jena\u2013Auerstedt", "", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_486", "doc_topic": "{", "question": ": [{"answer": "Battle of Jena\u2013Auerstedt", "provenance": [{"wikipedia_id": "295160", "title"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_529", "doc_topic": "{", "question": "provenance": [{"wikipedia_id": "295160", "title": "Battle of Jena\u2013Auerstedt"}]}], "meta"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_577", "doc_topic": "{", "question": ": "Battle of Jena\u2013Auerstedt"}]}], "meta": {"mention": "Battle of Jena-Auerstadt", "left_context", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_617", "doc_topic": "{", "question": ": {"mention": "Battle of Jena-Auerstadt", "left_context": "On 8 October 1806, Napoleon's troops first entered Prussian territory and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_672", "doc_topic": "{", "question": "": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_750", "doc_topic": "{", "question": "battles took place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_831", "doc_topic": "{", "question": "France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_917", "doc_topic": "{", "question": "Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the", "right_context": "on 14 October), was the first", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_979", "doc_topic": "{", "question": "on 9 October and the", "right_context": "on 14 October), was the first battle of the War of the Fourth Coalition."}}", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}

We can notice that the properties "positive_ctxs", "negative_ctxs", and "hard_negative_ctxs" are empty.
This is the same for all documents and lines in the jsonl file, and has to be confirmed whether it is fine or not.

I assume that this comes from the below lines of code in the convert_to_dpr function found in relik/cli/data.py:

for idx, entity in enumerate(sentence["window_labels"]):
entity = entity[2]
...
if entity in documents:
doc = documents.get_document_from_text(entity)
...
positive_pssgs.append(doc.to_dict())
...

If we take a look into the jsonl file that is generated in the previous step by running the below command:

relik data create-windows
data/blink/processed/blink-dev-kilt-relik.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl

we will find that the "window_labels" property is empty for all lines (this is a sample of the generated file blink-dev-kilt-relik-windowed.jsonl for the document with id=0):

{"doc_id": 0, "window_id": 0, "text": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "tokens": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "words": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "doc_topic": "{", "offset": 0, "spans": [], "token2char_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 7, "7": 8, "8": 13, "9": 14, "10": 19, "11": 20, "12": 22, "13": 23, "14": 28, "15": 29, "16": 31, "17": 32, "18": 35, "19": 37, "20": 45, "21": 49, "22": 51, "23": 59, "24": 62, "25": 69, "26": 75, "27": 83, "28": 92, "29": 102, "30": 106, "31": 114}, "token2char_end": {"0": 1, "1": 2, "2": 3, "3": 4, "4": 5, "5": 6, "6": 8, "7": 13, "8": 14, "9": 19, "10": 20, "11": 21, "12": 23, "13": 28, "14": 29, "15": 30, "16": 32, "17": 34, "18": 36, "19": 44, "20": 49, "21": 50, "22": 59, "23": 61, "24": 68, "25": 74, "26": 82, "27": 91, "28": 101, "29": 105, "30": 113, "31": 118}, "char2token_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "7": 6, "8": 7, "13": 8, "14": 9, "19": 10, "20": 11, "22": 12, "23": 13, "28": 14, "29": 15, "31": 16, "32": 17, "35": 18, "37": 19, "45": 20, "49": 21, "51": 22, "59": 23, "62": 24, "69": 25, "75": 26, "83": 27, "92": 28, "102": 29, "106": 30, "114": 31}, "char2token_end": {"1": 0, "2": 1, "3": 2, "4": 3, "5": 4, "6": 5, "8": 6, "13": 7, "14": 8, "19": 9, "20": 10, "21": 11, "23": 12, "28": 13, "29": 14, "30": 15, "32": 16, "34": 17, "36": 18, "44": 19, "49": 20, "50": 21, "59": 22, "61": 23, "68": 24, "74": 25, "82": 26, "91": 27, "101": 28, "105": 29, "113": 30, "118": 31}, "window_labels": [], "window_labels_tokens": []}

Please check whether data preparation needed for the training is as expected or not.

With the generated train/dev/test files for BLINK and AIDA datasets I move to the next step - Training the Retriever model for Entity Linking.

After running the below command:

relik retriever train relik/retriever/conf/pretrain_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl
data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

then I am getting the following error:

Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.InBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train

The same happens with the below command:

relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl
data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl
data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl

then I am getting the following error:

Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.AidaInBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train

I am assuming that all this is related, so please take a look and provide feedback.

Thanks and best regards.

@Riccorl Riccorl self-assigned this Sep 9, 2024
@Riccorl
Copy link
Collaborator

Riccorl commented Sep 12, 2024

Hi! Thanks for reporting this issue. We will take a look asap!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants