You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am following the step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.
The issue is that after running the below command as output I am getting empty jsonl file:
relik data convert-to-dpr
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data/kb/wikipedia/documents.jsonl
--title-map data/kb/wikipedia/title_map.json
The same happens with train/dev/test subsets of the BLINK dataset as well as with the AIDA dataset.
What I noticed is that if I remove/hide below lines of the code in the convert_to_dpr function found in relik/cli/data.py:
if len(positive_pssgs) == 0:
continue
then the output jsonl file will be filled with data like below (this is a sample from the generated file blink-dev-kilt-relik-windowed-dpr.jsonl for the document with id=0):
{"id": "0_0", "doc_topic": "{", "question": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_31", "doc_topic": "{", "question": ""On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of France and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_119", "doc_topic": "{", "question": "place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_198", "doc_topic": "{", "question": "Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the [", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_290", "doc_topic": "{", "question": ". This (a precursor to the Battle of Schleiz on 9 October and the [START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_357", "doc_topic": "{", "question": "START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was the first battle of the War of the Fourth Coalition.", "output"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_423", "doc_topic": "{", "question": "the first battle of the War of the Fourth Coalition.", "output": [{"answer": "Battle of Jena\u2013Auerstedt", "", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_486", "doc_topic": "{", "question": ": [{"answer": "Battle of Jena\u2013Auerstedt", "provenance": [{"wikipedia_id": "295160", "title"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_529", "doc_topic": "{", "question": "provenance": [{"wikipedia_id": "295160", "title": "Battle of Jena\u2013Auerstedt"}]}], "meta"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_577", "doc_topic": "{", "question": ": "Battle of Jena\u2013Auerstedt"}]}], "meta": {"mention": "Battle of Jena-Auerstadt", "left_context", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_617", "doc_topic": "{", "question": ": {"mention": "Battle of Jena-Auerstadt", "left_context": "On 8 October 1806, Napoleon's troops first entered Prussian territory and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_672", "doc_topic": "{", "question": "": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_750", "doc_topic": "{", "question": "battles took place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_831", "doc_topic": "{", "question": "France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_917", "doc_topic": "{", "question": "Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the", "right_context": "on 14 October), was the first", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_979", "doc_topic": "{", "question": "on 9 October and the", "right_context": "on 14 October), was the first battle of the War of the Fourth Coalition."}}", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
We can notice that the properties "positive_ctxs", "negative_ctxs", and "hard_negative_ctxs" are empty.
This is the same for all documents and lines in the jsonl file, and has to be confirmed whether it is fine or not.
I assume that this comes from the below lines of code in the convert_to_dpr function found in relik/cli/data.py:
for idx, entity in enumerate(sentence["window_labels"]):
entity = entity[2]
...
if entity in documents:
doc = documents.get_document_from_text(entity)
...
positive_pssgs.append(doc.to_dict())
...
If we take a look into the jsonl file that is generated in the previous step by running the below command:
relik data create-windows
data/blink/processed/blink-dev-kilt-relik.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
we will find that the "window_labels" property is empty for all lines (this is a sample of the generated file blink-dev-kilt-relik-windowed.jsonl for the document with id=0):
Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.InBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train
Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.AidaInBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train
I am assuming that all this is related, so please take a look and provide feedback.
Thanks and best regards.
The text was updated successfully, but these errors were encountered:
I am following the step-by-step guide on how to prepare the data, train the retriever and reader, and evaluate the model.
The issue is that after running the below command as output I am getting empty jsonl file:
relik data convert-to-dpr
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data/kb/wikipedia/documents.jsonl
--title-map data/kb/wikipedia/title_map.json
The same happens with train/dev/test subsets of the BLINK dataset as well as with the AIDA dataset.
What I noticed is that if I remove/hide below lines of the code in the convert_to_dpr function found in relik/cli/data.py:
if len(positive_pssgs) == 0:
continue
then the output jsonl file will be filled with data like below (this is a sample from the generated file blink-dev-kilt-relik-windowed-dpr.jsonl for the document with id=0):
{"id": "0_0", "doc_topic": "{", "question": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_31", "doc_topic": "{", "question": ""On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of France and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_119", "doc_topic": "{", "question": "place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_198", "doc_topic": "{", "question": "Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the [", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_290", "doc_topic": "{", "question": ". This (a precursor to the Battle of Schleiz on 9 October and the [START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_357", "doc_topic": "{", "question": "START_ENT] Battle of Jena-Auerstadt [END_ENT] on 14 October), was the first battle of the War of the Fourth Coalition.", "output"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_423", "doc_topic": "{", "question": "the first battle of the War of the Fourth Coalition.", "output": [{"answer": "Battle of Jena\u2013Auerstedt", "", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_486", "doc_topic": "{", "question": ": [{"answer": "Battle of Jena\u2013Auerstedt", "provenance": [{"wikipedia_id": "295160", "title"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_529", "doc_topic": "{", "question": "provenance": [{"wikipedia_id": "295160", "title": "Battle of Jena\u2013Auerstedt"}]}], "meta"", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_577", "doc_topic": "{", "question": ": "Battle of Jena\u2013Auerstedt"}]}], "meta": {"mention": "Battle of Jena-Auerstadt", "left_context", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_617", "doc_topic": "{", "question": ": {"mention": "Battle of Jena-Auerstadt", "left_context": "On 8 October 1806, Napoleon's troops first entered Prussian territory and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_672", "doc_topic": "{", "question": "": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took place on the banks of the Saale between the forces of Napoleon I of", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_750", "doc_topic": "{", "question": "battles took place on the banks of the Saale between the forces of Napoleon I of France and Frederick William III of Prussia. Napoleon spent the night of 8 October in", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_831", "doc_topic": "{", "question": "France and Frederick William III of Prussia. Napoleon spent the night of 8 October in Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_917", "doc_topic": "{", "question": "Schloss Ebersdorf. This (a precursor to the Battle of Schleiz on 9 October and the", "right_context": "on 14 October), was the first", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
{"id": "0_979", "doc_topic": "{", "question": "on 9 October and the", "right_context": "on 14 October), was the first battle of the War of the Fourth Coalition."}}", "positive_ctxs": [], "negative_ctxs": "", "hard_negative_ctxs": ""}
We can notice that the properties "positive_ctxs", "negative_ctxs", and "hard_negative_ctxs" are empty.
This is the same for all documents and lines in the jsonl file, and has to be confirmed whether it is fine or not.
I assume that this comes from the below lines of code in the convert_to_dpr function found in relik/cli/data.py:
for idx, entity in enumerate(sentence["window_labels"]):
entity = entity[2]
...
if entity in documents:
doc = documents.get_document_from_text(entity)
...
positive_pssgs.append(doc.to_dict())
...
If we take a look into the jsonl file that is generated in the previous step by running the below command:
relik data create-windows
data/blink/processed/blink-dev-kilt-relik.jsonl
data/blink/processed/blink-dev-kilt-relik-windowed.jsonl
we will find that the "window_labels" property is empty for all lines (this is a sample of the generated file blink-dev-kilt-relik-windowed.jsonl for the document with id=0):
{"doc_id": 0, "window_id": 0, "text": "{"id": "blink-dev-0", "input": "On 8 October 1806, Napoleon's troops first entered Prussian territory and battles took", "tokens": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "words": ["{", """, "i", "d", """, ":", """, "blink", "-", "dev-0", """, ",", """, "input", """, ":", """, "On", "8", "October", "1806", ",", "Napoleon", "'s", "troops", "first", "entered", "Prussian", "territory", "and", "battles", "took"], "doc_topic": "{", "offset": 0, "spans": [], "token2char_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "6": 7, "7": 8, "8": 13, "9": 14, "10": 19, "11": 20, "12": 22, "13": 23, "14": 28, "15": 29, "16": 31, "17": 32, "18": 35, "19": 37, "20": 45, "21": 49, "22": 51, "23": 59, "24": 62, "25": 69, "26": 75, "27": 83, "28": 92, "29": 102, "30": 106, "31": 114}, "token2char_end": {"0": 1, "1": 2, "2": 3, "3": 4, "4": 5, "5": 6, "6": 8, "7": 13, "8": 14, "9": 19, "10": 20, "11": 21, "12": 23, "13": 28, "14": 29, "15": 30, "16": 32, "17": 34, "18": 36, "19": 44, "20": 49, "21": 50, "22": 59, "23": 61, "24": 68, "25": 74, "26": 82, "27": 91, "28": 101, "29": 105, "30": 113, "31": 118}, "char2token_start": {"0": 0, "1": 1, "2": 2, "3": 3, "4": 4, "5": 5, "7": 6, "8": 7, "13": 8, "14": 9, "19": 10, "20": 11, "22": 12, "23": 13, "28": 14, "29": 15, "31": 16, "32": 17, "35": 18, "37": 19, "45": 20, "49": 21, "51": 22, "59": 23, "62": 24, "69": 25, "75": 26, "83": 27, "92": 28, "102": 29, "106": 30, "114": 31}, "char2token_end": {"1": 0, "2": 1, "3": 2, "4": 3, "5": 4, "6": 5, "8": 6, "13": 7, "14": 8, "19": 9, "20": 10, "21": 11, "23": 12, "28": 13, "29": 14, "30": 15, "32": 16, "34": 17, "36": 18, "44": 19, "49": 20, "50": 21, "59": 22, "61": 23, "68": 24, "74": 25, "82": 26, "91": 27, "101": 28, "105": 29, "113": 30, "118": 31}, "window_labels": [], "window_labels_tokens": []}
Please check whether data preparation needed for the training is as expected or not.
With the generated train/dev/test files for BLINK and AIDA datasets I move to the next step - Training the Retriever model for Entity Linking.
After running the below command:
relik retriever train relik/retriever/conf/pretrain_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl
data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl
data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl
then I am getting the following error:
Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/blink/processed/blink-train-kilt-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/blink/processed/blink-dev-kilt-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/blink/processed/blink-test-kilt-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.InBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train
The same happens with the below command:
relik retriever train relik/retriever/conf/finetune_iterable_in_batch.yaml
model.language_model=intfloat/e5-base-v2
data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl
data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl
data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl
data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl
then I am getting the following error:
Error executing job with overrides: ['model.language_model=intfloat/e5-base-v2', 'data.train_dataset_path=data/aida/processed/aida-train-relik-windowed-dpr.jsonl', 'data.val_dataset_path=data/aida/processed/aida-dev-relik-windowed-dpr.jsonl', 'data.test_dataset_path=data/aida/processed/aida-test-relik-windowed-dpr.jsonl', 'data.shared_params.documents_path=data/kb/wikipedia/documents.jsonl']
Error in call to target 'relik.retriever.data.datasets.AidaInBatchNegativesDataset':
IndexError('list index out of range')
full_key: datasets.train
I am assuming that all this is related, so please take a look and provide feedback.
Thanks and best regards.
The text was updated successfully, but these errors were encountered: