[BUG] Incorrect scores for evaluation #746

SPP3000 · 2023-09-20T18:52:43Z

Bug Description

It appears that a significant issue is affecting XLNet with CLM and potentially other models. When using the trainer's
evaluation method, even after just one training epoch, the NDCG and MRR scores approach near-perfection.
Upon inspecting the evaluation process, it seems that the model is able to predict the missing item_id,
most likely due to information leakage.

This bug impacts the trainer.evaluation method and, consequently, all eval_steps during training,
causing the automatic best model saving procedure to produce incorrect results.

Steps/Code to Reproduce the Bug

To replicate this issue, you can refer to the code provided here,
which is based on the Yoochoose e-commerce dataset example.

In 01-ETL-with-NVTabular,
the dataset is randomly split into a training and a validation set. The validation set is then duplicated and
transformed into a test set that contains the same entries as the validation set, but with the last item removed from each sequence.
The transformation has been simplified since the item_ids are the only input feature for the transformer being trained.

In 02-End-to-End-Session-Based-with-Evaluation,
an XLNet model is trained for a next-item prediction task.
According to your PR,
the last item in the sequence is the one to be predicted for evaluation.
After training and running the evaluation method, the results show exceptionally high accuracy scores (MRR > 0.95).

To rule out the possibility that the validation scores are inflated due to similarities or identical entries,
the trainer class is used to make predictions on the test set (which, as previously stated, is identical to the validation set).
Calculating the MRR based on these predictions results in a more reasonable score of MRR ≈ 0.2.

Environment Details

This bug persists across different versions and has been observed in the following environment:

Transformers4Rec version: 23.8.0
Platform: Ubuntu 22
Python version: 3.10
Huggingface Transformers version: 4.28
PyTorch version (GPU): 2.0.1+cu118 (GPU)
Tensorflow version (GPU): N/A

Additional Context

I hope that this issue might be due to a coding mistake on my part.
However, if it turns out to be a genuine bug, I recommend addressing it as a high-priority matter.
Thank you for your amazing support!

The text was updated successfully, but these errors were encountered:

SPP3000 added bug Something isn't working status/needs-triage labels Sep 20, 2023

SPP3000 mentioned this issue Sep 20, 2023

[BUG] XLNET-CLM eval recall metric value does not match with custom np based recall metric value #719

Open

dcy0577 mentioned this issue Nov 28, 2023

[BUG] Inconsistent inference and evaluation results of the XLNET-CLM even on the training set! #761

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Incorrect scores for evaluation #746

[BUG] Incorrect scores for evaluation #746

SPP3000 commented Sep 20, 2023

[BUG] Incorrect scores for evaluation #746

[BUG] Incorrect scores for evaluation #746

Comments

SPP3000 commented Sep 20, 2023

Bug Description

Steps/Code to Reproduce the Bug

Environment Details

Additional Context