add data drift example notebook for text data #427

SangamSwadiK · 2022-10-28T09:35:19Z

Related to :
#392

What does this implement?
This adds an example notebook for detecting data drift for text data using evidently

add data drift for text data
add relevant headings
explanation at each step
references and citation for datasets used

Summary of things in the example notebook :

I've used averaged glove vectors representing each sentence(record) across, IMDB movie review train , test and amazon echo-dot review datasets. I've used Glove 50dim for calculating the average vector for each record
I've created visualization based on Dashboard and created drift measurement based on TestSuite using wasserstein distance for numerical features.
Dataset used for reference (train and test): IMDB movie reviews.
Dataset used for current(Unseen dataset for drift comparison) : Amazon Echo-Dot reviews

Please let me know if improvements could be made in measuring data drift for text. Not completely sure if this is the right approach.
I tried using the word counts after and using Chi-squared test on the word counts between train and test, and
train and unseen datasets, It didn't seem to work well.

I'm reading the data from my raw.githubusercontent, please let me know if there is a better way to store and read data from, or perhaps data could be stored in a separate folder.

Attaching a link to colab notebook.
https://colab.research.google.com/drive/1Y5X-MkZELIQv3-OHeWZTyyfftE5Lju58?usp=sharing

Thanks !

emeli-dral · 2022-10-31T12:25:48Z

Hi @elenasamuylova,

I believe you would love to have this example 😎

emeli-dral · 2022-10-31T12:32:09Z

@SangamSwadiK we are exited to have an example with text data! Thank you.

Are you going to add more commits to this PR? I marked it as a hacktoberfest-accepted, to give you a chance to add something if you want.
And I kindly ask you to move your example from "sample_notebooks" to "how_to_questions", cause it fits there perfectly. It answers a question "How to calculate drift on top of the text data".

SangamSwadiK · 2022-10-31T13:28:11Z

@SangamSwadiK we are exited to have an example with text data! Thank you.

Are you going to add more commits to this PR? I marked it as a hacktoberfest-accepted, to give you a chance to add something if you want. And I kindly ask you to move your example from "sample_notebooks" to "how_to_questions", cause it fits there perfectly. It answers a question "How to calculate drift on top of the text data".

Hi ! @emeli-dral
Yes, I think there would be more commits, primarily, because of two questions ;

How would you suggest to handle loading glove vectors?
- I've commented the downloading and unpacking of glove vectors because
  - I thought it might take more time to run the github workflow
  - And it could be easily done locally, and the user need not be using glove vectors. They could be using something else. That's the reason I commented it.

The data used (Imdb reviews and amazon eco-dot reviews) are from raw.githubusercontent from https://github.com/SangamSwadiK/test_dataset
I think it would be better if this data were in Evidently repository maybe under a directory evidently/examples/how_to_questions/data(or any other better name). Please let me know if there is any other way to handle this.

Other things to be changed ;
I think the notebook could be cleaned further, add more detailed comments at each step and other minor changes.
I agree, moving it to how_to_questions because it answers a question.

what do you suggest ?
Thanks.

SangamSwadiK · 2022-11-03T12:28:08Z

~~Updated the notebook (added relevant comments and credits) and added the sample data as well.~~

…rences

emeli-dral · 2022-11-16T10:33:49Z

Hi @SangamSwadiK!

I take a loot to the dataset.
1. loading glove vectors
I agree with your approach. If a user decides to run it locally, it can easily uncomment the downloading and unpacking code and run it.

2. dataset in evidently/examples/how_to_questions/data
We discussed that, and we decided against storing data in the repo directly. The main reason is that we do not want to have our data collection now and support it further (especially when it comes to storing many datasets which can originally be distributed under variety of licences).
I suggest to add a link to the original dataset together with the comment on how this data were processed to get https://github.com/SangamSwadiK/test_dataset, and keep using test_dataset. What do you think?

SangamSwadiK · 2022-11-16T11:59:10Z

Hey @emeli-dral !
I've updated the notebook. I've commented out the glove vector download. I've also added an introduction, instruction section along with necessary links to data, preprocessing and my preprocessed data(github repo).

Please let me know in case of improvements. Will fix them asap.
Thanks for reviewing!

SangamSwadiK added 2 commits October 28, 2022 14:53

add data drift example for text data

a6bb4f9

fix headings

9d72ca3

emeli-dral added documentation Improvements or additions to documentation hacktoberfest Accepted contributions will count towards your hacktoberfest PRs hacktoberfest-accepted labels Oct 31, 2022

elenasamuylova mentioned this pull request Oct 31, 2022

library issue : DataDrift,DataQuality #432

Closed

SangamSwadiK added 3 commits November 2, 2022 13:24

update notebook and add dataset

55a1685

fix markdown

2033785

check data

f72116c

Add introduction section, cleanup notebook and links and correct refe…

69b5f03

…rences

add instruction, and links to preprocessing and original data

a7417c7

emeli-dral merged commit 63e001b into evidentlyai:main Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add data drift example notebook for text data #427

add data drift example notebook for text data #427

SangamSwadiK commented Oct 28, 2022 •

edited

Loading

emeli-dral commented Oct 31, 2022

emeli-dral commented Oct 31, 2022

SangamSwadiK commented Oct 31, 2022 •

edited

Loading

SangamSwadiK commented Nov 3, 2022 •

edited

Loading

emeli-dral commented Nov 16, 2022

SangamSwadiK commented Nov 16, 2022 •

edited

Loading

add data drift example notebook for text data #427

add data drift example notebook for text data #427

Conversation

SangamSwadiK commented Oct 28, 2022 • edited Loading

emeli-dral commented Oct 31, 2022

emeli-dral commented Oct 31, 2022

SangamSwadiK commented Oct 31, 2022 • edited Loading

SangamSwadiK commented Nov 3, 2022 • edited Loading

emeli-dral commented Nov 16, 2022

SangamSwadiK commented Nov 16, 2022 • edited Loading

SangamSwadiK commented Oct 28, 2022 •

edited

Loading

SangamSwadiK commented Oct 31, 2022 •

edited

Loading

SangamSwadiK commented Nov 3, 2022 •

edited

Loading

SangamSwadiK commented Nov 16, 2022 •

edited

Loading