Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add data drift example notebook for text data #427

Merged
merged 7 commits into from
Nov 18, 2022

Conversation

SangamSwadiK
Copy link
Contributor

@SangamSwadiK SangamSwadiK commented Oct 28, 2022

Related to :
#392

What does this implement?
This adds an example notebook for detecting data drift for text data using evidently

  • add data drift for text data
  • add relevant headings
  • explanation at each step
  • references and citation for datasets used

Summary of things in the example notebook :

  1. I've used averaged glove vectors representing each sentence(record) across, IMDB movie review train , test and amazon echo-dot review datasets. I've used Glove 50dim for calculating the average vector for each record

  2. I've created visualization based on Dashboard and created drift measurement based on TestSuite using wasserstein distance for numerical features.

  3. Dataset used for reference (train and test): IMDB movie reviews.
    Dataset used for current(Unseen dataset for drift comparison) : Amazon Echo-Dot reviews

Please let me know if improvements could be made in measuring data drift for text. Not completely sure if this is the right approach.
I tried using the word counts after and using Chi-squared test on the word counts between train and test, and
train and unseen datasets, It didn't seem to work well.

I'm reading the data from my raw.githubusercontent, please let me know if there is a better way to store and read data from, or perhaps data could be stored in a separate folder.

Attaching a link to colab notebook.
https://colab.research.google.com/drive/1Y5X-MkZELIQv3-OHeWZTyyfftE5Lju58?usp=sharing

Thanks !

@emeli-dral emeli-dral added documentation Improvements or additions to documentation hacktoberfest Accepted contributions will count towards your hacktoberfest PRs hacktoberfest-accepted labels Oct 31, 2022
@emeli-dral
Copy link
Contributor

Hi @elenasamuylova,

I believe you would love to have this example 😎

@emeli-dral
Copy link
Contributor

@SangamSwadiK we are exited to have an example with text data! Thank you.

Are you going to add more commits to this PR? I marked it as a hacktoberfest-accepted, to give you a chance to add something if you want.
And I kindly ask you to move your example from "sample_notebooks" to "how_to_questions", cause it fits there perfectly. It answers a question "How to calculate drift on top of the text data".

@SangamSwadiK
Copy link
Contributor Author

SangamSwadiK commented Oct 31, 2022

@SangamSwadiK we are exited to have an example with text data! Thank you.

Are you going to add more commits to this PR? I marked it as a hacktoberfest-accepted, to give you a chance to add something if you want. And I kindly ask you to move your example from "sample_notebooks" to "how_to_questions", cause it fits there perfectly. It answers a question "How to calculate drift on top of the text data".

Hi ! @emeli-dral
Yes, I think there would be more commits, primarily, because of two questions ;

  1. How would you suggest to handle loading glove vectors?
    • I've commented the downloading and unpacking of glove vectors because
      • I thought it might take more time to run the github workflow
      • And it could be easily done locally, and the user need not be using glove vectors. They could be using something else. That's the reason I commented it.
  1. The data used (Imdb reviews and amazon eco-dot reviews) are from raw.githubusercontent from https://github.com/SangamSwadiK/test_dataset
    I think it would be better if this data were in Evidently repository maybe under a directory evidently/examples/how_to_questions/data(or any other better name). Please let me know if there is any other way to handle this.

Other things to be changed ;
I think the notebook could be cleaned further, add more detailed comments at each step and other minor changes.
I agree, moving it to how_to_questions because it answers a question.

what do you suggest ?
Thanks.

@SangamSwadiK
Copy link
Contributor Author

SangamSwadiK commented Nov 3, 2022

Updated the notebook (added relevant comments and credits) and added the sample data as well.

@emeli-dral
Copy link
Contributor

Hi @SangamSwadiK!

I take a loot to the dataset.
1. loading glove vectors
I agree with your approach. If a user decides to run it locally, it can easily uncomment the downloading and unpacking code and run it.

2. dataset in evidently/examples/how_to_questions/data
We discussed that, and we decided against storing data in the repo directly. The main reason is that we do not want to have our data collection now and support it further (especially when it comes to storing many datasets which can originally be distributed under variety of licences).
I suggest to add a link to the original dataset together with the comment on how this data were processed to get https://github.com/SangamSwadiK/test_dataset, and keep using test_dataset. What do you think?

@SangamSwadiK
Copy link
Contributor Author

SangamSwadiK commented Nov 16, 2022

Hey @emeli-dral !
I've updated the notebook. I've commented out the glove vector download. I've also added an introduction, instruction section along with necessary links to data, preprocessing and my preprocessed data(github repo).

Please let me know in case of improvements. Will fix them asap.
Thanks for reviewing!

@emeli-dral emeli-dral merged commit 63e001b into evidentlyai:main Nov 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation hacktoberfest Accepted contributions will count towards your hacktoberfest PRs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants