-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add data drift example notebook for text data #427
add data drift example notebook for text data #427
Conversation
Hi @elenasamuylova, I believe you would love to have this example 😎 |
@SangamSwadiK we are exited to have an example with text data! Thank you. Are you going to add more commits to this PR? I marked it as a hacktoberfest-accepted, to give you a chance to add something if you want. |
Hi ! @emeli-dral
Other things to be changed ; what do you suggest ? |
|
Hi @SangamSwadiK! I take a loot to the dataset. 2. dataset in evidently/examples/how_to_questions/data |
Hey @emeli-dral ! Please let me know in case of improvements. Will fix them asap. |
Related to :
#392
What does this implement?
This adds an example notebook for detecting data drift for text data using evidently
Summary of things in the example notebook :
I've used averaged
glove vectors
representing each sentence(record) across, IMDB movie review train , test and amazon echo-dot review datasets. I've usedGlove 50dim
for calculating the average vector for each recordI've created visualization based on
Dashboard
and created drift measurement based onTestSuite
usingwasserstein
distance for numerical features.Dataset used for reference (train and test): IMDB movie reviews.
Dataset used for current(Unseen dataset for drift comparison) : Amazon Echo-Dot reviews
Please let me know if improvements could be made in measuring data drift for text. Not completely sure if this is the right approach.
I tried using the
word counts
after and usingChi-squared test
on the word counts betweentrain
andtest
, andtrain
andunseen
datasets, It didn't seem to work well.I'm reading the data from my
raw.githubusercontent
, please let me know if there is a better way to store and read data from, or perhaps data could be stored in a separate folder.Attaching a link to colab notebook.
https://colab.research.google.com/drive/1Y5X-MkZELIQv3-OHeWZTyyfftE5Lju58?usp=sharing
Thanks !