Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drifted Tabular Data Generation #328

Open
LifeBoey opened this issue Sep 22, 2022 · 1 comment
Open

Drifted Tabular Data Generation #328

LifeBoey opened this issue Sep 22, 2022 · 1 comment

Comments

@LifeBoey
Copy link

Hi there,

I've been exploring data drift detection and have been wanting to test how good evidently is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get evidently to detect how much drift was applied to them.

So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age.

adult df

What are the types of ways to artificially create a drifted dataset from a given dataset?

What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied?

Thank you!

@elenasamuylova
Copy link
Collaborator

Hi @LifeBoey, you might find this blog useful https://www.evidentlyai.com/blog/data-drift-detection-large-datasets

There, we generate artificial drift and then explore how each statistical test reacts to it.

There is also a notebook with all the code, including the code where we created artificial drift: https://colab.research.google.com/drive/1EFFcs0wDzToxSR6nw1umXDgPyeoP_Uk6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants