Drifted Tabular Data Generation #328

LifeBoey · 2022-09-22T09:34:56Z

Hi there,

I've been exploring data drift detection and have been wanting to test how good evidently is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get evidently to detect how much drift was applied to them.

So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age.

What are the types of ways to artificially create a drifted dataset from a given dataset?

What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied?

Thank you!

elenasamuylova · 2022-09-22T13:21:53Z

Hi @LifeBoey, you might find this blog useful https://www.evidentlyai.com/blog/data-drift-detection-large-datasets

There, we generate artificial drift and then explore how each statistical test reacts to it.

There is also a notebook with all the code, including the code where we created artificial drift: https://colab.research.google.com/drive/1EFFcs0wDzToxSR6nw1umXDgPyeoP_Uk6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drifted Tabular Data Generation #328

Drifted Tabular Data Generation #328

LifeBoey commented Sep 22, 2022

elenasamuylova commented Sep 22, 2022

Drifted Tabular Data Generation #328

Drifted Tabular Data Generation #328

Comments

LifeBoey commented Sep 22, 2022

elenasamuylova commented Sep 22, 2022