[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

glemaitre · 2020-11-02T10:11:52Z

SMOTE/ADASYN classes currently do not provide a sample_indices_ attribute since they are generating samples that do not belong to the original dataset.

However, we could create a new semantic for these samplers that generate data. sample_indices_ could expose a tuple of the sample used to generate the new point. For the samples that are not generated, it will only be a single integer.

This would implement a feature requested in issues and gitter.

The text was updated successfully, but these errors were encountered:

glemaitre · 2021-02-15T23:32:18Z

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

tianlinhe · 2021-04-01T09:27:21Z

I was thinking on the same issue because I need the sample indices for GroupKFold CV after oversampling using SMOTE. So I downloaded the repo and made some small local changes to imblearn/over_sampling/_smote/base.py/. The codes to oversample are the same:

import numpy as np
from imblearn.over_sampling import SMOTE as smo
X=np.random.random((8,3))
y=np.array([0,0,2,0,2,2,2,2])
oversample=smo(k_neighbors=2)
X_,y_=oversample.fit_resample(X,y)

By calling oversample.sample_indices(), it returns:

array([0, 1, 2, 3, 4, 5, 6, 7, 1, 3])

where the indice of the synthetic sample is the same as its "mother" real sample.

One can also call oversample.sample_indices(get_which_neighbors=True), which returns a list of tuples indicating which neighbor the synthetic sample was generated from:

[(0, 0),
 (1, 0),
 (2, 0),
 (3, 0),
 (4, 0),
 (5, 0),
 (6, 0),
 (7, 0),
 (1, 1),
 (3, 1)]

For real sample, its neighbor is 0 (itself).
Please let me know if this is also what you have
base.txt

in mind! If you think it is implementable I can open a new branch.

nhm-7 · 2021-05-17T19:33:25Z

Hi! Thanks for creating this issue. I think this feature can be useful to understand datasets we are working with.

Thinking a bit more about it and after reading about #724, I think that we should avoid reusing sample_indices_ that would have another semantic. However, we could provide a new attribute that would have a proper semantic for the SMOTE-like sampler.

@glemaitre, IMO, semantic should be given by owners of datasets. If we use the example of #724, oversample the data and suppose we use sample_indices_ as a tuple of the sample used to generate the new point, we will expect people generating new points (i.e., new people).

WDYT?

JurajSlivka · 2022-10-04T06:34:14Z

Hi,
Is this issue still open?
I see there was an PR but it seems outdated.

JurajSlivka · 2022-10-09T14:16:49Z

Hi, Is this issue still open? I see there was an PR but it seems outdated.

So as it seems that no one is currently working on it, I will do it.

glemaitre added the Type: Enhancement Indicates new feature requests label Nov 2, 2020

tianlinhe mentioned this issue Jun 18, 2021

ENH Add sample_indices for SMOTE class #843

Open

JurajSlivka mentioned this issue Oct 28, 2022

[MRG] [ENH] Add sample_indices_ for SMOTE/ADASYN classes #933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

glemaitre commented Nov 2, 2020

glemaitre commented Feb 15, 2021

tianlinhe commented Apr 1, 2021

nhm-7 commented May 17, 2021 •

edited

Loading

JurajSlivka commented Oct 4, 2022

JurajSlivka commented Oct 9, 2022

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

[ENH] Add sample_indices_ for SMOTE/ADASYN classes #772

Comments

glemaitre commented Nov 2, 2020

glemaitre commented Feb 15, 2021

tianlinhe commented Apr 1, 2021

nhm-7 commented May 17, 2021 • edited Loading

JurajSlivka commented Oct 4, 2022

JurajSlivka commented Oct 9, 2022

nhm-7 commented May 17, 2021 •

edited

Loading