Feature Selection

Emsemble Feature Selection

Emsembles are meta-estimators that fit a number of classifiers on various subsets of the dataset. Emsemble Feature Selection class runs the provided feature selection method on N subset of the dataset and generates N feature selection subsets and/or rankings. The N selections/rankings are combined using the provided combination method to obtain a final feature selection.

Emsemble Feature Selection Combination Methods

Subset combination options:

union : the union of the feature selection subsets
intersection : the intersection of the feature selection subsets
vote-threshold : features selected at least x amount of times

Ranking combination options

Feature rankings are combined and a threshold is set to obtain a subset of the features. The options to assign rank to a feature are:

mean-rank: mean of the N ranking
min-rank: the minimum of the N rankings
median-rank: the median of the N rankings
gmean-rank: geometric mean of the N rankings

Parameters

selector : the base feature selection algorithm
splitter : method for generating subsets of the dataset
combine : method for combining feature rankings; options are listed above
threshold: if combine is a ranking option then threshold is set to number of features to select. if combine is vote-threshold then threshold is the number of times a feature has to be selected to be in the final set.

Example

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.model_selection import ShuffleSplit

estimator = LogisticRegression()
selector = RFECV(estimator, step=0.1, cv=5)
splitter = ShuffleSplit(n_splits=5, test_size=.2)
emsembleFS = EmsembleFS(selector, splitter, combine='vote-threshold', threshold=4) # features selected 4 out of 5 times are outputted
emsembleFS.fit(X, y)
emsembleFS.selection_indices
"""
array([ 3,  6,  7,  9, 10, 12, 20, 21, 22, 23, 26, 27])
"""

Maximum Relevance and Minimum Redundancy (mRMR) Feature Selection Methods

Selects features considering both the relevance for predicting the target variable and the redundancy within the selected features. (https://arxiv.org/pdf/1908.05376.pdf)

mRMR Variants

MID (mutual information difference) : Mutual information between feature and target variable to calculate relevance and mutual information between each pair of features to calculate redundancy. Difference is used to balance the relevance and redundancy.
MIQ (mutual information quotient) : Quotient used to balance the two.
FCD (F-test correlation difference) : F-statistic to score relevance and Pearson correlation to score redundancy. Difference used to balance the two.
FCQ (F-test correlation quotient): Quotient used to balance the two.
RFCQ (Random Forest correlation quotient) : Random forest importance score to score relevance. Pearson correlation to score redundancy. Quotient used to balance.

Parameters

score_func : the mRMR variant to used
k : the number of features to select

Example

selector = mRMR(score_func='MIQ', k=10)
selector.fit(X, y)
X_transformed = selector.transform(X)

Genetic Algorithm

Feature selections are individuals and are binary encoded (1 means included, 0 means excluded). Initalize a population of individuals of size m. Calculate the fitness of the individual based on the performance metric of the base classifier used the feature selection corresponding to the individual. In each generation, create a new population of size m. Parents are selected from the current population based on rank selection. Parents crossed-over; In single point crossover, binary string from beginning to the crossover point is copied from one parent, the rest is copied from the second parent. Selected bits are then mutated based on the probability of mutation. The fitness of the population is recalculated. When the max number of generations is reached or the fitness stopped improving, the most fit individual from the last generation is outputed.

Parameters

clf - the base classifier used to for classification
m - size of population for every generation
metric - performance metric used as fitness of individual (f1, accuracy, precision, recall, auc)
crossover - which crossover operator to use (single, double, uniform)
adaptive - if True, adapt the crossover and mutuation probability according to the fitness of the individual
p_crossover - probability of crossover btw 0 and 1
p_mutation - probability of mutation btw 0 and 1
max_generations - max number of generations
tol - if average fitness does not differ by tol for 5 generation then stop

Example

base_clf = LogisticRegression()
selector = AGA(base_clf, 50, metric='f1', crossover='single', max_generations=50, tol=0.001, adaptive=True)
selector.fit(X, y)
X_transformed = selector.transform(X)

Classification

Homogeneous Emsemble

The Homogeneous Emsemble runs the same feature selection method and classification algorithm on N subset of the dataset to generate N classifiers.

Combination of Predictions

The outputs of the N classifiers are combined to obtain the final class labels. The prediction labels can be combined by majority vote meaning the final prediction is the one predicted by the majority of classifiers. If the classification methods that provide probablity, the prediction scores can be combined.

The options of combining prediction labels/scores are:

majority-vote : predict x as the majority class predicted by classifiers
product : predict x as class c if the product of prediction scores of c is the max among the classes
sum : predict x as class c if the sum of prediction scores of c is the max amongl the classes
max : predict x as class c if the max of prediction scores of c is the max among the classes
min : predict x as class c if the min of prediction scores of c is the max among the classes
median : predict x as class c if the median of prediction scores of c is the max among the classes

Cross Validation Example

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold, ShuffleSplit

results = pd.DataFrame(columns=["accuracy", "precision", "recall", "auc", "f1"])
estimator = LogisticRegression()
selector = RFECV(estimator, step=0.1, cv=5)
shuffle_splitter = ShuffleSplit(n_splits=5, test_size=.2)

clf = LogisticRegression(penalty='l2')
CVSplitter = StratifiedKFold(n_splits=5)
for train_index, test_index in splitter.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

    # run feature selection
    emsembleFS = EmsembleFS(selector, splitter, combine='intersection')
    emsembleFS.fit(X_train, y_train)
    X_train_tranformed = emsembleFS.transform(X_train)
    X_test_transformed = emsembleFS.transform(X_test)
    
    # create homogeneous emsemble
    emsemble = HomogeneousEmsemble(shuffle_splitter, clf, combine='min')
    emsemble.fit(X_train_tranformed, y_train)
    y_pred = emsemble.predict(X_test_transformed)
    results = results.append(emsemble.get_scores(y_test,y_pred), ignore_index=True)

Threshold Classifier

Threshold classifier fits the training dataset with the underlying classifier and calculates the optimal prediction score for thresold each class label that yields the highest classification performance on the holdout dataset. An example is predicted as class c if the prediction score for class c is above the threshold for class c. An example can be labeled as multiple classes.

Simple Multilabel Example

Divided the 3 class dataset into train, holdout, and test sets.
Train the underlying classifier on the training set.
Compute the optimal prediction threshold for each class on the holdout set.
Predict the class labels of the test set. predict returns a 2D array (n_samples, n_classes) - each row represents the subset of class labels for an example.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_dev, X_train, y_dev, y_test = train_test_split(X, y, test_size=0.3)
X_train, X_holdout, y_train, y_holdout = train_test_split(X_dev, y_dev, test_size=0.2)
clf = ThresholdClassifier(LogisticRegression(), multilabel=True)
clf.fit(X_train, y_train)
clf.optimize_threshold(X_holdout, y_holdout)
predictions = clf.predict(X_test)
"""
array([[0, 0, 1],
       [0, 0, 1],
       [1, 1, 0],
       [1, 1, 1]])
"""

# get the performance for label 1
clf.get_scores(y_test, predictions[:,1], 1))
"""
{'accuracy': 0.956140350877193, 
 'precision': 0.9565217391304348, 
 'recall': 0.9705882352941176, 
 'auc': 0.952685421994885, 
 'f1': 0.9635036496350365}
"""

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Feature Selection

Emsemble Feature Selection

Emsemble Feature Selection Combination Methods

Subset combination options:

Ranking combination options

Parameters

Example

Maximum Relevance and Minimum Redundancy (mRMR) Feature Selection Methods

mRMR Variants

Parameters

Example

Genetic Algorithm

Parameters

Example

Classification

Homogeneous Emsemble

Combination of Predictions

Cross Validation Example

Threshold Classifier

Simple Multilabel Example

Files

README.md

Latest commit

History

README.md

File metadata and controls

Feature Selection

Emsemble Feature Selection

Emsemble Feature Selection Combination Methods

Subset combination options:

Ranking combination options

Parameters

Example

Maximum Relevance and Minimum Redundancy (mRMR) Feature Selection Methods

mRMR Variants

Parameters

Example

Genetic Algorithm

Parameters

Example

Classification

Homogeneous Emsemble

Combination of Predictions

Cross Validation Example

Threshold Classifier

Simple Multilabel Example