Feature Selectors

Used to optimally select a subset of features before training a Classifiers

Correlation Threshold

This is an unsupervised feature selection algorithm that selects features based on absolute correlation (similar to backward feature selection). It first calculates a pair-wise correlation matrix consisting of all features. Then, a candidate feature is identified for removal. This candidate feature is the one that correlates to the maximum number of other features having correlation coefficient higher than the threshold. This step is repeated until there is no feature with correlation coefficient higher that the threshold or when there is no feature left.

Parameters
  • threshold – float; default = 0.85. Minimum correlation threshold over which features should be eliminated (0 to 1)

  • passthrough_columns – list of column names; The set of columns the selector should ignore

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Correlation Threshold',
                            'params':{ "threshold": 0.85 }}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0']
Custom Feature Selection

This is a feature selection method which allows custom feature selection. This takes a list of strings where each value is the feature name to keep.

Parameters
  • input_data (DataFrame) – Input data

  • custom_feature_selection (list) – feature generator names to keep

Returns

tuple containing:

selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features

Return type

tuple

Custom Feature Selection

This is a feature selection method which allows custom feature selection. This takes a dictionary where the key is the feature generator number and the value is an array of the features for the feature generator to keep. All feature generators that are not added as keys in the dictionary will be dropped.

Example

client.pipeline.add_feature_selector([{'name': 'Custom Feature Selection By Index',
                            'params': {"custom_feature_selection":
                                    {1: [0], 2:[0], 3:[1,2,3,4]},
                            }}])

# would select the features 0 from feature generator 1 and 2, and
# features 1,2,3,4 from the family feature generator 3.
Parameters
  • input_data (DataFrame) – Input data

  • custom_feature_selection (dict) – feature generator number and array of features to keep.

Returns

tuple containing:

selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features

Return type

tuple

Information Gain

This is a supervised feature selection algorithm that selects features based on Information Gain (one class vs other classes approaches).

First, it calculates Information Gain (IG) for each class separately to all features then sort features based on IG scores, std and mean differences. Feature with higher IG is better feature to differentiate the class from others. At the end, each feature has their own feature list.

Parameters

feature_number – Number of features will be selected for each class.

Returns

DataFrame which includes selected features for each class.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Information Gain',
                            'params':{"feature_number": 3}}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0001_accelx_0  gen_0001_accelx_1  gen_0001_accelx_2
    0  Crawling     s01         347.881775         372.258789         208.341858
    1  Crawling     s02         347.713013         224.231735          91.971481
    2  Crawling     s03         545.664429         503.276642         200.263031
    3   Running     s01         -21.588972         -23.511278         -16.322056
    4   Running     s02         422.405182         453.950897         431.893585
    5   Running     s03         350.105774         366.373627         360.777466
    6   Walking     s01         -10.362945         -46.967007           0.492386
    7   Walking     s02         375.751343         413.259460         374.443237
    8   Walking     s03         353.421906         317.618164         283.627502
Recursive Feature Elimination

This is a supervised method of feature selection. The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator (method: ‘Log R’ or ‘Linear SVC’) is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set features. That procedure is recursively repeated on the pruned set until the desired number of features number_of_features to select is eventually reached.

Parameters
  • method – str; The type of selection method. Two options available: 1) Log R and 2) Linear SVC. For Log R, the value of Inverse of regularization strength C is default to 1.0 and penalty is defaulted to `l1. For Linear SVC, the default for C is 0.01, penalty is l1 and dual is set to False.

  • number_of_features – int; The number of features you would like the selector to reduce to.

Returns

DataFrame which includes selected features and the passthrough columns.

Return type

DataFrame

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# # List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Recursive Feature Elimination',
                            'params':{"method": "Log R",
                                      "number_of_features": 3}}],
                          params={'number_of_features':3})
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0001_accelx_2  gen_0003_accelz_1  gen_0003_accelz_4
    0  Crawling     s01         208.341858        3881.038330        3900.734863
    1  Crawling     s02          91.971481        3821.513428        3896.376221
    2  Crawling     s03         200.263031        3896.349121        3889.297119
    3   Running     s01         -16.322056         641.164185         605.192993
    4   Running     s02         431.893585         870.608459         846.671204
    5   Running     s03         360.777466         263.184052         234.177200
    6   Walking     s01           0.492386         559.139587         558.538086
    7   Walking     s02         374.443237         658.902710         669.394592
    8   Walking     s03         283.627502         -87.612816         -98.735649

Notes

For more information on defaults of Log R, please see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression For Linear SVC, please see: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

Tree-based Selection

Select features using a supervised tree-based algorithm. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control overfitting. The default number of trees in the forest is set at 250, and the random_state to be 0. Please see notes for more information.

Parameters

number_of_features – int; The number of features you would like the selector to reduce to.

Returns

DataFrame which includes selected features and the passthrough columns for each class.

Return type

DataFrame

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Tree-based Selection', 'params':{ "number_of_features": 4 }}] )
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0002_accely_0  gen_0002_accely_1  gen_0002_accely_2  gen_0002_accely_3  gen_0002_accely_4  gen_0003_accelz_0  gen_0003_accelz_1  gen_0003_accelz_2  gen_0003_accelz_3  gen_0003_accelz_4
    0  Crawling     s01           1.669203           1.559860           1.526786           1.414068           1.413625           1.360500           1.368615           1.413445           1.426949           1.400083
    1  Crawling     s02           1.486925           1.418474           1.377726           1.414068           1.413625           1.360500           1.368615           1.388456           1.408576           1.397417
    2  Crawling     s03           1.035519           1.252789           1.332684           1.328587           1.324469           1.410274           1.414961           1.384032           1.345107           1.393088
    3   Running     s01          -0.700995          -0.678448          -0.706631          -0.674960          -0.713493          -0.572269          -0.600986          -0.582678          -0.560071          -0.615270
    4   Running     s02          -0.659030          -0.709012          -0.678594          -0.688869          -0.700753          -0.494247          -0.458891          -0.471897          -0.475010          -0.467597
    5   Running     s03          -0.712790          -0.713026          -0.740177          -0.728651          -0.733076          -0.836257          -0.835071          -0.868028          -0.855081          -0.842161
    6   Walking     s01          -0.701450          -0.714677          -0.692671          -0.716556          -0.696635          -0.652326          -0.651784          -0.640956          -0.655958          -0.643802
    7   Walking     s02          -0.698335          -0.689857          -0.696807          -0.702233          -0.682212          -0.551928          -0.590001          -0.570077          -0.558563          -0.576008
    8   Walking     s03          -0.719046          -0.726102          -0.722315          -0.727506          -0.712461          -1.077342          -1.052320          -1.052297          -1.075949          -1.045750

Notes

For more information, please see: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

t-Test Feature Selector

This is a supervised feature selection algorithm that selects features based on 2 tailed t-test. It computes the p values then select the top performing number of features for each class as defined by feature_number. It returns a reduced combined list for all of the features.

Parameters
  • input_data – DataFrame

  • label_column (str) – Class label

  • feature_number (int) – The number of features to select for each class

  • passthrough_columns – list of columns the selector should ignore

Univariate Selection

Select features with the highest univariate (ANOVA) F-values; It is supervised feature selection method and requires both a input features and labels.

Parameters

number_of_features – int; The number of features you would like the selector to reduce to.

Returns

DataFrame which includes selected features and the passthrough columns.

Return type

DataFrame

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Univariate Selection',
                    'params': {"number_of_features": 3 } }])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0002_accely_2  gen_0002_accely_3  gen_0002_accely_4
    0  Crawling     s01           1.526786           1.496120           1.500535
    1  Crawling     s02           1.377726           1.414068           1.413625
    2  Crawling     s03           1.332684           1.328587           1.324469
    3   Running     s01          -0.706631          -0.674960          -0.713493
    4   Running     s02          -0.678594          -0.688869          -0.700753
    5   Running     s03          -0.740177          -0.728651          -0.733076
    6   Walking     s01          -0.692671          -0.716556          -0.696635
    7   Walking     s02          -0.696807          -0.702233          -0.682212
    8   Walking     s03          -0.722315          -0.727506          -0.712461

Notes

Please see the following for more information: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif

Variance Threshold

Feature selector that removes all low-variance features. This step is an unsupervised feature selection algorithm and looks only at the input features (X) and not the Labels or outputs (y). Select features whose variance exceeds the given threshold (default is set to 0.05). It should be applied prior to standardization.

Parameters

threshold – float; default = 0.01. Minimum variance threshold under which features should be eliminated.

Returns

DataFrame which includes selected features and the passthrough columns.

Return type

DataFrame

Examples

>>> client.pipeline.reset()
>>> df = client.datasets.load_activity_raw_toy()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'Variance Threshold',
                            'params':{"threshold": 4513492.05}}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
    [u'Class',
     u'Subject',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4']