Feature Selectors

Used to optimally select a subset of features before training a Classifiers

Correlation Threshold

Correlation feature selection is an unsupervised feature selection algorithm that aims to select features based on their absolute correlation with the other features in the dataset. The algorithm begins by computing a pairwise correlation matrix of all the features. It then proceeds to identify a candidate feature for removal, which is the feature that correlates with the highest number of other features that have a correlation coefficient greater than the specified threshold. This process is repeated iteratively until there are no more features with a correlation coefficient higher than the threshold, or when there are no features left. The main objective is to remove the most correlated features first, which could help reduce multicollinearity issues and improve model performance.

Parameters
  • input_data – DataFrame containing the input features

  • threshold – float, default=0.85. Minimum correlation threshold over which features should be eliminated (0 to 1).

  • passthrough_columns – Optional list of column names to be ignored by the selector.

  • feature_table – Optional DataFrame that contains the correlation matrix of input features. If this argument is provided, the correlation matrix will not be calculated again.

  • median_sample_size – Optional float value to use instead of median when a feature has no correlation with other features.

Returns

A tuple containing the DataFrame with selected features

and the list of removed features.

Return type

Tuple[DataFrame, list]

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Correlation Threshold',
                            'params':{ "threshold": 0.85 }}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0']
Custom Feature Selection

This is a feature selection method which allows custom feature selection. This takes a list of strings where each value is the feature name to keep.

Parameters
  • input_data – DataFrame, input data

  • custom_feature_selection – list, feature generator names to keep

  • passthrough_columns – list, columns to pass through without modification

  • **kwargs – additional keyword arguments

Returns

tuple containing:

selected_features: DataFrame, which includes selected features and the passthrough columns. unselected_features: list, unselected features

Return type

tuple

Custom Feature Selection

This is a feature selection method which allows custom feature selection. This takes a dictionary where the key is the feature generator number and the value is an array of the features for the feature generator to keep. All feature generators that are not added as keys in the dictionary will be dropped.

Parameters
  • input_data (DataFrame) – Input data to perform feature selection on.

  • custom_feature_selection (dict) – A dictionary of feature generators and their corresponding features to keep.

  • passthrough_columns (list) – A list of columns to include in the output DataFrame in addition to the selected features.

  • **kwargs – Additional keyword arguments to pass to the function.

Returns

A tuple containing the selected features and the passthrough columns as a DataFrame, and a list of unselected features.

Return type

Tuple[DataFrame, list]

Example

client.pipeline.add_feature_selector([{'name': 'Custom Feature Selection By Index',
                            'params': {"custom_feature_selection":
                                    {1: [0], 2:[0], 3:[1,2,3,4]},
                            }}])

# would select the features 0 from feature generator 1 and 2, and
# features 1,2,3,4 from the generator feature generator 3.
Feature Selector By Family

This is an unsupervised method of feature selection. The goal is to randomly select features from the specified feature generators until the maximum number of generators given as input is reached. If no specific generator is provided, all feature generators have an equal chance to be selected.

Parameters
  • input_data (DataFrame) – Input data to perform feature selection on.

  • generators (List[Dict[str, Union[str, int]]]) – A list of feature generators to select from. Each member of this list is a dictionary of this form: {“generator_names”: [(str)] or (str), “number”: (int)}, where “generator_names” lists the name(s) of the generator(s) to select from, and “number” is the desired number of generators.

  • max_number_generators (int) – [Default 5] The maximum number of feature generators to keep.

  • random_seed (int) – [Optional] Random initialization seed.

  • passthrough_columns (List[str]) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.

  • **kwargs – Additional keyword arguments to pass.

Returns

A tuple containing a DataFrame that includes the selected features and the passthrough columns and a list containing the unselected feature columns.

Return type

Tuple[DataFrame, List[str]]

Examples

>>> client.project  = <project_name>
>>> client.pipeline = <piepline_name>
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.reset()
>>> client.pipeline.set_input_data('test_data',
                             data_columns = ['accelx', 'accely', 'accelz', 'gyrox', 'gyroy', 'gyroz'],
                             group_columns = ['Subject','Class', 'Rep'],
                             label_column = 'Class')
>>> client.pipeline.add_feature_generator([
            {
                "name": "MFCC",
                "params": {
                    "columns": ["accelx"],
                    "sample_rate": 10,
                    "cepstra_count": 3,
                },
            },
            {
                "name": "Downsample",
                "params": {"columns": ["accelx", "accely", "accelz"], "new_length": 3},
            },
            {
                "name": "MFCC",
                "params": {
                    "columns": ["accely"],
                    "sample_rate": 10,
                    "cepstra_count": 4,
                },
            },
            {
                "name": "Power Spectrum",
                "params": {
                    "columns": ["accelx"],
                    "number_of_bins": 5,
                    "window_type": "hanning",
                },
            },
            {
                "name": "Absolute Area",
                "params": {
                    "sample_rate": 10,
                    "columns": ["accelx", "accelz"],
                },
            },
        ])
>>>
>>> results, stats = client.pipeline.execute()
>>> results.columns.tolist()
    # List of all features before the feature selection algorithm
    Out:
    ['gen_0001_accelxmfcc_000000',
    'gen_0001_accelxmfcc_000001',
    'gen_0001_accelxmfcc_000002',
    'Class',
    'Rep',
    'Subject',
    'gen_0002_accelxDownsample_0',
    'gen_0002_accelxDownsample_1',
    'gen_0002_accelxDownsample_2',
    'gen_0003_accelyDownsample_0',
    'gen_0003_accelyDownsample_1',
    'gen_0003_accelyDownsample_2',
    'gen_0004_accelzDownsample_0',
    'gen_0004_accelzDownsample_1',
    'gen_0004_accelzDownsample_2',
    'gen_0005_accelymfcc_000000',
    'gen_0005_accelymfcc_000001',
    'gen_0005_accelymfcc_000002',
    'gen_0005_accelymfcc_000003',
    'gen_0006_accelxPowerSpec_000000',
    'gen_0006_accelxPowerSpec_000001',
    'gen_0006_accelxPowerSpec_000002',
    'gen_0006_accelxPowerSpec_000003',
    'gen_0006_accelxPowerSpec_000004',
    'gen_0007_accelxAbsArea',
    'gen_0008_accelzAbsArea']

Here, feature selector picks upto 5 feature generators, of which 2 could be from the “Downsample” generator and 2 of them could be either “MFCC” or “Power Spectrum” feature and the rest could be any other feature not listed in any of the “generator_names” lists.

>>> client.pipeline.add_feature_selector([{"name": "Feature Selector By Family",
                                        "params":{
                                            "max_number_generators": 5,
                                            "seed": 1,
                                            "generators":[
                                            {
                                                "generator_names": "Downsample",
                                                "number": 2
                                            },
                                            {
                                                "generator_names": ["MFCC", "Power Spectrum"],
                                                "number": 2
                                            }]
                                    }}])
>>> results, stats = client.pipeline.execute()
>>> results.columns.tolist()
    # List of all features after the feature selection algorithm
    # Because of the random nature of this selector function, the output might be
    # slightly different each time based on the chosen seed
    Out:
    ['Class',
    'Rep',
    'Subject',
    'gen_0003_accelyDownsample_0',
    'gen_0003_accelyDownsample_1',
    'gen_0003_accelyDownsample_2',
    'gen_0002_accelxDownsample_0',
    'gen_0002_accelxDownsample_1',
    'gen_0002_accelxDownsample_2',
    'gen_0005_accelymfcc_000000',
    'gen_0005_accelymfcc_000001',
    'gen_0005_accelymfcc_000002',
    'gen_0005_accelymfcc_000003',
    'gen_0001_accelxmfcc_000000',
    'gen_0001_accelxmfcc_000001',
    'gen_0001_accelxmfcc_000002',
    'gen_0008_accelzAbsArea']
Information Gain

This is a supervised feature selection algorithm that selects features based on Information Gain (one class vs other classes approaches).

First, it calculates Information Gain (IG) for each class separately to all features then sort features based on IG scores, std and mean differences. Feature with higher IG is a better feature to differentiate the class from others. At the end, each feature has their own feature list.

Parameters
  • input_data (DataFrame) – Input data.

  • label_column (str) – The label column in the input_data.

  • feature_number (int) – [Default 2] Number of features to select for each class.

  • passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.

  • **kwargs – Additional keyword arguments.

Returns

A tuple containing the selected features and the passthrough columns as a DataFrame, and a list of unselected features.

Return type

Tuple[DataFrame, list]

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Information Gain',
                            'params':{"feature_number": 3}}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0001_accelx_0  gen_0001_accelx_1  gen_0001_accelx_2
    0  Crawling     s01         347.881775         372.258789         208.341858
    1  Crawling     s02         347.713013         224.231735          91.971481
    2  Crawling     s03         545.664429         503.276642         200.263031
    3   Running     s01         -21.588972         -23.511278         -16.322056
    4   Running     s02         422.405182         453.950897         431.893585
    5   Running     s03         350.105774         366.373627         360.777466
    6   Walking     s01         -10.362945         -46.967007           0.492386
    7   Walking     s02         375.751343         413.259460         374.443237
    8   Walking     s03         353.421906         317.618164         283.627502
Recursive Feature Elimination

This is a supervised method of feature selection. The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator (method: ‘Log R’ or ‘Linear SVC’) is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features number_of_features to select is eventually reached.

Parameters
  • input_data (DataFrame) – Input data to perform feature selection on.

  • label_column (str) – Name of the column containing the labels.

  • method (str) – The type of selection method. Two options available: 1) Log R and 2) Linear SVC. For Log R, the value of Inverse of regularization strength C is default to 1.0 and penalty is defaulted to l1. For Linear SVC, the default for C is 0.01, penalty is l1 and dual is set to False.

  • number_of_features (int) – The number of features you would like the selector to reduce to.

  • passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.

Returns

A tuple containing:
  • DataFrame: A DataFrame that includes the selected features and the passthrough columns.

  • list: A list of unselected features.

Return type

tuple

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# # List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Recursive Feature Elimination',
                            'params':{"method": "Log R",
                                      "number_of_features": 3}}],
                          params={'number_of_features':3})
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0001_accelx_2  gen_0003_accelz_1  gen_0003_accelz_4
    0  Crawling     s01         208.341858        3881.038330        3900.734863
    1  Crawling     s02          91.971481        3821.513428        3896.376221
    2  Crawling     s03         200.263031        3896.349121        3889.297119
    3   Running     s01         -16.322056         641.164185         605.192993
    4   Running     s02         431.893585         870.608459         846.671204
    5   Running     s03         360.777466         263.184052         234.177200
    6   Walking     s01           0.492386         559.139587         558.538086
    7   Walking     s02         374.443237         658.902710         669.394592
    8   Walking     s03         283.627502         -87.612816         -98.735649

Notes

For more information on defaults of Log R, please see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression For Linear SVC, please see: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC

Tree-based Selection

Select features using a supervised tree-based algorithm. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control overfitting. The default number of trees in the forest is set at 250, and the random_state to be 0. Please see notes for more information.

Parameters
  • input_data (DataFrame) – Input data.

  • label_column (str) – Label column of input data.

  • number_of_features (int) – The number of features you would like the selector to reduce to.

  • passthrough_columns (list, optional) – A list of columns to include in the output dataframe in addition to the selected features. Defaults to None.

Returns

A tuple containing:
  • selected_features (DataFrame): DataFrame which includes selected features and the passthrough columns for each class.

  • unselected_features (list): A list of unselected features.

Return type

tuple

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Tree-based Selection', 'params':{ "number_of_features": 4 }}] )
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0002_accely_0  gen_0002_accely_1  gen_0002_accely_2  gen_0002_accely_3  gen_0002_accely_4  gen_0003_accelz_0  gen_0003_accelz_1  gen_0003_accelz_2  gen_0003_accelz_3  gen_0003_accelz_4
    0  Crawling     s01           1.669203           1.559860           1.526786           1.414068           1.413625           1.360500           1.368615           1.413445           1.426949           1.400083
    1  Crawling     s02           1.486925           1.418474           1.377726           1.414068           1.413625           1.360500           1.368615           1.388456           1.408576           1.397417
    2  Crawling     s03           1.035519           1.252789           1.332684           1.328587           1.324469           1.410274           1.414961           1.384032           1.345107           1.393088
    3   Running     s01          -0.700995          -0.678448          -0.706631          -0.674960          -0.713493          -0.572269          -0.600986          -0.582678          -0.560071          -0.615270
    4   Running     s02          -0.659030          -0.709012          -0.678594          -0.688869          -0.700753          -0.494247          -0.458891          -0.471897          -0.475010          -0.467597
    5   Running     s03          -0.712790          -0.713026          -0.740177          -0.728651          -0.733076          -0.836257          -0.835071          -0.868028          -0.855081          -0.842161
    6   Walking     s01          -0.701450          -0.714677          -0.692671          -0.716556          -0.696635          -0.652326          -0.651784          -0.640956          -0.655958          -0.643802
    7   Walking     s02          -0.698335          -0.689857          -0.696807          -0.702233          -0.682212          -0.551928          -0.590001          -0.570077          -0.558563          -0.576008
    8   Walking     s03          -0.719046          -0.726102          -0.722315          -0.727506          -0.712461          -1.077342          -1.052320          -1.052297          -1.075949          -1.045750

Notes

For more information, please see: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html

t-Test Feature Selector

This is a supervised feature selection algorithm that selects features based on a two-tailed t-test. It computes the p-values and selects the top-performing number of features for each class as defined by feature_number. It returns a reduced combined list of all the selected features.

Parameters
  • input_data (DataFrame) – Input data

  • label_column (str) – Column containing class labels

  • feature_number (int) – Number of features to select for each class

  • passthrough_columns (Optional[List[str]]) – List of columns that the selector should ignore

Returns

A tuple containing:
  • DataFrame: DataFrame which includes selected features and the passthrough columns.

  • List[str]: List of unselected features.

Return type

Tuple[DataFrame, List[str]]

Examples:
>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'ttest Feature Selector',
        'params':{"feature_number": 2 }}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
     [u'Class',
     u'Subject',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_4']

Univariate Selection

Univariate feature selection using ANOVA (Analysis of Variance) is a statistical method used to identify the most relevant features in a dataset by analyzing the variance between different groups. It is a supervised method of feature selection, which means that it requires labeled data.

The ANOVA test calculates the F-value for each feature by comparing the variance within each class to the variance between classes. The higher the F-value, the more significant the feature is in differentiating between the classes. Univariate feature selection selects the top k features with the highest F-values, where k is a specified parameter.

Parameters
  • input_data (DataFrame) – Input data

  • label_column (str) – Label column name

  • number_of_features (int) – The number of features you would like the selector to reduce to.

  • passthrough_columns (Optional[list], optional) – List of columns to pass through. Defaults to None.

Returns

Tuple containing:
  • DataFrame: DataFrame which includes selected features and the passthrough columns.

  • list: List of unselected features.

Return type

tuple

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False)
>>> client.pipeline.set_input_data('test_data', results, force=True,
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_selector([{'name':'Univariate Selection',
                    'params': {"number_of_features": 3 } }])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
          Class Subject  gen_0002_accely_2  gen_0002_accely_3  gen_0002_accely_4
    0  Crawling     s01           1.526786           1.496120           1.500535
    1  Crawling     s02           1.377726           1.414068           1.413625
    2  Crawling     s03           1.332684           1.328587           1.324469
    3   Running     s01          -0.706631          -0.674960          -0.713493
    4   Running     s02          -0.678594          -0.688869          -0.700753
    5   Running     s03          -0.740177          -0.728651          -0.733076
    6   Walking     s01          -0.692671          -0.716556          -0.696635
    7   Walking     s02          -0.696807          -0.702233          -0.682212
    8   Walking     s03          -0.722315          -0.727506          -0.712461

Notes

Please see the following for more information: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif

Variance Threshold

Feature selector that removes all low-variance features.

This step is an unsupervised feature selection algorithm and looks only at the input features (X) and not the Labels or outputs (y). Select features whose variance exceeds the given threshold (default is set to 0.05). It should be applied prior to standardization.

Parameters
  • input_data (DataFrame) – Input data.

  • threshold (float) – [Default 0.01] Minimum variance threshold under which features should be eliminated.

  • passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.

Returns

tuple containing:

selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features

Return type

tuple

Examples

>>> client.pipeline.reset()
>>> df = client.datasets.load_activity_raw_toy()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all features before the feature selection algorithm
>>> results.columns.tolist()
    Out:
    [u'Class',
     u'Subject',
     u'gen_0001_accelx_0',
     u'gen_0001_accelx_1',
     u'gen_0001_accelx_2',
     u'gen_0001_accelx_3',
     u'gen_0001_accelx_4',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4',
     u'gen_0003_accelz_0',
     u'gen_0003_accelz_1',
     u'gen_0003_accelz_2',
     u'gen_0003_accelz_3',
     u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'Variance Threshold',
                            'params':{"threshold": 4513492.05}}])
>>> results, stats = client.pipeline.execute()
>>> print results
    Out:
    [u'Class',
     u'Subject',
     u'gen_0002_accely_0',
     u'gen_0002_accely_1',
     u'gen_0002_accely_2',
     u'gen_0002_accely_3',
     u'gen_0002_accely_4']