Feature Selectors
Used to optimally select a subset of features before training a Classifiers
Copyright 2017-2024 SensiML Corporation
This file is part of SensiML™ Piccolo AI™.
SensiML Piccolo AI is free software: you can redistribute it and/or modify it under the terms of the GNU Affero General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
SensiML Piccolo AI is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Affero General Public License for more details.
You should have received a copy of the GNU Affero General Public License along with SensiML Piccolo AI. If not, see <https://www.gnu.org/licenses/>.
-
Correlation Threshold
Correlation feature selection is an unsupervised feature selection algorithm that aims to select features based on their absolute correlation with the other features in the dataset. The algorithm begins by computing a pairwise correlation matrix of all the features. It then proceeds to identify a candidate feature for removal, which is the feature that correlates with the highest number of other features that have a correlation coefficient greater than the specified threshold. This process is repeated iteratively until there are no more features with a correlation coefficient higher than the threshold, or when there are no features left. The main objective is to remove the most correlated features first, which could help reduce multicollinearity issues and improve model performance.
- Parameters
input_data – DataFrame containing the input features
threshold – float, default=0.85. Minimum correlation threshold over which features should be eliminated (0 to 1).
passthrough_columns – Optional list of column names to be ignored by the selector.
feature_table – Optional DataFrame that contains the correlation matrix of input features. If this argument is provided, the correlation matrix will not be calculated again.
median_sample_size – Optional float value to use instead of median when a feature has no correlation with other features.
- Returns
- A tuple containing the DataFrame with selected features
and the list of removed features.
- Return type
Tuple[DataFrame, list]
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Correlation Threshold', 'params':{ "threshold": 0.85 }}]) >>> results, stats = client.pipeline.execute()
>>> print results Out: [u'Class', u'Subject', u'gen_0001_accelx_2', u'gen_0001_accelx_4', u'gen_0002_accely_0']
-
Custom Feature Selection
This is a feature selection method which allows custom feature selection. This takes a list of strings where each value is the feature name to keep.
- Parameters
input_data – DataFrame, input data
custom_feature_selection – list, feature generator names to keep
passthrough_columns – list, columns to pass through without modification
**kwargs – additional keyword arguments
- Returns
- tuple containing:
selected_features: DataFrame, which includes selected features and the passthrough columns. unselected_features: list, unselected features
- Return type
tuple
-
Custom Feature Selection
This is a feature selection method which allows custom feature selection. This takes a dictionary where the key is the feature generator number and the value is an array of the features for the feature generator to keep. All feature generators that are not added as keys in the dictionary will be dropped.
- Parameters
input_data (DataFrame) – Input data to perform feature selection on.
custom_feature_selection (dict) – A dictionary of feature generators and their corresponding features to keep.
passthrough_columns (list) – A list of columns to include in the output DataFrame in addition to the selected features.
**kwargs – Additional keyword arguments to pass to the function.
- Returns
A tuple containing the selected features and the passthrough columns as a DataFrame, and a list of unselected features.
- Return type
Tuple[DataFrame, list]
Example
client.pipeline.add_feature_selector([{'name': 'Custom Feature Selection By Index', 'params': {"custom_feature_selection": {1: [0], 2:[0], 3:[1,2,3,4]}, }}]) # would select the features 0 from feature generator 1 and 2, and # features 1,2,3,4 from the generator feature generator 3.
-
Feature Selector By Family
This is an unsupervised method of feature selection. The goal is to randomly select features from the specified feature generators until the maximum number of generators given as input is reached. If no specific generator is provided, all feature generators have an equal chance to be selected.
- Parameters
input_data (DataFrame) – Input data to perform feature selection on.
generators (List[Dict[str, Union[str, int]]]) – A list of feature generators to select from. Each member of this list is a dictionary of this form: {“generator_names”: [(str)] or (str), “number”: (int)}, where “generator_names” lists the name(s) of the generator(s) to select from, and “number” is the desired number of generators.
max_number_generators (int) – [Default 5] The maximum number of feature generators to keep.
random_seed (int) – [Optional] Random initialization seed.
passthrough_columns (List[str]) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.
**kwargs – Additional keyword arguments to pass.
- Returns
A tuple containing a DataFrame that includes the selected features and the passthrough columns and a list containing the unselected feature columns.
- Return type
Tuple[DataFrame, List[str]]
Examples
>>> client.project = <project_name> >>> client.pipeline = <piepline_name> >>> df = client.datasets.load_activity_raw() >>> client.pipeline.reset() >>> client.pipeline.set_input_data('test_data', data_columns = ['accelx', 'accely', 'accelz', 'gyrox', 'gyroy', 'gyroz'], group_columns = ['Subject','Class', 'Rep'], label_column = 'Class') >>> client.pipeline.add_feature_generator([ { "name": "MFCC", "params": { "columns": ["accelx"], "sample_rate": 10, "cepstra_count": 3, }, }, { "name": "Downsample", "params": {"columns": ["accelx", "accely", "accelz"], "new_length": 3}, }, { "name": "MFCC", "params": { "columns": ["accely"], "sample_rate": 10, "cepstra_count": 4, }, }, { "name": "Power Spectrum", "params": { "columns": ["accelx"], "number_of_bins": 5, "window_type": "hanning", }, }, { "name": "Absolute Area", "params": { "sample_rate": 10, "columns": ["accelx", "accelz"], }, }, ]) >>> >>> results, stats = client.pipeline.execute()
>>> results.columns.tolist() # List of all features before the feature selection algorithm Out: ['gen_0001_accelxmfcc_000000', 'gen_0001_accelxmfcc_000001', 'gen_0001_accelxmfcc_000002', 'Class', 'Rep', 'Subject', 'gen_0002_accelxDownsample_0', 'gen_0002_accelxDownsample_1', 'gen_0002_accelxDownsample_2', 'gen_0003_accelyDownsample_0', 'gen_0003_accelyDownsample_1', 'gen_0003_accelyDownsample_2', 'gen_0004_accelzDownsample_0', 'gen_0004_accelzDownsample_1', 'gen_0004_accelzDownsample_2', 'gen_0005_accelymfcc_000000', 'gen_0005_accelymfcc_000001', 'gen_0005_accelymfcc_000002', 'gen_0005_accelymfcc_000003', 'gen_0006_accelxPowerSpec_000000', 'gen_0006_accelxPowerSpec_000001', 'gen_0006_accelxPowerSpec_000002', 'gen_0006_accelxPowerSpec_000003', 'gen_0006_accelxPowerSpec_000004', 'gen_0007_accelxAbsArea', 'gen_0008_accelzAbsArea']
Here, feature selector picks upto 5 feature generators, of which 2 could be from the “Downsample” generator and 2 of them could be either “MFCC” or “Power Spectrum” feature and the rest could be any other feature not listed in any of the “generator_names” lists.
>>> client.pipeline.add_feature_selector([{"name": "Feature Selector By Family", "params":{ "max_number_generators": 5, "seed": 1, "generators":[ { "generator_names": "Downsample", "number": 2 }, { "generator_names": ["MFCC", "Power Spectrum"], "number": 2 }] }}])
>>> results, stats = client.pipeline.execute()
>>> results.columns.tolist() # List of all features after the feature selection algorithm # Because of the random nature of this selector function, the output might be # slightly different each time based on the chosen seed Out: ['Class', 'Rep', 'Subject', 'gen_0003_accelyDownsample_0', 'gen_0003_accelyDownsample_1', 'gen_0003_accelyDownsample_2', 'gen_0002_accelxDownsample_0', 'gen_0002_accelxDownsample_1', 'gen_0002_accelxDownsample_2', 'gen_0005_accelymfcc_000000', 'gen_0005_accelymfcc_000001', 'gen_0005_accelymfcc_000002', 'gen_0005_accelymfcc_000003', 'gen_0001_accelxmfcc_000000', 'gen_0001_accelxmfcc_000001', 'gen_0001_accelxmfcc_000002', 'gen_0008_accelzAbsArea']
-
Information Gain
This is a supervised feature selection algorithm that selects features based on Information Gain (one class vs other classes approaches).
First, it calculates Information Gain (IG) for each class separately to all features then sort features based on IG scores, std and mean differences. Feature with higher IG is a better feature to differentiate the class from others. At the end, each feature has their own feature list.
- Parameters
input_data (DataFrame) – Input data.
label_column (str) – The label column in the input_data.
feature_number (int) – [Default 2] Number of features to select for each class.
passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.
**kwargs – Additional keyword arguments.
- Returns
A tuple containing the selected features and the passthrough columns as a DataFrame, and a list of unselected features.
- Return type
Tuple[DataFrame, list]
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Information Gain', 'params':{"feature_number": 3}}]) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0001_accelx_0 gen_0001_accelx_1 gen_0001_accelx_2 0 Crawling s01 347.881775 372.258789 208.341858 1 Crawling s02 347.713013 224.231735 91.971481 2 Crawling s03 545.664429 503.276642 200.263031 3 Running s01 -21.588972 -23.511278 -16.322056 4 Running s02 422.405182 453.950897 431.893585 5 Running s03 350.105774 366.373627 360.777466 6 Walking s01 -10.362945 -46.967007 0.492386 7 Walking s02 375.751343 413.259460 374.443237 8 Walking s03 353.421906 317.618164 283.627502
-
Recursive Feature Elimination
This is a supervised method of feature selection. The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator (method: ‘Log R’ or ‘Linear SVC’) is trained on the initial set of features and weights are assigned to each one of them. Then, features whose absolute weights are the smallest are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features number_of_features to select is eventually reached.
- Parameters
input_data (DataFrame) – Input data to perform feature selection on.
label_column (str) – Name of the column containing the labels.
method (str) – The type of selection method. Two options available: 1) Log R and 2) Linear SVC. For Log R, the value of Inverse of regularization strength C is default to 1.0 and penalty is defaulted to l1. For Linear SVC, the default for C is 0.01, penalty is l1 and dual is set to False.
number_of_features (int) – The number of features you would like the selector to reduce to.
passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.
- Returns
- A tuple containing:
DataFrame: A DataFrame that includes the selected features and the passthrough columns.
list: A list of unselected features.
- Return type
tuple
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Recursive Feature Elimination', 'params':{"method": "Log R", "number_of_features": 3}}], params={'number_of_features':3}) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0001_accelx_2 gen_0003_accelz_1 gen_0003_accelz_4 0 Crawling s01 208.341858 3881.038330 3900.734863 1 Crawling s02 91.971481 3821.513428 3896.376221 2 Crawling s03 200.263031 3896.349121 3889.297119 3 Running s01 -16.322056 641.164185 605.192993 4 Running s02 431.893585 870.608459 846.671204 5 Running s03 360.777466 263.184052 234.177200 6 Walking s01 0.492386 559.139587 558.538086 7 Walking s02 374.443237 658.902710 669.394592 8 Walking s03 283.627502 -87.612816 -98.735649
Notes
For more information on defaults of Log R, please see: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression For Linear SVC, please see: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html#sklearn.svm.LinearSVC
-
Tree-based Selection
Select features using a supervised tree-based algorithm. This class implements a meta estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and use averaging to improve the predictive accuracy and control overfitting. The default number of trees in the forest is set at 250, and the random_state to be 0. Please see notes for more information.
- Parameters
input_data (DataFrame) – Input data.
label_column (str) – Label column of input data.
number_of_features (int) – The number of features you would like the selector to reduce to.
passthrough_columns (list, optional) – A list of columns to include in the output dataframe in addition to the selected features. Defaults to None.
- Returns
- A tuple containing:
selected_features (DataFrame): DataFrame which includes selected features and the passthrough columns for each class.
unselected_features (list): A list of unselected features.
- Return type
tuple
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Tree-based Selection', 'params':{ "number_of_features": 4 }}] ) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0002_accely_0 gen_0002_accely_1 gen_0002_accely_2 gen_0002_accely_3 gen_0002_accely_4 gen_0003_accelz_0 gen_0003_accelz_1 gen_0003_accelz_2 gen_0003_accelz_3 gen_0003_accelz_4 0 Crawling s01 1.669203 1.559860 1.526786 1.414068 1.413625 1.360500 1.368615 1.413445 1.426949 1.400083 1 Crawling s02 1.486925 1.418474 1.377726 1.414068 1.413625 1.360500 1.368615 1.388456 1.408576 1.397417 2 Crawling s03 1.035519 1.252789 1.332684 1.328587 1.324469 1.410274 1.414961 1.384032 1.345107 1.393088 3 Running s01 -0.700995 -0.678448 -0.706631 -0.674960 -0.713493 -0.572269 -0.600986 -0.582678 -0.560071 -0.615270 4 Running s02 -0.659030 -0.709012 -0.678594 -0.688869 -0.700753 -0.494247 -0.458891 -0.471897 -0.475010 -0.467597 5 Running s03 -0.712790 -0.713026 -0.740177 -0.728651 -0.733076 -0.836257 -0.835071 -0.868028 -0.855081 -0.842161 6 Walking s01 -0.701450 -0.714677 -0.692671 -0.716556 -0.696635 -0.652326 -0.651784 -0.640956 -0.655958 -0.643802 7 Walking s02 -0.698335 -0.689857 -0.696807 -0.702233 -0.682212 -0.551928 -0.590001 -0.570077 -0.558563 -0.576008 8 Walking s03 -0.719046 -0.726102 -0.722315 -0.727506 -0.712461 -1.077342 -1.052320 -1.052297 -1.075949 -1.045750
Notes
For more information, please see: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
-
t-Test Feature Selector
This is a supervised feature selection algorithm that selects features based on a two-tailed t-test. It computes the p-values and selects the top-performing number of features for each class as defined by feature_number. It returns a reduced combined list of all the selected features.
- Parameters
input_data (DataFrame) – Input data
label_column (str) – Column containing class labels
feature_number (int) – Number of features to select for each class
passthrough_columns (Optional[List[str]]) – List of columns that the selector should ignore
- Returns
- A tuple containing:
DataFrame: DataFrame which includes selected features and the passthrough columns.
List[str]: List of unselected features.
- Return type
Tuple[DataFrame, List[str]]
- Examples:
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'ttest Feature Selector', 'params':{"feature_number": 2 }}]) >>> results, stats = client.pipeline.execute()
>>> print results Out: [u'Class', u'Subject', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_4', u'gen_0003_accelz_1', u'gen_0003_accelz_4']
-
Univariate Selection
Univariate feature selection using ANOVA (Analysis of Variance) is a statistical method used to identify the most relevant features in a dataset by analyzing the variance between different groups. It is a supervised method of feature selection, which means that it requires labeled data.
The ANOVA test calculates the F-value for each feature by comparing the variance within each class to the variance between classes. The higher the F-value, the more significant the feature is in differentiating between the classes. Univariate feature selection selects the top k features with the highest F-values, where k is a specified parameter.
- Parameters
input_data (DataFrame) – Input data
label_column (str) – Label column name
number_of_features (int) – The number of features you would like the selector to reduce to.
passthrough_columns (Optional[list], optional) – List of columns to pass through. Defaults to None.
- Returns
- Tuple containing:
DataFrame: DataFrame which includes selected features and the passthrough columns.
list: List of unselected features.
- Return type
tuple
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.reset(delete_cache=False) >>> client.pipeline.set_input_data('test_data', results, force=True, group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_selector([{'name':'Univariate Selection', 'params': {"number_of_features": 3 } }]) >>> results, stats = client.pipeline.execute()
>>> print results Out: Class Subject gen_0002_accely_2 gen_0002_accely_3 gen_0002_accely_4 0 Crawling s01 1.526786 1.496120 1.500535 1 Crawling s02 1.377726 1.414068 1.413625 2 Crawling s03 1.332684 1.328587 1.324469 3 Running s01 -0.706631 -0.674960 -0.713493 4 Running s02 -0.678594 -0.688869 -0.700753 5 Running s03 -0.740177 -0.728651 -0.733076 6 Walking s01 -0.692671 -0.716556 -0.696635 7 Walking s02 -0.696807 -0.702233 -0.682212 8 Walking s03 -0.722315 -0.727506 -0.712461
Notes
Please see the following for more information: http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif
-
Variance Threshold
Feature selector that removes all low-variance features.
This step is an unsupervised feature selection algorithm and looks only at the input features (X) and not the Labels or outputs (y). Select features whose variance exceeds the given threshold (default is set to 0.05). It should be applied prior to standardization.
- Parameters
input_data (DataFrame) – Input data.
threshold (float) – [Default 0.01] Minimum variance threshold under which features should be eliminated.
passthrough_columns (list) – [Optional] A list of columns to include in the output DataFrame in addition to the selected features.
- Returns
- tuple containing:
selected_features (DataFrame): which includes selected features and the passthrough columns. unselected_features (list): unselected features
- Return type
tuple
Examples
>>> client.pipeline.reset() >>> df = client.datasets.load_activity_raw_toy() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all features before the feature selection algorithm >>> results.columns.tolist() Out: [u'Class', u'Subject', u'gen_0001_accelx_0', u'gen_0001_accelx_1', u'gen_0001_accelx_2', u'gen_0001_accelx_3', u'gen_0001_accelx_4', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4', u'gen_0003_accelz_0', u'gen_0003_accelz_1', u'gen_0003_accelz_2', u'gen_0003_accelz_3', u'gen_0003_accelz_4']
>>> client.pipeline.add_feature_selector([{'name':'Variance Threshold', 'params':{"threshold": 4513492.05}}])
>>> results, stats = client.pipeline.execute() >>> print results Out: [u'Class', u'Subject', u'gen_0002_accely_0', u'gen_0002_accely_1', u'gen_0002_accely_2', u'gen_0002_accely_3', u'gen_0002_accely_4']