Samplers

Used to remove outliers and noisy data before classification. Samplers are useful in improving the robustness of the model.

Isolation Forest Filtering

Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Isolation Forest Filtering",
                   params={ "outliers_fraction": 0.01})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Local Outlier Factor Filtering

The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • number_of_neighbors (int) – Number of neighbors for a vector.

  • norm (string) – Metric that will be used for the distance computation.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Local Outlier Factor Filtering",
                   params={"outliers_fraction": 0.05,
                            "number_of_neighbors": 5})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
One Class SVM filtering

Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – Define the ratio of outliers.

  • kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("One Class SVM filtering",
                   params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Robust Covariance Filtering

Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.

Parameters
  • input_data – Dataframe, feature set that is results of generator_set or feature_selector

  • label_column (str) – Label column name.

  • filtering_label – List<String>, List of classes. if it is not defined, it use all classes.

  • feature_columns – List<String>, List of features. if it is not defined, it uses all features.

  • outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Robust Covariance Filtering",
                   params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Sample By Metadata

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

Parameters
  • input_data (DataFrame) – Input DataFrame.

  • metadata_name (str) – Name of the metadata column to use for sampling.

  • metadata_values (list[str]) – List of values of the named column for which to select rows of the input data.

Returns

The input_data DataFrame containing only the rows for which the metadata value is in the accepted list.

Return type

DataFrame

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.

Syntax:
combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],

‘group3’: [‘group5’]}

Parameters
  • input_data (DataFrame) – Input DataFrame.

  • label_column (str) – Label column name.

  • combine_labels (dict) – Map of label columns to combine.

Returns

The input_data containing only the rows for which the label value is in the combined list.

Return type

DataFrame

Zscore Filter

A z-score filter is a way to standardize feature vectors by transforming each feature in the vector to have a mean of zero and a standard deviation of one. The z-score, or standard score, is a measure of how many standard deviations a data point is from the mean of the distribution. This features that have z-score outside of a cutoff threshold are removed.

Parameters
  • input_data (DataFrame) – Input DataFrame.

  • label_column (str) – Label column name.

  • zscore_cutoff (int) – Cutoff for filtering features above z score.

  • feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector.

  • feature_columns (list) – List of features to filter by. If None, filters all.

  • assign_unknown (bool) – Assign unknown label to outliers.

Returns

The filtered DataFrame containing only the rows for which the metadata value is in

the accepted list.

Return type

DataFrame

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Zscore Filter",
                   params={"zscore_cutoff": 3, "feature_threshold": 1})
>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]
Sigma Outliers Filtering

A sigma outlier filter algorithm is a technique used to identify and remove outliers from feature vectors based on their deviation from the mean. In this algorithm, an outlier is defined as a data point that falls outside a certain number of standard deviations (sigma) from the mean of the distribution.

Parameters
  • input_data (DataFrame) – The feature set that is a result of either a generator_set or feature_selector.

  • label_column (str) – The label column name.

  • filtering_label (str) – List of classes that will be filtered. If it is not defined, all classes will be filtered.

  • feature_columns (list of str) – List of features. If it is not defined, it uses all features.

  • sigma_threshold (float) – Defines the ratio of outliers.

  • assign_unknown (bool) – Assigns an unknown label to outliers.

Returns

The filtered DataFrame containing features without outliers and noise.

Examples

client.pipeline.reset(delete_cache=False)
df = client.datasets.load_activity_raw()
client.pipeline.set_input_data('test_data', df, force=True,
                data_columns = ['accelx', 'accely', 'accelz'],
                group_columns = ['Subject','Class'],
                label_column = 'Class')
client.pipeline.add_feature_generator([{'name':'Downsample',
                        'params':{"columns": ['accelx','accely','accelz'],
                                "new_length": 5 }}])
results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

client.pipeline.add_transform("Sigma Outliers Filtering",
            params={ "sigma_threshold": 1.0 })

results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5]

Return type

DataFrame

Sampling Techniques for Handling Imbalanced Data sigma_outliers_filtering

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.

Syntax:
combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],

‘group3’: [‘group5’]}

Parameters
  • input_data (DataFrame) – Input DataFrame.

  • label_column (str) – Label column name.

  • combine_labels (dict) – Map of label columns to combine.

Returns

The input_data containing only the rows for which the label value is in the combined list.

Return type

DataFrame

Undersample Majority Classes

Create a balanced data set by undersampling the majority classes using random sampling without replacement.

Parameters
  • input_data (DataFrame) – input DataFrame

  • label_column (str) – The column to split against

  • target_class_size (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)

  • seed (int) – Specifies a random seed to use for sampling

  • maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,

Returns

DataFrame containing undersampled classes

Sampling Techniques for Augmenting Data Sets

Pad Segment

Pad a segment so that its length is equal to a specific sequence length

Parameters
  • input_data (DataFrame) – input DataFrame

  • group_columns (str) – The column to group by against (should 283 SegmentID)

  • sequence_length (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)

  • noise_level (int) – max amount of noise to add to augmentation

Returns

DataFrame containing padded segments

Resampling by Majority Vote

For each group, perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occurring value.

Parameters
  • input_data (DataFrame) – Input DataFrame.

  • group_columns (list) – Columns to group over.

  • metadata_name (str) – Name of the metadata column to use for sampling.

Returns

The modified input_data DataFrame with metadata_name column being modified by max pooling.

Return type

DataFrame