Samplers

Used to remove outliers and noisy data before classification. Samplers are useful in improving the robustness of the model.

Isolation Forest Filtering

Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Isolation Forest Filtering",
                   params={ "outliers_fraction": 0.01})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Local Outlier Factor Filtering

The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.

The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
number_of_neighbors (int) – Number of neighbors for a vector.
norm (string) – Metric that will be used for the distance computation.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Local Outlier Factor Filtering",
                   params={"outliers_fraction": 0.05,
                            "number_of_neighbors": 5})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

One Class SVM filtering

Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("One Class SVM filtering",
                   params={"outliers_fraction": 0.05})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Robust Covariance Filtering

Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.

Parameters

input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

DataFrame containing features without outliers and noise.

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Robust Covariance Filtering",
                   params={"outliers_fraction": 0.05})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Sample By Metadata

Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.

Parameters

input_data (DataFrame) – Input DataFrame.
metadata_name (str) – Name of the metadata column to use for sampling.
metadata_values (list[str]) – List of values of the named column for which to select rows of the input data.

Returns

The input_data DataFrame containing only the rows for which the metadata value is in the accepted list.

Return type

DataFrame

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.

Syntax:

combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],: ‘group3’: [‘group5’]}

Parameters

input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
combine_labels (dict) – Map of label columns to combine.

Returns

The input_data containing only the rows for which the label value is in the combined list.

Return type

DataFrame

Zscore Filter

A z-score filter is a way to standardize feature vectors by transforming each feature in the vector to have a mean of zero and a standard deviation of one. The z-score, or standard score, is a measure of how many standard deviations a data point is from the mean of the distribution. This features that have z-score outside of a cutoff threshold are removed.

Parameters

input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
zscore_cutoff (int) – Cutoff for filtering features above z score.
feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector.
feature_columns (list) – List of features to filter by. If None, filters all.
assign_unknown (bool) – Assign unknown label to outliers.

Returns

The filtered DataFrame containing only the rows for which the metadata value is in: the accepted list.

Return type

DataFrame

Examples

>>> client.pipeline.reset(delete_cache=False)
>>> df = client.datasets.load_activity_raw()
>>> client.pipeline.set_input_data('test_data', df, force=True,
                    data_columns = ['accelx', 'accely', 'accelz'],
                    group_columns = ['Subject','Class'],
                    label_column = 'Class')
>>> client.pipeline.add_feature_generator([{'name':'Downsample',
                             'params':{"columns": ['accelx','accely','accelz'],
                                       "new_length": 5 }}])
>>> results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
>>> results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5, 6, 7, 8]

>>> client.pipeline.add_transform("Zscore Filter",
                   params={"zscore_cutoff": 3, "feature_threshold": 1})

>>> results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
>>>results.index.tolist()
    Out:
    [0, 1, 2, 3, 4, 5]

Sigma Outliers Filtering

A sigma outlier filter algorithm is a technique used to identify and remove outliers from feature vectors based on their deviation from the mean. In this algorithm, an outlier is defined as a data point that falls outside a certain number of standard deviations (sigma) from the mean of the distribution.

Parameters

input_data (DataFrame) – The feature set that is a result of either a generator_set or feature_selector.
label_column (str) – The label column name.
filtering_label (str) – List of classes that will be filtered. If it is not defined, all classes will be filtered.
feature_columns (list of str) – List of features. If it is not defined, it uses all features.
sigma_threshold (float) – Defines the ratio of outliers.
assign_unknown (bool) – Assigns an unknown label to outliers.

Returns

The filtered DataFrame containing features without outliers and noise.

Examples

client.pipeline.reset(delete_cache=False)
df = client.datasets.load_activity_raw()
client.pipeline.set_input_data('test_data', df, force=True,
                data_columns = ['accelx', 'accely', 'accelz'],
                group_columns = ['Subject','Class'],
                label_column = 'Class')
client.pipeline.add_feature_generator([{'name':'Downsample',
                        'params':{"columns": ['accelx','accely','accelz'],
                                "new_length": 5 }}])
results, stats = client.pipeline.execute()
# List of all data indices before the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5, 6, 7, 8]

client.pipeline.add_transform("Sigma Outliers Filtering",
            params={ "sigma_threshold": 1.0 })

results, stats = client.pipeline.execute()
# List of all data indices after the filtering algorithm
results.index.tolist()
# Out:
# [0, 1, 2, 3, 4, 5]

Return type

DataFrame

Sampling Techniques for Handling Imbalanced Data sigma_outliers_filtering

Combine Labels

Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.

Syntax:

combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],: ‘group3’: [‘group5’]}

Parameters

input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
combine_labels (dict) – Map of label columns to combine.

Returns

The input_data containing only the rows for which the label value is in the combined list.

Return type

DataFrame

Undersample Majority Classes

Create a balanced data set by undersampling the majority classes using random sampling without replacement.

Parameters

input_data (DataFrame) – input DataFrame
label_column (str) – The column to split against
target_class_size (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)
seed (int) – Specifies a random seed to use for sampling
maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,

Returns

DataFrame containing undersampled classes

Sampling Techniques for Augmenting Data Sets

Pad Segment

Pad a segment so that its length is equal to a specific sequence length

Parameters

input_data (DataFrame) – input DataFrame
group_columns (str) – The column to group by against (should 283 SegmentID)
sequence_length (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)
noise_level (int) – max amount of noise to add to augmentation

Returns

DataFrame containing padded segments

Resampling by Majority Vote

For each group, perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occurring value.

Parameters

input_data (DataFrame) – Input DataFrame.
group_columns (list) – Columns to group over.
metadata_name (str) – Name of the metadata column to use for sampling.

Returns

The modified input_data DataFrame with metadata_name column being modified by max pooling.

Return type

DataFrame