Samplers
Used to remove outliers and noisy data before classification. Samplers are useful in improving the robustness of the model.
-
Isolation Forest Filtering
Isolation Forest Algorithm returns the anomaly score of each sample using the IsolationForest algorithm. The “Isolation Forest” isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
- Parameters
input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
assign_unknown (bool) – Assign unknown label to outliers.
- Returns
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Isolation Forest Filtering", params={ "outliers_fraction": 0.01})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
-
Local Outlier Factor Filtering
The local outlier factor (LOF) to measure the local deviation of a given data point with respect to its neighbors by comparing their local density.
The LOF algorithm is an unsupervised outlier detection method which computes the local density deviation of a given data point with respect to its neighbors. It considers as outlier samples that have a substantially lower density than their neighbors.
- Parameters
input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
number_of_neighbors (int) – Number of neighbors for a vector.
norm (string) – Metric that will be used for the distance computation.
assign_unknown (bool) – Assign unknown label to outliers.
- Returns
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Local Outlier Factor Filtering", params={"outliers_fraction": 0.05, "number_of_neighbors": 5})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
-
One Class SVM filtering
Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm.
- Parameters
input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – Define the ratio of outliers.
kernel (str) – Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’.
assign_unknown (bool) – Assign unknown label to outliers.
- Returns
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("One Class SVM filtering", params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
-
Robust Covariance Filtering
Unsupervised Outlier Detection. An object for detecting outliers in a Gaussian distributed dataset.
- Parameters
input_data – Dataframe, feature set that is results of generator_set or feature_selector
label_column (str) – Label column name.
filtering_label – List<String>, List of classes. if it is not defined, it use all classes.
feature_columns – List<String>, List of features. if it is not defined, it uses all features.
outliers_fraction (float) – An upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1]. By default 0.5 will be taken.
assign_unknown (bool) – Assign unknown label to outliers.
- Returns
DataFrame containing features without outliers and noise.
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Robust Covariance Filtering", params={"outliers_fraction": 0.05})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
-
Sample By Metadata
Select rows from the input DataFrame based on a metadata column. Rows that have a metadata value that is in the values list will be returned.
- Parameters
input_data (DataFrame) – Input DataFrame.
metadata_name (str) – Name of the metadata column to use for sampling.
metadata_values (list[str]) – List of values of the named column for which to select rows of the input data.
- Returns
The input_data DataFrame containing only the rows for which the metadata value is in the accepted list.
- Return type
DataFrame
-
Combine Labels
Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.
- Syntax:
- combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],
‘group3’: [‘group5’]}
- Parameters
input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
combine_labels (dict) – Map of label columns to combine.
- Returns
The input_data containing only the rows for which the label value is in the combined list.
- Return type
DataFrame
-
Zscore Filter
A z-score filter is a way to standardize feature vectors by transforming each feature in the vector to have a mean of zero and a standard deviation of one. The z-score, or standard score, is a measure of how many standard deviations a data point is from the mean of the distribution. This features that have z-score outside of a cutoff threshold are removed.
- Parameters
input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
zscore_cutoff (int) – Cutoff for filtering features above z score.
feature_threshold (int) – The number of features in a feature vector that can be outside of the zscore_cutoff without removing the feature vector.
feature_columns (list) – List of features to filter by. If None, filters all.
assign_unknown (bool) – Assign unknown label to outliers.
- Returns
- The filtered DataFrame containing only the rows for which the metadata value is in
the accepted list.
- Return type
DataFrame
Examples
>>> client.pipeline.reset(delete_cache=False) >>> df = client.datasets.load_activity_raw() >>> client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') >>> client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) >>> results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm >>> results.index.tolist() Out: [0, 1, 2, 3, 4, 5, 6, 7, 8]
>>> client.pipeline.add_transform("Zscore Filter", params={"zscore_cutoff": 3, "feature_threshold": 1})
>>> results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm >>>results.index.tolist() Out: [0, 1, 2, 3, 4, 5]
-
Sigma Outliers Filtering
A sigma outlier filter algorithm is a technique used to identify and remove outliers from feature vectors based on their deviation from the mean. In this algorithm, an outlier is defined as a data point that falls outside a certain number of standard deviations (sigma) from the mean of the distribution.
- Parameters
input_data (DataFrame) – The feature set that is a result of either a generator_set or feature_selector.
label_column (str) – The label column name.
filtering_label (str) – List of classes that will be filtered. If it is not defined, all classes will be filtered.
feature_columns (list of str) – List of features. If it is not defined, it uses all features.
sigma_threshold (float) – Defines the ratio of outliers.
assign_unknown (bool) – Assigns an unknown label to outliers.
- Returns
The filtered DataFrame containing features without outliers and noise.
Examples
client.pipeline.reset(delete_cache=False) df = client.datasets.load_activity_raw() client.pipeline.set_input_data('test_data', df, force=True, data_columns = ['accelx', 'accely', 'accelz'], group_columns = ['Subject','Class'], label_column = 'Class') client.pipeline.add_feature_generator([{'name':'Downsample', 'params':{"columns": ['accelx','accely','accelz'], "new_length": 5 }}]) results, stats = client.pipeline.execute() # List of all data indices before the filtering algorithm results.index.tolist() # Out: # [0, 1, 2, 3, 4, 5, 6, 7, 8] client.pipeline.add_transform("Sigma Outliers Filtering", params={ "sigma_threshold": 1.0 }) results, stats = client.pipeline.execute() # List of all data indices after the filtering algorithm results.index.tolist() # Out: # [0, 1, 2, 3, 4, 5]
- Return type
DataFrame
Sampling Techniques for Handling Imbalanced Data sigma_outliers_filtering
-
Combine Labels
Select rows from the input DataFrame based on a metadata column. Rows that have a label value that is in the combined label list will be returned.
- Syntax:
- combine_labels = {‘group1’: [‘label1’, ‘label2’], ‘group2’: [‘label3’, ‘label4’],
‘group3’: [‘group5’]}
- Parameters
input_data (DataFrame) – Input DataFrame.
label_column (str) – Label column name.
combine_labels (dict) – Map of label columns to combine.
- Returns
The input_data containing only the rows for which the label value is in the combined list.
- Return type
DataFrame
-
Undersample Majority Classes
Create a balanced data set by undersampling the majority classes using random sampling without replacement.
- Parameters
input_data (DataFrame) – input DataFrame
label_column (str) – The column to split against
target_class_size (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)
seed (int) – Specifies a random seed to use for sampling
maximum_samples_size_per_class (int) – Specifies the size of the maximum class to use per class,
- Returns
DataFrame containing undersampled classes
Sampling Techniques for Augmenting Data Sets
-
Pad Segment
Pad a segment so that its length is equal to a specific sequence length
- Parameters
input_data (DataFrame) – input DataFrame
group_columns (str) – The column to group by against (should 283 SegmentID)
sequence_length (int) – Specifies the size of the minimum class to use, if None we will use the min class size. If size is greater than min class size we use min class size (default: None)
noise_level (int) – max amount of noise to add to augmentation
- Returns
DataFrame containing padded segments
-
Resampling by Majority Vote
For each group, perform max pooling on the specified metadata_name column and set the value of that metadata column to the maximum occurring value.
- Parameters
input_data (DataFrame) – Input DataFrame.
group_columns (list) – Columns to group over.
metadata_name (str) – Name of the metadata column to use for sampling.
- Returns
The modified input_data DataFrame with metadata_name column being modified by max pooling.
- Return type
DataFrame