# SensiML AutoML¶

AutoML is used to create a set of models within the desired statistical (accuracy, f1-score, sensitivity, etc.) and classifier size (neurons, features) parameters. As the algorithm iterates each optimization step, it narrow downs the searching space to find a desired number of models. The optimization terminates when the desired model is found or the number of iterations reaches the max number of iterations.

We take advantage of dynamic programming and optimizations for training algorithms to speed up the computation. This makes it possible to search large parameter spaces quickly and efficiently. The results are ranked by the fitness score which takes into account the model’s statistical and hardware parameters.

## Parameters¶

Base class for the pipeline execution engine.

engine.automationengine.AutomationEngine.automate(self, auto_params, run_parallel=True, caching=True)
Pipeline parameters:

allow_unknown, (bool, False): Allows creating unknown prediction results for the vectors. A vector is classified as an unknown if it cannot be recognized by any neurons.

demean_segments, (bool, False): Removes the mean from the input data before extracting features.

validation_method, (Stratified Shuffle Split): Validation method that will be used in optimization

balance_data, (bool, False): Use Undersampling of the majority classes to balance the data prior to model building

outlier_filter, (bool, False): Filter outliers using “Isolation Forest Filter” with a filter percent of 5.

Knowledge Pack Architecture Setting:

single_model, (bool, True): Optimize the prediction results by creating a knowledge pack from a single model.

hierarchical_multi_model, (bool, True): Optimize the prediction results by creating a knowledge pack from the hierarchical multi-models.

hierarchical_model_configuration, (dict, None):

Create a custom hierarchical multi-models to optimize the prediction results. if the user has a domain knowledge and wants to group the classes to create a custom hierarchical_multi_model, hierarchical_model_configuration can be used to describe the groups. See example ‘Create a custom hierarchical multi-models’ given below. It has 3 paramenters:

parent, (dict) : Defines the parent of the node in hierarchical multi-models structure.

train, (dict): Defines how the data will be group.

nodes, (dict): Defines how to handle the results of the model. The result can be report or point to the sub-model.

Optimization algorithm parameters:

search_steps, (list, [‘selectorset’, ‘tvo’]): it is used to define which libraries in the pipelines will be optimized.

population_size, (int, 10): Initial number of randomly created pipelines.

iterations, (int, 1): Maximum repetition number of optimization process.

mutation_rate, (float, 0.1): Random changes from the previous population.

recreation_rate, (float, 0.1): Rate of randomly created pipelines for next generation.

survivor_rate, (float, 0.5): Ratio of the population that will be transferred to next generation.

statistical variables:

accuracy, (float, 1.0): The degree of correctness of all vectors

f1_score, (float, 0.0): Measures of the test’s accuracy

precision, (float, 0.3): Proportion of positive identifications that is actually correct

sensitivity, (float, 0.7): Measures of the proportion of actual positives that are correctly identified

specificity, (float, 0.5): Measures of the proportion of actual negatives that are correctly identified

positive_predictive_rate, (float, 0.0): Ratio of “true positive” is the event that the test makes a positive prediction

cost variables:

classifiers_sram, (float, 0.5) : Defines the weight of the SRAM usage in the fitness score.

prediction_target, (dict): Prediction_targets are statistical scores of accuracy, f1_score and sensitivity. These scores are used to terminate the optimization process. If statistical targets (and hardware_targets) reach the desired scores, the optimization process is terminated for statistical optimization.

hardware_target, (dict): Hardware_targets are latency, classifiers_sram. These scores are used to terminate the optimization process. If hardware targets reach the desired scores, the optimization process is terminated for hardware optimization.

feature_thresholds, (dict, None):

Defines the thresholds for feature domain. It has 2 paramenters:

minimum_variance, (float, 0.05): Remove all features whose variance are lower than the given threshold. maximum_correlation, (float, 0.95): Calculates a pair-wise correlation matrix consisting of all features. Remove all features whose absolute correlation pair is greater than the given threshold except the last one.

Status parameters:

lock (bool, False): Ping for results every 30 seconds until the process finishes. silent (bool, True): Silence status updates.

Seeds:

seed, (str): Seeds are predefined feature sets which perform better with specific cases and data sets. There are several seeds you can use when generating a model using AutoML.

Basic Seed: Generates a set of all-purpose, high-performance features using statistical, energy, and rate of change feature generators. The seed then performs feature selection and model generation algorithms with a genetic algorithm to optimize pipeline parameters.

• You are wondering where to start

• You want execution to be as quick as possible

• You want simple, easy-to-interpret features

Advanced Features: Generates a comprehensive set of features using statistical, energy, amplitude, shape, time, and rate of change feature generators. The seed then performs feature selection and model generation algorithms with a genetic algorithm to optimize pipeline parameters.

• You tried “Basic Features” and didn’t get a good model

• You don’t mind if execution takes a while

• You want the best possible features, even if they are complex

Downsampled Features: Generates a set of downsampled features and then performs feature selection and model generation algorithms with a genetic algorithm to optimize pipeline parameters.

• You are creating a gesture recognition application

Histogram Features: Generates a set of histogram features and then performs feature selection and model generation algorithms with a genetic algorithm to optimize pipeline parameters.

• You are creating a motor vibration application

Custom Feature Generatorset: Uses the user-defined pipeline to extract features and then searches for optimal parameters for the feature selection step.

• You tried the other seeds and didn’t get a good model

• You want to build your own pipeline and use the genetic algorithm to find the best number of features, best number of neurons, and other model-related parameters

Examples

Using Basic Features:

dsk.pipeline.reset()

dsk.pipeline.set_input_data('Gesture.csv',
data_columns= ['AccelerometerX', 'GyroscopeY'],
group_columns= ['Gesture', 'SegmentID', 'Subject'],
label_column= 'Gesture' )

dsk.pipeline.add_transform("Windowing", params={"window_size": 400, "delta": 400 })

results, summary = dsk.pipeline.auto({
'seed': 'Basic Features',
'params':{"search_steps": ['selectorset', 'tvo'],
"population_size": 6,
"iterations": 1,
"mutation_rate": 0.1,
"recreation_rate": 0.1,
"survivor_rate": 0.5,
"number_of_models_to_return": 5,
"run_parallel": True,
"allow_unknown": False,
"auto_group": False,
"balance_data": True,
"validation_method": "'Stratified Shuffle Split'",
"single_model": True,
"hierarchical_multi_model": False,
"prediction_target": {'accuracy': 1.0,
'positive_predictive_rate': 0.0,
'sensitivity': 0.0},
"hardware_target": {'latency': 0,
'classifiers_sram': 5},
}
}
)

summary['fitness_summary']


Using Custom Feature Generatorset:

sensors = ['AccelerometerX', 'AccelerometerY', 'AccelerometerZ',
'GyroscopeX', 'GyroscopeY', 'GyroscopeZ']
sample_rate = 400

custom_seed = {
'seed': 'Custom Feature Generatorset',
'params': {
'iterations': 1,
'reset': True,
'population_size': 6,
'mutation_rate': 0.1,
'recreation_rate': 0.2,
'survivor_rate': 0.5,
'allow_unknown': False,
'validation_method': "'Stratified Shuffle Split'",
'balanced_data': True,
'demean_segments': True,
'outlier_filter': True,
'single_model': True,
'hierarchical_multi_model': False,
'prediction_target(%)': {'f1_score': 100},
'hardware_target': {'classifiers_sram': 32000},
'combine_labels': {'Coughing': ['Coughing'], 'Other': ['Other', 'Sniffing']},

'generatorset': {
'name': 'generator_set',
'type': 'generatorset',
'set': [
{'function_name': 'Global Peak to Peak', 'inputs': {'columns': sensors}},
{'function_name': 'Global Min Max Sum', 'inputs': {'columns': sensors}},
{'function_name': '75th Percentile', 'inputs': {'columns': sensors}},
{'function_name': '100th Percentile', 'inputs': {'columns': sensors}},
{'function_name': 'Minimum', 'inputs': {'columns': sensors}},
{'function_name': 'Maximum', 'inputs': {'columns': sensors}},
{'function_name': 'Sum', 'inputs': {'columns': sensors}},
],
'inputs': {'group_columns': None, 'input_data': ''},
'outputs': ['temp.generator_set0', 'temp.features.generator_set0']
}}}

dsk.pipeline.reset()
dsk.pipeline.set_input_data(
"Gesture.csv",
data_columns=sensors,
group_columns=["Gesture", "SegmentID", "Subject"],
label_column="Gesture",
)

results, summary = dsk.pipeline.auto(custom_seed)
summary['fitness_summary']


Create a custom hierarchical multi-models:

Case:

if user has domain knowledge about the gestures (in our case, gestures are ‘A’, ‘D’, ‘L’, ‘M’, ‘U’) and want to group these gestures to create more accurate knowledgepack. The design of the hierarchical multi-models architecture given below.

• The parent model creates a prediction for (A,M), U, (L,D)

• The second model creates a prediction for A and M

• The third model creates a prediction for L and D

hm_configuration={
'pr':{'parent':None,
'train': {'group_1':['A', 'M'],
'U':['U'],
'group_2':['L', 'D'] },
'nodes': {'group_1':'model_group_1',
'U': 'Report',
'group_2':'model_group_2'  }
},

'model_group_1':{
'parent':'pr',
'train':{'A':['A'],
'M':['M']},
'nodes': {'A': 'Report',
'M': 'Report'}
},

'model_group_2':{
'parent': 'pr',
'train':{'L':['L'],
'D':['D']},
'nodes': {'L':'Report',
'D':'Report'}
}
}

dsk.pipeline.reset()

dsk.pipeline.set_input_data('Gesture.csv',
data_columns= ['AccelerometerX', 'GyroscopeY'],
group_columns= ['Gesture', 'SegmentID', 'Subject'],
label_column= 'Gesture' )

dsk.pipeline.add_transform("Windowing", params={"window_size": 400, "delta": 400 })

results, summary = dsk.pipeline.auto({
'seed': 'Basic Features',
'params':{"search_steps": ['selectorset', 'tvo'],
"population_size": 6,
"iterations": 1,
"mutation_rate": 0.1,
"recreation_rate": 0.1,
"survivor_rate": 0.5,
"number_of_models_to_return": 5,
"run_parallel": True,
"allow_unknown": False,
"auto_group": False,
"balance_data": True,
"validation_method": "'Stratified Shuffle Split'",
"single_model": True,
"hierarchical_multi_model": True,
"hierarchical_model_configuration": hm_configuration,
"prediction_target": {'accuracy': 1.0,
'positive_predictive_rate': 0.0,
'sensitivity': 0.0},
"hardware_target": {'latency': 0,
'classifiers_sram': 5},
}
}
)

summary['fitness_summary']