Getting Started with the SensiML Python SDK

This tutorial is a continuation of the Getting Started tutorial at https://sensiml.com/documentation/guides/getting-started/overview.html

We recommend you complete the main Getting Started Guide above before using the SensiML Python SDK.

This tutorial is good for two scenarios:

  • You are experienced with machine learning and want to create your own Knowledge Pack with customized algorithms

  • You already generated a Knowledge Pack using the Analytics Studio and want to find out how you can tweak the underlying features of the Knowledge Pack even further

Prerequisites: You should have already uploaded the Quick Start project through the Data Studio called Slide Demo

The goal of this tutorial is to give insight into the more advanced features in building a custom algorithm for a Knowledge Pack.

There are three main steps to building a SensiML Knowledge Pack:

- Query your data
- Transform the data into a feature vector
- Build the model to fit on the sensor device

Try It Yourself

You can download the Notebook file to follow along with this tutorial in your own environment.

Loading Your Project

First you need to load the project you created through the Data Studio. In our example it is called ‘Slide Demo’

[15]:
%matplotlib inline

from sensiml import SensiML
client = SensiML()
[16]:
client.project ='Slide Demo'

The next step is to initialize a pipeline space to work in. Pipelines store the preprocessing, feature extraction, and model building steps. When training a model, these steps are executed on the SensiML server. Once the model has been trained, the pipeline is converted to a firmware code that will run on your target embedded device. Add a pipeline to the project using the following code snippet.

client.pipeline = "Name of your pipeline"
[17]:
client.pipeline = "Slide Demo Pipeline"

Query Your Data

To select the data you want to use in your pipeline you need to add a query step. Queries give us a way to select and filter the data we want to use in our pipeline.

  1. Create a query for all labeled sensor data in your project

We recommend using the Prepare Data page in the Analytics Studio at https://app.sensiml.cloud/ to create your query. Alternatively, you can also use the create_query API by running the cell below.

[ ]:
client.create_query(name="My Query",
                 segmenter="My Training Session",
                 label_column="Label",
                 metadata_columns=["Subject"],
                 columns=["AccelerometerX", "AccelerometerY","AccelerometerZ","GyroscopeX","GyroscopeY","GyroscopeZ"]
                )

Building a Pipeline

Throughout this notebook we will add multiple steps to transform the data in a pipeline.

Note: No work is done on the data until you execute the pipeline, i.e., client.pipeline.execute()

The main steps of a pipeline include:

-Query
-Feature Engineering
-Model Generation

It is important that you add the steps in the right order. If you accidentally add them in the wrong order or want to restart, simply enter the command:

client.pipeline.reset()

Adding your Query step

Let’s add the query step that you created above. Use the command below:

[ ]:
client.pipeline.reset()
client.pipeline.set_input_query('My Query')

Pipeline Progress

To see the current steps in your pipeline you can enter the command:

[ ]:
client.pipeline.describe()

SensiML Core Functions

The Analytics Studio provides a way to define a pipeline for feature vector and model building. The feature vector generation part of the pipeline includes over 100 core functions that can be split into a few different types:

  • Sensor transforms - these are applied to the data directly as it comes off the sensor, they can be smoothing functions, magnitude of sensor columns etc.

  • Segmentation - the segmenter selects regions of interest from the streaming data. This can be an event if you are using an event detection segmenter, or simply a sliding window which buffers a segment of data and sends it to the next step.

  • Segment transforms - operates on a segment of data, typically normalizes the data in some way such as demeaning to prepare for feature vector generation.

  • Feature generators - Algorithms to extract relevant feature vectors from the data streams in preparation for model building.

  • Feature transforms - Feature transforms normalize all of the features in the feature vector to between 0-255.

  • Feature selectors - These functions remove features which do not help discriminate between different classes.

The Analytics Studio allows you to string together a pipeline composed of these individual steps. The pipeline is sent to our servers where we can take advantage of optimizations to speed up the pipeline processing.

The segmentation and feature engineering part of the pipeline involves transforming data streams into a feature vector that are used to train a model (SensiML Knowledge Pack). This is where we get into the more advanced machine learning part of the Analytics Studio. It is okay if you do not understand everything right away, we are going to walk through some examples of good features for the periodic event use case and give you the tools to explore more features

The features in the feature vector must be integers between 0-255. The feature vector can be any length, but in practice you will be limited by the space on the device.

Adding a Basic Core Function

Next we’re going to add one core function and explain how to work with other core functions.

A core function that is often useful for normalizing data is the magnitude sensor transform. Add a Magnitude sensor transform using the command below:

[ ]:
client.pipeline.add_transform("Magnitude", params={"input_columns": ['GyroscopeX','GyroscopeY', 'GyroscopeZ']})
client.pipeline.describe()

If you want to see specific documentation about any of the Analytics Studio commands, add a ? to the end of the command

client.pipeline.add_transform?

Exploring Core Functions:

The magnitude sensor transform is just one of over 100 core functions that the Analytics Studio provides. To see a list of the available core functions, use the following command:

[ ]:
client.list_functions()

To get the documentation for any of the functions, use the command:

[6]:
client.function_description('Magnitude')

    Computes the magnitude (square sum) of a signal across the input_columns
    streams.

    Args:
        input_columns (list[str]): sensor streams to use in computing the magnitude

    Returns:
        The input DataFrame with an additional column containing the per-sample
        magnitude of the desired input_columns


Inputs
----------
  input_data: DataFrame
  input_columns: list

Usage
----------
For DataFrame inputs, provide a string reference to the
DataFrame output of a previous step in the pipeline.
For Dataframe output, provide a string name that subsequent
operations can refer to.

To get the function parameters, use the following command:

[7]:
client.function_help('Magnitude')
client.pipeline.add_transform("Magnitude", params={"input_columns": <list>,
                                })

Function Snippets

The SensiML Python SDK includes function snippets that will auto-generate the function parameters for you. To use a function snippet, execute the following command:

client.snippets.Sensor_Transform.Magnitude()

To see snippets in action, go ahead and execute the cell below:

[ ]:
client.snippets.Sensor_Transform.Magnitude()

Pipeline Execution

When executing the pipeline, there will always be two results returned. Take a look at the next cell. The first variable magnitude_data will be the actual data. The second variable stats will contain information about the pipeline execution on the server.

[ ]:
magnitude_data, stats = client.pipeline.execute()

Explore the returned magnitude_data using the command below.

[ ]:
magnitude_data.head()

Notice that an additional column Magnitude_ST_0000 is added to the dataframe. The subscripts refer to this being a sensor transform (ST) and being the first one added 0000. If you were to add another sensor transform, for example taking the magnitude of the accelerometer data as well, you would get another column Magnitude_ST_0001.

Performing Segmentation

The next step is to segment our data into windows which we can perform recognition on. For periodic events we want to use the Windowing Transform. Go ahead and look at the function description. Delta is the sliding window overlap. Setting delta to the same value as the window size means that there is no overlap in our segmented windows.

[ ]:
client.pipeline.add_transform("Windowing", params={"window_size": 300,
                                                "delta": 300,})
client.pipeline.describe(show_params=True)

Different window sizes can lead to better models. For this project lets reduce the window_size and delta to 100. The actual time that the window size represents for this data set it corresponds to 2 seconds, as our data was recorded at 100HZ. Go ahead and change the values in the Windowing Segmenter and re-execute. You will see the parameters change for the windowing segmenter change, but a new step shouldn’t be added.

[ ]:
client.pipeline.add_transform("Windowing", params={"window_size": 100,
                                                "delta": 100,})
client.pipeline.describe(show_params=True)

Feature Vector Generation

At this point we are ready to generate a feature vector from our segments. Feature generators are algorithms to extract relevant feature vectors from the data streams in preparation for model building. They can be simple features such as mean up to more complex features such as the fourier transform.

Feature generators are all added into a single step and run in parallel against the same input data. Let’s add two feature generators now:

[ ]:
client.pipeline.add_feature_generator(["Mean", 'Standard Deviation'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

We have added two feature generators from the subtype Statistical. The more features, the better chance you have of building a successful model. Let’s try adding a few more feature generators of the same subtype. Call client.list_functions() and you can find more feature generators of the same type

[ ]:
client.pipeline.add_feature_generator(["Mean", 'Standard Deviation', 'Sum', '25th Percentile'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

Our classifiers are optimized for performance and memory usage to fit on resource constrained devices. Because of this we scale the features in the feature vector to be a single byte each so we need to add the Min Max Scale transform to the pipeline. This function will scale the features in the feature vector to have values between 0 and 255.

[ ]:
client.pipeline.add_transform('Min Max Scale')
[ ]:
feature_vectors, stats = client.pipeline.execute()
feature_vectors.head()

Naming Convention

The column header represents the name of the feature generator and can be used to identify which feature generator and which inputs were used. The suffix gen lets us know that this was a feature geneator. The number that follows lets us know the index of the feature generator. After that we have the name of in the input columns Magnitude_ST_0000 combined with the name of the feature generator Mean ie. Magnitude_ST_0000Mean

Visualizing Feature Vectors

Next let’s take a look at the feature vectors that you have generated. We plot of the average of all feature vectors grouped by Activity. Ideally, you are looking for feature vectors that are separable in space. How do the ones you’ve generated look?

[ ]:
client.pipeline.visualize_features(feature_vectors)

Training a Model

  • Train Validate Optimze (tvo): This step defines the model validation, the classifier and the training algorithm to build the model with. With SensiML, the model is first trained using the selected training algorithm, then loaded into the hardware simulator and tested using the specified validation method.

This pipeline uses the validation method “Stratified K-Fold Cross-Validation”. This is a standard validation method used to test the performance of a model by splitting the data into k folds, training on k-1 folds and testing against the excluded fold. Then it switches which fold is tested on, and repeats until all of the folds have been used as a test set. The average of the metrics for each model provides a good estimate of how a model trained on the full data set will perform.

The training algorithm attempts to optimize the number of neurons and their locations in order to create the best model. We are using the training algorithm “Hierarchical Clustering with Neuron Optimization,” which uses a clustering algorithm to optimize neurons placement in feature space.

We are using the Pattern Matching Engine (PME) classifier which has two classification modes, RBF and KNN and two distance modes of calculation, L1 and LSUP. You can see the documentation for further descriptions of the classifier.

[ ]:
client.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':3,})

client.pipeline.set_classifier('PME', params={"classification_mode":'RBF','distance_mode':'L1'})

client.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization',
                                    params = {'number_of_neurons':5})

client.pipeline.set_tvo({'validation_seed':2})

Go ahead and execute the full pipeline now.

[ ]:
model_results, stats = client.pipeline.execute()

The model_results object returned after a TVO step contains a wealth of information about the models that were generated and their performance. A simple view is to use the summarize function to see the performance of our model.

[ ]:
model_results.summarize()

Let’s grab the fold with the best performing model to compare with our features.

[ ]:
model = model_results.configurations[0].models[0]

The neurons are contained in model.neurons. Plot these over the feature_vector plot that you created earlier. This step is often useful for debugging.

[ ]:
import pandas as pd
client.pipeline.visualize_neuron_array(model, model_results.feature_vectors,
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[-1],
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[0])

Save the best model as a SensiML Knowledge Pack. Models that aren’t saved will be lost if you run the pipeline again.

[ ]:
model.knowledgepack.save('MyFirstModel_KP')

Downloading the Knowledge Pack

This completes the model training portion of the tutorial.

  1. We recommend using the Download Model page within the Analytics Studio at https://app.sensiml.cloud to download the Knowledge Pack model firmware.

  2. Alternatively, see instructions for setting up the Knowledge Pack API at https://sensiml.com/documentation/sensiml-python-sdk/api-methods/knowledge-packs.html