Overview

This tutorial is a continuation of the Getting Started Guide. You can find this and other notebook tutorials for Analytics Studio Notebook in our github repository

It is suggested you read the Getting Started Guide before going through this tutorial.

This tutorial is good for two scenarios:

  • You are experienced with machine learning and want to create your own Knowledge Pack with customized algorithms

  • You already generated a Knowledge Pack using the Analytics Studio and want to find out how you can tweak the underlying features of the Knowledge Pack even further

*Prerequisites:* You should have already uploaded the Quick Start project through the DCL called Slide Demo

The goal of this tutorial is to give insight into the more advanced features in building a custom algorithm for a Knowledge Pack.

There are three main steps to building a SensiML Knowledge Pack:

- Query your data
- Transform the data into a feature vector
- Build the model to fit on the sensor device

Jupyter Notebooks

The Analytics Studio is a tool based on jupyter notebooks. If you have not used jupyter notebooks before the following keyboard shortcuts will be useful. * Execute a cell - Shift + Enter * Auto-complete - Press tab at any time while typing a function/command and the Analytics Studio will give you all available options

Loading your project

First you need to load the project you created through the Data Capture Lab. In our example it is called ‘Slide Demo’

[15]:
import sys
%matplotlib inline

from sensiml import SensiML
from sensiml.widgets import *

dsk = SensiML()
[16]:
dsk.project ='Slide Demo'

The next step is to initialize a pipeline space to work in. A pipeline includes each step that you perform on the data to build a SensiML Knowledge Pack. The work you do in the pipeline will be stored in SensiML Cloud so that you can share pipelines with team members and come back to stored work in the future. Add a pipeline to the project using the following code snippet.

dsk.pipeline = "Name of your pipeline"
[17]:
dsk.pipeline = "Slide Demo Pipeline"

Query your data

To select all of the data you labeled through the Data Capture Lab you need to add a query step to your pipeline.

We provided a query widget to make this step easier. To load the query widget, use the command below:

[ ]:
QueryWidget(dsk).create_widget()

Use the query widget to enter the following parameters:

- Query Name: My Query
- Segmenter: Manual
- Label: Label
- Metadata: Subject
- Sources: (Hold shift and select all)

Once you are done click Add Query

Building a pipeline

Throughout this notebook we will add multiple steps to transform the data in a pipeline.

Note: No work is done on the data until you execute the pipeline, i.e., dsk.pipeline.execute()

The main steps of a pipeline include:

-Query
-Feature Engineering
-Model Generation

It is important that you add the steps in the right order. If you accidentally add them in the wrong order or want to restart, simply enter the command:

dsk.pipeline.reset()

Let’s add the query step that you created above. Use the command below:

[ ]:
dsk.pipeline.reset()
dsk.pipeline.set_input_query('My Query')

To see the current steps in your pipeline you can enter the command:

[ ]:
dsk.pipeline.describe()

SensiML Core Functions

The Analytics Studio provides a way to define a pipeline for feature vector and model building. The feature vector generation part of the pipeline includes over 100 core functions that can be split into a few different types:

  • Sensor transforms - these are applied to the data directly as it comes off the sensor, they can be smoothing functions, magnitude of sensor columns etc.

  • Segmentation - the segmenter selects regions of interest from the streaming data. This can be an event if you are using an event detection segmenter, or simply a sliding window which buffers a segment of data and sends it to the next step.

  • Segment transforms - operates on a segment of data, typically normalizes the data in some way such as demeaning to prepare for feature vector generation.

  • Feature generators - Algorithms to extract relevant feature vectors from the data streams in preparation for model building.

  • Feature transforms - Feature transforms normalize all of the features in the feature vector to between 0-255.

  • Feature selectors - These functions remove features which do not help discriminate between different classes.

The Analytics Studio allows you to string together a pipeline composed of these individual steps. The pipeline is sent to our servers where we can take advantage of optimizations to speed up the pipeline processing.

The segmentation and feature engineering part of the pipeline involves transforming data streams into a feature vector that are used to train a model (SensiML Knowledge Pack). This is where we get into the more advanced machine learning part of the Analytics Studio. It is okay if you do not understand everything right away, we are going to walk through some examples of good features for the periodic event use case and give you the tools to explore more features

The features in the feature vector must be integers between 0-255. The feature vector can be any length, but in practice you will be limited by the space on the device.

Adding a basic core function

Next we’re going to add one core function and explain how to work with other core functions.

A core function that is often useful for normalizing data is the magnitude sensor transform. Add a Magnitude sensor transform using the command below:

[ ]:
dsk.pipeline.add_transform("Magnitude", params={"input_columns": ['GyroscopeX','GyroscopeY', 'GyroscopeZ']})
dsk.pipeline.describe()

If you want to see specific documentation about any of the Analytics Studio commands, add a ? to the end of the command

[ ]:
dsk.pipeline.add_transform?

Exploring core functions:

The magnitude sensor transform is just one of over 100 core functions that the Analytics Studio provides. To see a list of the available core functions, use the following command:

[ ]:
dsk.list_functions()

To get the documentation for any of the functions, use the command:

[6]:
dsk.function_description('Magnitude')

    Computes the magnitude (square sum) of a signal across the input_columns
    streams.

    Args:
        input_columns (list[str]): sensor streams to use in computing the magnitude

    Returns:
        The input DataFrame with an additional column containing the per-sample
        magnitude of the desired input_columns


Inputs
----------
  input_data: DataFrame
  input_columns: list

Usage
----------
For DataFrame inputs, provide a string reference to the
DataFrame output of a previous step in the pipeline.
For Dataframe output, provide a string name that subsequent
operations can refer to.

To get the function parameters, use the following command:

[7]:
dsk.function_help('Magnitude')
dsk.pipeline.add_transform("Magnitude", params={"input_columns": <list>,
                                })

Function snippets

The Analytics Studio also includes function snippets that will auto-generate the function parameters for you. To use a snippet, execute the following command:

dsk.snippets.Transform.Magnitude()

To see snippets in action, go ahead and execute the cell below:

[ ]:
dsk.pipeline.add_transform("Magnitude", params={"input_columns": <list>,
                                })

Pipeline Execution

When executing the pipeline, there will always be two results returned. Take a look at the next cell. The first variable magnitude_data will be the actual data. The second variable stats will contain information about the pipeline execution on the server.

[ ]:
magnitude_data, stats = dsk.pipeline.execute()

Explore the returned magnitude_data using the command below.

[ ]:
magnitude_data.head()

Notice that an additional column Magnitude_ST_0000 is added to the dataframe. The subscripts refer to this being a sensor transform (ST) and being the first one added 0000. If you were to add another sensor transform, for example taking the magnitude of the accelerometer data as well, you would get another column Magnitude_ST_0001.

Performing Segmentation

The next step is to segment our data into windows which we can perform recognition on. For periodic events we want to use the Windowing Transform. Go ahead and look at the function description. Delta is the sliding window overlap. Setting delta to the same value as the window size means that there is no overlap in our segmented windows.

[ ]:
dsk.pipeline.add_transform("Windowing", params={"window_size": 300,
                                                "delta": 300,})
dsk.pipeline.describe(show_params=True)

Different window sizes can lead to better models. For this project lets reduce the window_size and delta to 200. The actual time that the window size represents for this data set it corresponds to 2 seconds, as our data was recorded at 100HZ. Go ahead and change the values in the Windowing Segmenter and re-execute. You will see the parameters change for the windowing segmenter change, but a new step shouldn’t be added.

[ ]:
dsk.pipeline.add_transform("Windowing", params={"window_size": 200,
                                                "delta": 200,})
dsk.pipeline.describe(show_params=True)

Feature Vector Generation

At this point we are ready to generate a feature vector from our segments. Feature generators are algorithms to extract relevant feature vectors from the data streams in preparation for model building. They can be simple features such as mean up to more complex features such as the fourier transform.

Feature generators are all added into a single step and run in parallel against the same input data. Let’s add two feature generators now:

[ ]:
dsk.pipeline.add_feature_generator(["Mean", 'Standard Deviation'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

We have added two feature generators from the subtype Statistical. The more features, the better chance you have of building a successful model. Let’s try adding a few more feature generators of the same subtype. Call dsk.list_functions() and you can find more feature generators of the same type

[ ]:
dsk.pipeline.add_feature_generator(["Mean", 'Standard Deviation', 'Sum', '25th Percentile'],
                                   function_defaults = {"columns":[u'Magnitude_ST_0000']})

Our classifiers are optimized for performance and memory usage to fit on resource constrained devices. Because of this we scale the features in the feature vector to be a single byte each so we need to add the Min Max Scale transform to the pipeline. This function will scale the features in the feature vector to have values between 0 and 255.

[ ]:
dsk.pipeline.add_transform('Min Max Scale')
[ ]:
feature_vectors, stats = dsk.pipeline.execute()
feature_vectors.head()

Naming Convention

The column header represents the name of the feature generator and can be used to identify which feature generator and which inputs were used. The suffix gen lets us know that this was a feature geneator. The number that follows lets us know the index of the feature generator. After that we have the name of in the input columns Magnitude_ST_0000 combined with the name of the feature generator Mean ie. Magnitude_ST_0000Mean

Visualizing Feature Vectors

Next let’s take a look at the feature vectors that you have generated. We plot of the average of all feature vectors grouped by Activity. Ideally, you are looking for feature vectors that are separable in space. How do the ones you’ve generated look?

[ ]:
dsk.pipeline.visualize_features(feature_vectors)

Training a model

  • Train Validate Optimze (tvo): This step defines the model validation, the classifier and the training algorithm to build the model with. With SensiML, the the model is first trained using the selected training algorithm, then loaded into the hardware simulator and tested using the specified validation method.

This pipeline uses the validation method “Stratified K-Fold Cross-Validation”. This is a standard validation method used to test the performance of a model by splitting the data into k folds, training on k-1 folds and testing against the excluded fold. Then it switches which fold is tested on, and repeats until all of the folds have been used as a test set. The average of the metrics for each model provide you with a good estimate of how a model trained on the full data set will perform.

The training algorithm attempts to optimize the number of neurons and their locations in order to create the best model. We are using the training algorithm “Hierarchical Clustering with Neuron Optimization,” which uses a clustering algorithm to optimize neurons placement in feature space.

We are using the Pattern Matching Engine (PME) classifier which has two classification modes, RBF and KNN and two distance modes of calculation, L1 and LSUP. You can see the documentation for further descriptions of the classifier.

[ ]:
dsk.pipeline.set_validation_method('Stratified K-Fold Cross-Validation', params={'number_of_folds':3,})

dsk.pipeline.set_classifier('PME', params={"classification_mode":'RBF','distance_mode':'L1'})

dsk.pipeline.set_training_algorithm('Hierarchical Clustering with Neuron Optimization',
                                    params = {'number_of_neurons':5})

dsk.pipeline.set_tvo({'validation_seed':2})

Go ahead and execute the full pipeline now.

[ ]:
model_results, stats = dsk.pipeline.execute()

The model_results object returned after a TVO step contains a wealth of information about the models that were generated and their performance. A simple view is to use the summarize function to see the performance of our model.

[ ]:
model_results.summarize()

Let’s grab the fold with the best performing model to compare with our features.

[ ]:
model = model_results.configurations[0].models[0]

The neurons are contained in model.neurons. Plot these over the feature_vector plot that you created earlier. This step is often useful for debugging.

[ ]:
import pandas as pd
dsk.pipeline.visualize_neuron_array(model, model_results.feature_vectors,
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[-1],
                                   pd.DataFrame(model.knowledgepack.feature_summary).Feature.values[0])

go ahead and save the best model as a SensiML Knowledge Pack. Models that aren’t saved will be lost when the cache is emptied.

[ ]:
model.knowledgepack.save('MyFirstModel_KP')

Generate Knowledge Pack

The most important objective of the Analytics Studio is to allow users to instantly turn their models into downloadable Knowledge Packs that can be flashed to devices to perform the classification tasks.

Let’s generate our Knowledge Pack. We have saved the Knowledge Pack with the name MyFirstModel_KP. Select it in the widget below. Then select your target platform.

DownloadWidget(dsk).create_widget()

Make sure to generate your Knowledge Pack with the same sample rate that you recorded your raw sensor data with or else you may get unexpected results. For our Slide Demo this should be 100 or 104 depending on if you are using QuickAI or Nordic thingy

Set the following properties:

- HW Platform: Nordic Thingy 2.1 or QuickAI S3 Merced <version>
- Target OS: NordicSDK (Nordic Thingy) or FreeRTOS (QuickAI)
- Format: Binary
- Sample Rate: 100 (Nordic Thingy) or 104 (QuickAI)
- Debug: False
- Test Data: None
- Output: BLE and Serial

To find out more about these properties check out the Quick Start Guide guide (Generating a Knowledge Pack)

Flashing a Knowledge Pack

Now that you’ve generated a Knowledge Pack you just need to flash it to a device! You can find the guide for flashing a Knowledge Pack guide at How to Flash a Knowledge Pack

[ ]:
FlashWidget(dsk).create_widget()

Model Validation

Now that you’ve flashed your Knowledge Pack to a device let’s check out the results! The easiest way to see the live event classification results of a Knowledge Pack running on your sensor is over BLE through the SensiML TestApp (PC or Android). Open the TestApp and connect to your device to see the output.

You can also connect your device to your PC and view the output over serial connection. See here for serial instructions