Over the past six months, the SensiML team has been busy adding new features that make it easier than ever to create, manage, label, and visualize AI/ML datasets for the IoT edge (aka TinyML models). Our goal is to provide tooling and workflow aids that address ALL the steps that users must undertake to create production-quality TinyML sensor models. One aspect of this is the AutoML engine that is used to convert labeled training data into efficient predictive executable code that fits the target endpoint. But the topic of this article is another equally important part of the process – the creation and management of the train/test dataset used to generate said AutoML model.
Most any TinyML-focused tool suite can produce a model that works for a narrowly defined demo or experiment but often falls apart when the initial proof-of-concept model advances to building a commercially viable algorithm suited for products to be sold and supported. Why is this? More often than not, it’s because the dataset used to build the model is too simplistic and/or contains errors or inconsistencies that degrade performance. These are not a failure of the AutoML stage used for model generation, but rather a result of dataset issues caused by data management features that are too rudimentary in most TinyML dev tools.
Production-grade ML models must utilize sufficiently large datasets to address real-world factors and corner cases that go well beyond the ‘happy-path’ scenarios typically seen in proof-of-concept datasets. Achieving this requires careful and thorough preparation of high-quality train/test datasets and tools that provide a productive means for scaling these datasets.
In fact, the percentage of time that is typically required for machine learning projects skews heavily (80%) to dataset development and NOT modeling. Thus it stands to reason that if developers must spend the majority of their ML project time creating and refining datasets, this is where ML tool vendors should be investing their efforts as well to automate and improve quality in this vital underlying ML train/test data.
To this end, SensiML uniquely offers its powerful dataset management tool for time-series sensor data, Data Capture Lab (DCL), to address these very tasks. We have added and improved numerous DCL features that users may not even realize are now available. What follows is a list of the top five new TinyML dataset management enhancements now included in the latest version of the SensiML Data Capture Lab application.
1) Spectrographic data visualization
Are you working with frequency-based datasets such as acoustic samples or perhaps vibration sensor data? If so, you’ll be happy to learn that the SensiML Data Capture Lab now offers spectrogram plots in addition to the usual time-series strip charts. Spectrogram plots, with their characteristic time-frequency image representations of signals, provide a more meaningful view of sounds and other frequency-based signals. Such plots also more closely represent the Fourier transform and Mel-frequency cepstrum pre-processing feature transforms commonly used for ML audio processing and recognition tasks.
Starting with Data Capture Lab v2022.7.0.0, SensiML’s data acquisition and labeling application now provides the option to display these more insightful heat maps with numerous settings to control dB color range, frequency range, overlays, and segment highlights.
2) Enhanced Project Explorer summary view
At the point you have collected enough labeled training data to begin building usable ML models, you will no doubt have many dozens, hundreds, or perhaps even thousands of labeled segments spread across dozens to hundreds of data files. As a project dataset grows to include an appreciable number of raw sensor data files and labeled segments, it can quickly get confusing as to what has already been labeled, what is left to do, and which files in your project contain which labels. Without good organization and workflow aids, the opportunities for mistakes compound. Making matters worse, such mistakes may not even be readily obvious. Only the net result of a poor-performing model based on inconsistent or wrongly labeled data is clearly obvious. In the end, such faults can lead to condemnation of the ML project overall without a good understanding of where the faults actually originated, which often can be traced back to underlying dataset faults.
Fortunately, these problems can be overcome upfront with new SensiML DCL features supporting good dataset discipline from the outset. One such feature is the enhanced summary view in the DCL Project Explorer that makes readily apparent where a project dataset is in the labeling process. Let’s explore this using a pet activity tracking application dataset example.
In the accompanying screenshot, we have just completed labeling an activity recognition dataset that uses a sensor-equipped dog collar to identify a variety of activities of interest to a dog owner such as running, walking, eating, drinking, and jumping.
Naturally, to build a viable commercial product it is not sufficient to have done this for just a single dog or even a single breed of dog. Rather, we need to collect data across a broad range of dogs of various breeds, ages, and sizes. Only then will we be confident that we have a universally valid model that supports marketing a smart dog collar that works for any dog. Thus we create a project dataset inclusive of 300 data collections across nearly 80 breeds of dogs and set out to accurately label these capture files accordingly.
The effort, given the size of data involved, could be a dedicated project assigned to one individual juggling all the needed files, but more likely involves multiple people tasked with data collection in the field as well as for dataset labeling. Data Capture Lab’s Project Explorer view provides new features to understand at a glance where a project stands in the collection and labeling effort, what data exists, and what data is left to complete.
Let’s have a look at the partially completed state of this dog activity project:
In the video clip above, you’ll note there are columns in the Project Explorer view that show how many segments have been created for each of the capture files in the project as well as the class label distribution (the column with the colored boxes) amongst those segments. These summary data make it easy to understand which raw sensor data files have actually been labeled and which remain to be worked on. By clicking the mouse on the ‘Segments’ column title, we can sort in ascending/descending order to quickly isolate the files that contain zero segments. We also gain a rapid understanding of how the various class labels are distributed across the files, which files have been synchronized to the cloud (so that other team members can edit/view), and which files have associated annotation videos.
The means for quickly understanding our dataset within DCL don’t stop there. In addition to the default columns shown in the Project Explorer table, we can also select and add many additional custom columns such as project-specific metadata we might have defined. In the case of the dog activity project, it was important to collect information on the dog’s breed and other physical characteristics as well. These might ultimately be used to create model families or to isolate certain subsets of data for subsequent model analysis. We can expose these metadata fields easily in the Project Explorer and use them to understand what subsets of data exist within the project at any given time. In the video clip below, we’ll add the age of the dog (in months), along with breed, and weight as visible metadata in the Project Explorer.
3) Bulk segmentation review and editing
Another often tedious and time-consuming aspect of ML dataset development is data cleansing. We often think about the sensor signal data itself when it comes to cleansing tasks, but just as important are inconsistencies in labels, metadata, and segment placements as well. As these are the information used to describe ground truth for training, it’s vitally important they be accurate and consistent. DCL’s Project Explorer makes quick work of spotting and correcting such errors.
In the dog activity example below, we see a common labeling and metadata error where multiple metadata labels exist for what should be the same label set: Breed = “Jack Russell Terrier” versus “Jack russel”. By simply highlighting and editing the metadata across the range of data files, we can resolve the metadata labeling error easily.
Inconsistencies can frequently occur in the definition of segment regions as well, particularly when manual segmentation is utilized to define the sample ranges of interest for various events. Ideally, the model is constructed with segments of consistent size and relative placement when referenced to signal features.
To illustrate how we can quickly edit segments for consistency, let’s look at a robotic movement recognition project. In this example, we have captured the motion profile of a robotic arm’s end joint using a 6-axis inertial sensor and then labeled the various movement stages of a complete program cycle for multiple iterations.
Checking like-segments (we’ll focus on the “Picking-Scanning” segment) we can see that their lengths are varying slightly and we wish to make them all identical in size.
We can do so easily by sorting the project in the Project Explorer screen by Label Distribution and selecting all of the events containing the “Picking-Scanning” label. Right-clicking on this set of files, we can select Segments…Edit
Once in the Edit Segments screen, we can again sort by label by clicking that column title, selecting all of the “Picking-Scanning” segments, and right-clicking Adjust Length.
We will then set these segments to all be 265 samples in length centered about the current segment position. We could have also chosen to anchor the revised segments to the start or the end of the existing segments.
And with that, all of our segments for the “Picking-Scanning” label have been equalized to the same length which improves the eventual model quality in subsequent AutoML efforts. There are many more aspects of bulk segment reviewing and editing that are possible than we have room to describe here. This summary just shows a couple of examples of how the batch project file editing capability can be used to rapidly cleanse your train/test dataset.
4) Streaming Knowledge Pack model evaluation
When you have built your TinyML model, the time comes to test it to understand how well it works. Any ML tool worth its weight (SensiML included) will provide a basic pass/fail assessment from running set-aside test data against the model. Typically such results are presented as a confusion matrix output. This can tell you if the model is performing well and generalizing against novel data, and it can even highlight specific misclassification pairs.
But when performance issues arise, you often need more insight into WHY results are not as you expect. Ideally, you could delve into the model itself and gain an intuitive visualization of how the model classifications respond to the source signals driving them. Fortunately, SensiML now delivers a powerful new real-time model evaluation tool that does just that.
The video above shows a real-time classification result as delivered by the ML model superimposed as colored regions on the underlying raw sensor data as it is streamed from the sensor device itself. In this instance, we are analyzing vibration data from a fan motor to detect various states of the fan like ON, OFF, Blade Interference, Blocked Flow, and Mount Shocks. After starting the device data streaming, we simply select from any of the Knowledge Pack ML models that we have created within the toolkit and evaluate that model’s performance in real-time.
In our example video, we can see the model is generally performing quite well under steady-state but in transient conditions between on/off and off/on it tends to misclassify briefly. This suggests our dataset needs to do one of the following if we seek to eliminate these transients altogether:
- Include more labeled transitional segments
- Utilize a form of low-pass filter post-processing
The real-time streaming model evaluation feature in Data Capture Lab makes it easy to identify and understand these issues in a way that would otherwise be obscured in summarized confusion matrix results tabulations.
5) Simultaneous sensor data acquisition and labeling
Where possible, best practices for ML dataset collection calls for isolating the various classes and capturing data representing each class separately. Using the example of the dog activity tracking collar, we would ideally like to capture an entire data file of a given dog as it is running followed by a separate file of the same dog walking, and then sitting, then drinking, and so on until all of the desired activity classes have been captured in distinct data files.
But any dog-owning reader will be quick to point out that rarely can one get the dog to cooperate with your best data collection intentions. Real dogs, even well-trained ones, are difficult to control and it’s more likely you will get capture files that contain a mix of activities. And this challenge isn’t unique to dogs. In fact, many physical processes present similar challenges where it is either impractical or impossible to isolate event classes of interest or to reproduce them on demand.
To address such cases, one can do one of two things:
- Collect data covering a span of time along with ‘ground-truth’ annotation data like a synchronized video feed. With good annotation, it is then possible to confidently go back and label individual segments based on observed results and ground truth measures that are synchronized to the raw sensor data.
- Use a new feature of SensiML Data Capture Lab known as Live Labeling mode
Live Labeling mode works similarly to normal data capture with the added capability of interactively adding labeled segments to your sensor streaming data as you acquire it. For use cases where multiple classes must be captured within a given data file, this mechanism provides a rapid means of labeling the dataset in real-time rather than doing so after the fact. A labeling panel is provided with buttons to select which label mode is active at any one time.
By clicking on different label buttons over the course of the data capture session, the labels can be laid down quickly and then edited thereafter in the Label Explorer mode as usual.
Summing It All Up
For those with experience in machine learning, our admonitions about the importance of dataset quantity and quality will likely not come as a revelation. A Google search on machine learning tutorials yields any number of articles referencing Peter Norvig’s adage “More data beats clever algorithms, but better data beats more data.”
But hopefully what you’ve gained from this blog is that great tools exist to help simplify your efforts to develop and manage high-quality TinyML datasets. We hope you’ll try these new features of SensiML Data Capture Lab and agree. Drop us a line at info@sensiml.com and share your experiences with TinyML dataset management whether using our tools or otherwise. We’d love to hear from you.