Part 4: Profiling the performance of the AI-accelerated EFR32MG24 model
In part 1 of this series, we discussed the value of on-device sensor analytics for audio-aware IoT applications. We introduced Silicon Labs’ MG24 and BG24 AI-Accelerated SoCs that further improve the capability of the SensiML toolkit to execute power-efficient, autonomous smart acoustic sensing applications. In part 2, we explained how to collect and annotate ML train/test audio data for such applications. In part 3, we discussed the process of extracting features and generating specific pipelines to train and evaluate audio models with the SensiML Python SDK. We then built an audio classification model using a convolutional neural network (CNN) implemented in TensorFlow and imported it into the SensiML classification pipeline. Since the Silicon Labs MG24 SoC includes a hardware AI accelerator and SDK library for optimizing TensorFlow, we then created a Knowledge Pack inference model that takes advantage of this platform’s AI acceleration. In part 4 of our series, we focus our attention on examining the performance of the AI-accelerated IoT device while executing various knowledge packs in real-time. For this purpose, we have generated multiple audio classification models with different levels of complexity.
Silicon Labs’ MG24 and BG24 SoCs benefit from an AI acceleration unit that expedites the matrix multiplication processes that a neural network requires to infer the final classification. To assess the effect of the AI acceleration, we measure the number of cycles the microprocessor unit (MCU) takes to run the model with and without the usage of the AI accelerator. To make clearer interpretations, we convert the number of clock cycles needed to perform the inference model into latency value using the nominal 78MHz clock speed of the SoC. SensiML’s edge algorithm can be broken down into two major steps to classify the raw time-series sensor inputs. In our assessment, we measure how much time our device is spending on each step.
The first step involves collecting the audio data with a 16kHz sample rate and performing the pre-processing steps necessary to extract the input feature vector for classification. For the smart lock application, feature vectors are generated using 400 audio samples (25 msec) as they are being captured. The feature utilized is a Mel-frequency cepstral coefficient (MFCC) common for audio event detection. According to our measurements, the extraction of 20 MFCC features out of 400 audio samples, takes on average 2 msec of MCU execution time. This step was executed fully in software on the MCU and therefore not accelerated. That said, the MFCC calculations do still benefit from Arm’s Cortex-A and Cortex-M CMSIS DSP library that SensiML invokes for devices like the MG24 that utilize these optimized signal processing functions. A detailed description of SensiML’s feature generation pipeline is available here.
The second step is to feed the feature vector to the quantized CNN which is the core of this classification model. The inferences can be either made by fully utilizing the MCU or with the aid of the MG24 AI acceleration unit. Our model requires combining multiple successive feature vectors, each holding 20 MFCCs to generate an inference. For instance, a model with a feature cascade size of 10 requires 10×20 MFCC elements to perform classification.
The definition of the model complexity involves counting the number of floating-point arithmetic operations required to run the model. Here, we take a simpler path to quantify model complexities by only considering a section of the model that can be potentially executed on the accelerator unit. Hence in this analysis, we measure the model complexity based on a combination of the input feature size and the number of parameters that are required to describe the model CNN.
We considered three categories of cascade size to study the effect of the CNN input size on the classification time. In addition, we feed each category of inputs into CNNs of different complexities. Here, we keep the architecture of the convolutional layers the same for all of our models and we only alter the complexity by tuning the number and size of the fully connected layers.
As illustrated in the following plot, we always run the input vector through four sets of convolutional filters each of size 2×2 pixels. The flattened output of the convolutional layers is then pumped through a conventional fully connected neural network with various architectures. In each category of feature size, complexity is defined to be the number of the model CNN parameters.
The characteristics of each category are as follows:
|Cascade Size||Feature Size||Input Feature Vector Size|
|Category 1||15||15×20 MFCC||300|
|Category 2||10||10×20 MFCC||200|
|Category 3||6||6×20 MFCC||120|
Category 1 comprises models that require 15×400 audio samples (375 msec @ 16 KHz) to make the classification. Categories 2 and 3 consist of smaller models that need 250 and 150 msec of audio inputs, respectively.
We start with Category 1 models to clearly illustrate how the AI accelerator helps with the calculations. For each network architecture, we execute the model with and without the accelerator assistance and plot the corresponding classification latencies in the following diagram. Open circles represent the latencies when classifications are purely made by the MCU. Green-filled points display the corresponding latencies for the same models when the accelerator is activated.
As seen, the accelerator reduces the classification time by at least a factor of 1.5. As expected, with more model parameters, more matrix multiplications are necessary, and the effect of the accelerator is more pronounced. For a CNN model with about one million parameters, the accelerated classification is nearly twice as fast.
For our smart lock application, the model must be capable of producing robust classifications in the 400 sample time interval equivalent to 25 milliseconds. Therefore, models in the first category might not reach enough accuracy in a noisy environment while the device is triggered to run the classification at a higher rate. This means that even though a more complicated model might show very promising accuracies in an off-device testing scenario, it rapidly loses its classification performance in an on-device execution due to the omission of streaming data that cannot be registered during the classification time. To address the data loss issue, one can build a more robust model that is insensitive to the absence of some fraction of data or simply use smaller feature vectors with a less complicated CNN architecture.
In the following plot, we have tested Category 2 and 3 models with various complexities. Open symbols are latencies of classifications utilizing only the MCU and filled symbols represent the latency of the corresponding models utilizing the accelerator. From this, we see the accelerator has reduced the classification latencies by a factor of 1.5-1.7.
The main goal of building any application for an edge device is to build an accurate model whose latency is well below the model inference rate (represented by the dashed horizontal line in the above diagram). The good news is that the AI accelerator facilitates the implementation of models with feature vectors of size 10×20 while still being able to generate outputs at the audio sampling rate.
In this part of the blog series, we measured the latency of different audio classification models with various complexities. We found that the classification time is significantly dominated by mapping the feature vector to the final classification through a series of matrix multiplications. We executed multiple models live on the device with and without the activation of the AI accelerator. We found that the AI accelerator reduced the classification latency by at least a factor of 1.5. Expectedly, the improvement factor increases with model complexity, requiring more matrix manipulations whose load can be redirected to the accelerator unit. A similar latency analysis is always necessary when deploying any model on an edge device. An acceptable model must handle data capturing and feature extractions at the same time it executes the algorithm and guarantees that no data is lost. The AI accelerator unit enables the execution to be offloaded from the MCU thereby allowing more complex models to be used in real-time with sufficient time to generate classification results at the rate of the streaming data.
In Our Next Installment: Using Data Augmentation to Enhance Model Accuracy
Up to this point, we have walked through each of the steps to collect data, build a model, implement the model, and understand the level of optimization provided by the AI-accelerated MG24 and BG24 SoCs. In our final installment of this series, we will explore how to further improve audio classification accuracy and noise tolerance through the use of data augmentation techniques.
Part 5: Using data augmentation to enhance model accuracy <Coming Soon>