The Scenario

Six months and $500,000 of investment into a key initiative to build a more intelligent product using AI, and things are not going well.  At the outset, the project team and management were very excited by the initiative.  The vision was to use machine learning for a differentiated smart device that improves with use and supports a business model innovation for extensible algorithms users would be willing to purchase.

The team set out to combine existing historical data from past projects along with new data gathered from an internal pilot using hundreds of new prototype devices. This involved assembly and programming of custom hardware, recruiting a cohort of internal test users, and gathering data using a team of technicians and developers to download datasets and cleanse them using custom written scripts for data conversion.

After months of effort and considerable expenditure, the results are falling well short of expectations.  Predictive models are performing poorly and the product prototype isn’t delivering on promised intelligence.  Worse, the data strategy is discovered to have been badly planned from the outset and missing key data needed to explain correlation of raw sensor data and desired modeling insight. Reconstructing this after the fact is deemed nearly impossible and the data collected over months of effort is thus largely useless and must be redone. 

An extreme example? According to IDC, “Most organizations reported some failures among their AI projects with a quarter of them reporting up to 50% failure rate; lack of skilled staff and unrealistic expectations were identified as the top reasons for failure.”[1] In the above scenario, the lack of skilled experience in what AI can and cannot do, along with unrealistic expectations led the team to believe that any data was good data and the more the better. Conversely, teams with some experience and understanding of how AI algorithms work know that planning for properly labeled training data is critical to getting good performance. When things go wrong, the natural tendency is to first suspect the algorithms are at fault. But in most cases, the problem isn’t in the AI algorithms themselves… it’s in the training data and upfront data collection protocol where failure to capture a bit more data can make a difference between salvageable datasets and those that must be recollected.

This troubling scenario is entirely avoidable with a modicum of upfront planning, an understanding of the basics of machine learning, and a phased process for data collection that allows for some human learning too.  Most importantly is the need to approach AI projects with a data-centric mentality.  All too often, teams are focused on the algorithms and AI methods themselves and overlook the importance of the underlying data, proper labeling of that data, and capturing of vital contextual metadata.  “Companies often don’t have the right data, and get frustrated when they can’t build models with data that isn’t labeled,” says Anand Rao, partner and global AI leader at PricewaterhouseCoopers. “That’s where companies consistently fail.”[2]

This holds true whether AI is performed in the cloud with the power of data center computing or deployed at the edge with real-time models adapted to fit the compute resource available locally on IoT embedded nodes and edge computing devices for applications demanding immediate responsiveness.

Collecting and labeling new data specific to a project need is another reality that organizations must consider.  A common misconception is that existing data lying about will be suitable for a given purpose and the power of AI can “automagically” extract insight from unstructured or application mismatched data.  As cited in a recent McKinsey report on AI, “on average, only 3 percent of an organization’s data meets the quality standards needed for analytics. And unlike tools, infrastructure, or talent, a complete set of AI-ready data cannot typically be purchased because an agency’s unique use cases and mission demand bespoke data inputs.”[3]


“Most organizations reported some failures among their AI projects with a quarter of them reporting up to 50% failure rate”

– IDC, 2019

Success Starts Before the First Byte of Data is Collected

Developing high quality datasets starts with a considered approach to how the data should be collected in the first place. Devising a test plan is a seemingly obvious first step, but knowing what questions should be addressed in the plan is where a lack of experience in AI data collection can cause problems. Just a few of the questions to consider include:

  • How many samples will be needed to have a statistically relevant population for modeling purposes?
  • What are the fewest number of samples that can be used to validate assumptions about expected data variance and distribution?
    Consider using this number as the target for your initial phase of data collection.
  • Besides the sensor readings themselves, what are the other knowable parameters that influence the outcome of desired classification models?
  • What sources of undesired variance are likely and how can these be controlled in a manner consistent with eventual predictive use?
    Effective control of this variance can greatly accelerate the effort and reduce cost by lessening the number of samples otherwise needed to get an initial working model.

There are a number of other considerations. A more extensive look at test planning is covered in the SensiML whitepaper “Building Smart IoT Devices Faster with AutoML: A Practical Guide to Zero-Coding Algorithm Design” As an added resource, we also include a test plan template in this document which can serve as a useful starting point for devising your own test plan.

Excerpt of SensiML Smart Edge AI Test Plan template included with SensiML whitepaper “Building Smart IoT Devices Faster with AutoML”

Introducing SensiML Data Capture Lab: Built for Collecting and Managing High Quality AI Datasets

Generating high quality training data can be greatly aided (or hindered) by the choice of software tools for AI and how they address the critical task of facilitating data collection, labeling, and management. The choices for data collection and management until now have been limited to either:

  1. Use of general purpose data acquisition software, building data workflows using their provided scripting facilities and exporting data into a suitable form to consume within AI analysis tools
  2. Development of custom-built scripts in generic or open-source programming environments such as Python and importing resulting datasets into AI analysis tools

Either way, these methods distract significantly and increase the risk of generating flawed datasets because they overburden the user with myriad non-core dataset manipulation tasks. Rather than focusing on application specific data analysis, expertise is diverted to low-value data management coding tasks to extract, ingest, clean, format, segment, synchronize, normalize, label, version control, and create meaningful data structures out of raw sensor data as part of the exploratory data modeling processes. Such custom-built workflows bottleneck the team and those possessing the critical know-how and manage hard-to-maintain proprietary code become burdened with the cumulative support of increasingly complex and fragile internal development code.

Alternatively, SensiML has built up over years a capability to collect, segment, label, and manage datasets in a standardized workflow using a front-end AI dataset management tool called Data Capture Lab.

Using SensiML’s Data Capture Lab, a key software component of the SensiML Analytics Toolkit, data scientists, developers, or data technicians working collaboratively on a project can efficiently complete the data management tasks leading up to effective data modeling without the need to devise and maintain such complicated and error-prone data processing scripts and workflows. Although data science projects can range widely in terms of scope, technologies, and application, most signal data processing tasks fall within common workflows that need not be reinvented each time. Data Capture Lab is a powerful data-centric edge AI development tool that bridges these workflows into meaningful, reproducible, and collaborative processes that are purpose-built for feeding high quality datasets into AI analysis, code-generation, and testing stages comprising the balance of the SensiML Toolkit suite.

Conclusion

SensiML Data Capture Lab recorded sensor data

Data Capture Lab, as part of the SensiML Analytics Toolkit, helps organizations overcome one of the biggest pitfalls encountered in AI-based algorithm design: failure from bad training dataset quality for the intended application. With a consistent, mature, reproducible, and supported workflow for collection and managing of labeled time-series sensor datasets, your team can focus on scaling up data coverage, functionality, and ongoing model learning capabilities. Data Capture Lab works in concert with the SensiML Analytics Toolkit’s other applications and your organization’s other AI software tools to create an AI workflow that works.

To learn more about Data Capture Lab and the rest of the SensiML Analytics Toolkit, download a Free Trial of the software and see for yourself how our tools can improve yours odds for AI project success.


[1] IDC report, Artificial Intelligence Global Adoption Trends & Strategies (IDC #US45120919), July 2019

[2] Maria Korolov, “6 reasons why AI projects fail”, CIO; August 6, 2019

[3] Anusha Dhasarathy, Ankur Ghia, Sian Griffiths, and Rob Wavra, “Accelerating AI impact by taming the data beast”, McKinsey article, March 2, 2020