SensiML Blog: Now We’re Talking!

As SensiML’s lead developer for Data Studio, I’m excited to announce we’ve just added generative AI for speech (Text to Speech and AI Voice Generator) to our powerful edge IoT ML dataset management application. This feature, available starting from release 2024.3.0, combines ElevenLabs voice generator APIs with Data Studio’s streamlined dataset labeling capabilities to generate rich voice recognition datasets based on purely text prompts. By utilizing this tool, you can build large speech datasets curated for keyword recognition or speaker identification models in minutes. We are very excited about how it can vastly reduce the time-consuming process of generating sufficiently large datasets for accurate model training. Typically this is done manually by recording phrases from many individuals or purchasing datasets from expensive proprietary voice databases. With the power of Generative AI, it is now conceivable to supplement or replace such methods with synthetically created voice data.

Below I’ll go over some of the features in more detail and give some tips for building robust voice datasets ready for downstream use in SensiML Analytics Studio or Piccolo AI^™, our open-source AutoML tool.

How to Use Data Studio Text-to-Speech and AI Voice Generator

Utilizing the Voice Generator feature involves six steps as listed below:

1. Open a project in Data Studio

2. Open the ‘Text to Speech and AI Voice Generator’ window

3. Enter your ElevenLabs API key (sign up for a free account at https://elevenlabs.io/)

Generate the API key first at elevenlabs.io web app under your ElevenLabs login:

Copy the generated API Key as highlighted below (yours not ours!)

Paste this key into the API Key field in Data Studio as shown below:

4. Enter a prompt for the voice

5. Select the voices you want to use

6. Adjust the ElevenLabs speech options

Note that ElevenLabs limits the number of available text-to-speech input characters you can use in a given period of time (they offer per-month subscriptions from free to various paid tiers as of this article, check with ElevenLabs for their latest offers for more details).

When you click Generate, Data Studio will warn you of how many characters will be used in the confirmation dialog before you actually use the characters.

(Optional) Add pitch, gain, echo, or reverberation adjustments to the files. These adjustments are done natively in the Data Studio, so they do not use any of your ElevenLabs characters. This allows you to greatly expand your dataset by simulating pitch, volume, and distance data augmentation, making for a more robust model.

Auto-Labeling and Metadata

Metadata

The Data Studio can add metadata to your files for any of the settings used as you generate files. This means you can track how all of your files were generated, for example you can save the VoiceName, Gender, Stability, Similarity, Pitch, and Gain percent as metadata to your files.

Given the nature of Generative AI, you may discover that certain files have artifacts or don’t sound natural. Adding metadata makes it easier to see what may be causing the artifacts or unnatural voices and it helps you organize and filter your files when you are building a model.

Metadata is saved to the file and allows you to filter/sort your files when reviewing your training dataset later.

Auto-Labeling

If you open any of the files in the Project Explorer, you will see the Data Studio automatically adds labels/segments for your Prompt as it generates the files.

By default, the Data Studio uses the Prompt text for the label, but you can change this in the File Settings (or you can disable this feature completely).

You can see the labels on your files through the Project Explorer Label Distribution columns

Tip: Add More Voices

The Data Studio will load the voices available to your account. By default, ElevenLabs automatically adds a range of voices to your account to get started, but there are many more voices available. Depending on your application, we recommend finding additional voices with different accents, ages, and use cases through the Elevenlabs Voice Library.

You can add additional community-built voices from the ElevenLabs Voice Library by visiting https://elevenlabs.io/app/voice-library.

1. Open the Voice Library

2. Click Add To My Voices to add additional voices to the available voices in your library

3. (Optional) You can also create your own using the ElevenLabs Voice Design/Voice Cloning features.

After adding new voices click the Refresh button to load the voices into the Data Studio

Build Your Model

From this point, the process follows the same workflow as with manually generated audio or voice data. You’ll create a pipeline, generate a model, explore the model results, test the model with data you reserve separately from training, then generate firmware for integration onto your embedded target device of choice.