Hey, you: What’s the process for voice command data collection?

Last Updated February 9, 2022

voice commands

Popular applications like Alexa, Siri, and Google Assistant require extensive voice command data collection to be effective. But how is that data collected, and how can you get your hands on it for your innovative AI solution?

The simple answer? Gather people willing to participate in the task. Voice commands for speech recognition can be collected quickly and efficiently on a small scale or from thousands of participants.

Let’s say you need voice command data in a specific language or dialect. With crowd-sourced data collection, we recruit participants online based on their language and demographic profile.

But we’re getting a bit ahead of ourselves here. Let’s first define voice commands and explain the collection process from start to finish. Then we’ll look at prime example of how it all comes together to give you the best possible AI solution for your customers.

What do we mean by voice commands?

Speech recognition technology allows you to control smart devices with your voice.

There are two fundamental types of voice commands. Wake words are used to activate the device, and specific instructions can be uttered for tasks you want the device to accomplish.

Wake words

A wake word is a phrase used to trigger voice assistants like Amazon’s Alexa, Apple’s Siri, and Google Assistant.

If you have one of these, you’ve likely uttered ‘Hey Siri’, ‘Okay Google’, or ‘Alexa’ many times.

Specific requests

When a wake word is detected, the system starts listening, interpreting, and responding to requests you make.

For example, after saying “hey google” you can utter commands to:

  • Set an alarm: “Set an alarm for 7 AM” or “Set an alarm for every Friday morning at 7 AM”
  • Set a reminder: “Remind me to call John at 6 PM” or “Remind me to buy Belgian chocolate at Ghirardelli Square.”
  • See SMS (text) messages: “Show me my messages from Brian about dinner”
  • Create a Google Calendar event: “Create a calendar event for dinner in San Francisco, Saturday at 7 PM”
  • See your upcoming bills: “My bills” or “My phone bills 2021.”
  • Check your schedule: “What does my day look like tomorrow?” or “When’s my next meeting?”

And that’s just scratching the surface, to be honest.

But again, the AI must first be trained to recognize voice commands in target languages and dialects to be effective. That’s where the speech data collection process comes into play.

Why You Need Voice Command Data Collection

To train your system, you need high-quality data as a reference point. Whether you’re developing a voice assistant, a video game, or an autonomous vehicle, you need all kinds of examples of so your system can be accessible to people of all demographics.

It’s not enough to get different languages or dialects, either.

Karina Brunoro

Data Solutions Project Coordinator in Production, Summa Linguae Technologies

Our clients tell us they want quality voice command data to train their systems. And they need as much data as possible they can get to make sure their system is accessible to people of different ages, different accents.

Karina offered up a very practical example to illustrate the need for voice command data collection.

“I’m Brazilian, my boyfriend is Brazilian. He has a very strong accent when he is speaking in English, and he set up his voice assistant in English,” she explained. “Sometimes it doesn’t understand what he’s saying, and it doesn’t turn on the TV or turn off the light because of his accent. He’s not saying something wrong. I understand what he’s saying, but the device doesn’t. If it was trained with different accents, the relationship between the device and the user itself would be much better.”

Making life as easy as possible for all potential users is the end goal of smart devices, and they’re only as smart as they’re taught and trained to be. That’s where the need for comprehensive data comes in.

What’s collected for you depends entirely on the needs of your project. When companies approach us for wake word and voice command data collection, they’re typically in one of two situations.

Common Situation #1

You have a mature speech recognition program with clearly defined speech data needs. You may be looking to improve your automatic speech recognition accuracy by 1%.

Companies approach us with specific demographic requirements and quantities. For example, “We need 500 recordings of Japanese speakers saying, ‘I need an Uber’, distributed evenly across age groups with the following audio quality requirements…”

If your quality requirements aren’t incredibly strict, you can possibly secure a bulk provider who can offer the data at the cheapest price.

But if you have stricter quality or demographic requirements, you’ll often need to work with a boutique collection agency that can customize your data collection project to your needs – and at scale, if necessary.

Common Situation #2

Other companies come to us in proof-of-concept mode. You may not have clear data requirements because you’re still qualifying your internal AI models or exploring the speech space for the first time.

For example, you may be looking for spontaneous, unscripted voice commands train your application to recognize hate speech uttered by players of a voice-controlled video game. You want to train the model to flag and ban users who don’t abide by community guidelines.

In this case, you need an innovation partner who can offer clever and cost-effective ways of getting you off the ground. You’ll be leaning on your data collection partner to help you figure out what combination of languages, dialects, and accents you need and in what quantities.

But even if your data needs are smaller today, you should plan to collect data that can scale in the future should you want to grow your program. You don’t want to have to switch providers and start over from scratch.

If you’re company #1 or company #2 (or somewhere in between), you should work with a speech data provider who can customize when needed, but also scale up on demand.

So, how is voice command data collected?

For these devices to be universally accessible, though, huge amounts of data must be collected to train the AI to recognize these wake words in all kinds of language and dialects.

It’s like how conversational data for AI is collected, but in shorter clips.


The collection process starts with the specific request of the client. What do you need to be recorded and by whom? The demographics can be narrowed in a number of ways:

  • Gender
  • Date of birth
  • Home city, region, and country
  • Current city, region, and country
  • Emigration year
  • Mother Tongue
  • Education level

Customizability also applies to the quality of the audio. Are you ok with some background noise, or do you need professional grade, crystal clear recordings? Do you need perfect enunciation, or are you looking for authentic speech recordings so your AI solution can absorb specific variations of voice commands? Do you require the audio at a certain bit and sample rate?

You can therefore tailor your voice command data collection needs based on any number of specifications.

Requests and Recordings

Once all the requirement information is gathered and organized, the request is put out there for participants.

An effective technology platform is the backbone of a data collection program. The more efficient the platform, the better quality the data and the bigger the cost savings.

Here at Summa Linguae, we built Robson – a remote data collection and crowd management platform. We went with a hybrid platform approach composed of a mobile app, desktop interface, and backend administration platform.

Here’s how it works:

  1. Robson users are matched to simple tasks based on their profile information based on the demographic options above.
  2. The user can view and sign up for all the tasks they’re eligible for.
  3. Once assigned to a task, they’ll see instructions and the sentences to record.
  4. After making a recording, the user can play it back, re-record if necessary, or move on to the next utterance.
  5. This data then enters our pipeline, where submissions are reviewed for quality, processed, and then securely shipped over to the client.

The beauty of this solution is it’s affordable, scalable, and highly customizable to your needs. And in today’s gig economy, people are more than willing to participate because they’re compensated for successfully completing a task.


The level of quality assurance performed on the speech recognition voice command data depends wholly on your needs.




Lilo Segovia Guerron Lilo Segovia Guerron

Data Solutions Project Coordinator in Production

It depends on the client and how many audio samples they want us to check. 100 percent of the recordings can be checked to make sure they meet your requirements, or a sampling can be reviewed. Before it’s sent out as a finished product, we will request re-recordings for any that don’t meet the target standards.

To make sure it’s done right the first time, it’s important to make the task instructions as clear as possible. For example, one recent task specifies “fast speech.” If a recording is submitted where the person is speaking at a normal pace or even lazily, they’ll be asked to do it again.

The meta data – all the information you want about the user – must be clearly laid out as well. This can include not only the country in which participants live, but also their birth country and where they currently live. We are always on the lookout for fraud users, or people who switch up their profiles to accomplish more tasks (since we do compensate them, after all).

We need the right crowd at the end of the day. English recordings are easy enough, but Japanese, Chinese, Arabic? That can be a bit more of a challenge. Once we know exactly what you need and why, we can create those very detailed instructions to get the precise data you need for your project.

Sonos Case Study

Sonos is one of the world’s leading sound experience brands. As the inventor of multi-room wireless home audio, Sonos gives people access to the content they love, anywhere and anytime.

Sonos was developing an integration between their wireless speakers and smart home assistants. This meant they needed speech data collection from three areas —the USA, the UK, and Germany—broken down by varying age groups. They specifically needed wake word data, similar to Amazon’s “Alexa” and Google’s “OK Google.”

This data would be used to test and tune the wake word recognition engine, ensuring that users of all demographics or dialects have an equally great voice experience on Sonos devices.

Sonos had specific accent and dialect requirements for their speech data set. The data set we built spanned different cultures and age groups. The range of data needed included several demographic identifiers, including age, sex, and lingual capacity.

Participants were picked meticulously, ranging from ages 6-65, with a 1:1 ratio of males to females, and tracked according to their accents.

In the end, Sonos was able to extend their speakers’ voice recognition capabilities to additional English and German dialects.

And that, as they say, is how it’s done.

Voice Command Data Collection at Scale

Requirements, requests, recordings, review – these are the pillars of speech recognition voice command data collection. But these pillars aren’t built overnight.

At Summa Linguae Technologies, we’ve worked for years to develop and refine these processes and platforms.

As a result, our data solutions team is recognized by our clients to be extremely versatile with our outside-of-the-box thinking, but as we’ve developed our crowd and our platform, we’ve gained the ability to offer custom speech data collection at scale.

To learn how we can create a speech collection program for your organization, book a consultation now.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More