15 Ways to Customize Your Speech Data Collection Project

Introduction

Before we begin your speech data collection project, you have some choices to make.

While it’s possible that a client comes to us knowing exactly how they’d like their voice data to be structured, the wide majority come to us with fairly loose requirements, either because their requirements are flexible, or they haven’t considered all the possible variables yet.

As a speech data provider, it’s our role to highlight all the ways in which your speech dataset can be customized, while also steering you towards the most effective and price-conscious collection option for your solution.

In this blog, we’ll cover how speech data collection can be customized by:

languages and demographics
collection size
script structure
audio requirements and formats
delivery and processing requirements

Why does it matter? These customizations or requirements will ultimately impact:

how the data is collected
how participants are recruited
how files are delivered
the delivery timeline
the total cost

Let’s dive into 15 different ways a speech data collection project can be customized to suit your needs.

Linguistic & Demographic Variables

Every project starts with specifying who you want to collect speech data from.

1. Target Language(s)

For which languages do you intend to collect speech data? And for each language, are the participants expected to be native, non-native, or both?

Example: Native British English speakers

2. Dialects and Non-Native Accents

Do you require a specific dialect or non-native accent? You can structure your collection to record a distribution of participants across dialects.

For example, you may want to intentionally target a range of dialects to avoid introducing systematic biases to your speech recognition algorithm.

Example: 25% Northwest US speakers, 25% Northeast US speakers, 25% Southwest US speakers, 25% Southeast US speakers

Example: Mexican Spanish-accented speakers

3. Primary Countries

For each language of the collection, should the participants be from specific countries? If not, are all countries where the language is an official language acceptable?

Example: There will be a big difference between Spanish speakers from Mexico and Spain.

For each language of the collection, should the participants currently live in a specific country?

Example: A native-Brazilian Portuguese speaker may have a different accent after spending 20 years abroad.

4. Demographics

On top of geographic location, you can also customize your data collection project by demographic variables.

You can target a specific distribution between males vs. females or children vs. adults.

Example: 50% between 18-35 years old, 50% over 35 years old

Example: 25% male children, 25% female children, 25% male adults, 25% female adults

These linguistic and demographic variables can usually be summed up by a short statement about your target demographic(s).

Example: Indian children speaking English, split evenly by gender.

Example: German native speakers, currently living in Germany.

Example: Native and non-native English speakers worldwide.

Collection Size

How much speech data you need to train your recognition algorithms will impact the number of participants you need to recruit, plus the number of utterances per participant.

5. Number of participants

How many total speakers will you need?

If the collection includes several languages, how many participants will you need per target language? And if your collection includes several demographics, what is the desired breakdown per demographic group?

Example: 50 American English speakers, 50 British English speakers

6. Number of utterances and repetitions

You’ll also need to consider the number of utterances you need per participant, or the total number of utterances needed for the collection as a whole.

Example: 20 participants x 50 utterances per participant = 1000 total repetitions

Script Considerations

You can tailor the collection’s script and workflow to the type of speech recognition data you need.

7. Scripted vs. Natural Language

Do you require scripted or natural language?

In scripted speech, each participant reads aloud what they see on a screen. This method is used to record specific voice commands or utterances.

Example: Participant reads “Alexa, turn off my music” off the screen

In unscripted speech, or natural language, participants may be given a particular scenario in which they are asked to form their own sentences. Or, they may just be asked to speak freely.

Example: Participant is asked to come up with a command to turn off the music and says “Alexa, please turn the music off now”

Example: To capture conversational speech, two participants are recorded having an unscripted phone conversation.

8. Script structure

If your collection is scripted, how many different scripts does it include? Is it one unique script read by each participant, or are there several scripts read by several subsets of the participant pool? Is it a unique script for each participant?

Example: Half of the female participants read one script, the other half read a different script.

Does the script combine combination sentences (e.g. a wake word + a command) and regular sentences?

Example: A combination of all wake words and all commands:

Command 1

[Alexa] Play songs by the Beatles

[Okay Google] Play songs by the Beatles

[Hey Siri] Play Songs by the Beatles

Command 2

[Alexa] What time is it?

[Okay Google] What time is it?

[Hey Siri] What time is it?

9. Special script instructions

Are specific background noises targeted as the script is being read?

Example: Recordings should take place in a crowded shopping mall with lots of noise in the background.

Do the special instructions relate to how the utterances must be read by participants? If so, we’ll need to specify the distribution of these special instructions in the script.

Example: “Read this sentence more quickly than usual” or “Read this sentence louder than usual”

Audio Requirements

Audio requirements have a big impact on how we collect speech recognition data.

If you require studio-quality speech recordings, we typically turn to in-field collections or a voiceover artist, which typically increases costs.

If you have looser audio requirements (or requirements that are more easily achievable through a phone or headset), it’s likely that remote data collection through our online data collection platform can achieve your collection goals at a more affordable cost.

10. Audio Quality

Is background noise acceptable in your speech data? And do you have any sampling/bit rate, peak amplitude, or SNR (signal-to-noise ratio) requirements?

Example: If you need to control background noise, your collection will be better suited for an in-person collection, as opposed to remotely through an app.

11. Audio Formats

Do you have audio channel requirements? As in, do you require stereo or mono recordings?

Example: Cell phones can only record audio in mono, so recording in stereo requires an in-person collection.

Do you have any file format and compression requirements?

Example: Are you looking for 8kHz/8-bit phone conversations or higher quality audio at 16-bits/16kHz?

File output requirements usually depend on the end audio that the model needs to recognize. If you are training a model for phone conversations, you will likely need an audio output sampled in 8kHz to train on the lower-quality of phone lines.

If the end product requires higher-quality audio, then it also makes sense to train the system on that same audio quality.

12. Audio Structure

Do you have any structure or post-processing requirements for the recordings or files?

For example, do you require any leading or trailing silences? Do noises like taps or clicks need to be removed? Do files need to be stitched together?

Delivery Requirements

You can also customize how the audio files are delivered to you.

13. Transcription or labeling requirements

Do you require speech data transcription before delivery? Do you have particular labeling, noise-marking, or segmentation rules?

14. File naming conventions

Do you have a specific file naming convention that needs to be followed? If yes, do you expect your data provider to deliver files in that format?

Depending on the complexity, requiring a particular file naming convention could require extra development and therefore could result in higher costs.

15. Delivery method & cadence

Do you have a specific delivery method (platform or process) to be followed? Do you have specific security requirements?

Example: Delivery over cloud storage vs. an SFTP server

Do you want the data delivered in milestones, or all the data delivered at once?

Example: 25% of the data is delivered on a bi-weekly basis over a period of 2 months

Choosing a Custom Speech Data Provider

When looking for a speech data provider, it’s highly recommended that you choose a data vendor that offers a customizable, flexible setup.

Because even if your speech data collection needs are fairly cookie cutter to start, they may evolve in complexity over time.

And if you’ve handcuffed yourself to an inflexible data provider with limited abilities to innovate, you could end up losing time and money when switching over to another provider or trying to customize your setup in-house.

As innovators in the speech collection space, Summa Linguae Technologies offers flexible, customizable data services that evolve with your needs.

To see how we can help you with your data collection project, learn more about our speech data collection services here.

More Data Collection Resources

Looking for resources to assist with your data collection project? Check out these helpful resources:

Free Guides

The Ultimate Guide to Data Collection – Learn how to collect data for emerging technology.
Building an Advanced Smart Home AI – What data collection is necessary to build a modern smart home AI system?

Sample Dataset Downloads

Alexa Wake Word Dataset – 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant.
Eye Gaze Sample Set – Get a sample of high-quality eye gaze data.
Road, Car, and People Dataset – Training a system that requires road image data? Download our sample dataset.