Before we start collecting voice data for your speech recognition project, you have choices to make.
While it’s possible that a client comes to us knowing exactly how they’d like their voice data to be structured, the wide majority come to us with fairly loose requirements, either because their requirements are flexible, or they haven’t considered all the possible variables yet.
As a speech data provider, it’s our role to highlight all the ways in which your speech dataset can be customized, while also steering you towards the most effective and price-conscious collection option for your solution.
In this blog, we’ll cover how speech data collection can be customized by:
- languages and demographics
- collection size
- script structure
- audio requirements and formats
- delivery and processing requirements
Why does it matter? These customizations or requirements will ultimately impact:
- how the data is collected
- how participants are recruited
- how files are delivered
- the delivery timeline
- the total cost
Let’s dive into 15 different ways a speech data collection project can be customized to suit your needs.
Linguistic & Demographic Variables
Every project starts with specifying who you want to collect speech data from.
1. Target Language(s)
For which languages do you intend to collect speech data? And for each language, are the participants expected to be native, non-native, or both?
Example: Native British English speakers
2. Dialects and Non-Native Accents
Do you require a specific dialect or non-native accent? You can structure your collection to record a distribution of participants across dialects.
For example, you may want to intentionally target a range of dialects to avoid introducing systematic biases to your speech recognition algorithm.
Example: 25% Northwest US speakers, 25% Northeast US speakers, 25% Southwest US speakers, 25% Southeast US speakers
Example: Mexican Spanish-accented speakers
3. Primary Countries
For each language of the collection, should the participants be from specific countries? If not, are all countries where the language is an official language acceptable?
Example: There will be a big difference between Spanish speakers from Mexico and Spain.
For each language of the collection, should the participants currently live in a specific country?
Example: A native-Brazilian Portuguese speaker may have a different accent after spending 20 years abroad.
On top of geographic location, you can also customize your data collection project by demographic variables.
You can target a specific distribution between males vs. females or children vs. adults.
Example: 50% between 18-35 years old, 50% over 35 years old
Example: 25% male children, 25% female children, 25% male adults, 25% female adults
These linguistic and demographic variables can usually be summed up by a short statement about your target demographic(s).
Example: Indian children speaking English, split evenly by gender.
Example: German native speakers, currently living in Germany.
Example: Native and non-native English speakers worldwide.
How much speech data you need to train your recognition algorithms will impact the number of participants you need to recruit, plus the number of utterances per participant.
5. Number of participants
How many total speakers will you need?
If the collection includes several languages, how many participants will you need per target language? And if your collection includes several demographics, what is the desired breakdown per demographic group?
Example: 50 American English speakers, 50 British English speakers
6. Number of utterances and repetitions
You’ll also need to consider the number of utterances you need per participant, or the total number of utterances needed for the collection as a whole.
Example: 20 participants x 50 utterances per participant = 1000 total repetitions
You can tailor the collection’s script and workflow to the type of speech recognition data you need.
7. Scripted vs. Natural Language
Do you require scripted or natural language?
In scripted speech, each participant reads aloud what they see on a screen. This method is used to record specific voice commands or utterances.
Example: Participant reads “Alexa, turn off my music” off the screen
In unscripted speech, or natural language, participants may be given a particular scenario in which they are asked to form their own sentences. Or, they may just be asked to speak freely.
Example: Participant is asked to come up with a command to turn off the music and says “Alexa, please turn the music off now”
8. Script structure
If your collection is scripted, how many different scripts does it include? Is it one unique script read by each participant, or are there several scripts read by several subsets of the participant pool? Is it a unique script for each participant?
Example: Half of the female participants read one script, the other half read a different script.
Does the script combine combination sentences (e.g. a wake word + a command) and regular sentences?
Example: A combination of all wake words and all commands:
[Alexa] Play songs by the Beatles
[Okay Google] Play songs by the Beatles
[Hey Siri] Play Songs by the Beatles
[Alexa] What time is it?
[Okay Google] What time is it?
[Hey Siri] What time is it?
9. Special script instructions
Are specific background noises targeted as the script is being read?
Example: Recordings should take place in a crowded shopping mall with lots of noise in the background.
Do the special instructions relate to how the utterances must be read by participants? If so, we’ll need to specify the distribution of these special instructions in the script.
Example: “Read this sentence more quickly than usual” or “Read this sentence louder than usual”
Audio requirements have a big impact on how we collect speech recognition data.
If you require studio-quality speech recordings, we typically turn to in-field collections or a voiceover artist, which typically increases costs.
If you have looser audio requirements (or requirements that are more easily achievable through a phone or headset), it’s likely that remote data collection through our online data collection platform can achieve your collection goals at a more affordable cost.
10. Audio Quality
Is background noise acceptable in your speech data? And do you have any sampling/bit rate, peak amplitude, or SNR (signal-to-noise ratio) requirements?
Example: If you need to control background noise, your collection will be better suited for an in-person collection, as opposed to remotely through an app.
11. Audio Formats
Do you have audio channel requirements? As in, do you require stereo or mono recordings?
Example: Cell phones can only record audio in mono, so recording in stereo requires an in-person collection.
Do you have any file format and compression requirements?
Example: Are you looking for 8kHz/8-bit phone conversations or higher quality audio at 16-bits/16kHz?
File output requirements usually depend on the end audio that the model needs to recognize. If you are training a model for phone conversations, you will likely need an audio output sampled in 8kHz to train on the lower-quality of phone lines.
If the end product requires higher-quality audio, then it also makes sense to train the system on that same audio quality.
12. Audio Structure
Do you have any structure or post-processing requirements for the recordings or files?
For example, do you require any leading or trailing silences? Do noises like taps or clicks need to be removed? Do files need to be stitched together?
You can also customize how the audio files are delivered to you.
13. Transcription or labeling requirements
Do you require speech data transcription before delivery? Do you have particular labeling, noise-marking, or segmentation rules?
14. File naming conventions
Do you have a specific file naming convention that needs to be followed? If yes, do you expect your data provider to deliver files in that format?
Depending on the complexity, requiring a particular file naming convention could require extra development and therefore could result in higher costs.
15. Delivery method & cadence
Do you have a specific delivery method (platform or process) to be followed? Do you have specific security requirements?
Example: Delivery over cloud storage vs. an SFTP server
Do you want the data delivered in milestones, or all the data delivered at once?
Example: 25% of the data is delivered on a bi-weekly basis over a period of 2 months
Choosing a Custom Speech Data Provider
When looking for a speech data provider, it’s highly recommended that you choose a data vendor that offers a customizable, flexible setup.
Because even if your speech data collection needs are fairly cookie cutter to start, they may evolve in complexity over time.
And if you’ve handcuffed yourself to an inflexible data provider with limited abilities to innovate, you could end up losing time and money when switching over to another provider or trying to customize your setup in-house.
As innovators in the speech collection space, Summa Linguae Technologies offers flexible, customizable data services that evolve with your needs.
To see how we can help you with your data collection project, learn more about our speech data collection services here.
More Data Collection Resources
Looking for resources to assist with your data collection project? Check out these helpful resources:
- The Ultimate Guide to Data Collection – Learn how to collect data for emerging technology.
- Building an Advanced Smart Home AI – What data collection is necessary to build a modern smart home AI system?
Sample Dataset Downloads
- Alexa Wake Word Dataset – 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant.
- Eye Gaze Sample Set – Get a sample of high-quality eye gaze data.
- Road, Car, and People Dataset – Training a system that requires road image data? Download our sample dataset.
Pulling Levers: Customizing Speech Data Annotation for Your AI
Artificial intelligence doesn’t emerge out of nowhere. It requires a huge amount of speech data to develop...
Speech Transcription for AI: Why we still need humans
Automatic speech transcription has reached near-human accuracy levels at a fraction of the cost and effort...