If you’re building a voice recognition system or conversational AI, you’re going to need plenty of training and testing data.
But where can you actually find quality speech recognition data? And how do you find voice recordings that have the exact training specifications you need?
The good news: you have options.
If all you need is a generic dataset, there are hundreds of public speech datasets available online.
But if you’re like most voice developers and you need speech data that’s tailored to your solution’s exact use cases, you will need to collect your own data.
Here’s where to find or collect speech data for your machine-learning algorithms, along with the pros and cons of each method.
1. Your Customer Speech Data
The most natural place to start is your own proprietary speech data.
If your company has the legal right and sufficient user consent to collect and use your own customer data, then you may already have a speech data training set at your fingertips.
No additional cost or time – While there’s an upfront investment to obtaining and processing the data, you won’t have to take on any additional collection costs.
Data relevance – If the data is coming from customers using your application, it’s likely already tailored to your solution’s use cases.
Natural data – Data collected straight from your product or device likely offers a rich variety of acoustic environments and users.
Legal restrictions – It can be legally challenging and costly to get sufficient permissions to record and make use of your user’s speech data.
Language & demographic gaps – Limitations of your existing product, customer base, or collection methodology may exclude certain target languages or demographics—or may be biased towards one demographic.
Processing costs – Most in-house-collected speech data still requires processing, like transcription, tagging, or bucketing, which must be outsourced to a data vendor.
But if you don’t have your own speech data?
2. Public Speech Datasets and Corpora
There are hundreds of publicly available speech recognition datasets that can serve as a great starting point.
These datasets are gathered as part of public, open-source research projects with the goal of fostering innovation in the speech technology community.
This category also includes data scraped from publicly available sources (like YouTube, for example).
Some popular public speech datasets include:
Free – This is great news if you don’t have a budget for data collection.
Fast – These datasets are all available for immediate download.
Lots of data – There are hundreds of datasets available, both unscripted or scripted, so if you’re purely after quantity of speech samples, this may be the best solution for you.
Processing costs – The majority of these datasets require significant pre-processing and quality assurance before they can be fed into a machine learning algorithm.
Generic – These speech samples are generic, so while they may be helpful for building a generic speech recognition system, they won’t help you train and test on your product’s specific use cases.
Low quality – As many of these databases are collected through open-source user submissions, they vary widely in audio quality.
Limited languages – While the number of speech databases available in different languages is growing, they are typically biased towards popular languages, like English.
3. Pre-Packaged Speech Datasets
If you don’t have your own data and a public dataset doesn’t suit your needs, that’s when you’ll have to explore purchasing data or collecting your own.
Pre-packaged datasets are speech datasets that have already been collected by a data vendor for the purpose of resale to multiple clients. Their main benefit is that they are available for immediate download.
These datasets can be quite general—like a pronunciation database, where native speakers of a language read a large number of words. But they can also be created for very specific applications.
For example, Summa Linguae Technologies has pre-packaged speech datasets that include:
- Phone conversation data: Conversations in Dutch, Japanese, and Irish English
- Alexa wake words: Spoken in English, Italian, Spanish, and French
If you’re lucky – You may be fortunate enough that there’s already been a collection for your specific use case, or for the languages or demographics you’re targeting.
Price – In that case, pre-collected datasets can occasionally be more affordable than collecting new data.
Speed – These datasets can typically be delivered in a matter of days.
Not customized – Because the data is pre-packaged, you won’t be able to customize the dataset to your needs. This could mean limited languages, dialects, demographics, audio specifications, or transcription options.
Not scalable – You’re confined to the data that was already collected. Collecting any new data requires an entirely new collection project.
Lack of ownership – This data can also be purchased by any other company, meaning it’s not unique to your application.
4. Custom Remote-Collected or Crowd-Sourced Datasets
If you’re building a voice application, it’s unlikely you’ll find an existing dataset that covers all of your training use cases.
For example, if you’re building a banking voice recognition app, you’ll need speech samples relating to bank withdrawals, statement balances, and deposits. It’s unlikely any pre-made dataset will cover those cases.
That’s when you’ll have to collect your own data, or collect data through a data solutions provider. For example, at Summa Linguae, we specialize in collecting speech data for any application in a variety of languages, dialects, and accents.
When it comes to collecting speech data, you have two options: remote collection or in-person collection.
Remote-collected speech data is collected through mobile apps or web browser platforms from a trusted crowd. Participants are recruited online based on their language and demographic profile. They’re then asked to record speech samples by reading prompts off their screen or by speaking through a variety of scenarios.
For most data collection projects, remote collection is the best option, as it is affordable, scalable, and highly customizable to your needs.
Customizable – You can structure the collection to your exact training data specifications.
Lower cost – Remote collection is more affordable than in-person collection.
Turnaround time – After the project kicks off, remote-collected data can typically be turned around within a few weeks.
Variety of speech data – You can collect different types of speech data, including command-based, scenario-based, or unscripted speech.
Scalable and flexible – Should you need to collect additional data, the infrastructure is in place to quickly and affordably collect more.
Access to participants – Collecting from a trusted crowd allows for access to any language, dialect, accent, or demographic.
Post processing options – As part of the collection project, you can specify your exact transcription and labeling requirements before the data is delivered to you.
Data ownership – Because you’ve collected this data yourself, the data won’t be accessible by any of your competitors.
Limited audio options – Because data is collected remotely from participants’ cellphones or headsets, you have fewer choices when it comes to audio or microphone specifications.
Limited acoustic scenarios – If you require a particular acoustic scenario, like certain types of background noise, you may need to opt for in-person collection.
5. In-Person or Field-Collected Speech Datasets
In-person collection is typically a larger investment than collecting data remotely.
That said, in-person data collection is the best collection option for clients who have specific audio or equipment requirements that otherwise can’t be achieved remotely.
For example, you may want to collect voice recordings from the actual microphone used in your speech recognition device. In that case, you would send your device to us at Summa Linguae, and we would record participants in person.
Customizable – In-person data collection is the most customizable option, as you can control every factor of the collection.
Equipment flexibility – In-person collection allows you to record with any hardware device, microphone, or camera.
Audio specifications – As a result, you can achieve any audio specifications needed for your training and testing data.
Variety of data – Not only can you collect any form of speech data, you also have the opportunity to simultaneously connect video data (like participant’s faces).
Natural data – With in-person collection, you can record audio in the natural acoustic environment where your technology will be used.
Post processing options – As with a remote collection project, the data can be delivered to you fully transcribed and labeled.
Data ownership – Again, collecting your own data means you have full proprietary ownership.
Cost – In-field collection is the most expensive collection method, as it can involve travel and building or shipping specialized recording equipment.
Turnaround time – More sophisticated in-person collections take longer to deliver than remote-collected or pre-packaged data.
Participant recruitment – In-field collection doesn’t offer the participant recruitment convenience of remote collection.
Start your speech collection project
If you need high-quality speech data for a voice recognition solution, Summa Linguae is the best place to start.
We collect speech data from any country in any language, dialect, or non-native accent.
Speech Transcription for AI: Why we still need humans
Automatic speech transcription has reached near-human accuracy levels at a fraction of the cost and effort...
Why LSPs are Taking the Lead in Data Collection
How language service providers made the jump from localization to data—and why it was a natural fit.