Data collection will do the heavy lifting when it comes to the future of speech recognition software.
Speech recognition software enables phones, computers, tablets, and other machines to receive, recognize, and understand human utterances.
It uses natural language as input to trigger an action, enabling our devices to respond to our spoken commands.
Speech technology is being used to replace other, more ‘tired’ methods of input like typing, texting, and clicking. It’s a slightly ironic development, seeing as texting and typing had become the preferred method of communication over voice calls just a few short years ago.
The ability to talk to your devices have expanded to encompass most of the technology that we use in our daily lives, and its success is built largely on data collection.
As we stand at the precipice of a world soon to be dominated by talking devices – and potentially, technologies with a consciousness – let’s take a look back at how it all started.
A History of Speech Recognition
The first official example of our modern speech recognition technology was “Audrey”, a system designed by Bell Laboratories in the 1950s.
Audrey, which occupied an entire room, was able to recognize only 9 digits (numbers 1-9) spoken by its developer, but it did so with an impressive 90% accuracy.
Its purpose was meant to be helping toll operators to take more phone calls over the wire, but its high cost and inability to recognize a wide array of voices made it impractical.
The next real advancement took another 12 years to develop. Premiering at the World’s Fair in 1962, IBM’s Shoebox was able to recognize and differentiate between 16 words.
Up to this point, speech recognition was still laborious. The earlier systems were set up to recognize and process bits of sound (phonemes). IBM engineers programmed the machines to use the sound and pitch of each phoneme as a clue to determine what word was being said.
Then, the system would try to match the sound as closely as it could to the preprogrammed tonal information it had. The technology was, at the time, quite advanced for what it was.
However, users had to make pauses and speak slowly to ensure the machine would pick up what was being said.
A Growing Vocabulary
In the early 1970s, the Department of Defense began to recognize the value of speech recognition technology. The ability for a computer to process natural human language could prove invaluable in any number of areas in the military and national defense.
So, they invested five years into DARPA’s Speech Understanding Research program—one of the largest programs of its kind in the history of speech recognition.
One of the more prominent inventions to come from this research program was called “Harpy”, a system that was able to recognize over 1000 words—the vocabulary of an average toddler.
In the late 1970s and 1980s, speech recognition systems started to become so ubiquitous that they were making their way into children’s toys. In 1978, the Speak & Spell, using a speech chip, was introduced to help children spell out words. The speech chip within would prove to be an important tool for the next phase in speech recognition software.
In 1987, the World of Wonders “Julie” doll came out. In an impressive (if not downright terrifying) display, Julie was able to respond to a speaker and had the capacity to distinguish between speaker’s voices.
From Acoustics to Linguistics
The ability to distinguish between speakers was not the only advancement made during this time. Scientists started abandoning the notion that speech recognition had to be purely acoustically based.
Instead, they moved more towards natural language processing (NLP). Instead of just using sounds, scientists turned to algorithms to program systems with the rules of the English language.
So, if you were speaking to a system that had trouble recognizing a word you said, it would be able to give an educated guess by assessing its options against correct syntactic, semantic, and tonal rules.
Three short years after Julie, the world was introduced to Dragon, debuting its first speech recognition system, the “Dragon Dictate”.
Around the same time, AT&T was playing with over-the-phone speech recognition software to help field their customer service calls. In 1997, Dragon released “Naturally Speaking,” which allowed for natural speech to be processed without the need for pauses.
What started out as a painfully simple and often inaccurate system is now easy for customers to use.
Developments in speech recognition software plateaued for over a decade as technology fought to catch up to our hopes for innovation. Recognition systems were limited to their processing power and memory, and still had to “guess” what words were being said based on phonemes.
This proved difficult for people around the globe with second-language accents or different regional vocabularies. Early speech recognition products were not localized or globalized by any means, and thus were only successful in specific markets.
In 2010, Google made a game-changing development that brought speech recognition technology to the forefront of innovation: the Google Voice Search app.
It aimed to reduce the hassle of typing on your phone’s tiny keyboard and was the first of its kind to utilize cloud data centers. It was also personalized to your voice and was able to ‘learn’ your speech patterns for higher accuracy.
This also paved the way for Siri, which debuted one year later.
Siri became instantly famous for her incredible ability to accurately process natural utterances, as well as her ability to respond using conversational language.
Siri’s success brought speech recognition technology to the forefront of innovation and technology.
Speech Recognition Software Today
With the ability to respond using natural language and to learn using cloud-based processing, Siri spurred the further development of voice assistants and speech recognition software.
Voice-Activated Digital Assistants
Google Assistant and Siri have since been joined by Amazon’s Alexa as modern speech recognition technologies that have become part of everything from computers and smartphones to cars, fridges, watches and video games.
Each is supported by a wide variety of languages, and can respond to various queries and requests with a great deal of accuracy.
These devices act as points of connection between multiple devices and corresponding apps to make life easier.
You can learn more about the future of voice assistant software here.
Efficient speech-to-text software makes it simple and easy to convert speech into text. Transcription software, for example, has been quite useful on desktop computers, but is now even more accessible on smartphone and mobile devices.
On average, humans can speak 150 words per minute, but can only type 40 words per minute. Therefore, voice commands can create efficiency, but only if the technology can recognize the words being spoken to it.
Progress has certainly been made, though.
Look at the Dragon programs from Nuance, for example. According to Tech Radar, Dragon’s offerings rank at the top of your speech recognition software options for 2021.
Nuance says the software can take dictation at an equivalent typing speed of 160 words per minute with a 99% accuracy rate.
Different speech-to-text programs have varying levels of ability and complexity, and machine learning is continually being used to fix errors flagged by users to help improve the technology.
Vehicle Speech Recognition
Tech companies removed the distraction of looking down at your mobile phone while you drive by developing vehicles that respond to voice commands.
You can tell your car where you need to go, and it’ll give you the navigation prompts or change the music you’re listening to. You can also offer prompts on who to call or what to send in a text message, and with your hands on the wheel.
It’s important to note this is all relatively new and some kinks are still being worked. Studies have found that voice activated technology in cars can cause higher levels of cognitive distractions.
Soon, though, self-driving cars will be the norm, reducing those risks altogether.
The Future: Accurate, Localized, and Ubiquitous
Thanks to ongoing data collection projects and cloud-based processing, many larger speech recognition systems no longer struggle with recognizing accents.
They have improved their ability to ‘hear’ a wider variety of words, languages, and accents. This is achieved through major data collection projects and with the help of language experts from all over the globe.
Here’s an example.
Sonos was developing a link between their wireless speakers and smart home assistants, and they sought speech data from three places—the USA, the UK, and Germany—broken down by varying age groups.
They specifically needed wake word data, like Amazon’s “Alexa” and Google’s “Hey Google.” This data would be used to test and tune the wake word recognition engine, ensuring that users of all demographics or dialects have an equally great voice experience on Sonos devices.
The project required strict sampling demographics and proportions. Participants were picked meticulously, ranging from ages 6-65, with a 1:1 ratio of males to females, and tracked according to their accents.
In the US, this also included participants of varying ethnic descent: Southeast Asian, Indian, Hispanic, and European.
In the end, Sonos was able to extend their speakers’ voice recognition capabilities to additional English and German dialects.
These types of projects will further open the door to a wide variety of voice-controlled devices that can be combined with the leading digital assistants’ voice technology on top of what we’ve already discussed, including:
- household appliances
- security devices and alarm systems
- personal assistants
If you’re looking for resources to assist with your data collection project, check out these helpful resources:
- The Ultimate Guide to Data Collection – Learn how to collect data for emerging technology.
- Building an Advanced Smart Home AI – What data collection is necessary to build a modern smart home AI system?
Sample Dataset Downloads
- Alexa Wake Word Dataset – 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant.
- Phone Conversation Data Set – Transcribed phone conversation recordings in Dutch, Japanese, and Irish English
Get the Most Out of Your Speech Recognition Software
As innovators in the speech collection space, Summa Linguae Technologies offers you flexible, customizable data solutions that evolve with your needs.
To see how we can help you with your data collection project, learn more about our speech data collection services here.
Rule-Based vs. Statistical vs. Neural Machine Translation
We used to have to teach computers the rules of language. Now, now they’re learning to understand language...
Voice Controlled Games: The Rise of Speech Technology in Gaming
We’ve come a long way from collecting coins, smashing bricks, sliding down secret pipes, and jumping on to...