How Does Speech Recognition Technology Work?

Introduction

Smart phones, TVs, tablets, speakers, laptops, automated cars are everywhere these days. But we take for granted how much work goes into creating speech recognition technology.

It seems straightforward to us now. But for every breakthrough in speech recognition, there have been countless failures and dead ends.

Learn more about our speech data solutions.

But between 2013 and 2017, Google’s word accuracy rate rose from 80% to an impressive 95%, and it was expected that 50% of all Google searches would be voice queries in 2020.

That represents staggering growth, but it didn’t come easily.

It took decades to develop speech recognition technology, and we have yet to reach its zenith.

In this article, we will outline how speech recognition technology works. We’ll also discuss the obstacles that remain along the path of perfecting it.

The Basics of Speech Recognition Technology

At its core, speech recognition technology is the process of converting audio into text for the purpose of conversational AI and voice applications.

Speech recognition breaks down into three stages:

Automated speech recognition (ASR): The task of transcribing the audio
Natural language processing (NLP): Deriving meaning from speech data and the subsequent transcribed text
Text-to-speech (TTS): Converts text to human-like speech

Where we see this play out most commonly is with virtual assistants. Think Amazon Alexa, Apple’s Siri, and Google Home, for example.

We speak, they interpret what we are trying to ask of them, and they respond to the best of their programmed abilities.

The process begins by digitizing a recorded speech sample with ASR. The speaker’s unique voice template is broken up into discrete segments made up of several tones.

These spectrograms are further divided into timesteps using the short-time Fourier transform.

Each spectrogram is analyzed and transcribed based on the NLP algorithm that predicts the probability of all words in a language’s vocabulary. A contextual layer is added to help correct any potential mistakes. Here the algorithm considers both what was said, and the likeliest next word based on its knowledge of the given language.

Finally, the device will verbalize the best possible response to what it has heard and analyzed using TTS.

It’s not all that unlike how we learn language as children.

How to Learn a Language

From day one of a child’s life, they hear words used all around them. Parents speak to the child knowing they can’t answer yet. But even though the child doesn’t respond, they are absorbing all kinds of verbal cues, including intonation, inflection, and pronunciation.

This is the input stage. The child’s brain is forming patterns and connections based on how their parents use language. Though humans are born to listen and understand, we train our entire lives to apply this natural ability to detecting patterns in one or more languages.

It takes five or six years to be able to have a full conversation, and then we spend the next 15 years in school collecting more data and increasing our vocabulary. By the time we reach adulthood, we can interpret meaning almost instantly.

Voice recognition technology works in a similar way. The speech recognition software breaks the speech down into bits it can interpret, converts it into a digital format, and analyzes the pieces of content.

It then makes determinations based on previous data and common speech patterns, making hypotheses about what the user is saying. After determining what the user most likely said, the smart device can offer back the best possible response.

But whereas humans have refined our process, we are still figuring out the best practices for AI. We have train them in the same way our parents and teachers trained us, and that involves a lot of manpower, research, and innovation.

Speech Recognition Technology in Action

Shazam is a great example of how speech recognition technology works. The popular app–purchased by Apple in 2018 for $400M—can identify music, movies, commercials, and TV shows based on a short audio sample using the microphone on your device.

When you hit the Shazam button, then, you’re starting an audio recording of your surroundings. It can differentiate the ambient noise from the intended source material, identify the song’s pattern, and compare the audio recording to its database.

It will then track down the specific track that was playing and supply the information to its curious end-user.

While this is a nice and simple example among other more recent innovations in speech technology, it’s not always that clean of a process.

The Challenges of Accurate Speech Recognition Technology

Imagine this: You’re driving around and make a voice request to call your friend Justin, but the software misunderstands you.

Instead, it starts blasting Justin Bieber’s latest infuriatingly catchy song. As you hurriedly attempt to change the song, you are obviously not in prime condition to be watching the road.

Speech recognition technology isn’t just about helping you to answer a trivia question, nor is it solely about making life easier.

It’s also about safety, and as beneficial as speech recognition technology may seem in an ideal scenario, it’s proven to be potentially hazardous when implemented before it has high enough accuracy.

Let’s look at the two main areas where challenges are most present.

Language & Speaker Differences

To take this technology around the word, engineers must program the ability to understand infinite more variations, including specific languages, dialects, and accents.

That requires the collection of massive amounts of data.

English speech recognition technology developed for North American accents and dialects does not work well in other parts of the world. For global projects, important considerations include:

Different languages (e.g. English, French, and German)
Non-native accents (e.g. A native-French speaker speaking English)
Genders
Ages
Different phrasing (e.g. US vs. British English phrases)

Recording Challenges

Background noises can easily throw a speech recognition device off track. That’s because it doesn’t inherently have the ability to distinguish between your unique voice and sounds like a dog barking or a helicopter flying overhead.

Engineers have to program that ability into the device. They collect specific data that includes these ambient sounds and then program the device to filter them out.

The device or software separates the noise (individualistic vocal patterns, accents, ambient sounds, and so on) from the keywords and turns it into text that the software can understand.

Other recording challenges include:

Low-quality recording tools
More than one person speaking at a time
Use of abbreviations or slang
Homophones like ‘there/their/they’re,’ and ‘ ‘right/write’ that sound the same but have different meanings

There are a few ways around these issues. They’re typically solved through customized data collection projects.

Voiceover artists can be recruited to record specific phrases with specific intonations, or in-field collection can be used to collect speech in a more real-world scenario. For example, we collected speech data for Nuance directly from the cabin of a car to simulate the in-car audio environment.

So next time Siri fails to understand your existential questions, or your Amazon Alexa plays the wrong music, remember that this technology is mind-blowingly complex yet still impressively accurate.

A Work in Progress

Summa Linguae Technologies collects and processes training and testing data for AI-powered solutions, including voice assistants, wearables, autonomous vehicles, and more.

We have worked on a number of speech recognition-related projects, including in-car speech recognition data collection and voice-controlled fitness wearables.

Through these projects and many more, we have seen first-hand the complexities of speech recognition technology. We’ve also devised data solutions to help make the devices more usable and inclusive.