How Does Speech Recognition Technology Work?

Speech recognition software is an ever increasing part of our lives, though thankfully not like the sci-fi movies of the ’90s led us to believe.

Whether it’s Siri, Cortana, Amazon Home or—on April fool’s day—Google Gnome or Petlexa, we are talking to digital assistants more than ever.

You may remember the first time speech recognition technology was made available to the general population and the headaches that came with it.

But between 2013 and 2017, Google’s word accuracy rate rose from 80% to an impressive 95%.

Speech recognition is becoming so easy to use and so accurate, experts predict that 50% of all web searches will be made using voice by 2020. And that’s only one year away.

Safety and Usability

Speech recognition technology isn’t just about making things easier.

It’s also about safety.

Instead of texting while driving, you can now tell your car who to call or what restaurant to navigate to.

As beneficial as it may seem in an ideal scenario, it’s dangerous when implemented before it has high enough accuracy.

Studies have found that voice activated technology in cars can actually cause higher levels of cognitive distractions.

This is because it is relatively new as a technology; engineers are still working out the software kinks.

Imagine this: you ask your connected car to call your friend Justin, and the in-car software misunderstands you Instead, it starts blasting Justin Bieber’s infuriatingly catchy song, Baby. As you hurriedly attempt to change the song, you are obviously not in prime condition to be watching the road.

We have worked on a number of speech recognition-related projects at Globalme, including in-car speech recognition data collection and voice-controlled fitness wearables.

Through these projects and many more, we have seen first-hand the way different languages, dialects and accents can prove too complex and individualistic for technologies to handle.

But, How Does Speech Recognition Technology Work?

It seems so simple to us now in 2017.

Surrounded by smartphones, smart TVs, tablets, laptops, solar powered cars and more, it’s easy to take for granted how much research has gone into creating this futuristic world we live in.

But for every breakthrough we have made in speech recognition technology, there have been thousands of failures and hundreds of dead ends.

Why? Because the simplicity of being able to speak to digital assistants is misleading. Speech recognition is actually incredibly complicated, even now.

Analyze, Filter, Digitize

Speech recognition software can analyze the sounds you make by filtering what you say, digitizing it to a format it can “read”, and then analyzing it for meaning.

Then, based on algorithms and previous input, it can make a highly accurate educated guess as to what you are saying. It gets to know the speaker’s use of language.

Unsurprisingly, if the speech recognition software is only used by one person, it will be trained specifically for how that person talks.

It becomes increasingly more complex when a device or software is geared towards multiple different markets around the world. This is because engineers have to program the ability to understand infinite more variations; language, dialects, accents, phrasing.

But, the complexities don’t stop there.

Even with hundreds of hours of input, other factors can play a huge role in whether or not the software can understand you:

Background noise can easily throw a speech recognition device off track. This is because it does not inherently have the ability to distinguish the ambient sounds it “hears” of a dog barking or a helicopter flying overhead, from your voice.

Engineers have to program that ability into the device; they conduct data collection of these ambient sounds and “tell” the device to filter them out.

Another factor is the way humans naturally shift the pitch of their voice to accommodate for noisy environments; speech recognition systems can be sensitive to these pitch changes.

Humans and Technology “Learn” in Similar Ways

You may be wondering what I mean by “training” or “input”.

Let me put it this way – think about how a child learns a language.

From day one of the child’s life, they are hearing language used all around them. Parents speak to the child knowing that they can’t answer yet. But, even though the child doesn’t respond, they are absorbing all kinds of verbal cues; intonation, inflection, and pronunciation.

This is called input. Their brain is forming patterns and connections based on how their parents use language.

Though it may seem as though humans are hardwired to listen and understand, we have actually been training our entire lives to develop this so-called natural ability.

It takes five or six years for a child to be able to have a full conversation, and then we spend the next 15 years in school collecting more data and increasing our vocabulary.

By the time we reach adulthood, we can mentally change these “phonemes” into words and then into meaning, almost instantly.

Speech recognition technology works in essentially the same way.

Whereas humans have refined our process, we are still figuring out the best practices for computers. We have to train them in the same way our parents and teachers trained us. And that training involves a lot of manpower, research, and innovative methods.

Speech Recognition Technology in Action

Shazam, an app that is used to instantly identify music, is another great example of how speech recognition technology works.

When you hit the Shazam button, you are effectively starting an audio recording of your surroundings.

The app differentiates the ambient noise, identifies the song’s pattern, and compares the audio recording to its database.

Eventually, tracking down the song that was playing and supplying the information to its curious end-user.

In much the same way, your voice is recognized as the input.

The device or software then separates the noise (individualistic vocal patterns, accents, ambient sounds, and so on) from the keywords and turns it into text that the software can understand.

This is why speech recognition technology developed in North America for the North American accent does not work well when foreigners attempt to use it; native speakers pronounce things more or less consistently – save for individual variety.

Whereas, foreigners speaking English with an accent introduce irregular intonations and phrasing.

A Work in Progress

You might be thinking, “Is it even possible to perfect something as complicated as this?”

The answer is: not as of yet. But, we are getting closer.

As time goes on, more and more data (audio, text, noise) processing adds to the accuracy of speech recognition technology.

Similarly, as life goes on, humans add more and more “data” to their “servers”.

As I think back just 15 years ago when cellphones were still in grayscale, and were actually used predominantly as, well, phones, I’m always taken aback at how far technology has come.

So, maybe the next time Siri fails to understand your existential questions, or your Amazon Alexa plays the wrong music, remember that this technology is mind-blowingly complicated and still impressively accurate.

Then smile, tell your Amazon Alexa that you forgive her, and dance along to the music she chose.

If you’d like to learn more, check out our article on the past, present and future of speech technology.

Or, head over to our in-car speech system data collection case study.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More