Speech Recognition Software: Past, Present & Future

What is Speech Recognition Software?

Speech recognition software (or speech recognition technology) enables phones, computers, tablets, and other machines to receive, recognize and understand human utterances. It uses natural language as input to trigger an action; enabling our devices to also respond to our spoken commands. The technology is being used to replace other, more ‘tired’ methods of input like typing, texting, and clicking. A slightly ironic development, seeing as texting and typing had become the preferred method of communication over voice calls just a few short years ago.

Today, speech recognition technology takes on many forms; from dictating text messages to your smartphone while driving, to asking your car to make dinner reservations at the Chinese restaurant down the road, to telling your speaker system to “please put on that new Beyoncé song”, the ability to talk to your devices have expanded to encompass the vast majority of technology that we use in our daily lives. As we stand at the precipice of a world soon to be dominated by talking devices – and potentially, technologies with a consciousness – let’s take a look back at how it all started.

From Magic 8 Ball’s to Talking Dolls

The first ever attempt at speech recognition technology was, astoundingly, from around the year 1000 A.D. A man named Pope Sylvester II invented an instrument with “magic” that could supposedly answer “yes” or “no” questions. Although the details of his invention have yet to be discovered, Pope Sylvester II would never have guessed that 944 years later, we were still captivated by the wonders of a similar such technology – the Magic 8 Ball.

The first ‘official’ example of our modern speech recognition technology was “Audrey”, a system designed by Bell Laboratories in the 1950s. A trailblazer in this field, “Audrey” was able to recognize only 9 digits spoken by a single voice (numbers 1-9). Originally, the creation of a functional voice recognition system was pursued so that secretaries would have less of a burden while taking dictation. So, while “Audrey” was an important first step, it did little to assist in transcription or dictation. The next real advancement took 12 years to develop.

Premiering at the World’s Fair in 1962, IBM’s “Shoebox” was able to recognize and differentiate between 16 words. Up to this point, speech recognition was still laborious. The earlier systems were set up to recognize and process bits of sound (‘phonemes’). IBM engineers programmed the machines to use the sound and pitch of each phoneme as a ‘clue’ to determine what word was being said. Then, the system would try to match the sound as closely as it could to the preprogrammed tonal information it had. The technology was, at the time, quite advanced for what it was. However, users had to make pauses and speak slowly to ensure the machine would actually pick up what was being said.

A Growing Vocabulary

After another nine years, the Department of Defense began to recognize the value of speech recognition technology. The ability for a computer to process natural human language could prove invaluable in any number of areas in the military and national defense. So, they invested five years into DARPA’s Speech Understanding Research program; one of the largest programs of its kind in the history of speech recognition. One of the more prominent inventions to come from this research program was called “Harpy”, a system that was able to recognize over 1000 words; the vocabulary of an average toddler. “Harpy” was the most advanced speech recognition software to date.

In the late 1970s and 1980s, speech recognition systems started to become so ubiquitous that they were making their way into children’s toys. In 1978, the Speak & Spell, using a speech chip, was introduced to help children spell out words. The speech chip within would prove to be an important tool for the next phase in speech recognition software. In 1987, the World of Wonders “Julie” doll came out. In an impressive (if not downright terrifying) display, Julie was able to respond to a speaker and had the capacity to distinguish between speaker’s voices.

From Acoustics to Linguistics

The ability to distinguish between speakers was not the only advancement made during this time. More and more scientists were abandoning the notion that speech recognition had to be acoustically based. Instead, they moved more towards a linguistic approach. Instead of just using sounds, scientists turned to algorithms to program systems with the rules of the English language. So, if you were speaking to a system that had trouble recognizing a word you said, it would be able to give an educated guess by assessing its options against correct syntactic, semantic, and tonal rules.

Three short years after Julie, the world was introduced to Dragon, debuting its first speech recognition system, the “Dragon Dictate”. Around the same time, AT&T was playing with over-the-phone speech recognition software to help field their customer service calls. In 1997, Dragon released “Naturally Speaking,” which allowed for natural speech to be processed without the need for pauses. What started out as a painfully simple and often inaccurate system is now easy for customers to use.

Speech Recognition Software Today

Developments in speech recognition software plateaued for over a decade as technology fought to catch-up to our hopes for innovation. Recognition systems were limited to their processing power and memory, and still had to “guess” what words were being said based on phonemes. This proved difficult for travelers around the globe with thick accents and/or a different vocabulary. Speech recognition products were not localized or globalized by any means, and thus were only successful in specific markets.

In 2010, Google made a game-changing development which brought speech recognition technology to the forefront of innovation: the Google Voice Search app. It aimed to reduce the hassle of typing on your phone’s tiny keyboard, and was the first of its kind to utilize cloud data centers. It was, also personalized to your voice and was able to ‘learn’ your speech patterns for higher accuracy. This all paved the way for Siri.

One year later in 2011, Apple debuted ‘Siri’. ‘She’ became instantly famous for her incredible ability to accurately process natural utterances. And, for her ability to respond using conversational – and often shockingly sassy – language. You’re sure to have seen a few screen-captures of her pre-programmed humor floating around the internet. Her success, boosted by zealous Apple fans, brought speech recognition technology to the forefront of innovation and technology. With the ability to respond using natural language and to ‘learn’ using cloud-based processing, Siri catalyzed the birth of other likeminded technologies such as Amazon’s Alexa and Microsoft’s Cortana.

The Future: Accurate, Localized, and Ubiquitous

Thanks to ongoing data collection projects and cloud-based processing, many larger speech recognition systems no longer struggle with accents. They have, in a way, undergone a series of ‘brain transplants’ that have improved their ability to ‘hear’ a wider variety of words, languages, and accents. At this point of writing, Apple CarPlay is available in five languages, and Siri is available in around 20.

We have certainly made, and will continue to make, significant strides in speech recognition. However, we are still far from inventing intelligent systems like in so many of our favorite sci-fi movies. Tell Siri that you love her, and she’ll respond, “I hope you don’t say that to those other mobile phones”. But she – it – does not and cannot truly understand you. To know love and other emotions is to go beyond software; whereas the voice we hear is merely linked to a few lines of code.

1950s Researchers in Bell’s Laboratory couldn’t have imagined a world in which we regularly talk to our devices. There is no doubt that we will be surprised by where this technology takes us in the future. Especially when we really consider the implications of how speech recognition will play a role in AI and deep learning. We have already begun to nudge our way deeper into a world where we are more and more dependent on our technology. Will we even begin to notice when our dependency turns into something more reminiscent of a conscious relationship?

But that’s a rabbit hole we’ll dive into another time.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More