Call a customer support line and you’ll likely get an automated speech recognition prompt for single-word entries such as yes-or-no responses and spoken numerals. Here’s the why and how of ASR.
In the not-so-distant past, anyone in need of customer support might call their service provider and enter a series of digits on your phone’s keypad before getting the necessary assistance.
With automatic speech recognition (ASR), you provide that information using speech prompts, and customer service chatbots forward our calls accordingly.
It’s a smoother process with fewer hiccups, in theory.
In recent years, ASR has become popular in the customer service departments of large corporations. It is also used by some government agencies and other organizations.
Expected average growth in this sphere is approximately 20% until 2030.
Well, McKinsey research found that “companies have already applied advanced analytics to reduce average handle time by up to 40 percent, increase self-service containment rates by 5 to 20 percent, cut employee costs by up to $5 million, and boot the conversion rate on service-to-sales calls by nearly 50 percent—all while improving customer satisfaction and employee engagement.”
With that in mind, let’s discuss the discuss the what, why and how of ASR.
What is automated speech recognition?
ASR is a technology that allows us to speak entries rather than punching numbers on a keypad.
These systems recognize single-word entries such as yes-or-no responses and spoken numerals. So, there’s no need to press 1 for this, press 2 for that, and enter your account number followed by the pound key.
Sadly, at some point your thumbs slips and you start the whole process over again. ASR technology enables callers to perform self-service tasks, like checking account balances, as well as authenticating their identity prior to speaking with an agent.
Natural Language Processing-based ASR
A more advanced ASR makes use of natural language processing (NLP).
NLP models quickly and accurately make sense of enormous amounts of human language data. Chatbots receive that data, analyze it, understand the sentiment, and then make fast, informed decisions.
For example, NLP understands if you mean “write” or “right”. It applies context to what we say, and that applies as well to what we’re trying to say.
Consider the term “wicked”. A customer might be calling a customer service line to yell about a “wicked” (ie: evil) corporation or receive support about a “wicked” (ie: cool, amazing) new smartphone.
NLP in ASR tells the difference and directs the call accordingly.
Once the ASR program understands what you’re trying to say, it develops an appropriate response and use text-to-speech conversion to reply to you.
Pros and Cons of Automated Speech Recognition
We anticipate massive growth in this sphere, but there’s still some work to do. Here’s the benefits and drawbacks of the technology as it stands.
Pros of ASR
For one, we work our way through automated menus without having to enter dozens of numerals manually.
For example, let’s say you get a flat tire on a busy highway and need to call roadside assistance. You don’t want to fumble around with your card and your phone, typing numbers in a stressful situation. Instead, you say what you need, read your number, and a tow can get there faster.
For companies, ASR identifies why a customer is calling and uses this information to route the call to the appropriate agent.
This reduces customer frustration by reducing touchpoints and the time it takes to reach someone that can deal with the issue at hand.
You can also reduce the number of call-center employees and save your company some money in the process.
Cons of ASR
There are some serious barriers we need to discuss here.
How many of us call these lines and experience a barrier when the system doesn’t recognize what you’re saying? You put your mouth as close to the phone as possible and say you want to speak to “BILL-ING” over and over without ending up talking to someone about your account.
Further to that, an ASR system can’t always correctly recognize the input from a person who speaks with a heavy accent or dialect.
It’s a similar issue that arises with virtual assistants and smartphones and speaks for the need to be more inclusive when building these technologies.
There are also problems with people who combine words from two languages by force of habit. That’s more of a user issue, but something to consider. Furthermore, negligible cell-phone connections can cause the system to misinterpret the input.
And, while the cost is gradually diminishing, ASR systems are still too expensive for some organizations. That connects with another con. Companies can save money by reducing customer service staff and replacing them with an automated service.
Ideally, that leads to more creative customer retention and sentiment analysis positions, but the potential impact on the human workforce is significant.
The Future of ASR – How to Make It Better
Sophisticated speech recognition systems require large volumes of speech data. The data is valuable when you transcribe and annotate it.
When it comes to ASR, the good news is it’s not super complicated because you largely deal with single word recordings instead of wading through hours of conversation.
Transcribed Data …
Let’s start with transcription. You label speech recordings to train AI to recognize what people say and to eliminate background noise. The annotation process consists of labeling noises, repetitions, false starts, changes in language, and who is speaking.
To begin, the transcriber—either a person or a computer—records what is said, when it’s said, and by whom. We err on the side of human speech transcription to ensure accuracy and inclusivity, and to handle complex environments and use cases.
Here’s the process:
- Annotation – One annotator works on segmentation, speaker tagging, and meta data.
- Partial QA – One of our team members QAs a sampling of the annotated files to ensure they’re ready for transcription.
- Transcription – A different transcriber inserts the transcription and any necessary tags.
- Full QA – The same QA reviews 100% of the transcription files.
By following this multi-step process – beginning with annotating the speech data and then performing partial QA – we ensure that the transcription step is as efficient as possible.
This helps our client save on costs without sacrificing quality.
…And lots of it
While ASR requires limited data in terms of potential responses, it must be trained to recognize different languages, dialects, accents, pitches, volumes, and speed.
Remote speech data collection is deal here. You gather it through a mobile app or web browser platform from anywhere an internet connection can be found. All you need is a trusted crowd.
Let’s say, for instance, you need recordings of numbers in a specific language or dialect for the purpose of recognizing account information for customer service call. Remote collection lets you gather short clips be in any language from anywhere around the world.
We collect the data on a small scale or from thousands of participants online based on their language and demographic profile. We can ask them to record speech samples by reading prompts off their screen or by speaking through a variety of scripted scenarios.
Examples could include “read off this series of account numbers” or “read these common customer service queries”.
Any challenges can be at least partially solved with a customized data collection and annotation project.
Automated Speech Recognition Data is our Business
Our data solutions team is recognized by our clients to be extremely versatile with our outside-of-the-box thinking.
With our crowd and our platform, we can offer custom speech data collection at scale.
To learn how we can create a speech collection program for your organization, book a consultation now.
3 Key Elements of Data Fixing
We’ve spent some time lately talking up our specialized Natural Language Processing services. In this arti...
Why Human Assisted Data Collection is the Best Method
A balance of automated and human assisted data collection gathers and interprets exactly what you need in ...