Speech Transcription for AI: Why we still need humans

Introduction

Automatic speech transcription has reached near-human accuracy levels at a fraction of the cost and effort. But when your goal is to further improve the accuracy of automatic speech recognition, you’ll still need the help of real-life human transcribers.

Speech transcription is a simple task on the surface: write down what was said in an audio recording. But as a data provider to AI developers, the transcription projects on our plates today are anything but simple.

That’s because automatic speech recognition (ASR) already covers the simple transcription cases.

When clients come to us for speech data collection and transcription, they’re trying to solve for the edge cases where ASR still struggles—whether it’s recognizing a greater variety of accents or dealing with background noise.

Therefore, when you consider the unique (and sometimes strange) training data needs of today’s speech technology developers, a one-size-fits-all approach to speech transcription is destined for failure.

Before starting a transcription project, you have to consider your project factors such as use case, budget, quality requirements, necessary language expertise, and more.

In this article, we’ll discuss why human transcription is still necessary in an increasingly automated world and why we take a consultancy approach to transcription for AI.

What is speech transcription for AI?

There’s a distinction to be made between general-use-case transcription and transcription for AI.

Speech transcription for AI, specifically, is transcription that’s used in combination with speech recordings to train and validate voice recognition algorithms on a variety of applications—from voice assistants to customer service bots.

The transcriber—either a person or a computer—records what is said, when it’s said, and by whom. In some cases, transcriptions can also include non-verbal sounds and background noises.

The speech data that’s transcribed for AI can come in the form of human-to-machine speech (e.g. voice commands or wake words), or human-to-human speech (e.g. interviews or phone conversations).

Transcription for AI can be contrasted with generic speech transcription, which is used for anything from podcasts to office meetings, interviews, doctors’ appointments, legal proceedings, TV shows, and customer service phone conversations. In this case, the transcription itself is usually the end goal. The user wants an account of what was said.

What is transcribed and the type of transcription used depends entirely on the end use case. The three main types of speech transcription are:

Verbatim transcription: A word-for-word transcription of spoken language. It captures everything the speaker says, including fillers like ‘ah’, ‘uh’, and ‘um’, throat clearing, and incomplete sentences.
Intelligent verbatim transcription: A layer of filtering is added to the transcription process to extract the meaning from what was said. The transcriptionist performs light editing to correct sentence structure and grammar and removes irrelevant words or phrases altogether.
Edited transcription: A full and accurate script is formalized and edited for readability and clarity.

Most speech recognition technology requires verbatim transcription to pick up on all the nuances of the audio recording. Intelligent verbatim transcription may also be used in the case where the overall meaning of the speech segment is more important than the mapping of acoustic signal to words.

Why are human transcriptionists still needed for AI?

While it’s cheaper and faster to use automated transcription tools for day-to-day transcription needs, human speech transcription is still needed for use cases where automatic speech recognition still fails.

Here are a few pertinent examples.

To improve ASR accuracy for human-to-human conversations

Recent research found that word error rate (WER) for ASR used to transcribe business phone conversations still ranged from 13-23%—much higher than the previously published error rates of 2-3%. The research suggested that ASR handles “chatbot-like” interactions between human and machine quite well because people purposely speak more clearly when they’re talking to a machine, but don’t speak nearly as clearly in person-to-person phone calls.

For high-stakes industries like medicine, law, or autonomous cars, double-digit ASR error rates could have serious consequences.

Therefore, ASR developers are still keen to use human transcribers to improve the edge cases where transcription is failing.

To create more inclusive technology

The early versions of voice assistants like Siri, Alexa, Google Home, and Siri only supported a specific and clear version of US English.

Wanting to cater to their global customer bases, the major voice players have since expanded to their language support to the world’s most popular languages and dialects.

Expanding to new languages doesn’t come without pain points. Despite their ongoing efforts to expand their language capabilities, the major voice assistants have received criticism for race and gender biases, for example.

This is a big challenge to solve. In the USA alone, there are 30 major dialects. Even within those dialects, there’s variation from speaker to speaker based on gender, education level, economic standing, and plenty of other demographic factors. That doesn’t even include second-language speakers with unique accents or evolutions to language over time (e.g. new words or slang).

To create voice technology that understands everyone, speech algorithms need to be trained on speech data from people of all demographic backgrounds. In these cases, human transcriptionists are still needed to capture the edge cases where automation speech recognition still struggles.

To handle complex environments and use cases

Beyond understanding more accents, ASR is also expected to handle increasingly complex acoustic environments and conversational scenarios.

Original ASR was expected to work in a quiet bedroom or home office, but it’s now expected to work in busy workplaces, cars, or at parties.

Transcribing speech recorded in a quiet room is simple enough, but can still be painful when there’s background noise, poor audio quality, or multiple competing speakers. Human transcribers are better-equipped to deal with these challenging audio environments where ASR may still struggle.

Now that we’ve established the need for speech transcription for AI, let’s take a closer look at that process.

The Speech Transcription Process in Three Steps

There’s no one-size-fits-all strategy when it comes to speech transcription for AI. From directions to delivery, each transcription project is entirely guided by your end goals as the customer.

Despite the differences from project to project, here are the three general steps we follow.

1. We evaluate and collaborate on your needs

We start each speech transcription project by understanding your end use case. According to Rick Lin, Solutions Architect, this is often the most complex part of the project.

“We want to make sure we meet your requirements, but we may also challenge you on your requirements. Our clients sometimes ask for things that are nice to have, not realizing the complexity they’re introducing to the project.”

Lin preaches the value of collaboration.

Rick Lin

Solutions Architect

We calibrate ourselves so that we're not overcharging you by working on things you don't actually care about or that you’re not going to use. Having these conversations early on is well worth it to optimize costs and delivery timelines.

For example, you’ll have to decide on whether non-speech sounds need to be labeled, whether to timestamp at the word or sentence level, and the type of syntax you want to use.

Each of these choices has a major influence on the project, from who’s doing the transcription work to how much it costs, so it’s best to talk it out with our team first. We may be able to find you cost savings or uncover specifications you didn’t realize you’d need.

2. We transform your needs into transcription guidelines

Once we understand your transcription needs, we turn these into transcription guidelines. To be clear, transcription guidelines are only guidelines, not hard-and-fast rules. They should be balanced with a level of flexibility.

“It’s less about defining and more about calibrating our team to be on the same wavelength,” said Lin.

The more we understand your use case, the more we can create guidelines that empower rather than restrict transcribers.

Katja Krohn, Data Solutions Project Manager, believes it’s crucial to loop transcribers in on your ultimate end goals.

Katja Krohn

Data Solutions Project Manager

There’s no way to account for every possible speech occurrence—otherwise your guidelines will be 50 pages long. Even then, you still won’t cover everything. You need to make sure whoever you train to complete the transcription understands the purpose.

If the transcriptionists understand the “why” of the project, they’ll be better equipped to make choices in the moment, requiring less QA or back-and-forth discussion.

3. We assemble a transcription dream team

So who exactly does the transcription work? This depends on how complex your project requirements are.

From most to least complex, we use a combination of:

subject-matter experts
linguists
third-party transcription vendors
crowd workers
automatic transcription tools

We go with subject-matter experts for complex use cases where accuracy is key. For example, medical transcription requires precision and confidentiality in each transcription, so it’s critical to find experts in that field.

Native-language linguists are needed for less-common dialects and languages, bilingual conversations, and other outside-the-box cases.

We’ll occasionally partner with third-party transcription vendors when you need high-volume transcription that doesn’t require native-language expertise. These vendors can follow simple guidelines for English transcription, but may not have the native-language abilities, technology, or project management expertise to carry out a complex project from start to finish.

We use crowd transcribers when speed is of the essence and the task is simple enough to not need experienced transcribers—podcasts and videos, for example. In this case, we split the content into bite-sized tasks on the Robson app to make it easy for anyone to accurately transcribe the sound clip.

We may even be able to use automatic transcription tools to do an initial round of transcription, and then validate those transcriptions by sending them to the Robson app for crowd validation, subverting the need for experts altogether.

The goal here is flexibility and adaptability. A single transcription project could involve a combination of any of the above to optimize your costs and deliver transcriptions as quickly as possible.

4. Quality Assurance

Surprise: the quality assurance process also depends on your end use case. For QA, we’ll often make use of the same types of resources as for the transcription itself.

We can apply varying levels of QA to a project. You may need:

Full QA: If it’s a highly complex recording and you have strict quality requirements, you may want expert QA on every recording from start-to-finish.

Partial QA: If the project is simple, the speech recordings are lengthy, or you’re on a tight budget, you could QA only a small portion of the transcription. For example, the reviewer could check 20% of the transcriptions.

Comparative QA: Alternatively, some clients give the same transcription task to two different transcribers and have QA compare the similarity of their outputs. Transcribers are then asked to review any controversial segments.

For a single project, you may use a combination or stages of QA to optimize costs.

Example: Call Center Data Transcription

We recently worked on a client project where we transcribed call center data in several languages with the purpose of training a conversational AI.

In collaboration with the client, we established the following speech annotation and transcription process:

Annotation – One annotator works on segmentation, speaker tagging, and meta data.
Partial QA – One of our team members QAs a sampling of the annotated files to ensure they’re ready for transcription.
Transcription – A different transcriber inserts the transcription and any necessary tags.
Full QA – The same QA reviews 100% of the transcription files.

By following this multi-step process and starting with annotating the speech data and then performing partial QA, we ensure that the transcription step—the most work-intensive step—is as efficient as possible. This helps our client save on costs without sacrificing quality.

Common Speech Transcription Challenges and Solutions

What we’ve outlined so far in this article doesn’t come without obstacles. Here are a few of the bigger challenges we’ve encountered when it comes to transcribing speech for AI.

Mono-channel, multi-speaker conversations

This involves differentiating between speakers or dealing with overlapping speak on a single audio track.

Call center data, for example, often requires the judgment of a human transcriber to differentiate the speakers. They may also be asked to label context cues or underlying sentiments being expressed.

Language changes

Over time, subtle differences in a single language evolve. Every year, new words are added to the dictionary – 100 in 2020 in English alone.

There are also regional dialects, loan words, and false friends that need to be interpreted. Again, this is an instance where it’s helpful to have a language expert on board to spot something that ASR hasn’t been trained to recognize.

Code switching or bilingual conversations

This occurs when speakers alternate between two or more languages, or language varieties, in the context of a conversation.

If you need to know what’s being said in both languages, you’ll need a bilingual transcriber or multiple transcribers to properly transcribe what’s being said in both languages.

To cut back on costs, you may just ignore what’s said in the second language, but that can be challenging if there’s frequent flipping back and forth.

Complex subject matter

If you’re transcribing speech data to help develop a medical device, transcription errors can be devastating to a firm’s image and financial health. Your training data needs to be 100 percent accurate and provided by human transcriptionists with medical expertise.

Background noises

There’s no one rule for background noise. It can be ignored, added as a note in a bracket, or tagged in detail. It all depends on your needs. How much time you want transcription to take, and do you want your costs to go up based on increased level of detail?

For example, let’s say you’re working on a speech recognition project for a fast-food drive thru. Do you want just voice placing an order transcribed? Or do you want to note background noises like traffic, passersby, or children’s voices yelling out menu items from the back seat to train your AI to ignore these noises?

Transcription style

This includes, but is not limited to, abbreviations and initialisms as well as exclamation points.

Consider abbreviations that consist of the initial letters of words, and which are pronounced as separate letters when they are spoken. Examples include BBC, UN, UK, DVD. How these are transcribed depends entirely on the use case.

You can record the abbreviation verbatim because the abbreviation is the language of record, but because abbreviations can obscure meaning, it may need to be spelled out.

There may also be instances where it’s a foreign language speech sample, but the speaker uses a common English acronym like IBM. You don’t want to train your application to recognize this as a German word, for example, and therefore it needs to be tagged as being an abbreviation.

To give another example, there’s no set rule when it comes to transcribing exclamation points. They can be added as is or written out for the purpose of capturing added emotional content or for sentiment analysis. It all comes back to your specific needs.

The Big Takeaway: Look for a speech transcription consultant

When it comes to speech data transcription for AI, there are many variables to consider when it comes to optimizing your requirements for cost and delivery speed.

Therefore, when evaluating speech data transcription services providers, look for a provider that’s adaptable, flexible, and looking out for your best interests. If they’re not deep diving into your end use case and offering a variety of solutions, they’re likely not the best fit.

At Summa Linguae, our data solutions experts work with you to understand exactly what level of transcription you need. And if your requirements aren’t yet fully defined, we can help you choose the right solution.