Why LSPs are Taking the Lead in Data Collection

Last Updated August 25, 2021

speech data

How language service providers made the jump from localization to data collection—and why it was a natural fit.

Analysis by Slator found that 7% of language service providers (LSPs) from the 2020 Language Service Provider Index now offer data-for-AI services.

And while the shift from localization to data solutions is still in its early stages, it’s an evolution that’s decades in the making.

The core strength of LSPs is their expertise in global workflow management, managing linguists and vendors at scale, and an intimate understanding of language and its nuances.

Those same pillars are necessary for data collection—especially human-focused data such as speech.

And although localization and data solutions differ substantially, LSPs are uniquely equipped to evolve the data for AI space after decades of dealing with global content.

Seizing This Speech Data Moment

Whereas the first major clients for data-for-AI services were big tech companies at the cutting edge of speech recognition and machine learning technology, AI is now being adopted across nearly every industry.

A McKinsey survey from November 2020 found that half of respondents say their organization has already adopted AI in at least one function.

Large enterprises and small companies alike are looking at products, services, and processes in a new light, asking how AI can be harnessed to transform and enhance them to meet consumer demand.

The Opportunity

This adoption of AI by both consumers and businesses are driving the rising demand for voice technology.

Voicebot.ai reports over 58% of online adults have used voice search. 33% were using it monthly in early 2019, up from just over 25% monthly reported use in September 2018.

Over the next few years, voice control will become as ubiquitous as touch screens and buttons.

Shannon Zimmerman

CRO, Summa Linguae Technologies

"The notion of our phone being the singular device that we’re going to interact with, it’s far beyond that. We’re always looking for a better, faster, more efficient way to get things done. And all the ways we interact with those devices are relevant to the collection of data to support the AI initiative."

Devices are only as good as they are effective, and the surge in successful AI is fueled by vast amounts of high-quality data. The bigger the sample size, the more accurate the machine learning system.

The Task at Hand

The challenge, therefore, is collecting speech data at scale, and making sure it’s usable, as indicated in the Slator 2021 Data-for-AI Market Report.

“The quality of data used to train models is integral to performance. If an AI model is trained on poor-quality or unrepresentative data, the model will draw the wrong conclusions and the system will not work as the user expects. Data must be clean and labeled accurately. It must comprehensively represent the range and diversity of the real-world cases that the machine will encounter.”

Speech data collection and transcription of speech data has traditionally been time-consuming and labor-intensive, so having reliable and tested processes in place goes a long way towards meeting the fast-paced needs of AI development.

While there are data-only startups entering the space, LSPs are broadening their horizons by offering data solutions to their customers, built off the back of their success offering language solutions.

Why LSPs Have a Leg Up in Speech Data Collection

The shift to data-driven solutions doesn’t happen automatically, but LSPs are ahead of the curve because they recognized the opportunity and took steps to optimize it.

Krzysztof Zdanowski

CEO & Founder, Summa Linguae Technologies

"To genuinely pivot into the data space, you will have to invest a lot of capital expenditures, build a data collection platform, and manage a crowd on web-based annotation software. Only then can you really claim you're a data provider."

That process is made easier because of what LSPs already had set in place to meet translation and localization needs, and how those processes serve similar end goals.

“It’s an evolution, not a revolution,” said Zdanowski.

1. LSPs are already working in artificial intelligence

AI has been the subject of conversation for LSPs for over three decades thanks to the arrival of machine translation.

The linguist used to be at the core of translation, but machine translation disrupted the industry. This forced LSPs to shift from being language experts to being language and technology experts.

Roles that used to be reserved for tech companies – such as engineers and solutions architects – are now commonplace at LSPs.

LSPs therefore have first-hand experience with the challenges of developing AI technology themselves, and have the necessary technical experts in-house to develop the technology necessary for data solutions.

2. LSPs have language expertise

The core goal of an LSP is to help clients make their products and content available to a wider demographic audience through linguistic and cultural adaptation.

When it comes to human-centered AI products, what begins as a localization challenge turns into data collection challenge.

For example, Sonos came to Summa Linguae Technologies to expand their smart speaker’s voice recognition capabilities. They wanted their speaker to accurately understand a greater variety of languages, accents, and dialects.

To achieve that goal, we carried out speech data collection in multiple languages and dialects, taking care of everything from script design, to crowd sourcing, to quality assurance.

This requires in-house expertise in each of the target languages—something that would be extremely challenging to build as a new data startup, but comes naturally to us as a language services provider.

The net effect of this language expertise for our clients is higher-quality data with fewer errors—an absolute necessity if you’re trying to increase the accuracy of your AI solution.

3. LSPs are vendor management experts

The bulk of translation work—including verification, editing, proofreading, and quality assurance—is typically conducted by freelancers.

LSPs have pre-existing expertise with vendor management and outsourcing.

Take this TunnelBear localization project as an example. To localize their VPN app beyond US English, we assembled a team of over 50 translators, editors, QA testers, and localization engineers. Requiring the translation of 20,000 words in 16 target languages, a project of this scale is impossible without an outsourced workforce.

This outsourcing expertise lends itself perfectly to data collection, which is highly dependent on efficient outsourcing.

For example, these three data services all require some form of outsourcing:

  • Collecting data requires recruiting a crowd — e.g. finding 500 Japanese speakers to make voice recordings.
  • Processing the data requires training freelancers — e.g. training 5 native-Japanese speakers to transcribe the voice recordings.
  • Performing quality assurance requires recruiting language experts — e.g. hiring a Japanese linguist to review the accuracy of transcriptions.

This vendor management expertise results in more efficient end-to-end workflows, which result in cost savings that are passed on to our clients.

4. LSPs are data management specialists

If you remove the language aspect, localization is a data management exercise. A client provides data that needs to be stored, transformed, managed, and delivered back with certain requirements.

In that sense, LSPs have been data management experts for years.

Where localization and data solutions differ, however, is the added wrinkle of having to collect the data. Localization clients come with their source material in hand, but data clients typically need the data collected from scratch.

Many projects require huge quantities of data from all around the world, collected on a tight timeline. Remote data collection makes that lofty goal a reality.

To that end, LSPs have leveraged existing projects infrastructure and added the step of crowdsourced task completion to take advantage of a real growth opportunity and offer greater value to customers.

data collection workflow


It’s a workflow that allows speech data collection to be highly customizable while staying scalable.

By adding crowdsourced task completion and management, LSPs are well positioned to provide speech data solutions for AI developers.

We Can Get the Data You Need

Summa Linguae Technologies is an experienced localization solutions service provider that also specializes in speech, image, and video data collection for emerging technologies at scale.

We collect a wide variety of data for AI-powered products, including voice assistants, chatbots, autonomous vehicles, and more.

Our data services team is recognized by our clients to be extremely versatile with our outside-of-the-box thinking.

Contact us today to discuss how we can meet your speech data needs.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More