Introducing the Summa Linguae Technologies Data Catalog

Our pre-collected data sets provide an affordable option for AI developers to purchase hard-to-find training data.

Summa Linguae Technologies is proud to announce the launch of our Data Catalog: a collection of off-the-shelf data sets for machine learning. These are ready-to-purchase training data sets that have already been collected, labeled, and verified by our team of experts.

At launch, we have 28 data sets, including speech, image, and video data—and we already have several more in the works.

The best way to learn about the Data Catalog is to explore it yourself. Most of our data sets have a downloadable sample file to give you a preview of our capabilities and whether they’ll be a fit for your use case.

Why off-the-shelf data?

While most of our clients come to us with custom data collection needs, there are other clients who come to us with a need for any data, and fast.

We can find solutions for these clients with a need for speed, but there are often consequences to moving quickly—such as higher costs or looser QA processes.

To better serve those clients, we’ve pre-collected and pre-processed data in a way that’s ready to be used today. Some of the data comes from past client projects (with client pre-approval of course!). In other cases, we’ve collected our own data sets to satisfy the most frequent data asks from our customers.

Off-the-shelf data is useful when your priority is speed, keeping costs down, or when your project is in its early proof-of-concept stages and you’re after data quantity.

To learn more about the pros and cons about custom vs. pre-collected data, check out our blog on where to find speech data.

What types of data are in the Data Catalog?

Let’s take a closer look at what we are offering in each data set category.

Speech Data

Our speech and voice data sets include wake words, voice commands, and phone conversations in six languages and even more dialects.

The languages include:

English (US and Irish English)
French (Canadian French)
Spanish (EU and Mexican Spanish)
Dutch
Italian
Japanese

The French and Spanish sets also have adult and youth versions.

Wake Words

These data sets include wake words used for common voice assistants like Alexa, Siri, and Google Assistant.

For example, the Google Wake Words in US English data set contains recordings of the wake word “OK Google” in US English from 103 participants of age 19-68. Each participant recorded 10 utterances of “OK Google” remotely using Robson, our in-house mobile app for data collection.

Voice Commands

Similarly, these data sets include common voice commands used for applications such as Alexa, Siri, and Google Assistant.

For example, the Mexican Spanish voice command data set contains recordings of voice commands without a wake word in Mexican Spanish from 106 participants of age 16-65 (e.g., “Oye Alexa, cuéntame un chiste.”).

Phone Conversations

We also offer natural phone conversation data for the purpose of training conversational AI.

For instance, our Japanese phone conversation data set includes 500 hours of time-stamped and transcribed unscripted speech data (i.e., natural conversation) between two speakers.

Image Data

These off-the-shelf image data sets can be used for computer vision applications such as facial recognition, object detection, and other visual recognition use cases.

Our initial eye gaze data set, for example, includes 62 different people, 187 eye gaze directions, 3 different head poses, and 347,820 eye gaze images.

This data provides a thorough picture of eye gaze behavior for teams and corporations looking to build their biometric-enabled facial recognition technology for a variety of end-user genders, ethnicities, ages, and more.

Video Data

Video data sets can also be used for computer vision applications like facial recognition, object detection, and other visual recognition use cases.

Our roads, cars and people video set is a unique use case.

At 32 locations in Vancouver, 4 cameras recorded traffic (4- and 2-wheel vehicles, pedestrians) at an intersection from either a 45- or 90-degree angle.

This data set can be used to train your autonomous car sensors, surveillance technology for pedestrians and cars, or addressing issues like spatial awareness and precision on the road.

Get the Data You Need

To get high-quality data for your project, explore our data catalog, download a sample, and request a quote if you’re interested in purchasing a set.

If we don’t have an off-the-shelf data set you’re looking for, we also provide highly customizable data solutions to collect data to your exact specifications.

To make things even quicker, just contact us now to ask us how we may be able to help your project.

Company News

Summa Linguae Technologies Unveils Free AI Playground

Summa Linguae Technologies is proud to introduce the free AI Playground (AIP). It’s a revolutionary ...

Company News

Summa Linguae Technologies Lands Polish Nuclear Power Plant Translation Contract

Energy sectors translation differs from others in that it requires a given word to mean exactly what it sh...

Company News

Madhu Sundaramurthy Named to Women in Localization 2024 Board

Summa Linguae Technologies is proud to have a member on this year’s Women in Localization board.