You can’t create new, innovative technology without effective data collections methods to support it.
Creating and refining data collection methods for emerging technology is an evolving process.
Machine learning and Artificial Intelligence (AI) require data to learn how to analyze and respond to your requests. It’s a lot like how children need books, pictures, videos, and audio to increase their knowledge base.
A great example is autonomous cars. Vast amounts of image and speech data were collected by the auto industry to promote smarter, hands-free driving using in-car speech recognition systems. The vehicle uses this data to perform specific actions: analyzing voice requests, reading street signs, or staying in a lane.
Technology develops very quickly, and sometimes you need a little help with your data collection. We’ve written elsewhere about our “four Ps” approach to speech data collection, but here’s where we’ll talk about it all from more of a bird’s eye view.
Data Collection Methods for Emerging Technology: The Basic Steps
Identify the Purpose of Your Emerging Technology
The process begins by establishing the objective of the emerging technology you’re working on. If objectives aren’t properly identified, your team could miss collecting the necessary data.
If you’re developing a voice assistant, for example, you’ll need to gather a bunch of voice command data, beginning with wake words and moving on to more specific voice commands in target languages and dialects.
Properly identifying the objectives leads to collecting the right data. From there, you set up the methodology in a way that fulfills the needs of the required data.
Narrow Your Use Cases
Once you’ve identified what your technology is going to do, determine how wide to cast your user net. Gather data accordingly from target languages and dialects as well as from different ages and genders.
Set the parameters for this comprehensive dataset by narrowing the use cases to beforehand.. You could have specific, scripted utterances you want to collect, or you may base the data collection on conversational scenarios, often called natural language data.
For an utterance-based audio data collection for a voice-activated home speaker, script the phrases. For example, get people to say “Play Elvis’ Top 10 Songs.”
In the case of natural language data collection, allow users to determine the output, only giving them an objective in mind. The objective could be, “Play music, specific songs, genres, albums”. The data will reflect how different people ask for the same things in different ways. User one could say, “Play X song” whereas user two may say, “I want to listen to X song”.
A home speaker, therefore, should be ready for a variety of requests.
Select Your Subjects
Data collection needs to include the right accents, languages, dialects, and gender ratio based on the target geography.
For instance, if your target geography is the US, a diverse set of English accents with an even gender split provides a good balance of data.
Consider the Recording Setup
Your data needs – audio, image, video, distance, or other – dictate the devices required to collect the data.
Assess the quality of the recording devices as well. If the purpose of the speech data collection project is generating natural speech, the audio quality may not be all that important. However, if the purpose is acoustic training, only a well-designed comprehensive setup will return the desired data.
Create the setup with the technology’s end use in mind. For an in-car voice assistant, where does the microphone need to be? Would there be several? Would it also need to see your gestures? Place the recording devices in a way that captures the real use case of the developing technology.
Clean Data vs Raw Data
Find a balance between clean data and raw data.
Clean data is the ideal situation. The data is recorded perfectly or cleaned up to be perfect. This can involve removing the background noise of a recording, so the speaker is clearer, or cropping out a stop sign so there’s nothing in the background.
Raw data would include keeping all the “dirt” in the data. Dirt could include the trees (as in the image above), or the sound of a motorbike in an audio recording. Since the data is not cleansed, it can be used for the purpose of teaching technologies what to ignore, or possibly even pay attention to, in different situations.
Label the Data
Annotation may or may not be necessary for your specific use case.
But for your speech recognition AI project to reach its full potential, it must pass through a series of machine learning processes.
Don’t Forget to Test
Collecting a preliminary set of data to test is an important step. These preliminary sets can run in-house before using real participants. Testing the data collection methods means saving time by fixing mistakes before hitting the field for the real data collection phase.
When using new and unfamiliar technologies or methods, things can often go wrong. The software fails to record, the hardware wasn’t in optimal positioning, or maybe there’s a better way to use the recording devices overall.
Let’s say the data only covered voice data, but it is later discovered that the voice data alone is not usable without knowing the distance of the person speaking from the microphone. Now there’s a need for collecting physical data using a 3D positioning camera. Testing ensures you run each real sampling scenario collecting the appropriate data.
Through testing, the process improves. The number, positioning and quality of the recording devices become optimized. This means collecting the best data possible, and not settling for the minimum acceptable standard.
Free Data Collection Methods and Resources
Looking for resources to assist with your data collection project? Check out these helpful resources:
The Ultimate Guide to Data Collection (PDF) – Learn how to collect data for emerging technology.
Wake Word Dataset (Audio Download) – Download 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant
Eye Gaze Sample Set (Download) – Get a sample of high-quality eye gaze data.
Road, Car, and People Dataset(Download) – Training a system that requires road image data? Download our sample dataset.
Want even more? Check out our Data Collection & Localization Resources page for more guides and downloads.
Contact us today for assistance with your next data collection project.
Languages Supported by ChatGPT and Claude
Discover the various languages supported by the cutting-edge AI models ChatGPT and Claude. Learn about the...
ChatGPT and Claude: Choosing the Right Virtual Assistant
With their ability to streamline tasks and provide quick and accurate responses, AI helpers like ChatGPT a...