To put it in a new perspective, how is one supposed to create said new, innovative technology without the necessary data to support it?
And, if this technology truly is cutting edge and never-seen-before, then how are we supposed to know what data to collect and how?
A Quick Example
An great example of data collection for emerging technology today is autonomous car data collection.
Image and speech data are being collected by the auto industry to promote smarter, hands-free driving using in-car speech systems. The vehicle would use this data to perform specific actions like analyzing voice requests of passengers, reading street signs, or staying in a lane.
So, how do technology companies execute data collection campaigns for the smart devices we use today, and the technology that is currently being produced for the future?
Developing New Data Collection Methods
The Purpose of the Data
Data collection methods for emerging technology often have no pre-established steps.
As the pioneer of collecting this specific data, it is important to analyze what to do and why it’s being done.
First, identify the objectives and the data needed to achieve these objectives.
Second, create the methodology knowing that it fulfills the needs of the required data.
The skill required to begin? Good old-fashioned teamwork. The team sets specific objectives, ensuring the data needed is the data collected. Doing this early on saves time and cost on mistakes.
If objectives were not properly identified, the team could miss collecting a complete dataset.
For example, the objective is, “Developing a voice assistant using audio data”. 100 users’ audio data are collected for this voice recognition device.
It is then realized that location data of the user would be useful as well. Now 100 pieces of user data were missed because the objectives were not properly identified.
Properly identifying the objectives means collecting the right data. The dataset will be weak if the objectives of the technology being developed and the reasons the specific data is needed to develop it are not clear.
Creating Use Cases
As data is collected, much of it requires varying circumstances. This involves emulating test cases and, with human interactive technology, gathering users with varying nationalities, linguistic abilities, ages and genders.
What kind of user data should be collected for a speech recognition technology, like a smart home speaker? This involves identifying the recording use cases and the subjects of the recording.
Setting the parameters for this comprehensive dataset requires the use cases to be created beforehand.
You could have specific utterances that you want to collect or you may base the collection on scenarios, which is often called “natural language data collection”.
For an utterance based audio data collection of a voice-activated home speaker phrases like, “Play Elvis’ Top 10 Songs” need to be set up beforehand.
This ensures that the use cases are collected repetitively and accurately by each user.
In the case of natural language data collection, you would allow users to determine the output, only giving them an objective in mind. The objective could be, “Play music, specific songs, genres, albums”. Now data can be collected for the real cases of how different people ask for the same things in different ways. User one may say, “Play X song” whereas user two may say, “I want to listen to X song”. The home speaker should be ready for a variety of requests.
The profiles of the subjects in the recording also need to be predetermined. The data collection needs to include the right accents, languages, dialects and gender ratio based on the target geography. For instance, if your target geography is the US, a diverse set of English accents with an even gender split would be a good balance of data for a voice activated home speaker to recognize a variety of voices.
Collecting the Data
The devices used in the recording set-up are decided based on the needs of the technology. This begins with the type of data that needs to be collected. It may have been decided to be audio, image, video, distance or other points of data as the desired collection for developing a new technology.
The quality of the recording devices and data needs to be assessed as well. If the purpose of the data collection is generating natural language utterances for machine learning, the audio quality may not be all that important. However, if the purpose is acoustic training, only a well designed comprehensive setup will return the desired data.
Ensure that the setup is created with the real device in mind. If an in-car voice assistant is developed, where would that voice assistant’s microphone be? Would there be several? Would it also need to see user gestures? Place the recording devices in manners that both emulate the real use case of the developing technology, and collect valuable data.
Clean Data vs Raw Data
Depending on the objectives, there needs to be a balance between clean data and raw data. Clean data is the ideal situation, the data is recorded perfectly or cleaned up to be perfect. This can be removing the background noise of a recording so the speaker is clearer, or cropping out a stop sign so there’s nothing in the background.
Raw data would include keeping all the “dirt” in the data. Dirt could include the trees (as in the image above), or the sound of a motorbike in an audio recording. Since the data is not cleansed, it can be used for the purpose of teaching technologies what to ignore, or possibly even pay attention to, in different situations.
It may or may not be necessary for your specific use case – see our machine learning article for more information on the purpose of labeling – but if you label, annotate, or tag the data you are collecting, make sure you plan your data collection methods accordingly. How will the annotation team consume the field data? Will you be using a specific markup? Do you need to align data from different input sources?
Testing the Process
Remember that this is all new, so collecting a preliminary set of data for testing is important. These preliminary sets can run in-house before using real participants. Testing the data collection methods means saving time by fixing mistakes before hitting the field for the real data collection phase.
When using new and unfamiliar technologies or methods, things can often go wrong. The software fails to record, the hardware wasn’t in optimal positioning, or maybe there’s a better way to use the recording devices overall. All of this comes with experience, and that is built with testing.
Something may also be missed. Let’s say the data only covered voice data, but it is later discovered that the voice data alone is not usable without knowing the distance of the person speaking from the microphone. Now there is a need for collecting physical data using a 3D positioning camera. Testing ensures we run each real sampling scenario collecting the appropriate data.
Through testing, the process can improve. The number, positioning and quality of the recording devices become optimized. This means collecting the best data possible, and not settling for the minimum acceptable standard.
The World Getting Ready for AI
Creating data collection methods for emerging technology is becoming more and more necessary the closer we move to an AI and machine learning powered world. Many technologies need data to learn, much like children need books, pictures, videos and audio to learn.
Technology develops very quickly, and sometimes you need a little help with your data collection.
Free Data Collection Resources
Looking for resources to assist with your data collection project? Check out these helpful resources:
The Ultimate Guide to Data Collection (PDF) – Learn how to collect data for emerging technology.
Alexa Wake Word Dataset (Audio Download) – Download 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant
Eye Gaze Sample Set (Download) – Get a sample of high-quality eye gaze data.
Road, Car, and People Dataset (Download) – Training a system that requires road image data? Download our sample dataset.
Want even more? Check our our Data Collection & Localization Resources page for more guides and downloads.
Speech Transcription for AI: Why we still need humans
Automatic speech transcription has reached near-human accuracy levels at a fraction of the cost and effort...
Why LSPs are Taking the Lead in Data Collection
How language service providers made the jump from localization to data—and why it was a natural fit.