Field Data Collection: How to Collect Natural Data

Introduction

You can’t build quality machine learning technology without quality training and testing data.

For most machine learning applications that end up in the hands of real-world users, you need to collect natural, real-world data that’s tailored to your product’s specific use cases. And while it would be great if a public dataset existed for every purpose, that’s rarely the case, creating the need to collect your own data.

Fortunately, you have options for where you find your speech data. For example, you can collect most data types—like voice recognition commands, for example—remotely via smartphones and apps.

But what about cases when you need naturalistic data with particular acoustic or physical scenario requirements? In that case, field data collection is the best approach.

Here’s everything you need to know about field data collection:

What is field data collection?

Field data collection is a data collection project executed in person, in a specifically chosen physical location or environment (as opposed to remotely).

Field data is collected for the purpose of training an artificial intelligence algorithm using real-world, naturalistic data in a variety of realistic use cases.

Field data is often recorded on a particular hardware device, like a prototype camera or specific microphone.

When should you choose field data collection?

Field data collection is best-suited for projects with specific environmental requirements, such as specific acoustics for sound recordings or a particular visual environment for images or videos. If the collection requires specific hardware, a field collection will be the way to go.

For example, a company developing in-car speech recognition technology may need to record participants from within the actual car where the system will be used.

Recording people from a sound booth wouldn’t provide artifacts like car horns or traffic noise that may take place when a real person is trying to use the technology—so it’s important to build those artifacts into the training data.

Three Types of Field Data

You can collect three different types of field data for a machine learning application.

Training Data

Training data is used to fit the initial machine learning model.

This is the most basic form of information that data collection vendors gather, process, and give back to our clients. AI developers, in turn, feed this training data into their algorithm(s) to detect patterns and learn from the data.

In other words, this field data helps to build the baseline functionality of the product.

An example of field training data would be recordings of people using specific voice commands collected for a speech recognition device. You can then transcribe (for natural language collections) and annotate the data before feeding it into the algorithm.

Validation Data

Validation data is used to test the model throughout the training process.

When developing machine learning models, you would typically collect validation data alongside the initial dataset. A portion of the original dataset is set aside from the model training phased and used as validation data to fine-tune model parameters.

To use a human example: imagine a baby trying to learn what a circle is. You can train the baby with examples of circles cut out on pieces of paper, and the baby will eventually come to recognize those circles.

But how can you be sure the baby actually knows what a circle is?

The only way to know the baby can actually identify the concept of a circle is to test with new circles the baby has never seen before. Only then can we be sure the baby understands the round shape as the important variable, as opposed to the colors or type of paper you used.

These new circles are the validation data: a similar dataset to the original training data, but specific instances that haven’t been seen yet.

Testing Data

Testing data is used to evaluate the final performance of a model.

Whereas validation data is used to adjust your model, testing data is used to evaluate the overall quality of your model, or to compare its performance to other potential models.

As is the case with validation data, you can collect data for training and testing concurrently.

How to Choose the Right Environment

Regardless of which type of field data you’re collecting, it’s best to collect in the natural environment for which the product is intended to be used.

That said, you need to maintain balance between naturalistic and controlled data.

If your data is too natural, you may end up with too much noise to sufficiently train or test an algorithm. But if the data is too controlled, you may not reflect your device’s actual use cases.

Choosing the right environment means understanding your technology and its objectives.

Collecting Data in Uncontrolled Environments

An uncontrolled environment is one where you’re collecting data in a natural environment where there are conditions outside of your control.

In a natural environment, we can better simulate the actual use-case environments where the technology will be used.

For example, speech recognition technology won’t always be used in a totally quiet room. It also has to be able to adapt to background noise, including other voices.

To collect uncontrolled speech data, Summa Linguae Technologies once hosted a number of “voice recognition parties.” We set up microphones and cameras in the participants’ homes in order to collect speech data in its most natural form: muddied with several other voices and sounds in the room. This data was used to help train speech recognition algorithms to detect a single voice in a noisier setting.

In most cases, noisy data is essential to the success of your product, as your machine learning model must learn to ignore or adapt to this interference.

In this way, uncontrolled environments give your team other challenges to think about and solve (before it’s too late).

Collecting Data in Controlled Environments

On the other hand, you may want to collect data in a controlled environment in order to simulate specific conditions.

This is an environment in which you have complete control over any outside disturbances.

When collecting speech data, the best example of a controlled environment is a sound studio. Sound studios allow for absolutely zero background noise. In this setting, we have full control over sound quality, ambient noise, the type of recording equipment, and the placement of the microphone relative to the speaker.

You may also choose a controlled environment for cost considerations. For example, recording participants from a sound booth may be more affordable than rigging up an outdoor setup.

If you do have to make sacrifices to how natural the data is, a robust testing phase becomes even more crucial.

Challenges of Field Data Collection

As with any project, there are lots of moving pieces to manage when it comes to field data collection.

That’s why data companies like Summa Linguae Technologies exist: because it’s far more cost-effective to outsource this field data collection than to discover and work through all of these challenges by yourself.

Here are some of the biggest challenges faced during the field data collection process.

Participant Recruitment

Data collection projects typically have very specific demographic requirements.

You may require hundreds, or even thousands of participants. You may require specific demographics like languages, accents, ethnicities, or age ranges. You also may need participants from a particular country, or to collect data within a particular country.

Finding enough participants that match your demographics (within a reasonable time frame) can be a big challenge unless you have established participant networks.

Equipment & Tools

Executing a data collection project in the field presents a few challenges from a logistical perspective.

First, collecting novel data in a particular environment often requires designing custom collection equipment.

We asked our team about some of the more challenging setups they’ve had to design.

“For one project, we needed to collect security camera footage. It needed to be mobile so we could move it to a different location every half hour or so.

We mounted the security camera to a mobile platform usually used in construction. Then we had to hook up all the supporting recording and network equipment to a mobile power source that allowed us to be completely self-sufficient in the field. And the entire setup had to be shielded from rain and the weather.”

Talk about complicated!

When designing collection setups, the key is to look for any chance at creating efficiencies.

“For another project, we needed to take pictures of participants’ faces from multiple angles. To lower costs and stay efficient, instead of taking images from each angle individually, we set up three cameras to take the photos simultaneously.

We also scripted it so that the files could be checked, named, and filed automatically. This created lots of efficiencies for the project.”

Travel

While many projects can be executed anywhere in the world, you may need to travel to a particular city or country to complete your collection.

For example, if you were building a self-driving car, you would need to collect road video data. However, roads in North America look very different from roads in Europe, or roads in Asia.

If you want to build a self-driving car that works in all of those locations, and your collection can’t be performed remotely, it sounds like you’ll be hopping on a plane. You also have to ensure you can access that equipment while traveling, and that it functions reliably when your participants show up.

For example, if you’re collecting speech samples from a particular microphone, or video samples from a specific camera, you’ll have to make sure those devices can easily travel around with you.

It can be tricky convincing border patrol that we aren’t spies when we have seven laptops, boxes full of wires, and multiple microphones on-hand…

Weather

Field data collection is in many cases also dependent on the weather. Unless, of course, we’re conducting field data collection for a voice-activated umbrella.

How do we solve this problem? In one particularly rainy winter, we packed our bags and moved our data collection project to sunny Arizona.

Doing so allowed us to provide the data our client needed year-round.

Field Data Collection in Practice

Collecting rich, natural data is a necessary part in the development of many emerging technologies.

Apple wouldn’t have been able to come out with their face-recognition technology without collecting facial imaging data from thousands of different people.

That includes multiple stages of hair growth, different ethnicities, and in different visual environments (in the evening, when it’s raining out, when the subject is backlit).

Speech recognition technologies such as Amazon’s Alexa wouldn’t be able to understand or respond to our voice commands without having collected thousands of hours of speech data.

That includes recordings in different languages, with different accents, with speech impediments, with multiple people in the conversation, or with music playing in the background.

Field data collection projects for emerging technologies such as voice-activated video games, autonomous cars, as well as a host of other smart devices are only growing in prevalence and size.

Start Your Field Data Collection Project

Field data collection, while crucial to the success of so many machine learning technologies, is no easy task to pull off.

Summa Linguae Technologies is a leader in data collection services, providing custom, real-world data to many prominent Fortune 500 companies.

In need of a customized dataset? Contact us now and we’ll follow up to chat about your project.

More Data Collection Resources

Looking for resources to assist with your in-field data collection project? Check out these helpful resources:

Sample Dataset Downloads

Alexa Wake Word Dataset – 24 custom multilingual Alexa wake word samples to hear the difference data variance makes for your voice assistant.
Eye Gaze Sample Set – Get a sample of high-quality eye gaze data.
Road, Car, and People Dataset – Training a system that requires road image data? Download our sample dataset.