Getting an AI advancement to market before the competition shouldn’t come at the expense of ethical considerations.
We talk a lot about the need for high-quality data that you properly classify and label for advancements in artificial intelligence (AI). Sometimes achieving that is easier said than done, though.
There’s a real balancing act. You want to keep costs down but get your innovation to market before the competition. And you want to do it well to boot.
You can automate your annotation and miss out on the important human touchpoints that ensure quality and accuracy.
You can also outsource the collection and labeling, running data through the gamut without clear direction and fair compensation.
Trying to do too much too soon can result in problems down the road.
Here’s how we strive to achieve that balance.
How to Use AI For Good …
In 2022, we wrote about AI was changing the game … literally.
Video games, that is, as AI can monitor and identify bullying, profanity, hate speech, sexual harassment, and graphic abusive language.
The AI flags and warns offenders. It can also ban users after a series of offenses or after the first offense if there’s a zero-tolerance policy.
It all begins, though, by teaching the AI to recognize harmful language.
… And Where It Can Go Wrong
We also brought you a look at ChatGPT, a very buzzy advancement in conversational AI chatbot technology.
What sets it apart is a dialogue format that makes it possible for ChatGPT to answer follow-up questions, admit mistakes, challenge incorrect premises, and reject inappropriate requests.
It’s the latter function that recently made the news.
Similar to the video game project, you reduce toxic speech in chatbots by feeding the model. It needs examples of violence, hate speech, and abuse so it learns to detect, flag, and curb users from spreading more of it.
That detector is built in to check whether it was echoing the toxicity of its training data. It then can filter it out before it ever reaches the user. It could also help scrub toxic text from the training datasets of future AI models.
However, as TIME reports, laborers earning a pittance while annotating thousands of example of text data that came from some pretty dark places on the internet.
The need for humans to label data is very real, but balancing time and cost is an issue many face.
“Despite the foundational role played by these data enrichment professionals, a growing body of research reveals the precarious working conditions these workers face,” the Partnership on AI, a coalition of AI organizations, told TIME. “This may be the result of efforts to hide AI’s dependence on this large labor force when celebrating the efficiency gains of technology. Out of sight is also out of mind.”
Let’s discuss a few insights we’ve learned along the way with respect to ethical data collection and annotation.
Ethical AI: Finding the Balance
Here’s what we know for sure. Developers need high-quality data to train and test machine learning models for an increasingly global customer base.
Human transcriptionists are still needed to capture the edge cases where automation speech recognition still struggles.. At the same time, edge cases like this are very sensitive because of the content and exposure to it.
Some of the biggest tech companies in the world outsource their data collection to third-party providers who have spent years developing efficient workflows and technology.
However, that can result in issues in terms of compensation and employee supports, as described above.
So, the options seem to boil down to the following:
- If you’re looking to keep costs down and release your AI quickly, rely on smaller datasets, an automated solution, or cut labor costs.
- If you want high-quality and comprehensive annotation, find a company that relies on the human eye, quality assurance nets to catch it all, and higher price points.
It’s a delicate balance, right?
How We Do It
All we can really speak to is our experience.
Sometimes a client comes to us knowing exactly how they’d like to structure their data. However, the wide majority come to us with loose requirements.
That’s either because their requirements are flexible, or they haven’t thought through all the possible variables yet.
As a data solutions provider, it’s our role to highlight all the ways in which your dataset can be customized, while also steering you towards the most effective and price-conscious collection option for your solution.
So, let’s say your company is trying to solve for the edge cases where automated speech recognition still struggles – abusive language, for example.
We initiate natural data collection and err on the side of human transcription. That’s to ensure accuracy and inclusivity, and to handle complex environments and use cases.
We outsource through Robson, our crowd management platform.
We work to build trust with potential crowd members, assuring them their submissions are used for good. They’re helping makes AI technology more inclusive, after all, and that’s not to be taken lightly.
When looking for a speech data provider, therefore, it’s highly recommended that you choose a data vendor that offers a customizable, flexible setup.
Because even if your speech data collection needs are fairly cookie cutter to start, they may evolve in complexity over time.
And if you handcuff yourself to an inflexible data provider without an ability to innovate, you could end up losing time and money when switching over to another provider or trying to customize your setup in-house.
That’s our two cents, anyways.
Partner With Us
As innovators in the speech collection space, Summa Linguae Technologies offers flexible, customizable data services that evolve with your needs.
To see how we can help you with your data collection project, learn more about our data solutions here.