The tools are out there to gather large swaths of training data for LLMS, but human touchpoints help clean, analyse, and label the data so you get exactly what you need.
Senior data scientists at Amazon recently reported “a shocking amount of the web is machine translated” into multiple languages. Additionally, the quality of these multi-way translations is often low.
Perhaps not unsurprisingly, multi-way parallel translations — those entailing a high number of languages — exhibited significantly lower quality compared to 2-way parallel translations.
“The more languages a sentence has been translated into, the lower quality the translations are, suggesting a higher prevalence of machine translation,” said the researchers.
And this isn’t simply an issue prevalent in translations for lower-resource languages. These translations make up a “large fraction of the total web content”.
In fact, the trend is consistent across eight language pair directions:
- English→German
- German→English
- French→German
- German→French
- English→Japanese
- Japanese→English
- English→Chinese
- Chinese→English
What do these findings mean? Well, it raises “serious concerns” about the quality of training data for large language models (LLMs) that comes from web scrapes.
If the training data is coming from low-quality MT, it’s likely the LLMs and therefore the AI innovations built off them will be less effective and even untrustworthy.
The data scientists emphasize data quality is “crucial” in LLM training, noting modern AI is enabled by huge amounts of training data — hundreds of billion tokens to a few trillion tokens. Training at this scale is only possible with web-scraped data, but the prevalence of machine-translated content — especially in lower resource languages — could lead to less fluent models with more hallucinations.
Here’s what Gert Van Assche, our Chief Technology Officer says about the study:
“Thank you, #Amazon: Finally scientific evidence of something we also noticed at #SummaLinguae: web pages available in many languages (multi-way parallel data) are rarely the result of human #translation or human review. The scientists observed this in low-resource languages, but I would not be surprised if the same would be true for all languages. The best suggestion however is in the last paragraph of the paper. Just take a peek.”
Training Data for LLMs: Human-in-the-Center Approach
There’s a reason why roles like data engineers and solutions architects are now commonplace at language solutions providers.
LSPs have first-hand experience with the challenges of developing AI technology and are adding in-house technical experts to support the necessary data solutions.
The tools are out there to gather large swaths of data, but human touchpoints help clean, analyse, and label the data so you get exactly what you need.
Of course, you want to keep costs down but get your innovation to market before the competition. And you want to do it well, but also fast.
So, you can either automate your data collection and miss out on the important human touchpoints that ensure quality and accuracy. You can also cheaply outsource the collection and labelling, running data through the gamut without clear direction and fair compensation.
But what you need is specialized, human-assisted data collection and annotation and not an all-encompassing, quick solution. This saves you money in the long run and gets you exactly what you need.
Don’t Settle for Data Scrapes
As a language solutions provider with data expertise, it’s our role to highlight all the ways we can customize your datasets while also steering you towards the most effective and price-conscious collection option for your solution.
We currently support more than 80 languages and over 200 different language pairs. We analyse large training datasets and detect patterns that cause issues through annotation, labelling, and tagging for the purpose of data enrichment.
Let our team of linguists and subject matter experts boost your AI with clean data for machine learning and evaluations of produced output.