Why Natural Language Processing (NLP) Needs Natural Language Data

Large medical research companies release hundreds of new pieces of medical technology, every year — but not all of them actually functions as intended.

Researchers have long known that they need a quicker method of detecting faulty or otherwise dangerous new devices, but there was a problem: the very best, the human-curated approach could take years to notice that users report harms as a result of using the device. But one approach has worked to reduce the time to detect such problems: natural language processing, or “NLP.”

Researchers used NLP to trawl a large database of user device feedback and complaints, looking through the text to find themes and assign each device a tentative rating. The researchers found that by using NLP to analyze the database, they were able to greatly reduce the time to make a judgment.

And more importantly, those quick judgments agreed very well with the more time consuming, human-curated analyses that came months or even years after.

Natural Language Processing is poised to be one of the most disruptive areas of computer science — all it needs to get there is data in the form of natural human speech. The medical technology check was only possible because the AI had been trained with masses of real feedback in real, natural language. Thankfully, for future projects, masses of real, naturally produced language is the most abundant online resource of all.

What is natural language processing?

Natural language processing is the area of software development concerned with regular written and spoken language, including all the inaccuracies, contradictions, and duplicate standards that make such communication hard for even humans to understand.

The hope is that in the future, humans won’t have to tailor their speech for interaction with machines, which will be more than capable of understanding their most natural modes of communication.

That’s why the NLP medical device feedback program was able to group together types of user feedback, even if the specific spellings and even words used to express these similar ideas were highly divergent.

One person’s way of expressing a problem with device-born bleeding may differ from another’s, but both patients have the same basic problem — and it’s the NLP program’s job to notice this association through differences in expression.

The goal of NLP is to create computer programs that can not only understand any coherent human statement but which can speak back to create easy, natural interaction with an AI. Various attempts to achieve this with classical programming have seen varying levels of success, most notably via the internet-famous Cleverbot.

It turns out, though, that for computers to understand language the way we do, they have to learn languages the way we do, as well.

Why is data so crucial to NLP?

Like most extremely hard computational problems, advancement of NLP has been handed off almost entirely to machine learning, rather than direct software development.

And that means that, just as in all other machine learning projects, the whole process is powered by data. In this case, that data is naturally formed written and spoken human language. It needs to be annotated with the real meanings of each statement, so the machine learning algorithm can embed associations between real and expressed meanings.

Without a properly curated dataset, machine learning algorithms have nothing to learn from — no failed attempts and no successful guesses.

How does NLP work?

There are basically two areas of NLP: Natural Language Understanding (NLU) and Natural Language Generation (NLG).

NLU has to do with applying machine learning toward breaking language down into concepts with coherent relationships, while NLG is about building up natural linguistic phrases that accurately represent a series of starting concepts. The two processes are the natural inverse of one another, though both are necessary for true NLP success.

Development of both abilities requires incredible volumes of data, both as text and as raw audio — and as we’ve gone over before, collecting and curating data is the most difficult and time-consuming part of machine learning, by far.

But every phoneme collected and curated brings the NLP space just a little bit closer to completion — to the day when users can simply speak to their computers and receive a completely natural response in return.

Why use NLP?

As it turns out, NLP is useful for more than just letting people to their computers speak in slang. With a better ability to recognize non-dictionary-standard versions of speech, computers are simply better at understanding speech in any context — about 99% accurate, by some measures, compared to 94% before NLP integration.

That’s useful for reducing the number of individual wrong responses to user queries, reducing frustration and response times, but it’s absolutely essential for large datasets in which that 4% difference in fidelity could mean hundreds or thousands of extra errors.

Now, even products like Facebook Messenger are offering built-in NLP solutions for developers, so they can easily understand the concepts being discussed by users. Facebook isn’t alone in this — most advanced messengers, personal assistants, and voice apps use some level of NLP to power their linguistic interface.

Other possible applications for NLP in modern consumer services include ultra-accurate spam filtering and profanity removal. NLP applications also range into sentiment analysis, not just translating natural speech but trying to understand its emotional and conceptual tenor; some NLP projects aim to have computers analyze speech in real-time to make determinations about which stock to buy or sell.

All this requires the right dataset, to be possible. But how do you get the right dataset?

NLP data comes in 7 data types

Natural language processing datasets generally concern themselves with a single type of analysis, and it’s only together that they can provide for the totality of the needs of natural language.

There’s speech recognition, in which the actual words spoken in audio are converted to text for further analysis, and machine translation for when those spoken words don’t happen to be in English.

Then, there’s text classification and language modeling, which have to do with chunking and classifying speech into concepts for further analysis. Next come the finishing touches: image captioning, question answering, and document summarization.

A given NLP query might use any number of these sorts of datasets. For instance, a request to troll Spanish-language blogs to determine the level of excitement about a particular product would have to first record the spoken words into text (speech recognition), then translate that Spanish text into English.

Only then would the program be able to look at the English translation and try to find conceptual and/or emotional elements that could be relevant to our question: How do these people feel about the product?

In each case, the dataset allows developers to use machine learning techniques to add the next crucial sub-ability to the overall NLP arsenal.

Since the whole point of NLP is to make human-machine interaction more seamless, most of the possible applications have to do with the places that interaction has historically always lagged behind. Real-time audio conversations with chatbots is an NLP dream, bringing all the challenges of understanding language up to the challengingly high speed of regular human interaction.

Natural Language Processing can be used exactly as wide as the human facility with language; anything that exists in the medium of natural language can be analyzed in that medium as well.

Armed with the right datasets, NLP promises to be one of the most impactful new computing innovations in recent decades. All the average person needs to do to make sure those datasets are available is keep having natural interactions, and allow the computer models of the future to learn from them.

Related Posts

Summa Linguae uses cookies to allow us to better understand how the site is used. By continuing to use this site, you consent to this policy.

Learn More