Natural language processing with Artificial Intelligence, what you probably meant

In his extraordinary book on the history of humanity, Israeli professor Yuval Noah Harari tells us that humans are not the only species that communicates with a language, other animals do. The difference is in the flexibility of our language, which allows us to transmit a lot of information about the environment, about other people and, what really makes us unique, about intangible or fictional concepts.

We do most of our activities using oral and written language, from close communication with our family and friends, through work and school, to communication between companies and nations. Language has enabled the expression, recording and dissemination of knowledge and ideas, first with written language and then, thanks to technology, also with oral language, which explains that almost all the information available today in the world is expressed in natural language.

Given the fundamental role of language for our interaction with others, the ability to successfully communicate with computers using natural language has always been an aspiration of scientists. This capability would make them more useful and help increase the confidence with which we use them. However, the subtleties and ambiguities associated with language make this one of the most complex but interesting areas of research of Artificial Intelligence (AI), called natural language processing (NLP).


Image by Gerd Altmann from Pixabay

NLP applications are as broad as language uses and today have different levels of advancement. These applications include:

  • Classification of documents. From the text it determines which category a document corresponds to
  • Information recovery. Finds documents relevant to the user's request (search engines like Google)
  • Extraction of information. Gains knowledge of a text through the detection of certain concepts and their relationships
  • Summary of texts. Automatically synthesizes the main ideas of a document
  • Sentiment analysis. Determines whether a message is positive or negative from its text
  • Text analysis. Determines the grammatical structure of a sentence
  • Voice recognition. Identifies words from the sound and generates the corresponding text
  • Spelling correction. Detects syntactic and even grammatical errors in a text
  • Automatic translation. Generates the text corresponding to a message in another language
  • Conversational systems. Combine previous apps to maintain a conversation (virtual assistants and chatbots)

Currently, the greatest  advances are in the classification of documents (spam filters), information retrieval (search engines) and spelling correction, followed by machine translation, information extraction,  text analysis and sentiment analysis, to continue with the work still in progress in  the summary of documents, speech recognition and conversational systems.


Image by Gerd Altmann from Pixabay

As in other areas of AI, NLP owes its recent advances to the use of probabilistic models and machine learning techniques, including artificial neural networks, which use millions of examples to give the answer with greater probability of being correct, according to the statistical behavior of the examples. This approach does not require explicit knowledge of a language's syntax or grammar.

Some of NLP's modern algorithms work from a probabilistic language model, that is, how often each letter, symbol, or word, or a sequence of them, known as n-grams, are present in a language sample. Letter-based or symbol-based models are used for short text, such as a spell checker, while word-based models are used to analyze long texts.

For example, the previous paragraph has 57 words, or unigrams, and the most common is "a" which appears 4 times, followed by "are", "of" and "or" that appear 3 times. It contains 56 bigrams, the most common being "models are" and "are used", which appear 2 times. Finally, there are 55 trigrams and the most common is "models are used", which appear 2 times.

Once the probabilistic model is built, it can be used to classify texts based on how often certain n-grams appear in a category, this is also done to determine the sentiment of a message or to do the grammatical analysis of a sentence. In the case of translation, the n-gram statistic with its equivalent in the other language is used and, in speech recognition, the correspondence of sound characteristics with certain n-grams is used.


Image by  Lucia Grzeskiewicz from Pixabay

We can see these algorithms in action with the predictive mode of instant messaging applications, such as WhatsApp. If we pay attention, as we write the application suggests some words, apparently based on the previous two, so it possibly uses trigrams. The selections we make update our statistic.

The longer the sequence of words used, the more the statistics are related with the meaning of the text, as word sequences become more meaningful, but this astronomically increases the volume of documents, known as corpus, that must be analyzed to generate a useful statistic. This is because a person may use a vocabulary of 500 words, so there are 500x500x500 possible trigrams, a total of 125 million, which is the minimum length in words of the corpus that should be used to generate the model, about 250 thousand pages of text.

In instant messaging applications we are fault tolerant, and because of constant updating, in a relatively short time the tool comes to be very useful, so the corpus may not be as large. However, in other applications it is essential to have a very significant statistic, so great efforts and investments are made to develop the language model and to create new techniques that avoid having to analyze all possible n-grams. Similarly, neural network-based techniques are being deployed to find new models for translation and surprising quality texts.

Some of the most advanced applications of NLP produce results that make it seem that there is an intelligence behind the creation of the texts, which captures the meaning of these. However, we are dealing with an exercise of statistical prediction, very accurate, but without understanding. In the foreseeable future, it appears that the richness and complexity of language that differentiates humans is in force and, as Professor Harari says, will remain the key factor for the success of our species.

Did you enjoy this post? Read another of our posts here.


Visit our other sections