Introduction to NLP (Part I)
Cet article est aussi disponible en français.
From the starting of the time in technology, man has always been searching to communicate effectively with machines. If the several available programming languages allowed man to interact with the machine, we liked the idea of establishing this communication in a more natural way. For this to be possible, first the « machine » has to understand what the users tells it followed by that the « machine » should be capable of responding in a manner which is understandable to the human.
The discipline behind this process is called Natural Language Processing (NLP) or Traitement Automatique du Langage Naturel (fr) (TALN) in French. It studies the understanding, manipulation and generation of natural language by the machines. For example, we seldom hear of the language used by the humans in their day to day communication as opposed to artificial programming languages or mathematical notions.
NLP is seldom confused with NLU which is a more recent term. NLU (Natural Language Understanding or Compréhension du Langage Naturel in french) is in fact a sub domain of NLP which is more focusses to the understanding of written language. It includes tasks like sentiment analysis, summarization, question-answering, dialogue agents (fr),....
NLP is in fact a research domain which started from the birth of modern computer science and which accelerated since the last few years. It’s history started way bach in the 40s and 50s. In its earliest forms it concentrated more on the translation of simple phrases. We can cite Georgetown-IBM experiment in 1954 which allowed the automated translation of more than 60 phrases from Russian to English.
Later in the 60s and 70s, we saw the emergence of the first chatbots such as ELIZA (1964). But it was only towards the end of 80s that the NLP underwent a revolution with the introduction of machine learning algorithms for language processing along with the possibility of capitalizing on the computational power.
Today with the computer science field achieving perfection, plausible, the availability of high quality open data and the use of Deep Learning (fr), NLP is becoming ever popular. Deep Learning is used in several functions of NLP providing a high performance improvement. It is also used to transform words to numerical vectors, a method called Word Embedding (fr). This technique is a real advancement because it allows to obtain quantitative vectors which provides a contextual similarity index between words.
Some functionalities of NLP
Writing an NLP code can be fairly easy or complex based on what the objective is. Among the most used implementations, we can cite text segmentation use cases.There exists different degrees of segmentation. For the segmentation of a text into phrases, we seldom talk of Sentence Boundary Disambiguation (SBD). It consists of determining where to start or where to end a phrase. For this, we can use a collection of rules pre-established or determined by learning from text. The following shows an example in French.
# Input<br />
"Ceci est 1 première phrase. Puis j'en écris une seconde. pour finir en voilà une troisième sans mettre de majuscule"
# Output<br />
Ceci est 1 première phrase.<br />
Puis j'en écris une seconde.<br />
pour finir en voilà une troisième sans mettre de majuscule
# Input<br />
"Ceci est une première phrase Ceci en est une seconde mais il manque un point pour les séparer"
# Output<br />
Ceci est une première phrase Ceci en est une seconde mais il manque un point pour les sépare
We can also separate a text in smaller units called ‘token’ and this is what is called tokenization. The tokens can be words, n-grams (group of n consecutive tokens), numbers, symbols and punctuation.
# Input<br />
"Les tokens peuvent êtres des symboles $, des chiffres 7 et des mots!"
# Output<br />
['Les', 'tokens', 'peuvent', 'êtres', 'des', 'symboles', '$', ',', 'des', 'chiffres', '7', 'et', 'des', 'mots', '!']
Finally it is possible to do something called topic segmentation. Several approaches are possible :
- It can mean a supervised approach of document classification. In this case, the topics are already known and we train a model of classification using a corpus of text labelled based on the topics. The trained model is then used on a new corpus of text for classification.
- We can also have a non supervised approach where we try to discover abstract but principal topics in a collection of text which is called Topic Modeling.
The functions of classification are not limited to topics. It is also possible to classify text or phrases based on their polarity (positive or negative) for example. Classification is a supervised (fr) approach where we need a training and labelled data set which is not seldom easy to find. We can also have non supervised (fr) approach called clustering which allows us to regroup the phrases or similar texts without the need of having labelled dataset.
Other functions of NLP are Part-Of-Speech Tagging (fr) (POS-Tagging or étiquetage morpho-syntaxique in French) and Named-Entity Recognition (fr) (NER or reconnaissance d’entités nommées in French). The first consists of associating each word of text with its morphosyntactic (name, verb, adjective…) starting from its context and lexical knowledge.
The second function allows us to identify in a text, a certain type of categorizable context in the classes such as names of organization, people or companies, place, quantity, distances, values and dates....
Finally among the most complex functions of NLP, we can cite Machine Translation (MT or traduction automatique in French). Statistical Machine Translation, consists of prediction algorithms which « learn » from a parallel corpus, in other words a collection of texts in many languages, in relation to mutual translation. Neural Machine Translation, uses Deep learning algorithms.
All functions cited early on is based on the analysis and understanding of natural language but another important aspect of NLP is the Natural Language Generation (génération automatique de textes in French). NLG allows to automate the generation of reports, resumes, paraphrases, responses to chatbots….
Real world examples of NLP based applications
All these different functions cited above allows to develop applications and tools which help us in our day to day activities. As proposed earlier on, the automated translation is used by Google Translate but also used a lot by other translation tools such as Microsoft Translator mobile application in 60 different languages or Skype Translator for aiding group conversations up to 100 participants.
We are seeing today a multiplication of chatbots and personal assistants (fr) (Google Now, Cortana, Siri). These technologies call other important functions of NLP for « understanding » the demand of users and responding to them using natural language.
Schematically, in a dialog system process, the user utterance (speech or text) is preprocessed by various NLP methods. Then the result is analyzed by a NLU unit. The semantic information is then analyzed by the dialog manager, that keeps the history and state of the dialog and manages the general flow of the dialog. The last phase will allow to generate a proper and understandable response to the user.
Classification of texts allows us to construct spam detectors or analyse sentiments for e.g. in the prediction of an election, the success or failure of a movie or another product.
Clustering allows us to regroup similar documents and can be used in a blog/article recommendation systems
In the end the numerous functions of NLP are used to extract information. Topic modelling for. Ex., lets us determine the principal themes of a document without having to read and extract a single keyword.
As we have seen in this first introductory blog, the NLP is an area of research under active development. Its huge complexity, lies in the ambiguity of human/natural language and the large diversity of existing languages, makes it a very interesting domain which will continue to evolve and improve. This evolution will be impacted positively by the development of new frameworks and the availability of high quality data. These two points will be looked into further detail in the second part of the introduction.
- Wikipedia page on NLP : https://en.wikipedia.org/wiki/History_of_natural_language_processing
- Academic paper “Natural langage processing, a historical review” : https://www.cl.cam.ac.uk/archive/ksj21/histdw4.pdf
- Intuitive article on Word Embedding : https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
- Simple article on Machine Translation : https://interstices.info/jcms/nn_72253/la-traduction-automatique-statistique-comment-ca-marche (fr)
- Academic paper “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” : https://arxiv.org/pdf/1609.08144.pdf
All the images are under licence CC0.