Simple NLP tasks tutorial

Actualité ekino -

From theory to practice...

Article paru le :
Simple NLP tasks tutorial
Article paru le :

Ce post est aussi disponible en français.

 

Introduction and work environment

In this post, we will provide some examples of Natural Language Processing (NLP) tasks  by comparing two commonly used Python libraries : NLTK and SpaCy (more information on NLP are available in these two posts : Introduction to NLP Part I and Part II).

To run these examples, it is recommended to use an Anaconda environment with Python 3.5 :

For convenience, we will work in a Jupyter notebook. So it is necessary to install the jupyter package, and of course the spacy and nltk packages and then start the notebook server :

The notebook will open in the web browser and we can access to the dashboard. Then we create a new notebook (in Python 3) :

Figure 1. Screenshot illustrating the generation of a new python 3 notebook.

To run the examples, we can simply copy-paste them in the notebooks cells.

Figure 2. Screenshot of a jupyter notebook cell.

And run the cells.

Figure 3. Screenshot illustrating the way to run a jupyter notebook cell.

To use SpaCy models, we first need to download them. It is possible to do that via the notebook itself :

Each of the following example can be run independently of each other. To do so, some parts are repeated each time (libraries importation, pipeline creation, sentences creation,…)  but these repetitions are not necessary if we run all the examples one after the other.

The examples are also gathered in a jupyter notebook available on GitHub.

 

Text transformation

Sentence breaking

The first example is about a problem which may appear simple : breaking a text in sentences.

With SpaCy

Output :

With NLTK

Output :

 

Word tokenization

A text can also be broken in tokens. These tokens can be words, n-grams, punctuation, symbols or numbers.

With SpaCy

Output :

With NLTK

Output :

In some NLP applications, it is sometimes more significant to use a set of consecutive tokens rather than only one token. By example, if the text contain the word “New-York”, a classical tokenization will break it in three distinct tokens “New”, “-” and “York”. This segmentation risks to distort the meaning of the word. To solve this issue, it is possible to break the text in n-grams (n successive tokens). Only NLTK allows to generate n-grams from a list of tokens.

Output :

Output :

 

Information extraction

Part-Of-Speech-Tagging (POS-Tagging)

POS-Tagging aims to assign a category to every token by identifying its syntactic functionality (verb, adjective, noun,…).

With SpaCy

Output :

Output :

Link to documents with the meaning of the french and english tags are available in the bibliography.

With NLTK

Output :

Named-Entity Recognition (NER)

Named-Entity Recognition aims to identify words in a text which match to categorized concepts (person names, location, organization,…)

With SpaCy

SpaCy has 17 named entity types : PERSON, NORP, FACILITY, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL.

Output :

With NLTK

There are 9 entity types in NLTK : ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, PERCENT, FACILITY and GPE. More details on these entities are available in chapter 7, paragraph 5 of the NLTK book (see the bibliography).

Output :

 

NLTK vs SpaCy

Mode of operation

Spacy and NLTK have a very different way to deal with a text. NLTK process strings when SpaCy has an object oriented approach. Indeed, NLTK provides a set of functions, one for each NLP task (pos_tag() for POS-Tagging, sent_tokenize() for sentence breaking, word_tokenize() for word tokenization,…). In general, these functions input are a string or a list of strings and they also return a string or a list of strings. So the user only have to find the function which match to his task and to feed her with his text as a string.

Figure 4. Illustration of the POS-Tagging operation in NLTK.

SpaCy works differently. The text, in the form of a string, has to pass through a language processing pipeline which is specific to the text language (english, french, spanish and german are supported by SpaCy for now). The pipeline returns a Doc object which is a container for accessing linguistic annotations. Each Doc element or set of elements is also represented by an object and possess attributes, methods and properties which give access to the requested linguistic informations. This approach is much more modern and pythonic.

Figure 5. Illustration of the POS-Tagging operation in SpaCy.

Examples feedback

We are now looking the previous examples and comparing the results of the two libraries.

  • The first example is about sentence breaking of a text. With SpaCy, we can break a text written in english, french, spanish or german. In NLTK, the nltk.tokenize  package present several tools to do sentence breaking and in particular the sent_tokenize() function. It has a language parameter which support most of the european languages. If we look our example, we can see that NLTK well recognize the three sentences of the text whereas SpaCy does not break correctly the first two sentences.

 

  • The second one concerns the sentence segmentation in tokens. Common to the sentence breaking, it is the nltk.tokenize package which compiles tokens segmentation tools in NLTK. In particular, we can mention the word_tokenize() function which presents the same language parameter than sent_tokenize(). In our example, we can notice a small difference between the two libraries, they do not treat  “++” the same way. NLTK considers it as only one token whereas SpaCy breaks the “+” and return two distinct tokens.

 

  • The next example deals with POS-Tagging. Only SpaCy presents multiple languages for it :  english, french, spanish and german. For the english tags, the two libraries rely on the Penn Treebank. Once again, there is a small difference in the results of our example : the word “to” is tagged “TO” (tag for infinitive to) with NLTK and “IN” (tag for conjunction, subordinating or preposition) with SpaCy. It seems that NLTK is wrong.

 

  • For the last example, we are interested in Named-Entity Recognition. As the previous example, only SpaCy offers an alternative to english with a german NER model, french and spanish models are not yet available. A second advantage with SpaCy is the number of named entities : 17 for SpaCy versus 9 for NLTK. In our example, three entities are used and they are common to the two libraries : PERSON, DATE and GPE. SpaCy recognises well the three entities but NLTK only recognises two of them, it misses the DATE.

Other tasks

Other applications, not presented by an example, are achievable with these libraries :

  • Lemmatization. It amounts to take the canonic form of a word, its lemma. For now, SpaCy has word lemma only for the english model. NLTK also implement one english  lemmatizer, based on Wordnet, it is available via the WordNetLemmatizer() function in the  nltk.stem package.

 

  • Stemming. It consists of taking the root form a word. Only NLTK proposes stemming tools. Several stemmers are available in the nltk.stem package : one in arabic (ISRI), two in english (Lancaster algorithm and Porter algorithm), one which is customisable with regex (regexp), one in portuguese (RSLP) and a stemmers set (Snowball) which cover several languages (french, danish, dutch, english, finnish, german, hungarian, italian, norwegian, portuguese, romanian, russian, spanish and swedish).

 

  • Sentiment analysis. It is still in development in SpaCy. Indeed, there is a sentiment attribute but it is empty for every language model. In NLTK, three packages are helpful to do sentiment analysis. The first one is nltk.sentiment. It regroups several sub-modules :
      • nltk.sentiment.vader offers a sentiment analysis tool called VADER. The polarity_scores() function of this module return, for a given text, a score for each polarity (positive, negative and neutral).

     

      • nltk.sentiment.sentiment_analyzer. This module is a tool to implement and facilitate sentiment analysis tasks using NLTK features and classifiers. It proposes features extraction, classification model training and evaluating functions,….

     

    • nltk.sentiment.util regroups utility methods (files conversion and parsing, writing the output of an analysis to a file,…)

 

The other two packages supplement the first one in the building of classification models for sentiment analysis, in providing datasets in nltk.corpus (the available NLTK corpora are listed in the second chapter of the NLTK book given in the bibliography) and a set of classifiers in nltk.classify. Several corpora can be useful : subjectivity (it contains 5000 sentences labelled “subjective” and 5000 labelled “objective”), twitters_samples (set of positive/negative annotated tweets) et movie_reviews (it contains 2000 positive/negative annotated movie reviews). The available classifiers are ConditionalExponentialClassifier, DecisionTreeClassifier, MaxentClassifier, NaiveBayesClassifier and WekaClassifier.

 

Conclusion

These few simple examples are just an overview of what NLTK and SpaCy allow to do in NLP. Even if SpaCy is more and more used, the two libraries are complementary. Each one proposes tools that the other don’t implement and sometimes some features are more developed in one library than in the other and vice versa. Even if the english language is favored, we can hope that features will shortly develop in other languages, especially in SpaCy.

 

Bibliography

 

Images credit

All the images are under licence CC0.