11/10/2017 10 Minutes read Tech 

Simple NLP tasks tutorial

This post is a quick tutorial about some simple NLP tasks in Python and more specifically with the NLTK and SpaCy librairies.

Ce post est aussi disponible en français.

Introduction and work environment

In this post, we will provide some examples of Natural Language Processing (NLP) tasks by comparing two commonly used Python libraries : NLTK and SpaCy (more information on NLP are available in these two posts : Introduction to NLP Part I and Part II).

To run these examples, it is recommended to use an Anaconda environment with Python 3.5 :

$ conda create --name spacy_nltk_examples python=3.5

$ source activate spacy_nltk_examples

For convenience, we will work in a Jupyter notebook. So it is necessary to install the jupyter package, and of course the spacy and nltk packages and then start the notebook server :

$ pip install jupyter==1.0.0 spacy==1.9.0 nltk==3.2.2

$ jupyter notebook

The notebook will open in the web browser and we can access to the dashboard. Then we create a new notebook (in Python 3) :

Figure 1. Screenshot illustrating the generation of a new python 3 notebook.

To run the examples, we can simply copy-paste them in the notebooks cells.

Figure 2. Screenshot of a jupyter notebook cell.

And run the cells.

Figure 3. Screenshot illustrating the way to run a jupyter notebook cell.

To use SpaCy models, we first need to download them. It is possible to do that via the notebook itself :

!python -m spacy download en # english model

!python -m spacy download fr # french model

Each of the following example can be run independently of each other. To do so, some parts are repeated each time (libraries importation, pipeline creation, sentences creation,…) but these repetitions are not necessary if we run all the examples one after the other.

The examples are also gathered in a jupyter notebook available on GitHub.

Text transformation

Sentence breaking

The first example is about a problem which may appear simple : breaking a text in sentences.

With SpaCy

# We import SpaCy library and create the french processing pipeline
import spacy
nlp_fr = spacy.load("fr")

# We create a text with several sentences in french
text_fr = "Ceci est 1 première phrase. Puis j'en écris une seconde. pour finir en voilà une troisième sans mettre de majuscule"

# We process the text through the pipeline
doc_fr = nlp_fr(text_fr)

# We print the sentences
for sent in doc_fr.sents:
    print(sent)

Output :

Ceci est 1 première phrase. Puis j'en
écris une seconde.
pour finir en voilà une troisième sans mettre de majuscule

With NLTK

# We import the NLTK segmentation function
from nltk.tokenize import sent_tokenize

# We create a text with several sentences
text_fr = "Ceci est 1 première phrase. Puis j'en écris une seconde. pour finir en voilà une troisième sans mettre de majuscule"

# We use the segmentation function on the text
sentences = sent_tokenize(text_fr, language = 'french')

# We print the sentences
for sent in sentences:
    print(sent)

Output :

Ceci est 1 première phrase.
Puis j'en écris une seconde.
pour finir en voilà une troisième sans mettre de majuscule

Word tokenization

A text can also be broken in tokens. These tokens can be words, n-grams, punctuation, symbols or numbers.

With SpaCy

# We import SpaCy library and create the french processing pipeline
import spacy
nlp_fr = spacy.load("fr")

# We create a sentence
text_fr = "Les tokens peuvent être des symboles $ ++, des chiffres 7 99, de la ponctuation !? des mots."

# We process the sentence through the pipeline
doc_fr = nlp_fr(text_fr)

# We list and print the tokens
words = [w.text for w in doc_fr]
print(words)

Output :

['Les', 'tokens', 'peuvent', 'être', 'des', 'symboles', '$', '+', '+', ',', 'des', 'chiffres', 
 '7', '99', ',', 'de', 'la', 'ponctuation', '!', '?', 'des', 'mots', '.']

With NLTK

# We import the NLTK word tokenization function
from nltk.tokenize import word_tokenize

# We create a sentence
text_fr = "Les tokens peuvent être des symboles $ ++, des chiffres 7 99, de la ponctuation !? des mots."

# On use the tokenization function on the sentence and print the result
words = word_tokenize(text_fr, language = 'french')
print(words)

Output :

['Les', 'tokens', 'peuvent', 'être', 'des', 'symboles', '$', '++', ',', 'des', 'chiffres', 
 '7', '99', ',', 'de', 'la', 'ponctuation', '!', '?', 'des', 'mots', '.']

In some NLP applications, it is sometimes more significant to use a set of consecutive tokens rather than only one token. By example, if the text contain the word “New-York”, a classical tokenization will break it in three distinct tokens “New”, “-” and “York”. This segmentation risks to distort the meaning of the word. To solve this issue, it is possible to break the text in n-grams (n successive tokens). Only NLTK allows to generate n-grams from a list of tokens.

# We import the NLTK word tokenization and n-grams generation functions
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

# We tokenize a sentence
words = word_tokenize("Le NLP avec NLTK, c'est génial!")

# We generate bi-grams from the tokens
bigrams=ngrams(words,2)

# We print the results
print(list(bigrams))

Output :

[('Le', 'NLP'), ('NLP', 'avec'), ('avec', 'NLTK'), ('NLTK', ','), (',', "c'est"), 
 ("c'est", 'génial'), ('génial', '!')]

# We import the NLTK word tokenization and n-grams generation functions
from nltk.tokenize import word_tokenize
from nltk.util import ngrams

# We tokenize a sentence
words = word_tokenize("Le NLP avec NLTK, c'est génial!")

# We generate tri-grams from the tokens
trigrams=ngrams(words,3)

# We print the result
print(list(trigrams))

Output :

[('Le', 'NLP', 'avec'), ('NLP', 'avec', 'NLTK'), ('avec', 'NLTK', ','), ('NLTK', ',', "c'est"),
 (',', "c'est", 'génial'), ("c'est", 'génial', '!')]

Information extraction

Part-Of-Speech-Tagging (POS-Tagging)

POS-Tagging aims to assign a category to every token by identifying its syntactic functionality (verb, adjective, noun,…).

With SpaCy

# We import SpaCy library and create the french processing pipeline 
import spacy
nlp_fr = spacy.load("fr")

# We create a french sentence
text_fr = "Je vais au parc avec mon chien"

# We process the sentence through the pipeline
doc_fr = nlp_fr(text_fr)

# We print each token with its POS-tag 
for token in doc_fr:
    print('Word : {0}, , Tag : {1}' .format(token.text, token.tag_))

Output :

Word : Je, , Tag : PRON
Word : vais, , Tag : AUX
Word : au, , Tag : PRON
Word : parc, , Tag : NOUN
Word : avec, , Tag : ADP
Word : mon, , Tag : DET
Word : chien, , Tag : NOUN

# We import SpaCy library and create the english processing pipeline
import spacy
nlp_en = spacy.load("en")

# We create a french sentence
text_en = "I go to the park with my dog"

# We process the sentence through the pipeline
doc_en = nlp_en(text_en)

# We print each token with its POS-tag 
for token in doc_en:
    print('Word : {0}, , Tag : {1}' .format(token.text, token.tag_))

Output :

Word : I, , Tag : PRP
Word : go, , Tag : VBP
Word : to, , Tag : IN
Word : the, , Tag : DT
Word : park, , Tag : NN
Word : with, , Tag : IN
Word : my, , Tag : PRP$
Word : dog, , Tag : NN

Link to documents with the meaning of the french and english tags are available in the bibliography.

With NLTK

# We import the NLTK word tokenizer and the POS-Tagging functions
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

# We create a sentence in english
text_en = "I go to the park with my dog"

# We tokenize the sentence
tokens_en = word_tokenize(text_en)

# We tag the tokens
tags_en = pos_tag(tokens_en)

# We print the result
print (tags_en)

Output :

[('I', 'PRP'), ('go', 'VBP'), ('to', 'TO'), ('the', 'DT'), ('park', 'NN'), ('with', 'IN'), 
 ('my', 'PRP$'), ('dog', 'NN')]

Named-Entity Recognition (NER)

Named-Entity Recognition aims to identify words in a text which match to categorized concepts (person names, location, organization,…)

With SpaCy

SpaCy has 17 named entity types : PERSON, NORP, FACILITY, ORG, GPE, LOC, PRODUCT, EVENT, WORK_OF_ART, LANGUAGE, DATE, TIME, PERCENT, MONEY, QUANTITY, ORDINAL and CARDINAL.

# We import SpaCy library and create the english processing pipeline
import spacy
nlp_fr = spacy.load("en")

# We create a sentence
text_en = "Mark Elliot Zuckerberg (born May 14, 1984) is a co-founder of Facebook."

# We process the sentence through the pipeline
doc_en = nlp_fr(text_en)

# We print each token and if an entity is recognized, we print the entity type
for token in doc_en:
    print('Word : {0}, , Entity : {1}' .format(token.text, token.ent_type_))

Output :

Word : Mark, , Entity : PERSON
Word : Elliot, , Entity : PERSON
Word : Zuckerberg, , Entity : PERSON
Word : (, , Entity :
Word : born, , Entity :
Word : May, , Entity : DATE
Word : 14, , Entity : DATE
Word : ,, , Entity : DATE
Word : 1984, , Entity : DATE
Word : ), , Entity :
Word : is, , Entity :
Word : a, , Entity :
Word : co, , Entity :
Word : -, , Entity :
Word : founder, , Entity :
Word : of, , Entity :
Word : Facebook, , Entity : ORG
Word : ., , Entity :

With NLTK

There are 9 entity types in NLTK : ORGANIZATION, PERSON, LOCATION, DATE, TIME, MONEY, PERCENT, FACILITY and GPE. More details on these entities are available in chapter 7, paragraph 5 of the NLTK book (see the bibliography).

# We import the NLTK word tokenization, POS-Tagging and named-entity recognition functions
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk

# We create an english sentence
text_en = "Mark Elliot Zuckerberg (born May 14, 1984) is a co-founder of Facebook."

# We tokenize the sentence
tokens_en = word_tokenize(text_en)

# We tag the tokens
tags_en = pos_tag(tokens_en)

# We use the ner function
ner_en = ne_chunk(tags_en)

# We print the result
print (ner_en)

Output :

(S
(PERSON Mark/NNP)
(PERSON Elliot/NNP Zuckerberg/NNP)
(/(
born/VBN
May/NNP
14/CD
,/,
1984/CD
)/)
is/VBZ
a/DT
co-founder/NN
of/IN
(GPE Facebook/NNP)
./.)

NLTK vs SpaCy

Mode of operation

Spacy and NLTK have a very different way to deal with a text. NLTK process strings when SpaCy has an object oriented approach. Indeed, NLTK provides a set of functions, one for each NLP task (pos_tag() for POS-Tagging, sent_tokenize() for sentence breaking, word_tokenize() for word tokenization,…). In general, these functions input are a string or a list of strings and they also return a string or a list of strings. So the user only have to find the function which match to his task and to feed her with his text as a string.

Figure 4. Illustration of the POS-Tagging operation in NLTK.

SpaCy works differently. The text, in the form of a string, has to pass through a language processing pipeline which is specific to the text language (english, french, spanish and german are supported by SpaCy for now). The pipeline returns a Doc object which is a container for accessing linguistic annotations. Each Doc element or set of elements is also represented by an object and possess attributes, methods and properties which give access to the requested linguistic informations. This approach is much more modern and pythonic.

Figure 5. Illustration of the POS-Tagging operation in SpaCy.

Examples feedback

We are now looking the previous examples and comparing the results of the two libraries.

  • The first example is about sentence breaking of a text. With SpaCy, we can break a text written in english, french, spanish or german. In NLTK, the nltk.tokenize package present several tools to do sentence breaking and in particular the sent_tokenize() function. It has a language parameter which support most of the european languages. If we look our example, we can see that NLTK well recognize the three sentences of the text whereas SpaCy does not break correctly the first two sentences.
  • The second one concerns the sentence segmentation in tokens. Common to the sentence breaking, it is the nltk.tokenize package which compiles tokens segmentation tools in NLTK. In particular, we can mention the word_tokenize() function which presents the same language parameter than sent_tokenize(). In our example, we can notice a small difference between the two libraries, they do not treat “++” the same way. NLTK considers it as only one token whereas SpaCy breaks the “+” and return two distinct tokens.
  • The next example deals with POS-Tagging. Only SpaCy presents multiple languages for it : english, french, spanish and german. For the english tags, the two libraries rely on the Penn Treebank. Once again, there is a small difference in the results of our example : the word “to” is tagged “TO” (tag for infinitive to) with NLTK and “IN” (tag for conjunction, subordinating or preposition) with SpaCy. It seems that NLTK is wrong.
  • For the last example, we are interested in Named-Entity Recognition. As the previous example, only SpaCy offers an alternative to english with a german NER model, french and spanish models are not yet available. A second advantage with SpaCy is the number of named entities : 17 for SpaCy versus 9 for NLTK. In our example, three entities are used and they are common to the two libraries : PERSON, DATE and GPE. SpaCy recognises well the three entities but NLTK only recognises two of them, it misses the DATE.

Other tasks

Other applications, not presented by an example, are achievable with these libraries :

  • Lemmatization. It amounts to take the canonic form of a word, its lemma. For now, SpaCy has word lemma only for the english model. NLTK also implement one english lemmatizer, based on Wordnet, it is available via the WordNetLemmatizer() function in the nltk.stem package.
  • Stemming. It consists of taking the root form a word. Only NLTK proposes stemming tools. Several stemmers are available in the nltk.stem package : one in arabic (ISRI), two in english (Lancaster algorithm and Porter algorithm), one which is customisable with regex (regexp), one in portuguese (RSLP) and a stemmers set (Snowball) which cover several languages (french, danish, dutch, english, finnish, german, hungarian, italian, norwegian, portuguese, romanian, russian, spanish and swedish).
  • Sentiment analysis. It is still in development in SpaCy. Indeed, there is a sentiment attribute but it is empty for every language model. In NLTK, three packages are helpful to do sentiment analysis. The first one is nltk.sentiment. It regroups several sub-modules :
    • nltk.sentiment.vader offers a sentiment analysis tool called VADER. The polarity_scores() function of this module return, for a given text, a score for each polarity (positive, negative and neutral).
    • nltk.sentiment.sentiment_analyzer. This module is a tool to implement and facilitate sentiment analysis tasks using NLTK features and classifiers. It proposes features extraction, classification model training and evaluating functions,….
    • nltk.sentiment.util regroups utility methods (files conversion and parsing, writing the output of an analysis to a file,…)

The other two packages supplement the first one in the building of classification models for sentiment analysis, in providing datasets in nltk.corpus (the available NLTK corpora are listed in the second chapter of the NLTK book given in the bibliography) and a set of classifiers in nltk.classify. Several corpora can be useful : subjectivity (it contains 5000 sentences labelled “subjective” and 5000 labelled “objective”), twitters_samples (set of positive/negative annotated tweets) et movie_reviews (it contains 2000 positive/negative annotated movie reviews). The available classifiers are ConditionalExponentialClassifier, DecisionTreeClassifier, MaxentClassifier, NaiveBayesClassifier and WekaClassifier.

Conclusion

These few simple examples are just an overview of what NLTK and SpaCy allow to do in NLP. Even if SpaCy is more and more used, the two libraries are complementary. Each one proposes tools that the other don’t implement and sometimes some features are more developed in one library than in the other and vice versa. Even if the english language is favored, we can hope that features will shortly develop in other languages, especially in SpaCy.

Bibliography

Images credit

All the images are under licence CC0.