Natural Language Processing with NLTK in Python

Introduction

Natural Language Processing (NLP) is a prime sub-field of Artificial Intelligence, which involved dealing with human language by processing, analyzing and generating it. The modern-day voice assistants like Siri, Cortana, Google Allo, Alexa, etc. makes use of various advanced NLP algorithms to interact with humans, like a human. However, we still are far off from developing an absolute domain-independent aritificially intelligent agent which can trick a person to think that it is another human being.

Python is a go-to language for any programming task. Though, it may not be the most efficient implementation language, it sure is the best prototying alternative. Most of the state-of-the-art NLP tasks are performed using Java. However, due to more versatilibity, flexibility, extensibility and ease-of-use, I have chosen Python. Some of the basic NLP tasks are tokenization, stemming, lemmatization, parsing, chunking, chinking, etc. Lets look at how this can be done using a Python library named NLTK.

Setting up

You must ideally use virtualenv for installing any python packages. Hence the system must have virtualenv, python-2.7, and pip (python package manager) installed. Follow this guide for setting up.
Make an activate virtualenv and activate it.
First install NLTK using pip install nltk.
Download data from python shell. Note that this will take a long time if your internet is slow. You can even download individual packages as and when you need.

python
>>> import nltk
>>> nltk.download('all',halt_on_error=False)

Basic Tasks

Tokenization

Tokenization refers to splitting up words from sentences or sentences from paragraphs. It can sound simple since a rudimentary solution would require a simple sentence.split(" ") method, however, this can become complicated when there is punctuation involved. Hence it is always better to use library functions whenever possible.

from nltk.tokenize import word_tokenize
query = "John is a Computer Scientist"
word_list = word_tokenize(query)
print word_list

Output : ['John', 'is', 'a', 'Computer', 'Scientist']

Similary, sent_tokenize can be used for tokenizing a text paragraph into sentences.

from nltk.tokenize import sent_tokenize
query = "John is a computer scientist. John has a sister named Mary."
print sent_tokenize(query)

Output: ['John is a computer scientist.', 'John has a sister named Mary.']

Another type of sentence tokenizer is PunktSentenceTokenizer which implements a sentence boundary detection algorithm. This tokenizer is capable of unsupervised machine learning, so you can actually train it on any body of text that you use. Refer this link for more details about the same.

Stop Words Removal

Natural language is nothign but, dealing with a set of words. The processing of such words involves extracting information out of it. Many of NLP tasks requires removal of stop_words which are the most common words that hardly contains any information, like the, a, this, is, etc.

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
word_list = ["John","is","a","computer","scientist","John","has","a","sister","named","Mary"]
filtered_words = [word for word in word_list if word not in stop_words]
print filtered_words

Output: ['John', 'computer', 'scientist', 'John', 'sister', 'named', 'Mary']

Stemming

Stemming means converting words to their base forms. When we want to extract information from language, we need to remove redundancy. In a language, many words mean the same action in different forms. When we are only interested in a base form of a word, stemming can be used. Just for example traditional, tradition etc. can be stemmed to tradit.

Note : Stemming does not give a dictionary-searchable word. If we want a proper word, which exists in a language, the task of lemmatization can be done.

from nltk.stem import PorterStemmer
ps = PorterStemmer()
print ps.stem("traditional")
print ps.stem("tradition")

word_list =  ['caresses', 'flies', 'dies', 'mules', 'denied',
...            'died', 'agreed', 'owned', 'humbled', 'sized',
...            'meeting', 'stating', 'siezing', 'itemization',
...            'sensational', 'traditional', 'reference', 'colonizer',
...            'plotted']

stem_words = [ ps.stem(word) for word in word_list]
print (' '.join(stem_words))

Output:

tradit
tradit
caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot

Lemmatization

A very similar operation to stemming is called lemmatizing. The major difference between these is, as you saw earlier, stemming can often create non-existent words, whereas lemmas are actual words.
So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.
Sometimes you will wind up with a very similar word, but sometimes, you will wind up with a completely different word.

>>> print( lemma.lemmatize("cats"))
cat
>>> print( lemma.lemmatize("cacti"))
cactus
>>> print( lemma.lemmatize("geese"))
goose
>>> print( lemma.lemmatize("rocks"))
rock

The only major thing to note is that lemmatize takes a part of speech parameter, “pos.” If not supplied, the default is “noun.”

>>> print( lemma.lemmatize("better", pos="a"))
good
>>> print( lemma.lemmatize("better"))
better

POS Tagging

To assign each word a particular part-of-speech, POS tagging is used. Here are some common POS tags.

POS tag list:

CC coordinating conjunction
CD cardinal digit
DT determiner
EX existential there (like: “there is” … think of it like “there exists”)
IN preposition/subordinating conjunction
JJ adjective ‘big’
JJR adjective, comparative ‘bigger’
JJS adjective, superlative ‘biggest’
NN noun, singular ‘desk’
NNS noun plural ‘desks’
RB adverb very, silently,
RBR adverb, comparative better
VB verb, base form take
VBD verb, past tense took
VBN verb, past participle taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present takes
WDT wh-determiner which
WP wh-pronoun who, what

Here is how you can get an exhaustive list of POS tags :

import nltk
nltk.help.upenn_tagset()

Output :

CC: conjunction, coordinating
	& 'n and both but either et for less minus neither nor or plus so
	therefore times v. versus vs. whether yet
CD: numeral, cardinal
	mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
	seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
	fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
	all an another any both del each either every half la many much nary
	neither no some such that the them these this those
EX: existential there
	there
FW: foreign word
	gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
	lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
	terram fiche oui corporis ...
	.
	.
	.

Here is an example code that trains a custom tokenizer and applies pos tagging.

import nltk
from nltk.corpus import state_union # importing some particular corpus
from nltk.tokenize import PunktSentenceTokenizer 
train_text = state_union.raw("2005-GWBush.txt") # way of importing a text out of corpus 
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text) #training punktSentenceTokenizer
tokenized = custom_sent_tokenizer.tokenize(sample_text)
def process_content():
	try:
		for i in tokenized[:5]: # tokenizing first 5 sentences only
			words = nltk.word_tokenize(i)
			tagged = nltk.pos_tag(words)
			print(tagged)

	except Exception as e:
		print(str(e))

process_content()

The above code is going to give a list of tuples with each element being something like this ('during', 'IN'), ('his', 'PRP$') …

Chunking and Chinking with NLTK

Meaning separating the noun phrase
Now that we know the parts of speech, we can do what is called chunking, and group words into hopefully meaningful chunks. One of the main goals of chunking is to group into what are known as “noun phrases.”
The idea is to group nouns with the words that are in relation to them.

The regex guide:

+ = match 1 or more
? = match 0 or 1 repetitions.
* = match 0 or MORE repetitions
. = Any character except a new line
You may find that, after a lot of chunking, you have some words in your chunk you still do not want, but you have no idea how to get rid of them by chunking. You may find that chinking is your solution.
Chinking is a lot like chunking, it is basically a way for you to remove a chunk from a chunk. The chunk that you remove from your chunk is your chink.

Named Entity Recognition

One of the most major forms of chunking in natural language processing is called “Named Entity Recognition.” The idea is to have the machine immediately be able to pull out “entities” like people, places, things, locations, monetary figures, and more.
There are two major options with NLTK’s named entity recognition: either recognize all named entities or recognize named entities as their respective type, like people, places, locations, etc.

NE Type and Examples

ORGANIZATION - Georgia-Pacific Corp., WHO
PERSON - Eddy Bonte, President Obama
LOCATION - Murray River, Mount Everest
DATE - June, 2008-06-29
TIME - two fifty a.m, 1:30 p.m.
MONEY - 175 million Canadian Dollars, GBP 10.40
PERCENT - twenty pct, 18.75 %
FACILITY - Washington Monument, Stonehenge
GPE - South East Asia, Midlothian

Check this out for detailed post on Named Entity Recognition

Wordnet

Wordnet is a lexical database of English that serves as a very large corpus as required by NLP tasks. It can be used to find synonyms of a word, find similarity between two words. Also, similarity can be of various kinds, like path similarity, etc.

from nltk.corpus import wordnet
syns = wordnet.synsets("program") # returns a list of synsets objects which has various methods like name
Synsets are synonyms of a given work.
print(syns[0].name()) #method name is called on one synset object plan.n.01
print(syns[0].lemmas()[0].name()) -> prints just the name plan
print(syns[0].definition()) a series of steps to be carried out or goals to be accomplished
print(syns[0].examples()) [‘they drew up a six-step plan’, ‘they discussed plans for a new bond issue’]

Finding synonyms and antonyms

synonyms = []
antonyms = []

for syn in wordnet.synsets("good"):
	for l in syn.lemmas():
		synonyms.append(l.name())
		if l.antonyms():
			antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

There are also other methods of similarity like path_based similarity and so on.

Classification

nltk.FreqDist(all_words) -> Gives freq distribution
There are built-in classifiers in NLTK like naive bayes classifier ( and others also, eg. sklearn module) and the function to train and test against it.
We can save the classifier using pickle module so that no need to train every time we need to use it.

save_classifier = open("naivebayes.pickle","wb")
pickle.dump(classifier, save_classifier) # classifier is the variable classifier and save classifier is like flag
save_classifier.close()

And to load it back

classifier_f = open("naivebayes.pickle", "rb")
classifier = pickle.load(classifier_f)
classifier_f.close()

Scikit-Learn

The people behind NLTK foresaw the value of incorporating the sklearn module into the NLTK classifier methodology. As such, they created the SklearnClassifier API of sorts. To use that, you just need to import it like: ` from nltk.classify.scikitlearn import SklearnClassifier`
There are loads of classifiers in sklearn, just check that out.
Also, we can use Stanford NER and POS tagging modules. Check out tutorials for more.
Refer tutorials for various other classifiers and how a combination of such classifiers can be used

Sources

Python Programming tutorials