Basics [NLTK]



Basics aspects of Natural Language Toolkit.

Installation and download data files


Installation command for anaconda and pip:

$ conda install --channel anaconda nltk

or

$ pip install nltk

To install all the data requirement for NLTK, first define the output directory and download it by running:

Tokenize


Tokenization is the process of breaking up a text into pieces of text called Token. Tokenization can happen at several different levels, like: paragraphs, sentences, words, syllables, or phonemes.

Given the example text:

Tokenize sentences


Tokenize words


Penn Part of Speech Tags


Alphabetical list of part-of-speech tags used in the Penn Treebank Project.

numbertagdescription
1CCCoordinating conjunction
2CDCardinal number
3DTDeterminer
4EXExistential there
5FWForeign word
6INPreposition or subordinating conjunction
7JJAdjective
8JJRAdjective, comparative
9JJSAdjective, superlative
10LSList item marker
11MDModal
12NNNoun, singular or mass
13NNSNoun, plural
14NNPProper noun, singular
15NNPSProper noun, plural
16PDTPredeterminer
17POSPossessive ending
18PRPPersonal pronoun
19PRP\$Possessive pronoun
20RBAdverb
21RBRAdverb, comparative
22RBSAdverb, superlative
23RPParticle
24SYMSymbol
25TOto
26UHInterjection
27VBVerb, base form
28VBDVerb, past tense
29VBGVerb, gerund or present participle
30VBNVerb, past participle
31VBPVerb, non-3rd person singular present
32VBZVerb, 3rd person singular present
33WDTWh-determiner
34WPWh-pronoun
35WP\$Possessive wh-pronoun
36WRBWh-adverb

Chunking


Chunking uses a special syntax for regular expressions rules that delimit the chunks.

For the following example, lets find any noun or proper noun (NN, NNS, NNP or NNPS) followed by a punctuation mark.

chunk tree

Stemming


Stemming removes all morphological affixes from words and leaves only the word stem.

Lemmatization


Lemmatization is the process of converting a word to its meaningful base form.