Basics aspects of Natural Language Toolkit.
import numpy as np
Installation command for anaconda and pip:
$ conda install --channel anaconda nltkor
$ pip install nltkimport nltk
To install all the data requirement for NLTK, first define the output directory and download it by running:
PATH = 'D:/GitHub/machine-learning-notebooks/Natural-Language-Processing/nltk_data'
nltk.data.path.append(PATH)
nltk.download(download_dir=PATH)
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
True
Tokenization is the process of breaking up a text into pieces of text called Token. Tokenization can happen at several different levels, like: paragraphs, sentences, words, syllables, or phonemes.
Given the example text:
text = 'Hello everyone, how are you all? This is an example of text, which will be tokenized in several ways. Thank you!'
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
for sentence in sentences:
print(sentence)
Hello everyone, how are you all? This is an example of text, which will be tokenized in several ways. Thank you!
from nltk.tokenize import word_tokenize
words = word_tokenize(text)
for word in words:
print(word)
Hello everyone , how are you all ? This is an example of text , which will be tokenized in several ways . Thank you !
Alphabetical list of part-of-speech tags used in the Penn Treebank Project.
| number | tag | description |
|---|---|---|
| 1 | CC | Coordinating conjunction |
| 2 | CD | Cardinal number |
| 3 | DT | Determiner |
| 4 | EX | Existential there |
| 5 | FW | Foreign word |
| 6 | IN | Preposition or subordinating conjunction |
| 7 | JJ | Adjective |
| 8 | JJR | Adjective, comparative |
| 9 | JJS | Adjective, superlative |
| 10 | LS | List item marker |
| 11 | MD | Modal |
| 12 | NN | Noun, singular or mass |
| 13 | NNS | Noun, plural |
| 14 | NNP | Proper noun, singular |
| 15 | NNPS | Proper noun, plural |
| 16 | PDT | Predeterminer |
| 17 | POS | Possessive ending |
| 18 | PRP | Personal pronoun |
| 19 | PRP\$ | Possessive pronoun |
| 20 | RB | Adverb |
| 21 | RBR | Adverb, comparative |
| 22 | RBS | Adverb, superlative |
| 23 | RP | Particle |
| 24 | SYM | Symbol |
| 25 | TO | to |
| 26 | UH | Interjection |
| 27 | VB | Verb, base form |
| 28 | VBD | Verb, past tense |
| 29 | VBG | Verb, gerund or present participle |
| 30 | VBN | Verb, past participle |
| 31 | VBP | Verb, non-3rd person singular present |
| 32 | VBZ | Verb, 3rd person singular present |
| 33 | WDT | Wh-determiner |
| 34 | WP | Wh-pronoun |
| 35 | WP\$ | Possessive wh-pronoun |
| 36 | WRB | Wh-adverb |
tags = nltk.pos_tag(words)
print(tags)
[('Hello', 'NNP'), ('everyone', 'NN'), (',', ','), ('how', 'WRB'), ('are', 'VBP'), ('you', 'PRP'), ('all', 'DT'), ('?', '.'), ('This', 'DT'), ('is', 'VBZ'), ('an', 'DT'), ('example', 'NN'), ('of', 'IN'), ('text', 'NN'), (',', ','), ('which', 'WDT'), ('will', 'MD'), ('be', 'VB'), ('tokenized', 'VBN'), ('in', 'IN'), ('several', 'JJ'), ('ways', 'NNS'), ('.', '.'), ('Thank', 'NNP'), ('you', 'PRP'), ('!', '.')]
rule = r'Chunk: {<NN[SP]*.?>+<.>}'
parser = nltk.RegexpParser(rule)
chunk = parser.parse(tags)
chunk.draw()

from nltk.stem import PorterStemmer
words = ['tokenization', 'running', 'pythonic', 'understandable', 'avoidable', 'memorable']
PS = PorterStemmer()
for word in words:
print(f'{word} -> {PS.stem(word)}')
tokenization -> token running -> run pythonic -> python understandable -> understand avoidable -> avoid memorable -> memor
from nltk.stem import WordNetLemmatizer
words = ['children', 'feet', 'wolves', 'indices', 'leaves', 'mice', 'phenomena']
WL = WordNetLemmatizer()
for word in words:
print(f'{word} -> {WL.lemmatize(word)}')
children -> child feet -> foot wolves -> wolf indices -> index leaves -> leaf mice -> mouse phenomena -> phenomenon