• No results found

A corpus is a structured or unstructured collection of words or expressions.

All NLTK corpora are stored in the module nltk.corpus. Some examples follow:

gutenberg—contains eighteen English texts from the Gutenberg Project, such as Moby Dick and the Bible.

names—a list of 8,000 male and female names.

words—a list of 235,000 most frequently used English words and forms.

stopwords—a list of stop words (the most frequently used words) in fourteen languages. The English list is in stopwords.words("english"). Stop words are usually eliminated from the text for most analyses types because they usually don’t add anything to our understanding of texts.

5. www.informationweek.com/software/information-management/structure-models-and-meaning/d/d-id/1030187

• cmudict—a pronunciation dictionary composed at Carnegie Mellon Univer-sity (hence the name), with over 134,000 entries. Each entry in cmu-dict.entries() is a tuple of a word and a list of syllables. The same word may have several different pronunciations. You can use this corpus to identify homophones (“soundalikes”).

The object nltk.corpus.wordnet is an interface to another corpus—an online semantic word network WordNet (Internet access is required). The network is a collection of synsets (synonym sets)—words tagged with a part-of-speech marker and a sequential number:

wn = nltk.corpus.wordnet # The corpus reader wn.synsets("cat")

[Synset('cat.n.01'), Synset('guy.n.01'),

«

more synsets

»

]

For each synset, you can look up its definition, which may be quite unexpect-ed:

wn.synset("cat.n.01").definition() wn.synset("cat.n.02").definition()

'feline mammal usually having thick soft fur

«

...

»

'

'an informal term for a youth or man'

A synset may have hypernyms (less specific synsets) and hyponyms (more specific synsets), which make synsets look like OOP classes with subclasses and superclasses.

wn.synset("cat.n.01").hypernyms() wn.synset("cat.n.01").hyponyms() [Synset('feline.n.01')]

[Synset('domestic_cat.n.01'), Synset('wildcat.n.03')]

Finally, you can use WordNet to calculate semantic similarity between two synsets. The similarity is a double number in the range [0...1]. If the similar-ity is 0, the synsets are unrelated; if it is 1, the synsets are full synonyms.

x = wn.synset("cat.n.01") y = wn.synset("lynx.n.01") x.path_similarity(y)

0.04

Processing Texts in Natural Languages

39

(x.path_similarity(y), x, y) for x in wn.synsets('cat') for y in wn.synsets('dog')

if x.path_similarity(y) # Ensure the synsets are related at all )[1:]]

['an informal term for a youth or man', 'informal term for a man']

Surprise!

In addition to using the standard corpora, you can create your own corpora through the PlaintextCorpusReader. The reader looks for the files in the root direc-tory that match the glob pattern.

myCorpus = nltk.corpus.PlaintextCorpusReader(root, glob)

The function fileids() returns the list of files included in the newly minted cor-pus. The function raw() returns the original “raw” text in the corpus. The function sents() returns the list of all sentences. The function words() returns the list of all words. You’ll find out in the next section how the magic of con-verting raw text into sentences and words happens.

myCorpus.fileids() myCorpus.raw() myCorpus.sents() myCorpus.words()

Use the latter function (described on page 42) in conjunction with a Counter object (described on page 17) to calculate word frequencies and identify the most frequently used words.

NLTK Is a Hollow Module

When you install the module NLTK, you actually install only the classes—but not the corpora. The corpora are considered too large to be included in the distribution. When you import the module for the first time, remember to call the function download() (an Internet connection is required) and install the missing parts, depending on your needs.

Normalization

Normalization is the procedure that prepares a text in a natural language for further processing. It typically involves the following steps (more or less in this order):

1. Tokenization (breaking the text into words). NLTK provides two simple and two more advanced tokenizers. The sentence tokenizer returns a list of sentences as strings. All other tokenizers return a list of words:

• word_tokenize(text)—word tokenizer

• sent_tokenize(text)—sentence tokenizer

• regexp_tokenize(text, re)—regular expression-based tokenizer; parameter re is a regular expression that describes a valid word

Depending on the quality of the tokenizer and the punctuational structure of the sentence, some “words” may contain non-alphabetic characters. For the tasks that heavily depend on punctuation analysis, such as sentiment analysis through emoticons, you need an advanced WordPunctTokenizer. Compare how WordPunctTokenizer.tokenize() and word_tokenize() parse the same text:

from nltk.tokenize import WordPunctTokenizer word_punct = WordPunctTokenizer()

text = "}Help! :))) :[ ... :D{"

word_punct.tokenize(text)

["}", "Help", "!", ":)))", ":[", "...", ":", "D", "{"]

nltk.word_tokenize(text)

["}", "Help", "!", ":", ")", ")", ")", ":", "[", "...",

"..", ":", "D", "{"]

2. Conversion of the words to all same-case characters (all uppercase or lowercase).

3. Elimination of stop words. Use the corpus stopwords and additional appli-cation-specific stop word lists as the reference. Remember that the words in stopwords are in lowercase. If you look up “THE” (definitely a stop word) in the corpus, it won’t be there.

4. Stemming (conversion of word forms to their stems). NLTK supplies two basic stemmers: a less aggressive Porter stemmer and a more aggressive Lancaster stemmer. Due to its aggressive stemming rules, the Lancaster Processing Texts in Natural Languages

41

lstemmer = nltk.LancasterStemmer() lstemmer.stem("wonderful")

'wond'

Apply either stemmer only to single words, not to complete phrases. That’s how they work!

5. Lemmatization—a slower and more conservative stemming mechanism.

The WordNetLemmatizer looks up calculated stems in WordNet and accepts them only if they exist as words or forms. (Internet access is required to use the lemmatizer.) The function lemmatize(word) returns the lemma of word.

lemmatizer = nltk.WordNetLemmatizer() lemmatizer.lemmatize("wonderful") 'wonderful'

Though technically not a part of the normalization sequence, the part-of-speech tagging (POS tagging) is nonetheless an important step in text prepro-cessing. The function nltk.pos_tag(text) assigns a part-of-speech tag to every word in the text, which is given as a list of words. The return value is a list of tuples, where the first element of a tuple is the original word and the second element is a tag.

nltk.pos_tag(["beautiful", "world"])

# An adjective and a noun

[('beautiful', 'JJ'), ('world', 'NN')]

To put everything together, let’s display the ten most frequently used non-stop word stems in the file index.html. (Note the use of BeautifulSoup previously introduced on page 30!)

from bs4 import BeautifulSoup from collections import Counter from nltk.corpus import stopwords from nltk import LancasterStemmer

# Create a new stemmer ls = nltk.LancasterStemmer()

# Read the file and cook a soup with open("index.html") as infile:

soup = BeautifulSoup(infile)

# Extract and tokenize the text

words = nltk.word_tokenize(soup.text)

# Convert to lowercase

words = [w.lower() for w in words]

# Eliminate stop words and stem the rest of the words words = [ls.stem(w) for w in text if w not in

stopwords.words("english") and w.isalnum()]

# Tally the words freqs = Counter(words) print(freqs.most_common(10))

Think of this code fragment as the first step toward topic extraction.