[Python] NLTKを使ってみる (1)

こちらのほぼ写経です。

NTLK Natural Language Toolkit

NTLK(Natural Language Toolkit)とは、英語の自然言語のためのPythonのライブラリです。

公式サイト

NTLKのインストール

公式の通りに進めます。

NTLK DATAのインストール

NTLE DATAをインストールしておくと色々と便利になります。

公式の通りに進めます。

付属のテキストを読み込む

付属のテキストを読み込んでみます。

まずは何があるかを確認します。

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

不思議の国のアリスを読み込むことにします。

alice = nltk.text.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))

どのようなオブジェクトか確認をします。

>>> print(type(alice))
<class 'nltk.text.Text'>

nltk.text.Textクラスはこちらで確認できます。

単語を数える

不思議の国のアリスに出てくる単語の数を数えます。

>>> print(len(alice))
34110

set()を使って単語の重複を除いて数えます。

>>> print(len(set(hamlet)))
5447

Aliceという単語の出現回数を数えます。

>>> print(alice.count("Alice"))
396

単語の使われる文脈

Aliceという単語が、どのような文脈で出てくるか(concordance)を確認します。

>>> alice.concordance("Alice")
Displaying 25 of 398 matches:
                                     Alice ' s Adventures in Wonderland by Lewi
] CHAPTER I . Down the Rabbit - Hole Alice was beginning to get very tired of s
what is the use of a book ,' thought Alice ' without pictures or conversation ?
so VERY remarkable in that ; nor did Alice think it so VERY much out of the way
looked at it , and then hurried on , Alice started to her feet , for it flashed
 hedge . In another moment down went Alice after it , never once considering ho
ped suddenly down , so suddenly that Alice had not a moment to think about stop
she fell past it . ' Well !' thought Alice to herself , ' after such a fall as
down , I think --' ( for , you see , Alice had learnt several things of this so
tude or Longitude I ' ve got to ?' ( Alice had no idea what Latitude was , or L
 . There was nothing else to do , so Alice soon began talking again . ' Dinah '
cats eat bats , I wonder ?' And here Alice began to get rather sleepy , and wen
dry leaves , and the fall was over . Alice was not a bit hurt , and she jumped
 not a moment to be lost : away went Alice like the wind , and was just in time
 but they were all locked ; and when Alice had been all the way down one side a
on it except a tiny golden key , and Alice ' s first thought was that it might
and to her great delight it fitted ! Alice opened the door and found that it le
ead would go through ,' thought poor Alice , ' it would be of very little use w
ay things had happened lately , that Alice had begun to think that very few thi
ertainly was not here before ,' said Alice ,) and round the neck of the bottle
ay ' Drink me ,' but the wise little Alice was not going to do THAT in a hurry
bottle was NOT marked ' poison ,' so Alice ventured to taste it , and finding i
* * ' What a curious feeling !' said Alice ; ' I must be shutting up like a tel
 for it might end , you know ,' said Alice to herself , ' in my going out altog
garden at once ; but , alas for poor Alice ! when she got to the door , she fou

単語がテキスト内でどのように分布しているか

Alice、Rabbit、Queenという単語が、テキストのどのぐらいの位置に出現するかを可視化します。

alice.dispersion_plot(["Alice", "Rabbit", "Queen"])

単語を数えて辞書にする

出現する単語を辞書のような形で数えるために、fdistというクラスがあります。

>>> fdist = nltk.FreqDist(alice)
>>> fdist
FreqDist({',': 1993, "'": 1731, 'the': 1527, 'and': 802, '.': 764, 'to': 725, 'a': 615, 'I': 543, 'it': 527, 'she': 509, ...})

fdistを可視化します。

>>> fdist.plot(30, title="不思議の国のアリスに出てくる単語Top30")

fdistから記号を取り除いて可視化します。

>>> fdist_no_punc = nltk.FreqDist(    dict((word, freq) for word, freq in fdist.items() if word.isalpha()))
>>> fdist_no_punc.plot(30, title='不思議の国のアリスに出てくる単語Top30(記号を除く)')

stopwordを取り除いて可視化します。

stopwordは、NTLKに用意されています。

>>> stopwords = nltk.corpus.stopwords.words('english')
>>> stopwords
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

>>> fdist_no_punc_no_stopwords = nltk.FreqDist(    dict((word, freq) for word, freq in fdist.items()        if word.lower() not in stopwords and word.isalpha()    ))
>>> fdist_no_punc_no_stopwords.plot(30,                                title='不思議の国のアリスに出て くる単語Top30(記号、ストップワードを除く)')