NTLK Natural Language Toolkit
NTLK(Natural Language Toolkit)とは、英語の自然言語のためのPythonのライブラリです。
NTLKのインストール
NTLK DATAのインストール
NTLE DATAをインストールしておくと色々と便利になります。
付属のテキストを読み込む
付属のテキストを読み込んでみます。
まずは何があるかを確認します。
>>> import nltk >>> nltk.corpus.gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
不思議の国のアリスを読み込むことにします。
alice = nltk.text.Text(nltk.corpus.gutenberg.words('carroll-alice.txt'))
どのようなオブジェクトか確認をします。
>>> print(type(alice)) <class 'nltk.text.Text'>
単語を数える
不思議の国のアリスに出てくる単語の数を数えます。
>>> print(len(alice)) 34110
set()を使って単語の重複を除いて数えます。
>>> print(len(set(hamlet))) 5447
Aliceという単語の出現回数を数えます。
>>> print(alice.count("Alice")) 396
単語の使われる文脈
Aliceという単語が、どのような文脈で出てくるか(concordance)を確認します。
>>> alice.concordance("Alice") Displaying 25 of 398 matches: Alice ' s Adventures in Wonderland by Lewi ] CHAPTER I . Down the Rabbit - Hole Alice was beginning to get very tired of s what is the use of a book ,' thought Alice ' without pictures or conversation ? so VERY remarkable in that ; nor did Alice think it so VERY much out of the way looked at it , and then hurried on , Alice started to her feet , for it flashed hedge . In another moment down went Alice after it , never once considering ho ped suddenly down , so suddenly that Alice had not a moment to think about stop she fell past it . ' Well !' thought Alice to herself , ' after such a fall as down , I think --' ( for , you see , Alice had learnt several things of this so tude or Longitude I ' ve got to ?' ( Alice had no idea what Latitude was , or L . There was nothing else to do , so Alice soon began talking again . ' Dinah ' cats eat bats , I wonder ?' And here Alice began to get rather sleepy , and wen dry leaves , and the fall was over . Alice was not a bit hurt , and she jumped not a moment to be lost : away went Alice like the wind , and was just in time but they were all locked ; and when Alice had been all the way down one side a on it except a tiny golden key , and Alice ' s first thought was that it might and to her great delight it fitted ! Alice opened the door and found that it le ead would go through ,' thought poor Alice , ' it would be of very little use w ay things had happened lately , that Alice had begun to think that very few thi ertainly was not here before ,' said Alice ,) and round the neck of the bottle ay ' Drink me ,' but the wise little Alice was not going to do THAT in a hurry bottle was NOT marked ' poison ,' so Alice ventured to taste it , and finding i * * ' What a curious feeling !' said Alice ; ' I must be shutting up like a tel for it might end , you know ,' said Alice to herself , ' in my going out altog garden at once ; but , alas for poor Alice ! when she got to the door , she fou
単語がテキスト内でどのように分布しているか
Alice、Rabbit、Queenという単語が、テキストのどのぐらいの位置に出現するかを可視化します。
alice.dispersion_plot(["Alice", "Rabbit", "Queen"])
単語を数えて辞書にする
出現する単語を辞書のような形で数えるために、fdistというクラスがあります。
>>> fdist = nltk.FreqDist(alice) >>> fdist FreqDist({',': 1993, "'": 1731, 'the': 1527, 'and': 802, '.': 764, 'to': 725, 'a': 615, 'I': 543, 'it': 527, 'she': 509, ...})
fdistを可視化します。
>>> fdist.plot(30, title="不思議の国のアリスに出てくる単語Top30")
fdistから記号を取り除いて可視化します。
>>> fdist_no_punc = nltk.FreqDist( dict((word, freq) for word, freq in fdist.items() if word.isalpha())) >>> fdist_no_punc.plot(30, title='不思議の国のアリスに出てくる単語Top30(記号を除く)')
stopwordを取り除いて可視化します。
stopwordは、NTLKに用意されています。
>>> stopwords = nltk.corpus.stopwords.words('english') >>> stopwords ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
>>> fdist_no_punc_no_stopwords = nltk.FreqDist( dict((word, freq) for word, freq in fdist.items() if word.lower() not in stopwords and word.isalpha() )) >>> fdist_no_punc_no_stopwords.plot(30, title='不思議の国のアリスに出て くる単語Top30(記号、ストップワードを除く)')