[Python] NLTKを使ってみる (4)

以下の続きです。

Stem/語幹

語幹（ごかん）とは語形変化の基礎になる部分のこと。日本語では用言の活用しない部分のことを言うが、形容詞や形容動詞では独立性が強い。また、語幹に対して、末尾の活用する部分のことを活用語尾ということがある。

出典: フリー百科事典『ウィキペディア（Wikipedia）』

語幹は英語では “Stem” と言われます。

root/stem/baseの違い

英語での語幹/Stemの意味をもう少し掘り下げておきます。

‘Root’, ‘stem’ and ‘base’ are all terms used in the literature to designate that part of a word that remains when all affixes have been removed.

A root is a form which is not further analysable, either in terms of derivational or inflectional morphology. It is that part of word-form that remains when all inflectional and derivational affixes have been removed. A root is the basic part always present in a lexeme. In the form ‘untouchables’ the root is ‘touch’, to which first the suffix ‘-able’, then the prefix ‘un-‘ and finally the suffix ‘-s’ have been added. In a compound word like ‘wheelchair’ there are two roots, ‘wheel’ and ‘chair’.

A stem is of concern only when dealing with inflectional morphology.
In the form ‘untouchables’ the stem is ‘untouchable’, although in the form ‘touched’ the stem is ‘touch’; in the form ‘wheelchairs’ the stem is ‘wheelchair’, even though the stem contains two roots.

A base is any form to which affixes of any kind can be added. This means that any root or any stem can be termed a base, but the set of bases is not exhausted by the union of the set of roots and the set of stems: a derivationally analysable form to which derivational affixes are added can only be referred to as a base. That is, ‘touchable’ can act as a base for prefixation to give ‘untouchable’, but in this process ‘touchable’ could not be referred to as a root because it is analysable in terms of derivational morphology, nor as a stem since it is not the adding of inflectional affixes which is in question.

What is the difference between root word and stem word?

以下意訳。

root/stem/baseとも、全ての接辞語を取り除いた後に残るものです。

rootは、derivational（派生的）・inflectional（語形変化的）な形態論上、それ以上分析することができない形で、lexeme（語彙素）となる基本的な部分です。例えば、 ‘untouchables’のrootは、’touch’です。 ‘wheelchair’のrootは、
‘wheel’ と ‘chair’ です。

stem/語幹は、inflectional（語形変化的）な形態論とのみ関係します。‘untouchables’のstemは ‘untouchable’ です。 ‘touched’のstemは’touch’です。’wheelchairs’のstemは’wheelchair’です。

baseはすべての種類の接辞語を加えることができる形です。例えば、‘touchable’は ‘untouchable’の接頭辞を加えるためのbaseとなります。しかし、 ‘touchable’はrootではないし、またstemでもありません。

NLTKでStemを分析する

NTLKには英語のStemを分析するために、以下のアルゴリズムが含まれています。

nltk.stem package

Porter stemming
Snowball stemmers
Lancaster (Paice/Husk) stemming algorithm

それぞれのアルゴリズムの違いは、以下が分かりやすいです。

What are the major differences and benefits of Porter and Lancaster Stemming algorithms?

意訳しておきます。

Porter stemming　最もよく使われている、最も易しいステマー。
Snowball stemmers　 Porter stemmingの改良バージョン。計算がPorter stemmingより速い。とりあえずこれを使っておけばよい。
Lancaster (Paice/Husk) stemming algorithm 最も計算は速いが、結果に問題がある場合がある。

NLTKでの使い方は用例は以下に載っています。

Stemmers

まずはimportします。

>>> from nltk.stem import PorterStemmer

‘connect’関連の単語の語幹を見てみます。

>>> porter = PorterStemmer()
>>> word_list = ["connected", "connecting", "connection", "connections"]
>>>
>>> for word in word_list:
...     print(porter.stem(word))
...
connect
connect
connect
connect

‘argue’関連の単語の語幹を見てみます。

>>> word_list = ["argue", "argued", "argues", "arguing", "argus"]
>>>
>>> for word in word_list:
...     print(porter.stem(word))
...
argu
argu
argu
argu
argu

‘argue’ではなく、全て’argu’が出力されました。他のアルゴリズムでも試してみます。

>>> from nltk.stem import LancasterStemmer
>>> from nltk.stem import SnowballStemmer
>>>
>>> lancaster = LancasterStemmer ()
>>> snowball = SnowballStemmer(language='english')
>>> word_list = ["argue", "argued", "argues", "arguing", "argus"]
>>> for word in word_list:
...     print(lancaster.stem(word))
...
argu
argu
argu
argu
arg
>>> for word in word_list:
...     print(snowball.stem(word))
...
argu
argu
argu
argu
argus

PorterStemmer、LancasterStemmer、SnowballStemmerとも、異なるstemが出力されています。’arugs’は’argue’とは全く異なる意味の単語なので、’arugs’の意味を考えた場合、ここでは、SnowballStemmerが最も良い結果を出しています。

Lemmaizisation/見出し語化

nltk.stem packageには、WordNetを用いた単語のLemmaizisation/見出し語化のメソッドも載っています。

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word’s lemma, or dictionary form.^[1]
In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighboring sentences or even an entire document. As a result, developing efficient lemmatisation algorithms is an open area of research.^[2][3][4]

From Wikipedia, the free encyclopedia

意訳しておきます。

Lemmaizisation/見出し語化は、その単語の辞書に載っている形（lemmma）に従って、単語を分類することです。計量言語学的な分野では、stemmingとは違い、文章や文脈による意味や意図に基づいて、単語を分類します。

ただし、 nltk.stem package でのLenmatizationは、WordNetの情報に基づいて、その単語の品詞に従ってlemmmaを返すという単純なものです。

‘better’に使うと以下のような出力を得ることができます。

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()
>>> # Adjective:
... print(lemmatizer.lemmatize('better', pos='a'))
good
>>>
>>> # Adverb:
... print(lemmatizer.lemmatize('better', pos='r'))
well
>>>
>>> # Noun:
... print(lemmatizer.lemmatize('better', pos='n'))
better
>>>
>>> # Verb:
... print(lemmatizer.lemmatize('better', pos='v'))
better