上QQ阅读APP看书，第一时间看更新

Lemmatization of raw text

Lemmatization is the process that identifies the correct intended part-of-speech (POS) and the meaning of words that are present in sentences.

In lemmatization, we remove the inflection endings and convert the word into its base form, present in a dictionary or in the vocabulary. If we use vocabulary and morphological analysis of all the words present in the raw text properly, then we can get high accuracy for lemmatization.

Lemmatization transforms words present in the raw text to its lemma by using a tagged dictionary such as WordNet.

Lemmatization is closely related to stemming.

In lemmatization, we consider POS tags, and in stemming we do not consider POS tags and the context of words.

Let's take some examples to make the concepts clear. The following are the sentences:

Sentence 1: It is better for you.
- There is a word better present in sentence 1. So, the lemma of word better is as good as a lemma. But stemming is missing as it requires a dictionary lookup.
Sentence 2: Man is walking.
- The word walking is derived from the base word walk and here, stemming and lemmatization are both the same.
Sentence 3: We are meeting tomorrow.
- Here, to meet is the base form. The word meeting is derived from the base form. The base form meet can be a noun or it can be a verb. So it depends on the context it will use. So, lemmatization attempts to select the right lemma based on their POS tags.
Refer to the code snippet in Figure 4.6 for the lemmatization of raw text:

Figure 4.6: Stemming and lemmatization of raw text

The output of the preceding code is given as follows:

The given input is:

text = """Stemming is funnier than a bummer says the sushi loving computer scientist.She really wants to buy cars. She told me angrily.It is better for you. Man is walking. We are meeting tomorrow."""

The output is given as:

Stemmer 
stem is funnier than a bummer say the sushi love comput scientist. she realli want to buy cars. she told me angrily. It is better for you. man is walking. We are meet tomorrow. 
Verb lemma 
Stemming be funnier than a bummer say the sushi love computer scientist. She really want to buy cars. She tell me angrily. It be better for you. Man be walking. We be meet tomorrow. 
Noun lemma 
Stemming is funnier than a bummer say the sushi loving computer scientist. She really want to buy cars. She told me angrily. It is better for you. Man is walking. We are meeting tomorrow. 
Adjective lemma 
Stemming is funny than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is good for you. Man is walking. We are meeting tomorrow. 
Satellite adjectives lemma 
Stemming is funny than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is good for you. Man is walking. We are meeting tomorrow. 
Adverb lemma 
Stemming is funnier than a bummer says the sushi loving computer scientist. She really wants to buy cars. She told me angrily. It is well for you. Man is walking. We are meeting tomorrow.

In lemmatization, we use different POS tags. The abbreviation description is as follows:

v stands for verbs
n stands for nouns
a stands for adjectives
s stands for satellite adjectives
r stands for adverbs

You can see that, inside the lemmatizer()function, I have used all the described POS tags.

You can download the code from the GitHub link at: https://github.com/jalajthanaki/NLPython/blob/master/ch4/4_2_rawtext_Stemmers.py.