Feature Extraction from Texts
As we already know, machine learning algorithms do not understand textual data directly. We need to represent the text data in numerical form or vectors. To convert each textual sentence into a vector, we need to represent it as a set of features. This set of features should uniquely represent the text, though, inpidually, some of the features may be common across many textual sentences. Features can be classified into two different categories:
- General features: These features are statistical calculations and do not depend on the content of the text. Some examples of general features could be the number of tokens in the text, the number of characters in the text, and so on.
- Specific features: These features are dependent on the inherent meaning of the text and represent the semantics of the text. For example, the frequency of unique words in the text is a specific feature.
Let's explore these in detail.
Extracting General Features from Raw Text
As we've already learned, general features refer to those that are not directly dependent on the inpidual tokens constituting a text corpus. Let's consider these two sentences: "The sky is blue" and "The pillar is yellow". Here, the sentences have the same number of words (a general feature)—that is, four. But the inpidual constituent tokens are different. Let's complete an exercise to understand this better.
Exercise 2.11: Extracting General Features from Raw Text
In this exercise, we will extract general features from input text. These general features include detecting the number of words, the presence of "wh" words (words beginning with "wh", such as "what" and "why") and the language in which the text is written. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Import the pandas library and create a DataFrame with four sentences. Add the following code to implement this:
import pandas as pd
from textblob import TextBlob
df = pd.DataFrame([['The interim budget for 2019 will '\
'be announced on 1st February.'], \
['Do you know how much expectation '\
'the middle-class working population '\
'is having from this budget?'], \
['February is the shortest month '\
'in a year.'], \
['This financial year will end on '\
'31st March.']])
df.columns = ['text']
df.head()
The preceding code generates the following output:
- Use the apply() function to iterate through each row of the column text, convert them into TextBlob objects, and extract words from them. Add the following code to implement this:
def add_num_words(df):
df['number_of_words'] = df['text'].apply(lambda x : \
len(TextBlob(str(x)).words))
return df
add_num_words(df)['number_of_words']
The preceding code generates the following output:
0 11
1 15
2 8
3 8
Name: number_of_words, dtype: int64
The preceding code line will print the number_of_words column of the DataFrame to represent the number of words in each row.
- Use the apply() function to iterate through each row of the column text, convert the text into TextBlob objects, and extract the words from them to check whether any of them belong to the list of "wh" words that has been declared. Add the following code to do so:
def is_present(wh_words, df):
"""
The below line of code will find the intersection
between set of tokens of every sentence and the
wh_words and will return true if the length of
intersection set is non-zero.
"""
df['is_wh_words_present'] = df['text'].apply(lambda x : \
True if \
len(set(TextBlob(str(x)).\
words).intersection(wh_words))\
>0 else False)
return df
wh_words = set(['why', 'who', 'which', 'what', \
'where', 'when', 'how'])
is_present(wh_words, df)['is_wh_words_present']
The preceding code generates the following output:
0 False
1 True
2 False
3 False
Name: is_wh_words_present, dtype: bool
The preceding code line will print the is_wh_words_present column that was added by the is_present method to df, which means for every row, we will see whether wh_word is present.
- Use the apply() function to iterate through each row of the column text, convert them into TextBlob objects, and detect their languages:
def get_language(df):
df['language'] = df['text'].apply(lambda x : \
TextBlob(str(x)).detect_language())
return df
get_language(df)['language']
The preceding code generates the following output:
0 en
1 en
2 en
3 en
Name: language, dtype: object
With that, we have learned how to extract general features from text data.
Note
To access the source code for this specific section, please refer to https://packt.live/2X9jLcS.
You can also run this example online at https://packt.live/3fgrYSK.
Let's perform another exercise to get a better understanding of this.
Exercise 2.12: Extracting General Features from Text
In this exercise, we will extract various general features from documents. The dataset that we will be using here consists of random statements. Our objective is to find the frequency of various general features such as punctuation, uppercase and lowercase words, letters, digits, words, and whitespaces.
Note
The dataset that is being used in this exercise can be found at this link: https://packt.live/3k0qCPR.
- Open a Jupyter Notebook.
- Insert a new cell and add the following code to import the necessary libraries:
import pandas as pd
from string import punctuation
import nltk
nltk.download('tagsets')
from nltk.data import load
nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag
from nltk import word_tokenize
from collections import Counter
- To see what different kinds of parts of speech nltk provides, add the following code:
def get_tagsets():
tagdict = load('help/tagsets/upenn_tagset.pickle')
return list(tagdict.keys())
tag_list = get_tagsets()
print(tag_list)
The preceding code generates the following output:
- Calculate the number of occurrences of each PoS by iterating through each document and annotating each word with the corresponding pos tag. Add the following code to implement this:
"""
This method will count the occurrence of pos
tags in each sentence.
"""
def get_pos_occurrence_freq(data, tag_list):
# Get list of sentences in text_list
text_list = data.text
# create empty dataframe
feature_df = pd.DataFrame(columns=tag_list)
for text_line in text_list:
# get pos tags of each word.
pos_tags = [j for i, j in \
pos_tag(word_tokenize(text_line))]
"""
create a dict of pos tags and their frequency
in given sentence.
"""
row = dict(Counter(pos_tags))
feature_df = feature_df.append(row, ignore_index=True)
feature_df.fillna(0, inplace=True)
return feature_df
tag_list = get_tagsets()
data = pd.read_csv('../data/data.csv', header=0)
feature_df = get_pos_occurrence_freq(data, tag_list)
feature_df.head()
The preceding code generates the following output:
- To calculate the number of punctuation marks, add the following code:
def add_punctuation_count(feature_df, data):
feature_df['num_of_unique_punctuations'] = data['text'].\
apply(lambda x: len(set(x).intersection\
(set(punctuation))))
return feature_df
feature_df = add_punctuation_count(feature_df, data)
feature_df['num_of_unique_punctuations'].head()
The add_punctuation_count() method will find the intersection of the set of punctuation marks in the text and punctuation sets that were imported from the string module. Then, it will find the length of the intersection set in each row and add it to the num_of_unique_punctuations column of the DataFrame. The preceding code generates the following output:
0 0
1 0
2 1
3 1
4 0
Name: num_of_unique_punctuations, dtype: int64
- To calculate the number of capitalized words, add the following code:
def get_capitalized_word_count(feature_df, data):
"""
The below code line will tokenize text in every row and
create a set of only capital words, ten find the length of
this set and add it to the column 'number_of_capital_words'
of dataframe.
"""
feature_df['number_of_capital_words'] = data['text'].\
apply(lambda x: len([word for word in \
word_tokenize(str(x)) if word[0].isupper()]))
return feature_df
feature_df = get_capitalized_word_count(feature_df, data)
feature_df['number_of_capital_words'].head()
The preceding code will tokenize the text in every row and create a set of words consisting of only capital words. It will then find the length of this set and add it to the number_of_capital_words column of the DataFrame. The preceding code generates the following output:
0 1
1 1
2 1
3 1
4 1
Name: number_of_capital_words, dtype: int64
The last line of the preceding code will print the number_of_capital_words column, which represents the count of the number of capital letter words in each row.
- To calculate the number of lowercase words, add the following code:
def get_small_word_count(feature_df, data):
"""
The below code line will tokenize text in every row and
create a set of only small words, then find the length of
this set and add it to the column 'number_of_small_words'
of dataframe.
"""
feature_df['number_of_small_words'] = data['text'].\
apply(lambda x: len([word for word in \
word_tokenize(str(x)) if word[0].islower()]))
return feature_df
feature_df = get_small_word_count(feature_df, data)
feature_df['number_of_small_words'].head()
The preceding code will tokenize the text in every row and create a set of only small words, then find the length of this set and add it to the number_of_small_words column of the DataFrame. The preceding code generates the following output:
0 4
1 3
2 7
3 3
4 2
Name: number_of_small_words, dtype: int64
The last line of the preceding code will print the number_of_small_words column, which represents the number of small letter words in each row.
- To calculate the number of letters in the DataFrame, use the following code:
def get_number_of_alphabets(feature_df, data):
feature_df['number_of_alphabets'] = data['text']. \
apply(lambda x: len([ch for ch in str(x) \
if ch.isalpha()]))
return feature_df
feature_df = get_number_of_alphabets(feature_df, data)
feature_df['number_of_alphabets'].head()
The preceding code will break the text line into a list of characters in each row and add the count of that list to the number_of_alphabets columns. This will produce the following output:
0 19
1 18
2 28
3 14
4 13
Name: number_of_alphabets, dtype: int64
The last line of the preceding code will print the number_of_columns column, which represents the count of the number of alphabets in each row.
- To calculate the number of digits in the DataFrame, add the following code:
def get_number_of_digit_count(feature_df, data):
"""
The below code line will break the text line in a list of
digits in each row and add the count of that list into
the columns 'number_of_digits'
"""
feature_df['number_of_digits'] = data['text']. \
apply(lambda x: len([ch for ch in str(x) \
if ch.isdigit()]))
return feature_df
feature_df = get_number_of_digit_count(feature_df, data)
feature_df['number_of_digits'].head()
The preceding code will get the digit count from each row and add the count of that list to the number_of_digits columns. The preceding code generates the following output:
0 0
1 0
2 0
3 0
4 0
Name: number_of_digits, dtype: int64
- To calculate the number of words in the DataFrame, add the following code:
def get_number_of_words(feature_df, data):
"""
The below code line will break the text line in a list of
words in each row and add the count of that list into
the columns 'number_of_digits'
"""
feature_df['number_of_words'] = data['text'].\
apply(lambda x : len(word_tokenize(str(x))))
return feature_df
feature_df = get_number_of_words(feature_df, data)
feature_df['number_of_words'].head()
The preceding code will split the text line into a list of words in each row and add the count of that list to the number_of_digits columns. We will get the following output:
0 5
1 4
2 9
3 5
4 3
Name: number_of_words, dtype: int64
- To calculate the number of whitespaces in the DataFrame, add the following code:
def get_number_of_whitespaces(feature_df, data):
"""
The below code line will generate list of white spaces
in each row and add the length of that list into
the columns 'number_of_white_spaces
"""
feature_df['number_of_white_spaces'] = data['text']. \
apply(lambda x: len([ch for ch in str(x) \
if ch.isspace()]))
return feature_df
feature_df = get_number_of_whitespaces(feature_df, data)
feature_df['number_of_white_spaces'].head()
The preceding code will generate a list of whitespaces in each row and add the length of that list to the number_of_white_spaces columns. The preceding code generates the following output:
0 4
1 3
2 7
3 3
4 2
Name: number_of_white_spaces, dtype: int64
- To view the full feature set we have just created, add the following code:
feature_df.head()
We will be printing the head of the final DataFrame, which means we will print five rows of all the columns. We will get the following output:
With that, we have learned how to extract general features from the given text.
Note
To access the source code for this specific section, please refer to https://packt.live/3jSsLNh.
You can also run this example online at https://packt.live/3hPFmPA.
Now, let's explore how we can extract unique features.
Bag of Words (BoW)
The Bag of Words (BoW) model is one of the most popular methods for extracting features from raw texts.
In this technique, we convert each sentence into a vector. The length of this vector is equal to the number of unique words in all the documents. This is done in two steps:
- The vocabulary or dictionary of all the words is generated.
- The document is represented in terms of the presence or absence of all words.
A vocabulary or dictionary is created from all the unique possible words available in the corpus (all documents) and every single word is assigned a unique index number. In the second step, every document is represented by a list whose length is equal to the number of words in the vocabulary. The following exercise illustrates how BoW can be implemented using Python.
Exercise 2.13: Creating a Bag of Words
In this exercise, we will create a BoW representation for all the terms in a document and ascertain the 10 most frequent terms. In this exercise, we will use the CountVectorizer module from sklearn, which performs the following tasks:
- Tokenizes the collection of documents, also called a corpus
- Builds the vocabulary of unique words
- Converts a document into vectors using the previously built vocabulary
Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Import the necessary libraries and declare a list corpus. Add the following code to implement this:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
- Use the CountVectorizer function to create the BoW model. Add the following code to do this:
def vectorize_text(corpus):
"""
Will return a dataframe in which every row will ,be
vector representation of a document in corpus
:param corpus: input text corpus
:return: dataframe of vectors
"""
bag_of_words_model = CountVectorizer()
"""
performs the above described three tasks on
the given data corpus.
"""
dense_vec_matrix = bag_of_words_model.\
fit_transform(corpus).todense()
bag_of_word_df = pd.DataFrame(dense_vec_matrix)
bag_of_word_df.columns = sorted(bag_of_words_model.\
vocabulary_)
return bag_of_word_df
corpus = ['Data Science is an overlap between Arts and Science',\
'Generally, Arts graduates are right-brained and '\
'Science graduates are left-brained',\
'Excelling in both Arts and Science at a time '\
'becomes difficult',\
'Natural Language Processing is a part of Data Science']
df = vectorize_text(corpus)
df.head()
The vectorize_text method will take a document corpus as an argument and return a DataFrame in which every row will be a vector representation of a document in the corpus.
The preceding code generates the following output:
- Create a BoW model for the 10 most frequent terms. Add the following code to implement this:
def bow_top_n(corpus, n):
"""
Will return a dataframe in which every row
will be represented by presence or absence of top 10 most
frequently occurring words in data corpus
:param corpus: input text corpus
:return: dataframe of vectors
"""
bag_of_words_model_small = CountVectorizer(max_features=n)
bag_of_word_df_small = pd.DataFrame\
(bag_of_words_model_small.fit_transform\
(corpus).todense())
bag_of_word_df_small.columns = \
sorted(bag_of_words_model_small.vocabulary_)
return bag_of_word_df_small
df_2 = bow_top_n(corpus, 10)
df_2.head()
In the preceding code, we are checking the occurrence of the top 10 most frequent words in each sentence and creating a DataFrame out of it.
The preceding code generates the following output:
Note
To access the source code for this specific section, please refer to https://packt.live/3gdhViJ.
You can also run this example online at https://packt.live/3hPUTi8.
In this section, we learned what BoW is and how to can use it to convert a sentence or document into a vector. BoW is the easiest way to convert text into a vector; however, it has a severe disadvantage. This method only considers the presence and absence of words in a sentence or document—not the frequency of the words/tokens in a document. If we are going to use the semantics of any sentence, the frequency of the words plays an important role. To overcome this issue, there is another feature extraction model called TFIDF, which we will discuss later in this chapter.
Zipf's Law
According to Zipf's law, the number of times a word occurs in a corpus is inversely proportional to its rank in the frequency table. In simple terms, if the words in a corpus are arranged in descending order of their frequency of occurrence, then the frequency of the word at the ith rank will be proportional to 1/i:
This also means that the frequency of the most frequent word will be twice the frequency of the second most frequent word. For example, if we look at the Brown University Standard Corpus of Present-Day American English, the word "the" is the most frequent word (its frequency is 69,971), while the word "of" is the second most frequent (with a frequency of 36,411). As we can see, its frequency is almost half of the most frequently occurring word. To get a better understanding of this, let's perform a simple exercise.
Exercise 2.14: Zipf's Law
In this exercise, we will plot both the expected and actual ranks and frequencies of tokens with the help of Zipf's law. We will be using the 20newsgroups dataset provided by the sklearn library, which is a collection of newsgroup documents. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Import the necessary libraries:
from pylab import *
import nltk
nltk.download('stopwords')
from sklearn.datasets import fetch_20newsgroups
from nltk import word_tokenize
from nltk.corpus import stopwords
import matplotlib.pyplot as plt
import re
import string
from collections import Counter
Add two methods for loading stop words and the data from the newsgroups_data_sample variable:
def get_stop_words():
stop_words = stopwords.words('english')
stop_words = stop_words + list(string.printable)
return stop_words
def get_and_prepare_data(stop_words):
"""
This method will load 20newsgroups data and
and remove stop words from it using given stop word list.
:param stop_words:
:return:
"""
newsgroups_data_sample = \
fetch_20newsgroups(subset='train')
tokenized_corpus = [word.lower() for sentence in \
newsgroups_data_sample['data'] \
for word in word_tokenize\
(re.sub(r'([^\s\w]|_)+', ' ', sentence)) \
if word.lower() not in stop_words]
return tokenized_corpus
In the preceding code, there are two methods; get_stop_words() will load stop word list from nltk data, while get_and_prepare_data() will load the 20newsgroups data and remove stop words from it using the given stop word list.
- Add the following method to calculate the frequency of each token:
def get_frequency(corpus, n):
token_count_di = Counter(corpus)
return token_count_di.most_common(n)
The preceding method uses the Counter class to count the frequency of tokens in the corpus and then return the most common n tokens.
- Now, call all the preceding methods to calculate the frequency of the top 50 most frequent tokens:
stop_word_list = get_stop_words()
corpus = get_and_prepare_data(stop_word_list)
get_frequency(corpus, 50)
The preceding code generates the following output:
- Plot the actual ranks of words that we got from frequency dictionary and the ranks expected as per Zipf's law. Calculate the frequencies of the top 10,000 words using the preceding get_frequency() method and the expected frequencies of the same list using Zipf's law. For this, create two lists—an actual_frequencies and an expected_frequencies list. Use the log of actual frequencies to downscale the numbers. After getting the actual and expected frequencies, plot them using matplotlib:
def get_actual_and_expected_frequencies(corpus):
freq_dict = get_frequency(corpus, 1000)
actual_frequencies = []
expected_frequencies = []
for rank, tup in enumerate(freq_dict):
actual_frequencies.append(log(tup[1]))
rank = 1 if rank == 0 else rank
# expected frequency 1/rank as per zipf's law
expected_frequencies.append(1 / rank)
return actual_frequencies, expected_frequencies
def plot(actual_frequencies, expected_frequencies):
plt.plot(actual_frequencies, 'g*', \
expected_frequencies, 'ro')
plt.show()
# We will plot the actual and expected frequencies
actual_frequencies, expected_frequencies = \
get_actual_and_expected_frequencies(corpus)
plot(actual_frequencies, expected_frequencies)
The preceding code generates the following output:
So, as we can see from the preceding output, both lines have almost the same slope. In other words, we can say that the lines (or graphs) depict the proportionality of two lists.
Note
To access the source code for this specific section, please refer to https://packt.live/30ZnKtD.
You can also run this example online at https://packt.live/3f9ZFoT.
Term Frequency–Inverse Document Frequency (TFIDF)
Term Frequency-Inverse Document Frequency (TFIDF) is another method of representing text data in a vector format. Here, once again, we'll represent each document as a list whose length is equal to the number of unique words/tokens in all documents (corpus), but the vector here not only represents the presence and absence of a word, but also the frequency of the word—both in the current document and the whole corpus.
This technique is based on the idea that the rarely occurring words are better representatives of the document than frequently occurring words. Hence, this representation gives more weightage to the rarer or less frequent words than frequently occurring words. It does so with the following formula:
Here, term frequency is the frequency of a word in the given document. Inverse document frequency can be defined as log(D/df), where df is document frequency and D is the total number of documents in the background corpus.
Now, let's complete an exercise and learn how TFIDF can be implemented in Python.
Exercise 2.15: TFIDF Representation
In this exercise, we will represent the input texts with their TFIDF vectors. We will use a sklearn module named TfidfVectorizer, which converts text into TFIDF vectors. Follow these steps to implement this exercise:
- Open a Jupyter Notebook.
- Import all the necessary libraries and create a method to calculate the TFIDF of the corpus. Add the following code to implement this:
from sklearn.feature_extraction.text import TfidfVectorizer
def get_tf_idf_vectors(corpus):
tfidf_model = TfidfVectorizer()
vector_list = tfidf_model.fit_transform(corpus).todense()
return vector_list
- To create a TFIDF model, write the following code:
corpus = ['Data Science is an overlap between Arts and Science',\
'Generally, Arts graduates are right-brained and '\
'Science graduates are left-brained',\
'Excelling in both Arts and Science at a '\
'time becomes difficult',\
'Natural Language Processing is a part of Data Science']
vector_list = get_tf_idf_vectors(corpus)
print(vector_list)
In the preceding code, the get_tf_idf_vectors() method will generate TFIDF vectors from the corpus. You will then call this method on a given corpus. The preceding code generates the following output:
The preceding output represents the TFIDF vectors for each row. As you can see from the results, each document is represented by a list whose length is equal to the unique words in the corpus and in each list (vector). The vector contains the TFIDF values of the words at their corresponding index.
Note
To access the source code for this specific section, please refer to https://packt.live/3gdzsHA.
You can also run this example online at https://packt.live/3fdP5gS.
In the next section, we will solve an activity to extract specific features from texts.