Replacing synonyms
It is often useful to reduce the vocabulary of a text by replacing words with common synonyms. By compressing the vocabulary without losing meaning, you can save memory in cases such as frequency analysis and text indexing. More details about these topics are available at https://en.wikipedia.org/wiki/Frequency_analysis and https://en.wikipedia.org/wiki/Full_text_search. Vocabulary reduction can also increase the occurrence of significant collocations, which was covered in the Discovering word collocations recipe of Chapter 1, Tokenizing Text and WordNet Basics.
Getting ready
You will need a defined mapping of a word to its synonym. This is a simple controlled vocabulary. We will start by hardcoding the synonyms as a Python dictionary, and then explore other options to store synonym maps.
How to do it...
We'll first create a WordReplacer
class in replacers.py
that takes a word replacement mapping:
class WordReplacer(object): def __init__(self, word_map): self.word_map = word_map def replace(self, word): return self.word_map.get(word, word)
Then, we can demonstrate its usage for simple word replacement:
>>> from replacers import WordReplacer >>> replacer = WordReplacer({'bday': 'birthday'}) >>> replacer.replace('bday') 'birthday' >>> replacer.replace('happy') 'happy'
How it works...
The WordReplacer
class is simply a class wrapper around a Python dictionary. The replace()
method looks up the given word in its word_map
dictionary and returns the replacement synonym if it exists. Otherwise, the given word is returned as is.
If you were only using the word_map
dictionary, you wouldn't need the WordReplacer
class and could instead call word_map.get()
directly. However, WordReplacer
can act as a base class for other classes that construct the word_map
dictionary from various file formats. Read on for more information.
There's more...
Hardcoding synonyms in a Python dictionary is not a good long-term solution. Two better alternatives are to store the synonyms in a CSV file or in a YAML file. Choose whichever format is easiest for those who maintain your synonym vocabulary. Both of the classes outlined in the following section inherit the replace()
method from WordReplacer
.
CSV synonym replacement
The CsvWordReplacer
class extends WordReplacer
in replacers.py
in order to construct the word_map
dictionary from a CSV file:
import csv class CsvWordReplacer(WordReplacer): def __init__(self, fname): word_map = {} for line in csv.reader(open(fname)): word, syn = line word_map[word] = syn super(CsvWordReplacer, self).__init__(word_map)
Your CSV file should consist of two columns, where the first column is the word and the second column is the synonym meant to replace it. If this file is called synonyms.csv
and the first line is bday
, birthday
, then you can perform the following:
>>> from replacers import CsvWordReplacer >>> replacer = CsvWordReplacer('synonyms.csv') >>> replacer.replace('bday') 'birthday' >>> replacer.replace('happy') 'happy'
YAML synonym replacement
If you have PyYAML installed, you can create YamlWordReplacer
in replacers.py
as shown in the following:
import yaml class YamlWordReplacer(WordReplacer): def __init__(self, fname): word_map = yaml.load(open(fname)) super(YamlWordReplacer, self).__init__(word_map)
Note
Download and installation instructions for PyYAML are located at http://pyyaml.org/wiki/PyYAML. You can also type pip install pyyaml
on the command prompt
Your YAML file should be a simple mapping of word: synonym
, such as bday: birthday
. Note that the YAML syntax is very particular, and the space after the colon is required. If the file is named synonyms.yaml
, then you can perform the following:
>>> from replacers import YamlWordReplacer >>> replacer = YamlWordReplacer('synonyms.yaml') >>> replacer.replace('bday') 'birthday' >>> replacer.replace('happy') 'happy'
See also
You can use the WordReplacer
class to perform any kind of word replacement, even spelling correction for more complicated words that can't be automatically corrected, as we did in the previous recipe. In the next recipe, we will cover antonym replacement.