Removing repeating characters
In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it
in order to emphasize the word love
. However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.
Getting ready
As in the previous recipe, we will be making use of the re
module, and more specifically, backreferences. A backreference is a way to refer to a previously matched group in a regular expression. This will allow us to match and remove repeating characters.
How to do it...
We will create a class that has the same form as the RegexpReplacer
class from the previous recipe. It will have a replace()
method that takes a single word and returns a more correct version of that word, with the dubious repeating characters removed. This code can be found in replacers.py
in the book's code bundle and is meant to be imported:
import re class RepeatReplacer(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
And now some example use cases:
>>> from replacers import RepeatReplacer >>> replacer = RepeatReplacer() >>> replacer.replace('looooove') 'love' >>> replacer.replace('oooooh') 'oh' >>> replacer.replace('goose') 'gose'
How it works...
The RepeatReplacer
class starts by compiling a regular expression to match and define a replacement string with backreferences. The repeat_regexp
pattern matches three groups:
0
or more starting characters(\w*)
- A single character (
\w
) that is followed by another instance of that character(\2)
0
or more ending characters(\w*)
The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group. So, the word looooove
gets split into (looo)(o)o(ve)
and then recombined as loooove
, discarding the last o
. This continues until only one o
remains, when repeat_regexp
no longer matches the string and no more characters are removed.
There's more...
In the preceding examples, you can see that the RepeatReplacer
class is a bit too greedy and ends up changing goose
into gose
. To correct this issue, we can augment the replace()
function with a WordNet lookup. If WordNet recognizes the word, then we can stop replacing characters. Here is the WordNet-augmented version:
import re from nltk.corpus import wordnet class RepeatReplacer(object): def __init__(self): self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)') self.repl = r'\1\2\3' def replace(self, word): if wordnet.synsets(word): return word repl_word = self.repeat_regexp.sub(self.repl, word) if repl_word != word: return self.replace(repl_word) else: return repl_word
Now, goose
will be found in WordNet, and no character replacement will take place. Also, oooooh
will become ooh
instead of oh
because ooh
is actually a word in WordNet, defined as an expression of admiration or pleasure.
See also
Read the next recipe to learn how to correct misspellings. For more information on WordNet, refer to the WordNet recipes in Chapter 1, Tokenizing Text and WordNet Basics. We will also be using WordNet for antonym replacement later in this chapter.