Removing repeating characters_Python 3 Text Processing with NLTK 3 Cookbook-QQ阅读男生科幻网

上QQ阅读APP看书，第一时间看更新

Removing repeating characters

In everyday language, people are often not strictly grammatical. They will write things such as I looooooove it in order to emphasize the word love. However, computers don't know that "looooooove" is a variation of "love" unless they are told. This recipe presents a method to remove these annoying repeating characters in order to end up with a proper English word.

Getting ready

As in the previous recipe, we will be making use of the re module, and more specifically, backreferences. A backreference is a way to refer to a previously matched group in a regular expression. This will allow us to match and remove repeating characters.

How to do it...

We will create a class that has the same form as the RegexpReplacer class from the previous recipe. It will have a replace() method that takes a single word and returns a more correct version of that word, with the dubious repeating characters removed. This code can be found in replacers.py in the book's code bundle and is meant to be imported:

import re

class RepeatReplacer(object):
  def __init__(self):
    self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    self.repl = r'\1\2\3'

  def replace(self, word):
    repl_word = self.repeat_regexp.sub(self.repl, word)

    if repl_word != word:
      return self.replace(repl_word)
    else:
      return repl_word

And now some example use cases:

>>> from replacers import RepeatReplacer
>>> replacer = RepeatReplacer()
>>> replacer.replace('looooove')
'love'
>>> replacer.replace('oooooh')
'oh'
>>> replacer.replace('goose')
'gose'

How it works...

The RepeatReplacer class starts by compiling a regular expression to match and define a replacement string with backreferences. The repeat_regexp pattern matches three groups:

0 or more starting characters (\w*)
A single character (\w) that is followed by another instance of that character (\2)
0 or more ending characters (\w*)

The replacement string is then used to keep all the matched groups, while discarding the backreference to the second group. So, the word looooove gets split into (looo)(o)o(ve) and then recombined as loooove, discarding the last o. This continues until only one o remains, when repeat_regexp no longer matches the string and no more characters are removed.

There's more...

In the preceding examples, you can see that the RepeatReplacer class is a bit too greedy and ends up changing goose into gose. To correct this issue, we can augment the replace() function with a WordNet lookup. If WordNet recognizes the word, then we can stop replacing characters. Here is the WordNet-augmented version:

import re
from nltk.corpus import wordnet

class RepeatReplacer(object):
  def __init__(self):
    self.repeat_regexp = re.compile(r'(\w*)(\w)\2(\w*)')
    self.repl = r'\1\2\3'

  def replace(self, word):
    if wordnet.synsets(word):
      return word
    repl_word = self.repeat_regexp.sub(self.repl, word)

    if repl_word != word:
      return self.replace(repl_word)
    else:
      return repl_word

Now, goose will be found in WordNet, and no character replacement will take place. Also, oooooh will become ooh instead of oh because ooh is actually a word in WordNet, defined as an expression of admiration or pleasure.