Python Natural Language Processing
上QQ阅读APP看书,第一时间看更新

Challenges for word tokenization

If you analyze the preceding output, then you can observe that the word don't is tokenized as do, n't know. Tokenizing these kinds of words is pretty painful using the word_tokenize of nltk.

To solve the preceding problem, you can write exception codes and improvise the accuracy. You need to write pattern matching rules, which solve the defined challenge, but are so customized and vary from application to application.

Another challenge involves some languages such as Urdu, Hebrew, Arabic, and so on. They are quite difficult in terms of deciding on the word boundary and find out meaningful tokens from the sentences.