How to Tokenize Japanese in Python
DRANK

Over the past several years there's been a welcome trend in NLP projects to be broadly multi-lingual. However, even when many languages are supported, there's a few that tend to be left out. One of these is Japanese. Japanese is written without spaces, and deciding where one word ends and another begins is not trivial. While highly accurate tokenizers are available, they can be hard to use, and English documentation is scarce. This is a short guide to tokenizing Japanese in Python that should be enough to get you started adding Japanese support to your application.

dampfkraft.com
Related Topics: Python