卿少納言

卿少納言

JavaScript & Japanese, Python & Polyglot, TypeScript & Translate.
zhihu
github
email
x

Mecab Installation Guide

A brief introduction to the installation method of the Japanese natural language processing tool [[Mecab]]

Mecab Installation Guide#

Introduction#

If you only need to analyze a small amount of data, there are ready-made tools online, such as Web ちゃまめ, which can parse data online.

There are many tutorials on installing Mecab online, but upon closer inspection, they are often written too casually. Only the article Japanese Word Segmenter Mecab Documentation I Love Natural Language Processing, which translates the official documentation, is worth a detailed read.

Additionally, the discussions in Simple Use of Mecab Japanese Word Segmentation Tool - FreeMdict Forum are also worth reading.

If there are other good tutorials, feel free to add them.

Before getting into the main topic, let's briefly mention two key factors that affect "morphological analysis": morphological analyzers developed based on different algorithms and morphological analysis dictionaries.

There are many analyzers available, such as awesome-japanese-nlp-resources, which lists a plethora of analyzers developed in various programming languages and optimized for different use cases.

However, morphological analysis dictionaries are relatively singular, with the main ones currently being ChaSen, JUMAN, and Unidic Dictionary. The latest version of ChaSen is 2.4.5, updated on June 25, 2012, JUMAN's latest version is 1.02, updated on January 12, 2017, and only the Unidic Dictionary has maintained an annual update frequency in the last five years.

Currently, the most widely used open-source morphological analyzer is Mecab, and below I will explain how to install and use it on Windows. Just remember "morphological analysis = morphological analyzer + morphological analysis dictionary," and installing other morphological analyzers shouldn't be too problematic.

Here is a backup installation file:
https://www.123pan.com/s/iGz0Vv-svEVh.html

Installation via Installation Package#

First, I will introduce the safest method, suitable for those who only need the default parsing format.

First, go to the official homepage MeCab: Yet Another Part-of-Speech and Morphological Analyzer to download the installation package provided by the official site: https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7WElGUGt6ejlpVXc (the corresponding path for the backup installation package: Morphological Analysis > Morphological Analyzers > mecab).

(You can find various third-party libraries online, some of which come with the Mecab installation package, so you don't need to install it, but the version may not be the latest 0.996, so it is more recommended to install it from scratch as this article describes.)

When installing, be sure to check the utf-8 encoding, and you can keep clicking next for other options.

|500

Note: You can change the program path, but if you do, you may need to manually add it to the environment variables. In fact, the main program of Mecab does not take up much space, and I personally think there is no need to change the path. (Mainly because if something goes wrong, you might have to reinstall it 2333)

At this point, the installation is essentially complete, and you can directly call it via the command line. For specific command line usage, refer to Japanese Word Segmenter Mecab Documentation I Love Natural Language Processing.

Installation via mecab-python3#

The above method is not flexible enough for scenarios that require processing a large amount of custom-formatted data. Moreover, the official site does not provide an installation package for macOS, so I will introduce another method using Python.

First, install the third-party library mecab-python3:

pip install mecab-python3

Then use the following commands to install and switch to the unidic-lite dictionary.

pip install unidic-lite
pip install --no-binary :all: mecab-python3

Then run the following code for testing. If there are no errors, the installation is complete.

import MeCab
tagger = MeCab.Tagger("-Owakati")
print(tagger.parse("天気が良いから、散歩しましょう。").split())

tagger = MeCab.Tagger()
print(tagger.parse("天気が良いから、散歩しましょう。"))

|500

Possible Issues#

Mecab is a tool developed in C++, and many tools can be found online to call it. However, unexpected issues often arise during the environment configuration step, and here are some feedbacks recorded:

The following feedback is from amob:

First, when pip installing some C++ based Python libraries, you need to run the command line 'Native(or Cross) Tools Command Prompt' that comes with Visual Studio, not the system default cmd.
I also forgot which misleading Mecab tutorial I saw before, because the default encoding of Mecab in the command line would not display text correctly. I added an autorun item in the registry to set the default to UTF-8, which also affected the normal operation of the Visual Studio environment...
Then I still got the error 'Microsoft Visual C++ 14.0 is required', and only then did I realize that I just needed to run: pip install --upgrade setuptools
Mission accomplished.
Reference pages:
visual studio: x64 Native Tools Command Prompt for VS 2019 initialization failed_script “vsdevcmd\ext\active” could not be found.-CSDN Blog
python pip on Windows - command ‘cl.exe’ failed - Stack Overflow
‘Microsoft Visual C++ 14.0 is required’ in Windows 10 - Microsoft Community

Custom Dictionary#

The dictionary that comes with Mecab installed via the installation package is ipadic, which was last updated in May 2003.

The unidic-lite installed via mecab-python3, according to its README documentation, is version 2.1.2 from 2013:

At the moment it uses Unidic 2.1.2, from 2013, which is the most recent release of UniDic that's small enough to be distributed via PyPI.

If you have requirements for parsing accuracy, it is more recommended to install the Unidic Dictionary maintained by the National Institute for Japanese Language and Linguistics.

If there are no special requirements, just download the latest version of "Modern Written Language UniDic" https://clrd.ninjal.ac.jp/unidic_archive/2302/unidic-cwj-202302.zip (Note: Updated on March 24, 2023, confirmed as the latest version on February 23, 2024) (backup path: Morphological Analysis > Morphological Analysis Dictionary > UniDic)

When extracting the files, pay attention to the folder name; it is best to name it unidic-cwj-3.1.1 (if it is not named this way, please modify the subsequent code's dic_path).

|500

Then you can test with the following code.

import os
import MeCab

# Please ensure the following path matches the actual installation path; to keep it consistent with the screenshot, I modified the folder name
dic_path = "D:\00temp\unidic-cwj-3.1.1"
tagger = MeCab.Tagger(
    '-r nul -d {} -Ochasen'.format(dic_path).replace('\\', '/'))

text = "天気が良いから、散歩しましょう。"
print(type(tagger.parse(text)))
print(tagger.parse(text).split("\n"))
print(tagger.parse(text))

mecab-ipadic-NEologd#

mecab-ipadic-NEologd: Neologism dictionary for MeCab
Project address: https://github.com/neologd/mecab-ipadic-neologd/blob/master/README.ja.md
License: Apache License, Version 2.0

The term "Neologism" in the project name means "new words," so this morphological analysis dictionary has good parsing effects for new words. However, it requires compiling the source code, which I have not attempted; I hope someone will provide a tutorial.

References#

Simple Use of Mecab Japanese Word Segmentation Tool - FreeMdict Forum: Provides very detailed explanations and example codes.

Other Morphological Analyzers#

As mentioned earlier, awesome-japanese-nlp-resources can find many morphological analyzers. In addition to specific use cases, it is also worth looking at the evaluations of other morphological analyzers in The Development History of MeCab:

Juman's commercially distributed morphological analyzers before were fixed in terms of dictionaries and part-of-speech systems, and users could not freely define them. Juman allowed all these definitions to be externalized and freely defined.
Dictionaries are relatively easy to obtain, but the definitions of connection costs and word occurrence costs had to be done manually. Every time a parsing error was discovered, it was necessary to adjust the connection costs within a range that would not cause side effects, leading to high development costs.
Additionally, since Juman was developed for Japanese morphological analysis, the unknown word processing was specialized for Japanese, and users could not define their own unknown word processing. Furthermore, the part-of-speech system was fixed to two levels, which imposed a kind of limitation on the part-of-speech system.

One of ChaSen's contributions is that it estimates connection costs and word occurrence costs through statistical processing (HMM). Thanks to this processing, it became possible to automatically estimate cost values just by accumulating parsing errors. Furthermore, the part-of-speech hierarchy became unlimited, allowing for (truly) free definitions, including the part-of-speech system.
However, the more complex the part-of-speech system becomes, the more the problem of data sparsity arises. When using HMM, it is necessary to fix the internal state (Hidden Class) of HMM to one, which requires "transformation" from each part of speech to the internal state. Simply assigning each part of speech to one internal state is sufficient, but if all parts of speech are expanded to include inflections, the number can reach 500, making it impossible to obtain reliable estimates for low-frequency parts of speech. Conversely, high-frequency parts of speech such as "particles" must also be included in the internal state to achieve high accuracy. The more complex the part-of-speech system becomes, the more difficult it is to define the internal state. In other words, the current (complex) part-of-speech system is insufficient for HMM, and the manual costs to supplement it are increasing.
Additionally, ChaSen does not come with a cost value estimation module. It seems to be usable internally at NAIST, but due to the reasons mentioned above, there are many parameters that need to be set, making it difficult to master.
Furthermore, ChaSen's unknown word processing is also hard-coded and cannot be freely defined.

The concerns raised in the above evaluations are generally consistent with the development trends of morphological analyzers:

  1. Morphological analyzers are abandoning grammar rule-based approaches and completely shifting to purely mathematical algorithms based on statistics;
  2. They are abandoning custom morphological analysis dictionaries and using resources like UniDic that are constructed and maintained by authoritative institutions;
  3. They are beginning to attempt to support parsing in multiple languages simultaneously.

Here are a few morphological analyzers that I have personally researched a bit:

GiNZA - Japanese NLP Library:
Development Language: Python
License: MIT license
Last Updated: 2023-09-25

Note: This morphological analysis tool was open-sourced in 2019 by Megagon Labs, an AI research institution under the Japanese company Recruit (リクルート).

Kuromoji
Development Language: Java
License: Apache-2.0 license
Last Updated: 5 years ago

Note: Based on the results extracted from the MOJi Android APK, this is likely the parsing tool used. Additionally, [[Elasticsearch]] also uses this by default.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.