A Brief Analysis of the Morphological Rules of Goldendict Using the hunspell_ja_JP Project File as an Example

Abstract: This article analyzes the morphological rules of GoldenDict using the hunspell_ja_JP project files as an example. (Well, this article is quite technical, and I don't know how to write an abstract for it, 2333)

Introduction#

This article uses the morphological files from the MrCorn0-0/hunspell_ja_JP: Hunspell morphology dictionary for Japanese used in GoldenDict. (github.com) project (hereinafter referred to as the "original project") as an example, referencing the man 4 hunspell PDF file provided by the Linux hunspell official website to explain the basic morphological rules.

A preliminary statement: I am not very knowledgeable and have only roughly understood some of the rules in the original project's morphological files, so there may be errors. I hope everyone reads rationally and communicates kindly. Additionally, the original project is only for Japanese; if you have morphological questions for other languages, I may not be able to answer them, so please refer to the manual.

The morphological functionality of GoldenDict primarily uses Hunspell, which is originally a spell-checking tool for Linux. Therefore, there are slight differences from what is mentioned in Introducing Hunspell to German Language Students. GoldenDict's morphology only includes the two files with the extensions .dic and .aff, and does not include files with the extensions .morph, .good, .wrong, or .sug. However, the rules for .dic and .aff files are completely consistent with those in the Hunspell manual.

Regarding the functions of these two files, I quote from Introducing Hunspell to German Language Students:

The .dic dictionary file contains entries similar to the headwords (lexemes) in a paper dictionary.

The .aff restoration rule set contains many groups of rules on how to convert a complex form with prefixes, suffixes, or compound words into a headword that exists in the dictionary.

Basic Format of the dic File#

The dic file is quite simple; it looks like the following (the initial number indicates the number of entries in the dic file):

450851
あ
亜
亞
吾
我
高い/XA

Some words have a / followed by what looks like gibberish; this notation can be understood as indicating the part of speech of the word, indicating that this word has a transformation rule named XA, which will be explained in detail later. Here, it is only necessary to know that the dic file can specify the part of speech of a word.

Additionally, the words included in GoldenDict's dic file will affect the final lookup results when the morphological function is enabled, so it is best to treat the words in this file with caution and not modify them recklessly.

aff File#

The rules for the aff file are very complex. Below, I will only explain the rules used in the original project. Those studying other languages or with other needs should refer to the man 4 Hunspell manual for the Linux project. (To repeat an important point, my English is only at a level four, and reading the manual is very challenging. The conclusions I reached were entirely based on guessing and verification, not from fully understanding the manual. Rather than relying on a Japanese major with poor English, it might be better to try it yourself. :)

MAP#

Let’s start with an example to introduce the most understandable MAP rule:

# Spelling Variants
MAP	89
MAP	あア
MAP	かカ
MAP	さサ
MAP	たタ

MAP 89 indicates that there will be 89 MAP rules below.

The MAP rule can be simply understood as ignoring specified character differences. For example, MAP あア means that inputting ア will be treated as あ. Therefore, this rule is used in the original project to handle the writing habits of Japanese onomatopoeia, such as チョコチョコ, which does not appear in most authoritative dictionaries, while the rule-replaced ちょこちょこ appears in many dictionaries.

However, there is a formula at the end of the original project that I have not fully understood:

MAP	(ゃ)(ャ)

According to the manual, Use parenthesized groups for character sequences (eg. for composed Unicode characters): (use parenthesized groups to handle special characters composed of multiple characters, such as those special symbols in Unicode), so these rules should be for contracted sounds— for チョロチョロ, the computer uses the three characters チ, ョ, and ロ to represent it. Therefore, when replacing, do we need to replace them together? But I do not fully understand.

Additionally, rules like the following seem to have no significance, as many dictionaries do not include entries like 腕きゞ, so it doesn't matter whether they are replaced or not. If we really want to solve the problems caused by dancing characters, we might still need to rely on regular expressions (heh, I'm just showing off~ The "Japanese Non-Dictionary Form Dictionary v3" has already perfectly supported this in the previous version).

MAP	(ゝ)(ヽ)
MAP	(ゞ)(ヾ)

(This may not be obvious; looking at images makes it clearer; the symbols in the first column all have a small hook.)

REP#

Next, I will introduce the REP rule, which is quite similar to the MAP rule:

REP 12
REP かけ  掛け
REP かか  掛か

(The original project wrote REP かけ掛けかけ, which may have been a mistake...)

Like MAP, REP 12 indicates that there will be 12 REP rules following, but unlike MAP, in REP かけ掛け, かけ is the input, and 掛け is the replacement result (the order is reversed compared to MAP). Additionally, the REP rule can implement multiple character replacements (parameters are separated by tabs, while MAP can only use (), which is quite limiting).

Basic Format of aff#

Next, I will go back to introduce the basic format of the aff file. All aff files should start like this:

SET UTF-8
LANG ja
FLAG long
# https://github.com/MrCorn0-0/hunspell_ja_JP/

SET sets the file encoding;

LANG specifies the language applicable for the rules; please refer to the manual for other languages;

FLAG long indicates that two ASCII characters are required when naming rules. For example, a rule that will be used later as an example, XA is the rule name:

SFX	XA	Y	1	
SFX	XA	い	く	い

If you prefer to use numbers to directly number the rules, you can write FLAG num at the beginning of the file according to the manual: The long' value sets the double extended ASCII character flag type, the num' sets the decimal number flag type., and then you can name them in the following style:

# Adjective く
SFX	001	Y	1	
SFX	001	い	く	い

To facilitate future modifications, it is best to explain the naming when using numbers through comments. Note: # must be at the beginning of each line for that line to be treated as explanatory content.

However, naming with numbers will also lead to a problem: how to use multiple rules on a single word? (Using the conjugation of words is not a single method, right?)

If using long, we only need to write the rules directly (the length is fixed at 2 characters, and the program can recognize it):

高い/XAXB

When using numbers, you need to separate them with a half-width comma:

高い/001,002

SFX#

The reason for going back to introduce the writing style at the beginning of the aff file is mainly to emphasize that the dic file can specify multiple rules. These multiple rules do not refer to MAP and REP, but to the SFX rules that support custom naming, which will be introduced next.

Here, I specifically emphasize "support for custom naming" because naming will affect both the aff file and the dic file.

It was mentioned earlier that the dic file may have such a notation:

高い/XA

To some extent, the SFX rule is the true morphological rule. Through this function, we can construct affixes and achieve word form restoration, which allows for the derivation of Japanese conjugation and the restoration of dictionary forms. (The following content mainly refers to the AFFIX FILE OPTIONS FOR AFFIX CREATION section of the manual.)

First, let’s illustrate with a simple example:

SFX	XA	Y	1	
SFX	XA	い	く	い

The first line: XA is the name of our custom affix, Y is a fixed parameter (the manual mentions that 1 is the number of affixes contained in the affix named XA); in the second line, the first い indicates that this rule only applies to words ending with い in the dic file (the manual states stripping characters from beginning (at prefix rules) or end (at suffix rules) of the word, my understanding is that we defined an affix named XA, and the actual content of the affix is い. This affix is the part that the program will process), く indicates that this rule will take effect when the input word ends with く, and the final い indicates a condition that must be met before the morphological rule can take effect: "the derived word must end with い"; if this condition is not met, the derivation result will not be displayed.

For example, when we input 高く, the program will replace く with い (where the い is the first い from the second line), and then check if there is a word in the dic file that ends with い that is 高い (the ending with い is because of the second い in the second line). If there is, then GoldenDict will directly jump to the corresponding interface.

It is considered simple because in the original project, this rule is actually as follows:

# Adjective 文く
SFX	XA	Y	2	
SFX	XA	し	く/BaTe	し
SFX	XA	い	く/BaTe	い

This is because in Japanese, 高く can also continue to conjugate, so the user may select parts like 高くば or 高くて. The original project author fully considered this feature of Japanese and used the / rule to handle nested continuous transformations (does that sound familiar? Because in the dic file, this symbol is used to indicate which rules a word can be used for; you can understand it as part of speech).

It is important to note that BaTe consists of two independent rules, which can be found in the original author's aff file:

SFX	Ba	Y	1	
SFX	Ba	0	ば	.

(I am not very sure if there is such a grammar as 高くば, but when I isolated this rule for testing, I found that it indeed causes the software to display the explanation for 高い when inputting 高くば.)

SFX	Te	Y	3	
SFX	Te	0	て	[っいしく]
SFX	Te	0	で	ん
SFX	Te	0	て	.

(The key is SFX Te 0 て .; the others are rules that are unrelated to SFX XA しく/BaTe し and SFX XA いく/BaTe い from the perspective of Japanese grammar; the original project author may have grouped them together out of personal habit.)

Here, a very special character . appears, which was mentioned earlier. Its position indicates the character that should be included at the end of the replacement result, and there is a rule that Zero condition is indicated by dot.. Therefore, SFX Te 0 て . means that for any て at the end of the word, it should be deleted directly.

This may be difficult to understand the function of the rules, so let's return to the rule SFX XA いく/BaTe い and put it together with SFX Te 0 て . for a clearer example: the original project author designed this to handle the input 高くて. (Removing /BaTe would raise the requirements for selection, so the original project is indeed well designed.)

The previous example involves double nesting, which may still be hard to understand. If you have questions, refer to the Twofold suffix stripping section and the AFFIX FILE OPTIONS FOR AFFIX CREATION part of the manual. ~~Actually, I didn't quite understand it either.~~

Here’s another example:

SFX	To	Y	1	
SFX	To	0	とも	.

To is the rule name, 0 indicates that this rule will remove the defined affix とも, and とも represents the affix of the actual input word, while . indicates that there are no requirements for the replacement result. Therefore, the function of this rule named To is to remove the とも at the end of the input.

Additionally, SFX Te 0 て [っいしく] here includes [っいしく], which according to the manual indicates that any of the characters っ、い、し、く must be present in the input characters before replacement.

Below are some more complex custom rules, which I will only briefly introduce:

[^行]く has the same meaning as in regular expressions, meaning the rule only applies to words that do not contain 行く but end with い:

SFX	5T	く	い/TeTaTrTm	[^行]く
SFX	5T	く	っ/TeTaTrTm	行く

This rule is slightly longer but actually has no special significance:

SFX	KN	く	け/RUTeTaTrTmf1f2f3f4f5f6m0m1m2m3m4m5m6m7TiTItiSuNgQq1Eba1M1myo	く

This rule can actually be split into two, but the original author flexibly used the [] syntax:

SFX	xU	い	かる/bs	[しじ]い

Side Note#

My interest in morphology is entirely due to encountering tricky problems in the "Japanese Non-Dictionary Form Dictionary" project, and I wanted to see if there were other solutions. Therefore, I spent nearly a week reading the obscure manual. Although I only have a rough understanding, I have already felt the power of GoldenDict's morphological Hunspell functionality~~ Linguistics is the best!~~. With the idea that it is better to teach someone to fish than to give them fish, I share my summary in hopes of helping everyone understand this functionality a bit, and I look forward to everyone working together to improve GoldenDict's morphological Hunspell functionality. ~~Go submit issues and PRs at - hunspell_ja_JP~~

However, there is an even more important reason: I just joined FreeMdict, and @epistularum has created a GoldenDict morphological demo using a similar approach to the "Japanese Non-Dictionary Form Dictionary" (on GitHub). To communicate with them, I decided to officially put this matter, which had been delayed for months, on the agenda (my poor English is really a struggle...).

Additionally, I would like to announce in advance that GoldenDict can easily solve the previously mentioned issues with the kanji writing of Japanese compound verbs:
|500

Interestingly, GoldenDict's Hunspell functionality can return multiple results, while the Eudic dictionary also has a similar Hunspell functionality but only supports returning one result. Although the manual states that only one feature may not be supported on mobile: BUG: UTF-8 flag type doesn't work on ARM platform., Eudic wouldn't avoid using this technology for that reason, would it...

But in any case, it should support multiple spelling results, such as:

雨が降ります。
バスから降ります。

Similar Functionality of Eudic Dictionary#

I discussed the similar functionality of the Eudic dictionary with friends on the FreeMdict Forum:

No major optimizations were found, but there are still some minor optimizations:

Some omitted sentence patterns, such as 言わざるを得ない's 言わざる (but selecting 言わ will also yield results)
Colloquial expressions like ん、と、ちゃ，etc.

The Hunspell technology seems to only solve double nesting, so I estimate that complex sentence patterns like 食べたければ may not be solvable (which means you still need to think carefully before selecting; you can't just point anywhere you don't understand).

Additionally, I may not have expressed myself clearly: Eudic has the "restoration of conjugation" function, but the technology does not seem to be like Hunspell:
|500
(multiple nesting, the original project seems unable to achieve this)

|500
(selecting only at the end of the word, the original project can)

Ignoring Japanese writing habits
|500
Adjectives do not even support the simplest transformations
|500
From the results, it seems that Eudic may have specifically created a non-open-source conjugation derivation tool but does not allow user customization, so it might be worth trying to give feedback to Eudic's official team to help them improve.

References#

Tutorials#

Linux hunspell official website
- The most critical document is the man 4 hunspell file; other documents introduce the technical details of the Linux implementation.
- Here is a copy of my annotated version:
  - FreeMdict Cloud
  - Lanzou Cloud
Introducing Hunspell to German Language Students - Xu Yinuo's article - Zhihu
- Organized the materials used in the man 4 hunspell PDF and explained some simple concepts.

Open Source Projects#

MrCorn0-0/hunspell_ja_JP: Hunspell morphology dictionary for Japanese used in GoldenDict. (github.com): Nearly 400 morphological rules written according to Japanese grammar, authored by a Chinese person.
epistularum/hunspell-ja-deinflection: Hunspell dictionary to deinflect all Japanese conjugated verbs to the dictionary form and suggest correct spelling. (github.com): Rules written based on replacing word endings, not very complete, authored by a foreigner.
https://github.com/wooorm/dictionaries: Includes morphological rules written in JavaScript, but does not include Japanese.

Note: This article is backed up on the following platforms:

Analyzing the Morphological Rules of GoldenDict Using the hunspell_ja_JP Project Files as an Example - NoHeartPen's article - Zhihu

Analyzing the Morphological Rules of GoldenDict Using the hunspell_ja_JP Project Files as an Example - Software Experience Exchange Outlook - FreeMdict Forum