Summary: This article is a discussion about Hunspell by xiaoyifang, the maintainer of FreeMdict and GoldenDict.
Morphology dictionaries do not work with EPWING (hunspell), is there a way to fix this? I have already opened an issue on the original goldendict github. Should I open another issue on xiaoyifang’s fork?
Machine translation:
Morphological dictionaries cannot be used with EPWING (hunspell). Is there a way to solve this problem? I have already opened an issue on the original goldendict github. Should I open another issue on xiaoyifang's fork?
Sorry, even though you have fixed the issue, I still want to clarify: the inference results of GoldenDict's Hunspell feature are determined by the words included in files with the .dic suffix and are not affected by the loaded dictionaries.
The mechanism on the Android side of Youdao Dictionary is different; it is influenced by the dictionaries loaded by the user (your suspicion holds true for this software).
Additionally, I believe that for Japanese to be able to look up words completely using the clipboard like English, some issues may (possibly) not be solvable through Hunspell—Japanese inflection is much more complex than English, and Japanese has unique typesetting rules. Therefore, I specifically created the "Japanese Non-dictionary Morphological Dictionary," and you can refer to these posts and their discussion areas to understand some general concepts:
This discusses the characteristics of Japanese inflection
https://forum.freemdict.com/t/topic/11523
This discusses the unique typesetting rules of Japanese
https://forum.freemdict.com/t/topic/14241/17
If you would like to know more, I will take the time to organize an article to systematically and thoroughly explain what special aspects Japanese has (it may take about a month; if I finish organizing it, I will specifically notify you).
If possible, I hope GoldenDict can natively handle the issues mentioned above without using the methods discussed here [https://forum.freemdict.com/t/topic/14241]—using scripts and tools like Quicker for intermediary implementation.
Black Book Style#
Not familiar with Japanese, is this related?
It should not be significantly related to Chinese learners of Japanese. The areas you mentioned mainly affect the romanization of Japanese (you can think of it as pinyin; they are all Latin letters).
Specifically, the things to handle are similar to the differences between the commonly used simplified pinyin scheme and the Wade-Giles pinyin scheme in Chinese. The effect after enabling it is like this:
Tsinghua University processed as Qinghua University,
Tsingtao processed as Qingdao,
Peking University processed as Beijing University
(These examples may not be very rigorous, as some spelling methods are based on pronunciations that are not the current Mandarin)
In other words, it mainly resolves the differences between romanization (or Latin letters) spelling schemes. Chinese learners of Japanese rarely look up words using romanization (like taberu); they generally use kana (like たべる) or kanji (like 食べる) for lookup. However, I have seen some dictionary websites designed by foreigners that support looking things up using romanization, which is why such a feature exists (but Chinese people probably won't use it).
I want to solve issues similar to those caused by English tenses, like the following: (For ease of comparison and explanation, the example sentences are created by me)
私はご飯を食べている(I am having dinner)
I am having dinner
私はご飯を食べていた(At that time, I was having dinner)
I was having dinner
私はご飯を食べた。(I had dinner)
I had dinner.
私はご飯を食べなかった(I didn't have dinner)
I didn't have dinner.
母親は私をご飯を食べさせる。(Mom lets me have supper)
Mom lets me have supper
母親は私をご飯を食べさせない。(Mom won't let me have dinner)
Mom won't let me have dinner.
The bolded parts are the verbs in both languages (also the parts that need to be marked using the conventional morphological function), and you can see that: when English expresses different meanings, morphological changes do not occur consecutively on a single verb (thus, the inflection of verbs is much less frequent, with only three forms); whereas when Japanese expresses different meanings, morphological changes can be nested multiple times on a verb (hence each sentence above is a new inflection, and there are far more than these). This leads to the morphological files for Japanese being very complex, which is why I want to try other solutions.
My solution is not very academic (the bolded parts are the parts that need to be marked using the solution I proposed):
私はご飯を食べている(I am having dinner)
I am having dinner
私はご飯を食べていた(At that time, I was having dinner)
I was having dinner
私はご飯を食べた。(I had dinner)
I had dinner.
私はご飯を食べなかった(I didn't have dinner)
I didn't have dinner.
母親は私をご飯を食べさせる。(Mom lets me have supper)
Mom lets me have supper
母親は私をご飯を食べさせない。(Mom won't let me have dinner)
Mom won't let me have dinner.
You can see that the final kana of 食べる,which is る,has some repetition, so I created an mdx file by exhaustively listing the inflections of the last kana (i.e., "Japanese Non-dictionary Morphological Dictionary v1" and "Japanese Non-dictionary Morphological Dictionary v2")—entries like 食べら,食べり,食べれ,食べさ,食べま,食べろ all point to 食べる,and then in "Japanese Non-dictionary Morphological Dictionary v3," I used Python and JavaScript to create two scripts to reverse-engineer the original forms based on the exhaustive rules and compared them with the results of "Japanese Non-dictionary Morphological Dictionary v2." (Although it is a bit of a paradox, it still has some validation value)
In summary, after putting my ideas into practice and conducting practical tests for six months (I received feedback on the forum), I did not find any serious issues. So next, I will first submit a PR on Salad Dictionary (the communication will be a bit easier, plus I don't understand the C++ and C used by GoldenDict) and observe the actual effects.
Possibly useless reference: Hepburn Romanization - Wikipedia, the free encyclopedia (wikipedia.org)
Mecab#
Japanese morphological analysis, such as extracting the base form, has existing libraries in the industry. The most popular open-source libraries currently seen are all based on dictionaries (ipadic/unidic) for analysis. There are also some that use custom rules for analysis, but the results are machine-trained based on dictionaries. Will handwritten rule analysis have any issues? It’s best to test with a larger sample: https://clrd.ninjal.ac.jp/unidic/
(Deleting it won't help; I have archived it via email)
The tool you recommended should be used for analyzing articles; word segmentation is probably not the original intention of this tool, and there are certain differences between the two (for example, when segmenting words, the contextual meaning is basically lost; also, the segmented text has not been cleaned and requires special preprocessing).
However, we can refer to their processing details and make certain modifications (we don't need to worry about word segmentation; we only need to focus on the derivation process after segmentation).
Below are my half-finished notes to give everyone some ideas (not a computer science major, only know Python, so please don't be misled by me):
The developer provides source code for other languages, here (you can only scroll down slowly; I don't know why you can't search...)
But after downloading, I found the file was too small
3 Python files can achieve Japanese NLP? 233, it should still need to call the packaged exe (but I want to study the processing details, I can't possibly read binary code...)... Also, it uses Python 2 syntax...
So I didn't continue researching.
Not giving up, I found another one:
(SamuraiT/mecab-python3: mecab-python. you can find original version here //taku910.github.io/mecab/)
Unofficial interface, although it provides a Python interface, the actual processing is (should be) not in Python.
About Large Sample Validation#
I don't want to write code, nor do I want to ask others to write code and then delete it. Modern search engines basically use this set of dictionaries and morphological analysis tools, but they are not suitable for client-side use. It’s great that you can summarize and improve, but it’s best to test with a large sample so that client developers will have confidence in using it.
Yes, I agree with your point; we need to validate with a large sample. Relying solely on manual collection is too slow (in fact, the idea of consciously collecting inflections has been around for two years, but it was only six months ago that I started working on it and found that I still missed a lot).
The mecab you recommended can be validated by comparing the two columns of data in the segmentation results, but I currently do not have segmented corpus, so I previously just briefly explained it.
[ ] Anyone with Mecab segmentation corpus is welcome to send it to [email protected], I only need the
書字形
and書字形基本形
two columns of data, I would be very grateful :)
Two months have passed, and sure enough I haven't received a single file (maybe I should specifically open a post 233).
However, I have found an old computer, and I will take the time to create a segmented corpus, expecting to start large sample validation around National Day.