The online seminar on Japanese corpus and computational linguistics "Language Resources Workshop 2024" is here!
Language Resources Workshop 2024#
An online seminar related to corpora and computational linguistics hosted by the National Institute for Japanese Language and Linguistics. Please fill out the registration form on the official website before attending the conference https://clrd.ninjal.ac.jp/lrw2024.html (free).
Next, I will list some of the presentations I am interested in; the complete conference content can be found on the official website: https://clrd.ninjal.ac.jp/lrw2024-programme.html.
Additionally, the schedule for the academic conference "68th Annual Meeting of the Society for Quantitative Linguistics" hosted by the National Institute for Japanese Language and Linguistics has also been released. Since it will be held offline, you can visit the official website for more information if you are interested.
https://sites.google.com/view/mathling2024/%E3%83%9B%E3%83%BC%E3%83%A0
Day 1: August 28 (Wednesday)#
09:30〜10:45#
o01: [[The Occurrence of "Inclusion of Sentences" in Conversation Data]]
What is "Inclusion of Sentences": Expressions such as "Hurry up aura," "I'm trying my best appeal," and "Let's start the Pokémon card game campaign" are unique linguistic phenomena that deviate from the general rules of word formation, where elements equivalent to "sentences" occur within words, and larger units cannot fit within the word (this presentation refers to it as "Inclusion of Sentences").
I collected a large number of example sentences from anime subtitles while researching [[non-dictionary]], many of which do not conform to standard Japanese grammar and are somewhat similar to the "Inclusion of Sentences" discussed in this presentation. I would like to see how academia views these less standard examples.
10:55〜12:10#
o04s: [[Verification of the Effectiveness of Large-scale Language Models for Meaning Classification of Katakana Words]]
This paper reports on the methods and results of meaning classification of katakana words in context using LLM.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o04s-paper.pdf
Meaning classification? I'm curious how it is done. I designed a prompt that is somewhat in this direction:
# Role: Dictionary Query Assistant
## Profile
- Author: NoHeartPen
- Version: 0.1
- Description: The dictionary query assistant searches for the closest meaning item to the context from the complete explanations provided by authoritative dictionaries.
## Rules
1. Respect the original text; do not translate the complete explanations provided by the dictionary, and do not modify the complete explanations provided by the dictionary.
2. When the context contains usages not included in the dictionary, return "The dictionary has not recorded this usage." At other times, no additional explanation is needed; just return the dictionary explanation.
## Workflow
1. Ask the user to provide context and the word to be queried in the format "Context: [], Word to query: [], Complete dictionary explanation: []".
2. Analyze the closest explanation item in the complete dictionary explanation provided by the user to the context, based on the context, the word to be queried, and the complete dictionary explanation.
3. Only return the relevant explanation of the closest meaning item to the context; do not return other explanations unrelated to the context.
4. No need to translate the dictionary explanation, and no additional explanation is required.
## Initialization
As the role <Role>, strictly adhere to <Rules>, and warmly welcome the user. Then introduce yourself and inform the user of <Workflow>.
## Example
Context: [全部さらけ出して], Word to query: [さらけ出して], Complete dictionary explanation: [さらけ‐だ・す【×曝け出す】
[動サ五(四)]
① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
② 追い出す。
「おらあ女房を―・してしまって」〈滑・膝栗毛・発端〉]
Your answer: ① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
(Note: This prompt performs poorly on GPT3.5 and many domestic AIs, but works well on GPT4o mini, allowing for quick searches for the most similar meanings in authoritative dictionaries like "Daijisen." A slight modification of the example also provides a good experience when querying English words in the "Oxford Advanced Learner's English-Chinese Dictionary" using domestic AIs.)
o06s: Analysis of Structural Patterns of Noun Clauses Containing Chinese Gerunds - Based on BCCWJ Data
When Chinese gerunds are used in noun clauses, there are at least three structural patterns: verb type ("Chinese + suru/shita"), noun type ("Chinese + no"), and adjective type ("Chinese + teki/teki na/na"). The results confirmed that (1) the typicality of the verb-type structural pattern is prominent, (2) there are constraints on the noun-type structural pattern, and (3) the adjective-type structural pattern is exceptional. It was also revealed that factors such as the part of speech of Chinese gerunds, usage environment, semantic categories, and era influence the selection of each pattern.
The author's article was among the few recommended by my advisor when I was writing my thesis. I didn't expect to encounter it this time; the direction and conclusions are quite interesting.
14:10〜15:50#
o07s: Construction of the "Chinese Video Audio Corpus" - Aiming for Accurate Transcription through Multiple Modalities
I originally planned to write something similar to [[Conan Bilingual Corpus]], but I really didn't have time to work on it before finishing [[Easy to Check]]. I want to see what technology stack they used and what their needs are.
Chinese videos uploaded to video-sharing sites typically have subtitles embedded as image data within the video frames. To enable the collection of a wider range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that allows simultaneous display and search of text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect several genres and conduct language analysis.
For Chinese videos uploaded to video-sharing sites, subtitles are usually embedded as image data within the video frames. To create a Chinese corpus and collect a broader range of texts, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that can simultaneously display and search text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect some genres and conduct language analysis.
16:15 〜 17:15#
i1_A3s A Room: An Attempt at Readable Accent Notation for a Japanese-Slovene Dictionary for Japanese Learners
I didn't expect there would be scholars sharing their experiences in constructing a Japanese-Slovene dictionary, and the sharing is about the processing experience of UniDic, which is a must-see! (Additionally, I hadn't noticed before that UniDic also contains tone information.)
i1_B3s: An Attempt to Extract Onomatopoeia Candidate Words through Pattern Matching - Using an Onomatopoeia Morphological Transformation Program
It has been revealed that there are 61 types of morphological patterns of onomatopoeia appearing in modern Japanese written and spoken language, with about 2200 concrete forms.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_B3s-paper.pdf
Researching input methods...? My [[non-dictionary]] and input text are actually very similar processes, but I only vaguely noticed that Japanese people are very flexible when using hiragana, but I didn't expect that onomatopoeia could be divided into 61 types.
i1_C2: Characteristics of English Vocabulary Not Adopted as Loanwords in Japanese
This presentation focuses on English loanwords that have not been adopted into Japanese and clarifies some of their characteristics. It is well known that there are many English loanwords in modern Japanese. However, not all English words have become loanwords in Japanese; for example, frequently used articles like "a," adverbs like "as," and pronouns like "he" are not loanwords in Japanese (they are not listed in the national language dictionary). ... Among the top 100 words, 49 were listed in "Digital Daijisen," while 51 were not, which is almost half and half. When viewed by part of speech, all 8 nouns were listed, while 5 out of 6 auxiliary verbs and 9 out of 12 pronouns were not listed.
I previously answered a question on Zhihu [[Zhihu Answer: What are the Japanese words derived from English?]] https://www.zhihu.com/question/544356324/answer/2609385955. I originally planned to take it easy for my thesis: analyze the intersection of Japanese loanwords and vocabulary from exams like the Chinese CET-4, CET-6, IELTS, and TOEFL, but in the end, I couldn't resist choosing the direction of [[non-dictionary]] morphological analysis (it's a pity I only wrote half of it 2333).
Day 2: August 29 (Thursday)#
9:20 〜 10:40#
i2_A1: Interim Report on the Construction of the "Japanese Game Corpus (JGC)" - Quantitative Features Observed in Early Action Games
A game corpus?! A must-see! Also, the selected games are all console games from Japanese manufacturers, both new and old (unfortunately, there is no Genshin Impact, what a pity).
i2_A2: (Tentative) An Attempt at Japanese Research Using "National Diet Library Digitalized Materials Full Text Data"
I'm curious how academia searches for what they want using publicly available databases.
i2_A3: Examination of the "Classification Vocabulary Table" as a Polysemous Code - Using the Most Important Verbs from the "Basic Dictionary of Japanese for Computers IPAL"
Several presentations at this seminar have utilized this "Classification Vocabulary Table," and I'm curious about the issues considered during the numbering process.
i2_B3: Design, Implementation, and Operation of a Japanese Morphological Analysis System for Popup Dictionaries
It is said that hovering the mouse over a word to display the dictionary can enhance reading efficiency. However, to realize this function, it is necessary to solve the problem of converting the string under the mouse pointer into dictionary form. Using morphological analysis systems like Mecab is one solution, but such systems often require specific performance from the user's computer, so they are usually run on servers. However, the morphological analysis in this process differs from that for language research, machine translation, or full-text search, as the main purpose is to convert the input string into dictionary form. Therefore, it is possible to reduce the size of the morphological analysis system and enable more efficient implementation. This paper discusses the design, implementation, and operation of a morphological analysis system specialized for dictionary search, NonJishoKei, aimed at popup dictionaries.
It has been proven that automatically displaying dictionary explanations when the mouse hovers over a word can effectively improve reading efficiency. However, to achieve this function, a problem needs to be solved: converting the text near the mouse pointer into dictionary form. Using morphological analyzers like Mecab is one solution, but such systems often have high requirements for the user's device, so they are usually run on servers. However, unlike language research, machine translation, or full-text search, in this scenario, the main goal is to convert the text near the mouse pointer into dictionary form. This means that a streamlined morphological analyzer can be specifically designed for such usage scenarios. The Japanese Non-Dictionary Form Dictionary (NonJishoKei) is based on this idea and is a morphological analyzer designed specifically for popup dictionary retrieval. This paper will discuss its algorithm principles and engineering implementation.
My own presentation (the truth is revealed 2333), the translation is quite different from the original as I rewrote it after submitting the original text (囧).
i2_C2: TEachOtherS: A Composition Education Support System as a Learner Corpus Construction Mechanism
(a) Provides learners with a web-based environment for composition, comments, and reflection, (b) Allows teachers to manage accounts for the entire class and control activity phases such as composition, comments, and reflection, applicable to all students at once. In addition, it is assumed that learners will revise their compositions based on comments received from others, and it has a version management function for compositions. The results of composition education activities can also be output in HTML format.
I am very interested in the implementation details of this system.
i2_C4: (Tentative) Trends in Writing Errors in Handwritten Kanji by High School Students
In the first year, about 70% of students' compositions showed kanji writing errors, but as the grade increased, the errors decreased, reaching about 50% in the third year. Among the kanji used in more than 20 compositions, the kanji with the highest error rate was "達," with about 40% of compositions containing errors in the character's form.
The conclusions on the issues of interest are very intriguing.
10:50〜12:05#
o12: (Tentative) Characteristics of Anime and Game Vocabulary from the Perspective of Misanalysis - Towards the Creation of a Vocabulary List
Anime and games are resources for Japanese learners, but the vocabulary used differs from that learned in the classroom. However, there are no vocabulary lists that are easy for both learners and teachers to utilize, showing the vocabulary by genre and its frequency. Therefore, we decided to create a vocabulary list as a linguistic resource that can be utilized in Japanese education. Scripts from anime and games tend to produce misanalysis when subjected to morphological analysis. Aiming for accurate data provision, we first confirmed where and to what extent misanalysis occurs by conducting morphological analysis on four anime works and one game. As a result, it was found that about 10% of misanalysis occurred, most of which represented characteristics of vocabulary in anime and games, including unique nouns, interjections, colloquial language, and hesitations. This presentation will organize the procedures of morphological analysis conducted towards the creation of a vocabulary list and explore methods to analyze while retaining the characteristics of anime and games as much as possible.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o12-paper.pdf
I am very interested in the direction and the issues pointed out regarding "misanalysis," and also, the anime studied includes "Oshi no Ko" and "The Quintessential Quintuplets" (laughs).
o13: Overview of the Monitor Public Version of the "Children's Daily Conversation Corpus"
A children's dialogue corpus? Looking forward to it!
13:00〜14:00#
Linguistics Deepening Dialogue with Generative AI
Presenter: Taiki Sano (Google LLC)
Wow, Google is impressive!
14:25〜15:25#
i3_A1: The Relationship Between Rising and Falling Intonation and Conversational Form - Using the "Japanese Daily Conversation Corpus"
Presenter: Li Haiqi (Zhejiang University, Japanese Department)
There are differing opinions regarding the usage scenarios of rising and falling intonation, which is a sentence-final intonation. According to a summary based on introspection and materials, rising and falling intonation tends to be used in somewhat formal settings. However, based on impression evaluations and usage rate statistics from data of monologues, rising and falling intonation is frequently used in casual speech.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A1-paper.pdf
The conclusion is very interesting.
i3_A2: (Tentative) Differences in Speech Rate by Conversational Context
This presentation will report on the results of investigating how speech rate can vary depending on the conversational context and interlocutor.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A2-paper.pdf
The title caught my interest.
i3_A3: Pronunciation of /ei/ Vowel Sequences in Japanese
Presenter: Katarina Hitomi Gerl (University of Ljubljana, Faculty of Arts, Japanese Studies)
According to various dictionaries, the /ei/ vowel sequence in Japanese is pronounced as a long "e" when it is not between meaningful breaks.The issues of interest are very intriguing.
i3_B3: Construction of a Slovene-Japanese Learning Dictionary Based on Dictionary Reversal and Open Data
Presenter: Kristina Hmeljak Sangawa (University of Ljubljana), Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Jan Hrastnik, Nina Kališnik, Sara Kleč, Eva Kovač, Nina Sangawa Hmeljak, Jure Tomše, and Tomaž Erjavec
Japanese language learning is popular in Slovenia, but reference books are still scarce. Therefore, we attempted to reverse the data of a previously edited Japanese-Slovene dictionary and utilize open data to construct a Slovene-Japanese learning dictionary. First, we extracted equivalent words from the Japanese-Slovene dictionary by meaning, rearranged them with Slovene as the headword, then manually removed duplicates and inappropriate headwords, and automatically assigned part of speech and CEFR-compliant difficulty levels to the headwords, with some including example sentences. Using collaborative editing software Lexonomy, we manually assigned meaning hints and positional labels to polysemous headwords, and some headwords also included example sentences from parallel corpora. The approximately 8500 words of dictionary data constructed in this way were made publicly available as TEI Lex0 compliant XML data. Learners who participated in the project reported that they gained knowledge about the dictionary's structure, and they plan to continue editing in the same manner in the future.
The introduction is very appealing to me; I look forward to the presentation at that time.
i3_C2: Personal Emergencies: Analysis of "Wait" on X (Twitter)
This analysis focuses on the usage and characteristics of the imperative "wait" written without accompanying other elements representing the subject or object in the same sentence, recorded as the sender's (writer's) own words. Observations of examples posted in the last 60 minutes revealed that such "wait" is used more frequently than similar expressions like "look" and "listen," and is often used in "tweets" (posts) that do not have a specific recipient. Furthermore, since such "wait" often co-occurs with the sender's (writer's) emotions or evaluations, it is thought to express "an event that shakes emotions or evaluations, and it is an emergency situation that literally requires the sender (writer) to wait." Additionally, comparisons were made with examples from Yahoo! Blogs and LINE chats, suggesting that such "wait" is particularly likely to be used on X (Twitter).
The analytical subject is very interesting.
15:35〜16:50#
o15: A Corpus-Based Cognitive Semantic Analysis of the Polysemy of the Japanese Temperature Adjective Tsumetai
Presenters: Wang Haitao (Kyoto University), Huang Haihong (Kyoto University), Zhong Yong (Nanjing University of Aeronautics and Astronautics)
A paper in English by Chinese authors speaking Japanese...? I'm curious what language will be used for the presentation at that time 2333.
o16: The Use of Sentence-Final Forms in Distinguishing Characters' Dialogue in Novels
This paper attempts to collect, organize, and analyze the sentence-final forms from the dialogues of 24 characters appearing in 10 works of entertainment novels and light novels.
I thought the title was analyzing some classic Japanese literature, but the introduction was "analyzing the language styles of different characters in 10 light novels," which instantly piqued my interest. Upon opening the paper, I found that one of the analyzed works is "Bunny Girl Senpai"! Moreover, there are also new works like "Frylin's Funeral" ... so can I expect someone to analyze "MyGo" at next year's seminar? (what a fog)