The online seminar on Japanese corpus linguistics "Language Resources Workshop 2024" is here!
Language Resources Workshop 2024#
An online seminar related to corpora and computational linguistics hosted by the National Institute for Japanese Language and Linguistics. Please fill out the registration form on the official website before attending the conference https://clrd.ninjal.ac.jp/lrw2024.html (free).
Next, I will list some of the presentations I am interested in; the complete conference content can be found on the official website: https://clrd.ninjal.ac.jp/lrw2024-programme.html.
Additionally, the schedule for the academic conference "68th Annual Meeting of the Society for Quantitative Linguistics," hosted by the National Institute for Japanese Language and Linguistics, has also been released. Since it will be held offline, you can visit the official website for more information if you are interested.
https://sites.google.com/view/mathling2024/%E3%83%9B%E3%83%BC%E3%83%A0
Day 1: August 28 (Wednesday)#
09:30〜10:45#
o01: [[The Occurrence of "Inclusion of Sentences" in Conversational Data]]
What is 【Inclusion of Sentences】: Language expressions such as "Hurry up aura," "I'm trying hard appeal," and "Let's start the Pokémon card game campaign" contain elements equivalent to "sentences" occurring within words, which deviate from the general rules of word formation that larger units cannot occur within words (this phenomenon is referred to as "Inclusion of Sentences" in this presentation).
I collected a large number of example sentences from anime subtitles while researching [[Non-dictionary]], many of which do not conform to standard Japanese grammar and are quite similar to the "Inclusion of Sentences" discussed in this presentation. I want to see how academia views these less standard examples.
10:55〜12:10#
o04s: [[Verification of the Effectiveness of Large-scale Language Models for Meaning Classification of Katakana Words]]
This paper reports on the methods and results of meaning classification of Katakana words in context using LLM.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o04s-paper.pdf
Meaning classification? I'm curious about how it's done; I designed a prompt that somewhat aligns with this direction:
# Role: Dictionary Query Assistant
## Profile
- Author: NoHeartPen
- Version: 0.1
- Description: The Dictionary Query Assistant searches for the meaning closest to the context from the complete explanations provided by authoritative dictionaries.
## Rules
1. Respect the original text; do not translate the complete explanations provided by the dictionary, and do not modify them.
2. When a usage not included in the dictionary appears in the context, return "The dictionary has not recorded this usage." At other times, no additional explanation is needed; just return the dictionary explanation.
## Workflow
1. Ask the user to provide context in the format "Context: [], Word to query: [], Complete dictionary explanation: []".
2. Analyze the closest explanation in the user's provided complete dictionary explanation to the given context and word.
3. Only return the relevant explanation closest to the context; do not return unrelated explanations.
4. No need to translate the dictionary explanation or provide any additional notes.
## Initialization
As the role <Role>, strictly adhere to <Rules>, and warmly welcome the user. Then introduce yourself and explain <Workflow>.
## Example
Context: [全部さらけ出して], Word to query: [さらけ出して], Complete dictionary explanation: [さらけ‐だ・す【×曝け出す】
[動サ五(四)]
① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
② 追い出す。
「おらあ女房を―・してしまって」〈滑・膝栗毛・発端〉]
Your answer: ① 隠すところなく、すべてを現す。ありのままを見せる。「内情を―・す」「弱点を―・す」
(Note: This prompt performs poorly on GPT3.5 and many domestic AIs, but works well on GPT4o mini, allowing for quick searches of the most similar meanings from authoritative dictionaries like "Daijisen." A slight adjustment to the example also provides a good experience when using domestic AI to look up English words in the "Oxford Advanced Learner's English-Chinese Dictionary.")
o06s: Analysis of Structural Patterns of Noun Clauses Containing Chinese Gerunds - Based on BCCWJ Data
When Chinese gerunds are used within noun clauses, there are at least three structural patterns: verb type ("Chinese + suru/shita"), noun type ("Chinese + no"), and adjective type ("Chinese + teki/teki na/na"). The results confirmed that (1) the typicality of the verb-type structural pattern is prominent, (2) there are constraints on the noun-type structural pattern, and (3) the adjective-type structural pattern is exceptional. Additionally, factors such as the part of speech of Chinese gerunds, usage environment, semantic categories, and era also influence the selection of each pattern.
I encountered the author's article in a few papers recommended by my advisor while writing my thesis, and I didn't expect to meet it again this time; the direction and conclusions are quite interesting.
14:10〜15:50#
o07s: Construction of the "Chinese Video Audio Corpus" - Aiming for Accurate Transcription through Multiple Modalities
I originally planned to write something similar to [[Conan Bilingual Corpus]], but I really didn't have time to do it before finishing [[Easy to Check]], so I want to see what technology stack they used and what their needs are.
Chinese videos uploaded to video-sharing sites typically have subtitles embedded as image data within the video frames. To enable the collection of a wider range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that can simultaneously display and search text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect some genres and conduct language analysis.
For Chinese videos uploaded to video-sharing sites, subtitles are generally embedded as image data within the video frames. To collect a broader range of texts when creating a Chinese corpus, it is necessary to use text recognition or speech recognition methods on the videos. In this study, we will implement an application that can simultaneously display and search text obtained from multiple resources, such as OCR for embedded subtitles, speech recognition for audio, and subtitles prepared by video creators. We will also attempt to collect some genres and conduct language analysis.
16:15 〜 17:15#
i1_A3s A Room: An Attempt at Readable Accent Notation for a Japanese-Slovene Dictionary for Japanese Learners
I didn't expect there would be scholars sharing their experiences in constructing a Japanese-Slovene dictionary, and the shared experience is about the processing of UniDic, which is a must-see! (Additionally, I hadn't noticed that UniDic also contains tone information.)
i1_B3s: An Attempt to Extract Candidate Words for Onomatopoeia Using Pattern Matching - Using an Onomatopoeia Morphological Transformation Program
It has been revealed that there are 61 types of morphological patterns for onomatopoeia appearing in modern Japanese written and spoken language, with approximately 2200 concrete forms.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i1_B3s-paper.pdf
Researching input methods...? My own [[Non-dictionary]] and input text are actually very similar processes, but I only vaguely realized that Japanese people are very flexible when using hiragana, but I didn't expect that onomatopoeia could be divided into 61 types.
i1_C2: Characteristics of English Vocabulary Not Adopted as Loanwords in Japanese
This presentation focuses on English loanwords that have not been adopted into Japanese and clarifies some of their characteristics. It is well known that many English loanwords exist in modern Japanese. However, not all English words have become loanwords in Japanese; for example, frequently used articles like "a," adverbs like "as," and pronouns like "he" have not become loanwords in Japanese (they are not listed in the national language dictionary). ... In the results of the top 100 words, 49 were listed as headwords in "Digital Daijisen," while 51 were not, making it roughly half and half. By part of speech, all 8 nouns were headwords, while 5 out of 6 auxiliary verbs and 9 out of 12 pronouns were not listed as headwords.
I previously answered a question on Zhihu [[What are the Japanese words derived from English]] https://www.zhihu.com/question/544356324/answer/2609385955, and I originally planned to coast through my thesis: analyzing the intersection of Japanese loanwords and vocabulary from exams like the Chinese CET-4, IELTS, and TOEFL, but ultimately I couldn't resist choosing the morphological analysis direction of [[Non-dictionary]] (it's a pity that I only ended up writing half of it 2333).
Day 2: August 29 (Thursday)#
9:20 〜 10:40#
i2_A1: Interim Report on the Construction of the "Japanese Game Corpus (JGC)" - Quantitative Characteristics Observed in Early Action Games
A game corpus?! A must-see! Also, the selected games are all console games from Japanese manufacturers, both new and old (unfortunately, no Genshin Impact, what a shame).
i2_A2: (Tentative) An Attempt at Japanese Research Using "Full Text Data of Digitized Materials from the National Diet Library"
I'm curious about how academia searches for what they want using publicly available databases.
i2_A3: Examination of the "Classification Vocabulary List" as a Polysemous Code - Using the Most Important Verbs from the "Basic Dictionary of Japanese for Computers IPAL"
Several presentations at this seminar have utilized this "Classification Vocabulary List," and I'm curious about the issues considered during numbering.
i2_B3: Design, Implementation, and Operation of a Japanese Morphological Analysis System for Popup Dictionaries
It is said that hovering the mouse over a word to display the dictionary can enhance reading efficiency. However, to achieve this function, it is necessary to solve the problem of converting the string of characters where the mouse is hovering into dictionary form. Using morphological analysis systems like Mecab is one solution, but such systems often have specific requirements for the user's computer performance, so they are usually run on servers. However, the morphological analysis in this process differs from that for language research, machine translation, or full-text search; the main purpose is to convert the input string into dictionary form. Therefore, it is possible to reduce the size of the morphological analysis system and enable more efficient implementation. This paper discusses the design, implementation, and operation of a morphological analysis system specialized for dictionary searches, NonJishoKei.
It has been proven that automatically displaying dictionary explanations when the mouse hovers over a word can effectively improve reading efficiency. However, to achieve this function, a problem must be solved: converting the text near the mouse pointer into a form included in the dictionary. Using morphological analyzers like Mecab is one solution, but such systems often have high requirements for the user's device, so they are usually run on servers. However, unlike language research, machine translation, or full-text search, this scenario only requires converting the text near the mouse pointer into a form included in the dictionary. In other words, a streamlined morphological analyzer can be specifically designed for such use cases. The Japanese Non-dictionary Morphological Dictionary (NonJishoKei) is based on this idea, and this paper will discuss its algorithm principles and engineering implementation.
My own presentation (the truth is revealed 2333), the translation is quite different from the original as I rewrote it after submission (囧).
i2_C2: TEachOtherS: A Composition Education Support System as a Learner Corpus Construction Mechanism
(a) Provides learners with a web-based environment for composition, comments, and reflections, (b) Allows teachers to manage accounts for the entire class and control activity phases such as composition, comments, and reflections, which can be applied to the entire class at once. In addition, it is assumed that learners will revise their compositions based on comments received from others, and it has a version management function for compositions. Furthermore, the results of composition education activities can be output in HTML format.
I am very interested in the implementation details of this system.
i2_C4: (Tentative) Trends in Writing Errors in Handwritten Kanji by High School Students
In the first year, about 70% of students' compositions showed kanji writing errors, but as the grade increased, the errors decreased, and in the third year, they decreased to about 50%. Among the kanji used in more than 20 compositions, the kanji with the highest error rate was "達," with about 40% of the compositions containing errors in the character form of "達."
The conclusions regarding the issues of interest are very intriguing.
10:50〜12:05#
o12: (Tentative) Characteristics of Anime and Game Vocabulary from the Perspective of Misanalysis - Towards the Creation of a Vocabulary List
Anime and games are one of the resources for Japanese learners, but the vocabulary used differs from that learned in the classroom. However, there is no vocabulary list that is easy for both learners and teachers to utilize, showing the vocabulary by genre and its frequency. Therefore, we decided to create a vocabulary list as a linguistic resource that can be utilized in Japanese education. Scripts from anime and games tend to produce misanalysis when subjected to morphological analysis directly. Aiming for accurate data provision, we first conducted morphological analysis on four anime works and one game to confirm where and to what extent misanalysis occurs. As a result, it was found that about 10% of misanalysis occurs, most of which reflects the characteristics of vocabulary in anime and games, including unique nouns, interjections, colloquial speech, and hesitations. This presentation will organize the procedures of morphological analysis conducted for the creation of the vocabulary list and explore methods to analyze while retaining the characteristics of anime and games as much as possible.
https://clrd.ninjal.ac.jp/lrw/lrw2024/o12-paper.pdf
I am very interested in the direction and the pointed issue of 【misanalysis】, and also, the anime studied includes 【Oshi no Ko】 and 【The Quintessential Quintuplets】 (laughs).
o13: Overview of the "Children's Daily Conversation Corpus" Monitor Public Version
A children's dialogue corpus? Looking forward to it!
13:00〜14:00#
Linguistics Deepening Dialogue with Generative AI
Presenter: Taiki Sano (Google LLC)
Wow, Google is impressive!
14:25〜15:25#
i3_A1: The Relationship Between Rising and Falling Intonation and Conversational Form - Using the "Japanese Daily Conversation Corpus"
Presenter: Li Haiqi (Zhejiang University Japanese Department)
There are differing opinions regarding the usage of the rising and falling intonation at the end of sentences. According to a summary based on introspection and data, rising and falling intonation tends to be used more in somewhat formal situations. However, based on impression evaluation and usage rate statistics using solo speech data, rising and falling intonation is frequently used in casual speech.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A1-paper.pdf
The conclusions are very interesting.
i3_A2: (Tentative) Differences in Speech Speed by Daily Conversation Situations
This presentation will report on the results of how speech speed can vary depending on the conversation situation and conversation partner.
https://clrd.ninjal.ac.jp/lrw/lrw2024/i3_A2-paper.pdf
The title caught my interest.
i3_A3: Pronunciation of the /ei/ Vowel Sequence in Japanese
Presenter: Katarina Hitomi Gerl (University of Ljubljana, Faculty of Arts, Japanese Studies)
According to various dictionaries, the /ei/ vowel sequence in Japanese is pronounced as a long "e" when it is not between breaks in meaning.
The issues of interest are very intriguing.
i3_B3: Construction of a Slovene-Japanese Learning Dictionary Based on Dictionary Reversal and Open Data
Presenter: Kristina Hmeljak Sangawa (University of Ljubljana), Laura Barovič Božjak, Nadja Bostič, Katarina Hitomi Gerl, Jan Hrastnik, Nina Kališnik, Sara Kleč, Eva Kovač, Nina Sangawa Hmeljak, Jure Tomše, and Tomaž Erjavec
Japanese language learning is popular in Slovenia, but there are still few reference books. Therefore, we attempted to reverse the data of the previously edited Japanese-Slovene dictionary and utilize open data to construct a Slovene-Japanese learning dictionary. First, we extracted equivalent words from the Japanese-Slovene dictionary for each meaning, rearranged them with Slovene as the headword, then manually removed duplicates and inappropriate headwords, and automatically assigned parts of speech and CEFR-compliant difficulty levels to the headwords, with some example sentences attached. Using collaborative editing software Lexonomy, we manually added meaning hints and positional labels to polysemous headwords, and some headwords were also accompanied by example sentences from parallel corpora. The approximately 8500 vocabulary data constructed in this way was made publicly available as TEI Lex0 compliant XML data. Learners who participated in the project responded that they gained knowledge about the structure of the dictionary, and we plan to continue editing in the same manner in the future.
The introduction is very appealing to me, and I look forward to the upcoming presentation.
i3_C2: Personal Emergencies: Analysis of "Wait" on X (Twitter)
This study focuses on the usage and characteristics of the imperative "wait" recorded as the writer's own words without accompanying other elements representing the subject or object in the same sentence on X (Twitter). Observations of examples posted within the last 60 minutes revealed that such "wait" is used more frequently than similar expressions like "look" and "listen," and is often used in tweets (posts) that do not have a specific addressee. Furthermore, since such "wait" often co-occurs with the expression of the writer's emotions or evaluations, it is considered to represent "an event that shakes emotions and evaluations, and is an emergency situation that literally requires the writer to wait." Additionally, comparisons were made with examples from Yahoo! Blogs and LINE chats, suggesting that such "wait" is particularly likely to be used on X (Twitter).
The analytical subject is very interesting.
15:35〜16:50#
o15: A corpus-based cognitive semantic analysis of the polysemy of the Japanese temperature adjective tsumetai
Presenters: Wang Haitao (Kyoto University), Huang Haihong (Kyoto University), Zhong Yong (Nanjing University of Aeronautics and Astronautics)
A Chinese person presenting a paper in English about Japanese...? I'm curious what language will be used for the presentation at that time 2333.
o16: The Use of Sentence-final Forms in Distinguishing Characters' Dialogue in Novels
This paper attempts to collect, organize, and analyze the sentence-final forms from the dialogues of 24 characters appearing in 10 works of entertainment novels and light novels.
I thought the title was about analyzing some classic Japanese literature, but the introduction was "analyzing the language styles of different characters in 10 light novels," which instantly piqued my interest. Upon opening the paper, I found that one of the analyzed works is "Rascal Does Not Dream of Bunny Girl Senpai"! Moreover, there are also new works like "Frylin's Funeral" ... Can I expect someone to analyze "MyGo" at next year's seminar? (what a foggy thought)