Skip to content

Commit fd8c064

Browse files
committed
fix: Prevent long hiragana verb renyokei from absorbing onomatopoeia+する
Add penalty for 5+ character pure-hiragana verb renyokei not in dictionary, mirroring the existing kPenaltyVeryLongHiraganaVerb logic. Without this penalty, onomatopoeia followed by する conjugations (e.g., つるつるしている) could be parsed as a single spurious renyokei token (つるつるし) instead of the correct split つるつる(ADV) + し(する). The 15-byte threshold (5 hiragana chars * 3 bytes each) only applies to VerbRenyokei to avoid penalizing legitimate long renyokei forms like づけられる derived from dictionary verbs. - Add regression test for 肌がつるつるしている in onomatopoeia.json
1 parent 9c2cc23 commit fd8c064

2 files changed

Lines changed: 42 additions & 0 deletions

File tree

src/analysis/scorer.cpp

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -467,6 +467,16 @@ float Scorer::wordCost(const core::LatticeEdge& edge) const {
467467
cost += sc::kPenaltyVeryLongHiraganaVerb;
468468
}
469469

470+
// Penalty for 5-char pure-hiragana verb renyokei not in dictionary
471+
// E.g., "つるつるし" as godan-sa renyokei — should be つるつる(ADV) + し(する)
472+
// Only renyokei: base forms like "づけられる" (from づける) are legitimate
473+
if (!edge.fromDictionary() && edge.pos == core::PartOfSpeech::Verb &&
474+
edge.extended_pos == core::ExtendedPOS::VerbRenyokei &&
475+
grammar::isPureHiragana(edge.surface) &&
476+
edge.surface.size() >= 15) { // 5+ hiragana chars (5*3=15 bytes)
477+
cost += sc::kPenaltyVeryLongHiraganaVerb;
478+
}
479+
470480
// Penalty for kanji+hiragana verb renyokei ending in いし pattern
471481
// E.g., "願いし" as renyokei of "願いす" is spurious
472482
// Should be 願い + し (願う renyokei + する renyokei)

tests/data/tokenization/onomatopoeia.json

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -959,6 +959,38 @@
959959
"cat_speech",
960960
"W-007"
961961
]
962+
},
963+
{
964+
"description": "onomatopoeia+する should split as つるつる+し, not つるつるし(verb)",
965+
"expected": [
966+
{
967+
"pos": "Noun",
968+
"surface": ""
969+
},
970+
{
971+
"pos": "Particle",
972+
"surface": ""
973+
},
974+
{
975+
"pos": "Adverb",
976+
"surface": "つるつる"
977+
},
978+
{
979+
"lemma": "する",
980+
"pos": "Verb",
981+
"surface": ""
982+
},
983+
{
984+
"pos": "Particle",
985+
"surface": ""
986+
},
987+
{
988+
"pos": "Auxiliary",
989+
"surface": "いる"
990+
}
991+
],
992+
"id": "gatsurutsurushiteiru",
993+
"input": "肌がつるつるしている"
962994
}
963995
],
964996
"description": "Onomatopoeia and mimetic words",

0 commit comments

Comments
 (0)