Commit fd8c064
committed
fix: Prevent long hiragana verb renyokei from absorbing onomatopoeia+する
Add penalty for 5+ character pure-hiragana verb renyokei not in
dictionary, mirroring the existing kPenaltyVeryLongHiraganaVerb logic.
Without this penalty, onomatopoeia followed by する conjugations (e.g.,
つるつるしている) could be parsed as a single spurious renyokei token
(つるつるし) instead of the correct split つるつる(ADV) + し(する).
The 15-byte threshold (5 hiragana chars * 3 bytes each) only applies to
VerbRenyokei to avoid penalizing legitimate long renyokei forms like
づけられる derived from dictionary verbs.
- Add regression test for 肌がつるつるしている in onomatopoeia.json1 parent 9c2cc23 commit fd8c064
2 files changed
Lines changed: 42 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
467 | 467 | | |
468 | 468 | | |
469 | 469 | | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
| 473 | + | |
| 474 | + | |
| 475 | + | |
| 476 | + | |
| 477 | + | |
| 478 | + | |
| 479 | + | |
470 | 480 | | |
471 | 481 | | |
472 | 482 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
959 | 959 | | |
960 | 960 | | |
961 | 961 | | |
| 962 | + | |
| 963 | + | |
| 964 | + | |
| 965 | + | |
| 966 | + | |
| 967 | + | |
| 968 | + | |
| 969 | + | |
| 970 | + | |
| 971 | + | |
| 972 | + | |
| 973 | + | |
| 974 | + | |
| 975 | + | |
| 976 | + | |
| 977 | + | |
| 978 | + | |
| 979 | + | |
| 980 | + | |
| 981 | + | |
| 982 | + | |
| 983 | + | |
| 984 | + | |
| 985 | + | |
| 986 | + | |
| 987 | + | |
| 988 | + | |
| 989 | + | |
| 990 | + | |
| 991 | + | |
| 992 | + | |
| 993 | + | |
962 | 994 | | |
963 | 995 | | |
964 | 996 | | |
| |||
0 commit comments