Changed patterns to correctly capture UTF8 superscripts#134
Open
larsbarring wants to merge 4 commits intoUnidata:mainfrom
Open
Changed patterns to correctly capture UTF8 superscripts#134larsbarring wants to merge 4 commits intoUnidata:mainfrom
larsbarring wants to merge 4 commits intoUnidata:mainfrom
Conversation
This was referenced Oct 28, 2025
Contributor
Author
d626d3c to
8e93650
Compare
1. Explicit UTF-8 superscript number, not range (line 114)
2. Update letter pattern to exclude superscript characters,
UTF-8 patterns excluding superscripts (line 126-128)
The new utf8_2bytes_no_super pattern excludes:
\xc2\xb1 (±, plus-minus, not relevant but close)
\xc2\xb2 (² superscript 2)
\xc2\xb3 (³ superscript 3)
\xc2\xb3 (¶ pilcrow)
\xc2\xb7 (· middle dot)
\xc2\xb7 (¸ cedilla)
\xc2\xb9 (¹ superscript 1)
The new utf8_3bytes_no_super pattern excludes:
\xe2\x81\xb0 through \xe2\x81\xb9 (superscript digits ⁰-⁹)
\xe2\x81\xba and \xe2\x81\xbb (superscript + and -)
This ensures that UTF-8 superscript characters are not captured
by the {id} rule and can instead be properly matched by the
{utf8_exponent} rule.
8e93650 to
36595e9
Compare
There was a problem hiding this comment.
Pull request overview
This PR updates the Flex lexer patterns in lib/scanner.l to ensure UTF-8 superscript exponent characters (notably U+207B superscript minus) are not tokenized as identifier “letters”, so they can be recognized by the {utf8_exponent} rule during ut_parse().
Changes:
- Replaced the UTF-8 exponent digit definition with an explicit list of supported superscript digit byte sequences.
- Introduced
utf8_2bytes_no_super/utf8_3bytes_no_superand updated{letter}to exclude superscript-related characters so exponent lexing can win. - (Recommended) Add a regression test to confirm
ut_parse(..., UT_UTF8)accepts specs containing superscript minus.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
WardF
previously approved these changes
Mar 12, 2026
WardF
approved these changes
Apr 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes issue #128.
Explicit UTF-8 superscript number, not range (line 114)
Update the letter pattern to exclude superscript characters, UTF-8 patterns excluding superscripts (line 126-128)
utf8_2bytes_no_superletter pattern excludes:utf8_3bytes_no_superletter pattern excludes:This ensures that UTF-8 superscript characters are not captured as letters by the
{id}rule and can instead be properly matched by the{utf8_exponent}rule.