Skip to content

Changed patterns to correctly capture UTF8 superscripts#134

Open
larsbarring wants to merge 4 commits intoUnidata:mainfrom
larsbarring:issue-128
Open

Changed patterns to correctly capture UTF8 superscripts#134
larsbarring wants to merge 4 commits intoUnidata:mainfrom
larsbarring:issue-128

Conversation

@larsbarring
Copy link
Copy Markdown
Contributor

@larsbarring larsbarring commented Oct 27, 2025

Closes issue #128.

  1. Explicit UTF-8 superscript number, not range (line 114)

  2. Update the letter pattern to exclude superscript characters, UTF-8 patterns excluding superscripts (line 126-128)

  • The new utf8_2bytes_no_super letter pattern excludes:
    • \xc2\xb1 (±, plus-minus, not relevant but close)
    • \xc2\xb2 (² superscript 2)
    • \xc2\xb3 (³ superscript 3)
    • \xc2\xb3 (¶ pilcrow)
    • \xc2\xb7 (· middle dot)
    • \xc2\xb7 (¸ cedilla)
    • \xc2\xb9 (¹ superscript 1)
  • The new utf8_3bytes_no_super letter pattern excludes:
    • \xe2\x81\xb0 through \xe2\x81\xb9 (superscript digits ⁰, ⁴⁻⁹)
    • \xe2\x81\xba and \xe2\x81\xbb (superscript + and -)

This ensures that UTF-8 superscript characters are not captured as letters by the {id} rule and can instead be properly matched by the {utf8_exponent} rule.

@larsbarring
Copy link
Copy Markdown
Contributor Author

Simple bash script and C code for testing is available here, common to PR #134 (issue #128) , PR #135 (issue #129, and PR #136 (issue #132).

1. Explicit UTF-8 superscript number, not range (line 114)

2. Update letter pattern to exclude superscript characters,
   UTF-8 patterns excluding superscripts (line 126-128)

   The new utf8_2bytes_no_super pattern excludes:
         \xc2\xb1 (±, plus-minus, not relevant but close)
         \xc2\xb2 (² superscript 2)
         \xc2\xb3 (³ superscript 3)
         \xc2\xb3 (¶ pilcrow)
         \xc2\xb7 (· middle dot)
         \xc2\xb7 (¸ cedilla)
         \xc2\xb9 (¹ superscript 1)
   The new utf8_3bytes_no_super pattern excludes:
         \xe2\x81\xb0 through \xe2\x81\xb9 (superscript digits ⁰-⁹)
         \xe2\x81\xba and \xe2\x81\xbb (superscript + and -)
   This ensures that UTF-8 superscript characters are not captured
   by the {id} rule and can instead be properly matched by the
   {utf8_exponent} rule.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Flex lexer patterns in lib/scanner.l to ensure UTF-8 superscript exponent characters (notably U+207B superscript minus) are not tokenized as identifier “letters”, so they can be recognized by the {utf8_exponent} rule during ut_parse().

Changes:

  • Replaced the UTF-8 exponent digit definition with an explicit list of supported superscript digit byte sequences.
  • Introduced utf8_2bytes_no_super / utf8_3bytes_no_super and updated {letter} to exclude superscript-related characters so exponent lexing can win.
  • (Recommended) Add a regression test to confirm ut_parse(..., UT_UTF8) accepts specs containing superscript minus.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lib/scanner.l Outdated
Comment thread lib/scanner.l
Comment thread lib/scanner.l Outdated
WardF
WardF previously approved these changes Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants