Changed patterns to correctly capture UTF8 superscripts by larsbarring · Pull Request #134 · Unidata/UDUNITS-2

larsbarring · 2025-10-27T23:08:50Z

Closes issue #128.

Explicit UTF-8 superscript number, not range (line 114)
Update the letter pattern to exclude superscript characters, UTF-8 patterns excluding superscripts (line 126-128)

The new utf8_2bytes_no_super letter pattern excludes:
- \xc2\xb1 (±, plus-minus, not relevant but close)
- \xc2\xb2 (² superscript 2)
- \xc2\xb3 (³ superscript 3)
- \xc2\xb3 (¶ pilcrow)
- \xc2\xb7 (· middle dot)
- \xc2\xb7 (¸ cedilla)
- \xc2\xb9 (¹ superscript 1)
The new utf8_3bytes_no_super letter pattern excludes:
- \xe2\x81\xb0 through \xe2\x81\xb9 (superscript digits ⁰, ⁴⁻⁹)
- \xe2\x81\xba and \xe2\x81\xbb (superscript + and -)

This ensures that UTF-8 superscript characters are not captured as letters by the {id} rule and can instead be properly matched by the {utf8_exponent} rule.

larsbarring · 2025-10-28T00:23:22Z

Simple bash script and C code for testing is available here, common to PR #134 (issue #128) , PR #135 (issue #129, and PR #136 (issue #132).

1. Explicit UTF-8 superscript number, not range (line 114) 2. Update letter pattern to exclude superscript characters, UTF-8 patterns excluding superscripts (line 126-128) The new utf8_2bytes_no_super pattern excludes: \xc2\xb1 (±, plus-minus, not relevant but close) \xc2\xb2 (² superscript 2) \xc2\xb3 (³ superscript 3) \xc2\xb3 (¶ pilcrow) \xc2\xb7 (· middle dot) \xc2\xb7 (¸ cedilla) \xc2\xb9 (¹ superscript 1) The new utf8_3bytes_no_super pattern excludes: \xe2\x81\xb0 through \xe2\x81\xb9 (superscript digits ⁰-⁹) \xe2\x81\xba and \xe2\x81\xbb (superscript + and -) This ensures that UTF-8 superscript characters are not captured by the {id} rule and can instead be properly matched by the {utf8_exponent} rule.

Copilot

Pull request overview

This PR updates the Flex lexer patterns in lib/scanner.l to ensure UTF-8 superscript exponent characters (notably U+207B superscript minus) are not tokenized as identifier “letters”, so they can be recognized by the {utf8_exponent} rule during ut_parse().

Changes:

Replaced the UTF-8 exponent digit definition with an explicit list of supported superscript digit byte sequences.
Introduced utf8_2bytes_no_super / utf8_3bytes_no_super and updated {letter} to exclude superscript-related characters so exponent lexing can win.
(Recommended) Add a regression test to confirm ut_parse(..., UT_UTF8) accepts specs containing superscript minus.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…#128)

This was referenced Oct 28, 2025

Disallow NaN and Inf in unit strings #136

Open

Division/ratio: in regex require space around "PER", optional for "/" #135

Merged

larsbarring force-pushed the issue-128 branch from d626d3c to 8e93650 Compare November 30, 2025 15:39

larsbarring force-pushed the issue-128 branch from 8e93650 to 36595e9 Compare February 26, 2026 16:45

larsbarring requested a review from a team February 26, 2026 16:45

WardF requested a review from Copilot March 11, 2026 20:40

Copilot started reviewing on behalf of WardF March 11, 2026 20:40 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

Comment thread lib/scanner.l Outdated

Comment thread lib/scanner.l

Comment thread lib/scanner.l Outdated

WardF previously approved these changes Mar 12, 2026

View reviewed changes

larsbarring added 2 commits March 13, 2026 08:47

Updated UTF-8 ranges, \xff -> \xbf, as per code review.

562b546

Add CUnit tests for UTF-8 superscript exponent parsing (issue Unidata…

2ad5f8a

…#128)

larsbarring dismissed WardF’s stale review via 2ad5f8a March 13, 2026 08:00

WardF approved these changes Apr 1, 2026

View reviewed changes

Merge branch 'main' into issue-128

f98434a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed patterns to correctly capture UTF8 superscripts#134

Changed patterns to correctly capture UTF8 superscripts#134
larsbarring wants to merge 4 commits intoUnidata:mainfrom
larsbarring:issue-128

larsbarring commented Oct 27, 2025 •

edited

Loading

Uh oh!

larsbarring commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

larsbarring commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

larsbarring commented Oct 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

larsbarring commented Oct 27, 2025 •

edited

Loading