A C11 phone number parser modelled on Google's libphonenumber, rebuilt around three design rules. Named after the Turkish word for "digit".
- No runtime regex engine. Every pattern in
PhoneNumberMetadata.xmlis expanded offline into flat rule tables (nibble-packed prefixes, length windows, transform recipes). The parser is a linear scan over generated C arrays. - No heap allocation. All state lives on the caller's stack or in
the compiled
.datasection. Buffers are fixed-size, matched to the ITU-defined upper bounds (17-digit national, 8-char extension, …). - SIMD-accelerated sanitization. The ASCII hot path runs on a NEON / SSSE3 8-byte vectorised digit stripper with a 256-entry compaction LUT; non-ASCII input falls back to the decoder UTF-8 pipeline.
1276 tests green over 254 regions.
#include <rakam.h>
int main(void) {
rakam_init();
rakam_number_t n;
if (rakam_parse("+90 (532) 123-45.67", 19, NULL, &n) == RAKAM_OK) {
char buf[RAKAM_MAX_FORMATTED];
size_t len = 0;
rakam_format(&n, RAKAM_FMT_INTERNATIONAL, buf, sizeof(buf), &len);
printf("%s\n", buf); /* +90 532 123 45 67 */
}
rakam_cleanup();
}$ cd rakam
$ make -C ../decoder # build the decoder dependency once
$ make # produces build/librakam.a
$ make test # 1276/1276 passing
$ make bench # micro-benchmarks- International form (
+<cc><nsn>) with 254 region calling codes. - National form with
default_regionand a region'snationalPrefixForParsingregex (compiled offline into(match_prefix → replacement)tuples — AR0 + area + 15 → 9 + area, BB7-digit local → 246 prepended, GB180020alt-prefix strip). - Extensions:
ext.,Ext,EXT,x,xt,#,;ext=(RFC 3966). - Letterphone:
A-Z/a-z→ keypad digits (1-800-FLOWERS→+18003569377). Disambiguated from extension markers by checking that the post-marker tail contains no letters. tel:URI prefix stripped before letterphone runs.- Unicode digits across 24 blocks — Arabic-Indic, Persian, N'Ko, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala Lith, Thai, Lao, Tibetan, Myanmar, Khmer, Mongolian, Fullwidth, Osmanya, Mathematical Bold / Double / Sans. UTF-8 validation via decoder.
- NANPA + 11 other shared-cc disambiguation (US, RU, IT, GB main + 40 secondary NPAs). Supplemented with CA area codes mined from libphonenumber's geocoding data (50 NPAs) and GG / JE fixed-line 4-digit prefixes.
- Italian-leading-zero bit + zero count preserved through round-trip.
E164— strict+<cc><nsn>, ASCII only.NATIONAL— per-region grouping ((415) 555-2671,0532 123 45 67,020 7946 0958).INTERNATIONAL—+<cc> <grouping>.RFC3966—tel:+<cc>-<dashed-grouping>;ext=<ext>.OutOfCountryCallingNumber— prepends the caller region's dial-out prefix (011US,00DE,0011AU,8~10RU).WithCarrierCode— BR/AR carrier-select dialing (0 15 11 96123-4567).InOriginalFormat— echoes the user's raw input when preserved (viaparseAndKeepRawInput), otherwiseNATIONAL/INTERNATIONALfallback.
is_possible— length fits the region's envelope.is_valid— full type-aware check; regex-free via 67,697 compiled(prefix, min_len, max_len, type_tag)rules, 44,642 unique after cross-region deduplication.type— 10 libphonenumber types +FIXED_OR_MOBILEwhen a region's fixed and mobile patterns coincide (NANPA, …).is_valid_for_region— stricter: region must match the hint.
get_example_number(region, type)— 982-row table compiled from XML<exampleNumber>data.country_code_for_region,main_region_for_country_code,supported_regions_count,iterate_regions.length_of_national_destination_code— NDC derived from the matched format spec's first group.is_number_match(a, b)— libphonenumber-style EXACT / NSN / SHORT_NSN / NO_MATCH comparison.
is_valid_short_number(digits, region)— 4,501 rules fromShortNumberMetadata.xml.is_emergency_number(digits, region)— emergency subset (911 US, 112 EU, 110 DE/JP, 000 AU, …).
- Progressive national format as the user types.
- One-shot national-prefix strip (0 for TR, 1 for US, …).
+switches to international echo mode.- Ignores separators typed by the user.
- Single-region scope; no cursor tracking (UI framework handles that).
find_numbers(text, region, cb)— scans free text, emits one callback per parseable number with{start, length, rakam_number_t}.- Regex-free scanner: character classifier + state machine.
- Tight mode (no letterphone runs). Minimum 3 digits per candidate.
Features from libphonenumber that live outside the current scope:
- Geocoder — per-language city/region name lookup
(
~12 MBraw data). No integration yet. - Carrier name mapper — per-MCC/MNC operator names.
- Timezone mapper — per-prefix IANA timezone.
- AsYouType cursor tracking (
remember_position). - Matcher leniency levels (
STRICT_GROUPING/EXACT_GROUPING) and letterphone mode. - ShortNumberInfo extras:
isCarrierSpecific,getExpectedCostForRegion,connectsToEmergencyNumber. - A few small wrappers:
truncateTooLongNumber,isPossibleNumberWithReason,isAlphaNumber,convertAlphaCharactersInNumber,formatNationalNumberWithPreferredCarrierCode.
See API.md for the full public interface.
libphonenumber/resources/
PhoneNumberMetadata.xml 254 regions, formats, patterns
PhoneNumberAlternateFormats.xml 158 extra specs, 46 regions
ShortNumberMetadata.xml 4,501 rules
geocoding/en/1.txt CA area codes (NANPA)
│
▼
tools/gen_metadata.py regex subset parser
│ ├─ expand_pattern_to_rules (validation)
│ ├─ expand_leading_digits (format selection)
│ ├─ compile_np_pattern (prefix stripping)
│ └─ _expand_class (char classes)
▼
generated/metadata.{h,c} ~99 K lines of C data
▼
librakam.a ~900 KB
All XML regexes are expanded at generation time. The runtime never sees a regex string; it walks pre-computed tables.
Validation rules originally compiled to 67,697 × 13 bytes = ~1.1 MB.
Nibble-packing the prefix (4 bits × 8 positions in a uint32_t)
drops the struct to 8 bytes (~540 KB). Cross-region deduplication
adds a uint16_t indirection and trims another ~50 KB for a final
~490 KB validation footprint.
if (rakam_sanitize_ascii(input, len, out)) { /* NEON/SSSE3 hot path */ }
else { /* decoder UTF-8 fallback */ }
- 8-byte chunks, compaction LUT indexed by an 8-bit digit mask.
- NEON:
vld1_u8→vcge/vcle→vtbl1_u8. - SSSE3:
_mm_loadl_epi64→_mm_cmpgt_epi8→_mm_shuffle_epi8. - High-bit byte in the chunk → bail to scalar UTF-8 path.
- Benchmark on Apple Silicon: ~59 Mops/s (17 ns/call) for ASCII;
UTF-8 fallback ~14 Mops/s. Full
parse()~5–12 Mops/s (depends on region's validate rule count).
Each region's nationalNumberPattern is compiled into a list of
(prefix, min_len, max_len, type_tag) tuples. A number is valid
iff some rule's length window contains its national length AND the
prefix matches (with . wildcards for \d). Type comes from the
first per-type rule that fires; FIXED_OR_MOBILE triggers when
both the fixed-line and mobile tags match.
rakam_init guards behind pthread_once. Once initialised, all
query functions are read-only over static tables and hence
trivially thread-safe. No shared mutable state lives in the parser.
- decoder — UTF-8 validation and UTF-32 conversion. Built
separately; linked as
libunicode.a. - C11 compiler + libc.
pthreadforpthread_once. <arm_neon.h>on AArch64,<tmmintrin.h>on x86 with SSSE3. Otherwise the SIMD fast path compiles to an identical scalar loop.
| Path | Input | Throughput | ns/op |
|---|---|---|---|
| ASCII sanitize (SIMD) | +90 (532) 123-45.67 |
59.0 Mops/s | 17.0 |
| UTF-8 sanitize (decoder) | same with trailing ١ |
13.9 Mops/s | 71.9 |
Full parse() |
TR input | 5.6 Mops/s | 179 |
Full parse() includes sanitize → country-code extract → region
dispatch → NANPA override → strict validation → type detection.
CN has ~12k rules; TR ~550.
test_parse 86 national + intl + transform + unicode digits
test_format 24 E164 / NATIONAL / INTL / RFC3966
test_validate 19 per-type validation, FIXED_OR_MOBILE
test_extension 19 ext/x/xt/# separators
test_letterphone 9 A-Z → 2-9
test_leading_zero 8 IT, VA, SM, CI
test_out_of_country 11 011 / 00 / 0011 / 8~10
test_api 31 example_number, region APIs, raw input, OOC
test_wrappers 16 NDC length, for_region, match_type, init
test_short_numbers 18 emergency + valid short
test_aytf 39 keystroke-by-keystroke progressive format
test_matcher 11 findNumbers over free text
test_bulk 985 every XML exampleNumber, round-tripped E.164
+ is_valid + type consistency
────────────────
total 1276