Skip to content

toprakdeviren/rakam

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rakam

A C11 phone number parser modelled on Google's libphonenumber, rebuilt around three design rules. Named after the Turkish word for "digit".

  1. No runtime regex engine. Every pattern in PhoneNumberMetadata.xml is expanded offline into flat rule tables (nibble-packed prefixes, length windows, transform recipes). The parser is a linear scan over generated C arrays.
  2. No heap allocation. All state lives on the caller's stack or in the compiled .data section. Buffers are fixed-size, matched to the ITU-defined upper bounds (17-digit national, 8-char extension, …).
  3. SIMD-accelerated sanitization. The ASCII hot path runs on a NEON / SSSE3 8-byte vectorised digit stripper with a 256-entry compaction LUT; non-ASCII input falls back to the decoder UTF-8 pipeline.

1276 tests green over 254 regions.


Quick start

#include <rakam.h>

int main(void) {
    rakam_init();

    rakam_number_t n;
    if (rakam_parse("+90 (532) 123-45.67", 19, NULL, &n) == RAKAM_OK) {
        char buf[RAKAM_MAX_FORMATTED];
        size_t len = 0;
        rakam_format(&n, RAKAM_FMT_INTERNATIONAL, buf, sizeof(buf), &len);
        printf("%s\n", buf);   /* +90 532 123 45 67 */
    }

    rakam_cleanup();
}
$ cd rakam
$ make -C ../decoder           # build the decoder dependency once
$ make                         # produces build/librakam.a
$ make test                    # 1276/1276 passing
$ make bench                   # micro-benchmarks

What it supports

Parsing

  • International form (+<cc><nsn>) with 254 region calling codes.
  • National form with default_region and a region's nationalPrefixForParsing regex (compiled offline into (match_prefix → replacement) tuples — AR 0 + area + 15 → 9 + area, BB 7-digit local → 246 prepended, GB 180020 alt-prefix strip).
  • Extensions: ext., Ext, EXT, x, xt, #, ;ext= (RFC 3966).
  • Letterphone: A-Z / a-z → keypad digits (1-800-FLOWERS+18003569377). Disambiguated from extension markers by checking that the post-marker tail contains no letters.
  • tel: URI prefix stripped before letterphone runs.
  • Unicode digits across 24 blocks — Arabic-Indic, Persian, N'Ko, Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam, Sinhala Lith, Thai, Lao, Tibetan, Myanmar, Khmer, Mongolian, Fullwidth, Osmanya, Mathematical Bold / Double / Sans. UTF-8 validation via decoder.
  • NANPA + 11 other shared-cc disambiguation (US, RU, IT, GB main + 40 secondary NPAs). Supplemented with CA area codes mined from libphonenumber's geocoding data (50 NPAs) and GG / JE fixed-line 4-digit prefixes.
  • Italian-leading-zero bit + zero count preserved through round-trip.

Formatting

  • E164 — strict +<cc><nsn>, ASCII only.
  • NATIONAL — per-region grouping ((415) 555-2671, 0532 123 45 67, 020 7946 0958).
  • INTERNATIONAL+<cc> <grouping>.
  • RFC3966tel:+<cc>-<dashed-grouping>;ext=<ext>.
  • OutOfCountryCallingNumber — prepends the caller region's dial-out prefix (011 US, 00 DE, 0011 AU, 8~10 RU).
  • WithCarrierCode — BR/AR carrier-select dialing (0 15 11 96123-4567).
  • InOriginalFormat — echoes the user's raw input when preserved (via parseAndKeepRawInput), otherwise NATIONAL / INTERNATIONAL fallback.

Validation

  • is_possible — length fits the region's envelope.
  • is_valid — full type-aware check; regex-free via 67,697 compiled (prefix, min_len, max_len, type_tag) rules, 44,642 unique after cross-region deduplication.
  • type — 10 libphonenumber types + FIXED_OR_MOBILE when a region's fixed and mobile patterns coincide (NANPA, …).
  • is_valid_for_region — stricter: region must match the hint.

Metadata queries

  • get_example_number(region, type) — 982-row table compiled from XML <exampleNumber> data.
  • country_code_for_region, main_region_for_country_code, supported_regions_count, iterate_regions.
  • length_of_national_destination_code — NDC derived from the matched format spec's first group.
  • is_number_match(a, b) — libphonenumber-style EXACT / NSN / SHORT_NSN / NO_MATCH comparison.

Short numbers

  • is_valid_short_number(digits, region) — 4,501 rules from ShortNumberMetadata.xml.
  • is_emergency_number(digits, region) — emergency subset (911 US, 112 EU, 110 DE/JP, 000 AU, …).

AsYouTypeFormatter

  • Progressive national format as the user types.
  • One-shot national-prefix strip (0 for TR, 1 for US, …).
  • + switches to international echo mode.
  • Ignores separators typed by the user.
  • Single-region scope; no cursor tracking (UI framework handles that).

PhoneNumberMatcher

  • find_numbers(text, region, cb) — scans free text, emits one callback per parseable number with {start, length, rakam_number_t}.
  • Regex-free scanner: character classifier + state machine.
  • Tight mode (no letterphone runs). Minimum 3 digits per candidate.

What it doesn't

Features from libphonenumber that live outside the current scope:

  • Geocoder — per-language city/region name lookup (~12 MB raw data). No integration yet.
  • Carrier name mapper — per-MCC/MNC operator names.
  • Timezone mapper — per-prefix IANA timezone.
  • AsYouType cursor tracking (remember_position).
  • Matcher leniency levels (STRICT_GROUPING / EXACT_GROUPING) and letterphone mode.
  • ShortNumberInfo extras: isCarrierSpecific, getExpectedCostForRegion, connectsToEmergencyNumber.
  • A few small wrappers: truncateTooLongNumber, isPossibleNumberWithReason, isAlphaNumber, convertAlphaCharactersInNumber, formatNationalNumberWithPreferredCarrierCode.

See API.md for the full public interface.


Design notes

Metadata pipeline

libphonenumber/resources/
    PhoneNumberMetadata.xml          254 regions, formats, patterns
    PhoneNumberAlternateFormats.xml  158 extra specs, 46 regions
    ShortNumberMetadata.xml          4,501 rules
    geocoding/en/1.txt               CA area codes (NANPA)
                │
                ▼
    tools/gen_metadata.py            regex subset parser
                │      ├─ expand_pattern_to_rules  (validation)
                │      ├─ expand_leading_digits    (format selection)
                │      ├─ compile_np_pattern       (prefix stripping)
                │      └─ _expand_class            (char classes)
                ▼
    generated/metadata.{h,c}         ~99 K lines of C data
                ▼
    librakam.a                      ~900 KB

All XML regexes are expanded at generation time. The runtime never sees a regex string; it walks pre-computed tables.

Rule compression

Validation rules originally compiled to 67,697 × 13 bytes = ~1.1 MB. Nibble-packing the prefix (4 bits × 8 positions in a uint32_t) drops the struct to 8 bytes (~540 KB). Cross-region deduplication adds a uint16_t indirection and trims another ~50 KB for a final ~490 KB validation footprint.

SIMD sanitizer

if (rakam_sanitize_ascii(input, len, out)) { /* NEON/SSSE3 hot path */ }
else                                         { /* decoder UTF-8 fallback */ }
  • 8-byte chunks, compaction LUT indexed by an 8-bit digit mask.
  • NEON: vld1_u8vcge/vclevtbl1_u8.
  • SSSE3: _mm_loadl_epi64_mm_cmpgt_epi8_mm_shuffle_epi8.
  • High-bit byte in the chunk → bail to scalar UTF-8 path.
  • Benchmark on Apple Silicon: ~59 Mops/s (17 ns/call) for ASCII; UTF-8 fallback ~14 Mops/s. Full parse() ~5–12 Mops/s (depends on region's validate rule count).

Regex-free validation

Each region's nationalNumberPattern is compiled into a list of (prefix, min_len, max_len, type_tag) tuples. A number is valid iff some rule's length window contains its national length AND the prefix matches (with . wildcards for \d). Type comes from the first per-type rule that fires; FIXED_OR_MOBILE triggers when both the fixed-line and mobile tags match.

Thread safety

rakam_init guards behind pthread_once. Once initialised, all query functions are read-only over static tables and hence trivially thread-safe. No shared mutable state lives in the parser.


Dependencies

  • decoder — UTF-8 validation and UTF-32 conversion. Built separately; linked as libunicode.a.
  • C11 compiler + libc. pthread for pthread_once.
  • <arm_neon.h> on AArch64, <tmmintrin.h> on x86 with SSSE3. Otherwise the SIMD fast path compiles to an identical scalar loop.

Benchmarks (Apple Silicon, clang -O3)

Path Input Throughput ns/op
ASCII sanitize (SIMD) +90 (532) 123-45.67 59.0 Mops/s 17.0
UTF-8 sanitize (decoder) same with trailing ١ 13.9 Mops/s 71.9
Full parse() TR input 5.6 Mops/s 179

Full parse() includes sanitize → country-code extract → region dispatch → NANPA override → strict validation → type detection. CN has ~12k rules; TR ~550.


Test inventory

test_parse         86   national + intl + transform + unicode digits
test_format        24   E164 / NATIONAL / INTL / RFC3966
test_validate      19   per-type validation, FIXED_OR_MOBILE
test_extension     19   ext/x/xt/# separators
test_letterphone    9   A-Z → 2-9
test_leading_zero   8   IT, VA, SM, CI
test_out_of_country 11  011 / 00 / 0011 / 8~10
test_api           31   example_number, region APIs, raw input, OOC
test_wrappers      16   NDC length, for_region, match_type, init
test_short_numbers 18   emergency + valid short
test_aytf          39   keystroke-by-keystroke progressive format
test_matcher       11   findNumbers over free text
test_bulk         985   every XML exampleNumber, round-tripped E.164
                        + is_valid + type consistency
────────────────
total            1276

About

A C11 phone number parser modelled on Google's libphonenumber, rebuilt around three design rules. Named after the Turkish word for "digit".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages