fix: support CJK characters in MinHash fuzzy deduplication by Lucas5357 · Pull Request #1357 · getzep/graphiti

Lucas5357 · 2026-03-29T06:19:14Z

Summary

_normalize_name_for_fuzzy() uses regex [^a-z0-9' ] which strips all non-ASCII characters, making MinHash/Jaccard similarity always return 0.0 for CJK (Chinese, Japanese, Korean) entity names
Replace with [^\w' ] to preserve Unicode word characters
Add adaptive n-gram: 2-gram shingles for CJK text (each character carries more semantic weight) vs unchanged 3-gram for Latin
Add 8 unit tests covering CJK normalization, shingle generation, similarity scoring, and entity resolution

Context

The existing Embedding + LLM layers already handle CJK entity resolution end-to-end, so the overall system still works. However, the MinHash fast-path silently falls through to LLM for every CJK entity pair, adding unnecessary latency and cost. This fix restores the deterministic fast-path for CJK scripts.

Test plan

All 31 existing + new tests pass (pytest tests/utils/maintenance/test_node_operations.py)
Latin text behaviour unchanged (3-gram shingles, same normalization)
CJK normalization preserves characters (was stripped to empty string before)
CJK 2-gram shingles produce correct sets
End-to-end Jaccard similarity > 0 for similar CJK names
Exact CJK name match resolves deterministically
Short CJK names correctly defer to LLM (entropy filter)

The `_normalize_name_for_fuzzy()` regex `[^a-z0-9' ]` strips all non-ASCII characters, making MinHash/Jaccard similarity always return 0.0 for CJK entity names (Chinese, Japanese, Korean). Changes: - Replace `[^a-z0-9' ]` with `[^\w' ]` to preserve Unicode word chars - Add `_has_cjk()` helper to detect CJK text - Use 2-gram shingles for CJK (higher per-char entropy) vs 3-gram for Latin (unchanged behaviour) - Add 8 unit tests covering CJK normalization, shingle generation, Jaccard similarity, and entity resolution Note: the existing Embedding + LLM layers already handle CJK entity resolution end-to-end; this fix restores the MinHash fast-path that was silently broken for non-Latin scripts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

danielchalef · 2026-03-29T06:19:25Z

Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.

I have read the CLA Document and I hereby sign the CLA

潘彥廷 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Port confirmed bug fixes from upstream getzep/graphiti PRs: - PR getzep#1356: Fix label_propagation infinite loop by updating community_map before break check in bounded for-loop - PR getzep#1362/getzep#1291: Strip markdown code fences from OpenAI generic client JSON responses before parsing - PR getzep#1357: CJK character support in MinHash fuzzy dedup - use Unicode \w instead of [a-z0-9], detect CJK for 2-gram shingles vs 3-gram - PR getzep#1332: Guard against null/invalid embeddings in similarity search by adding size() checks to Neo4j and FalkorDB vector queries - PR getzep#1303: Search both edge directions during dedup resolution using bidirectional RELATES_TO match - PR getzep#1330: Fix FalkorDB default_group_id from escaped '\\_' to '_' Not applicable (TS port doesn't have the code): - PR getzep#1281: No Gemini LLM client in TS port - PR getzep#1276: TS resolver uses client-side scoring, not LLM context - PR getzep#1289: TS reranker already returns 0 for empty logprobs - PR getzep#1351: TS episode_mentions_reranker already sorts DESC - PR getzep#1312: TS already validates node labels - PR getzep#1212: TS addTripletFull already checks UUID collision - PRs getzep#1326,1305,1295,1270,1222,1249,1272: FalkorDB RediSearch query building not present in TS port (uses Cypher CONTAINS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: support CJK characters in MinHash fuzzy deduplication#1357

fix: support CJK characters in MinHash fuzzy deduplication#1357
Lucas5357 wants to merge 1 commit intogetzep:mainfrom
Lucas5357:fix/cjk-minhash-support

Lucas5357 commented Mar 29, 2026

Uh oh!

danielchalef commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lucas5357 commented Mar 29, 2026

Summary

Context

Test plan

Uh oh!

danielchalef commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants