Skip to content

fix: support CJK characters in MinHash fuzzy deduplication#1357

Open
Lucas5357 wants to merge 1 commit intogetzep:mainfrom
Lucas5357:fix/cjk-minhash-support
Open

fix: support CJK characters in MinHash fuzzy deduplication#1357
Lucas5357 wants to merge 1 commit intogetzep:mainfrom
Lucas5357:fix/cjk-minhash-support

Conversation

@Lucas5357
Copy link
Copy Markdown

Summary

  • _normalize_name_for_fuzzy() uses regex [^a-z0-9' ] which strips all non-ASCII characters, making MinHash/Jaccard similarity always return 0.0 for CJK (Chinese, Japanese, Korean) entity names
  • Replace with [^\w' ] to preserve Unicode word characters
  • Add adaptive n-gram: 2-gram shingles for CJK text (each character carries more semantic weight) vs unchanged 3-gram for Latin
  • Add 8 unit tests covering CJK normalization, shingle generation, similarity scoring, and entity resolution

Context

The existing Embedding + LLM layers already handle CJK entity resolution end-to-end, so the overall system still works. However, the MinHash fast-path silently falls through to LLM for every CJK entity pair, adding unnecessary latency and cost. This fix restores the deterministic fast-path for CJK scripts.

Test plan

  • All 31 existing + new tests pass (pytest tests/utils/maintenance/test_node_operations.py)
  • Latin text behaviour unchanged (3-gram shingles, same normalization)
  • CJK normalization preserves characters (was stripped to empty string before)
  • CJK 2-gram shingles produce correct sets
  • End-to-end Jaccard similarity > 0 for similar CJK names
  • Exact CJK name match resolves deterministically
  • Short CJK names correctly defer to LLM (entropy filter)

The `_normalize_name_for_fuzzy()` regex `[^a-z0-9' ]` strips all
non-ASCII characters, making MinHash/Jaccard similarity always return
0.0 for CJK entity names (Chinese, Japanese, Korean).

Changes:
- Replace `[^a-z0-9' ]` with `[^\w' ]` to preserve Unicode word chars
- Add `_has_cjk()` helper to detect CJK text
- Use 2-gram shingles for CJK (higher per-char entropy) vs 3-gram for
  Latin (unchanged behaviour)
- Add 8 unit tests covering CJK normalization, shingle generation,
  Jaccard similarity, and entity resolution

Note: the existing Embedding + LLM layers already handle CJK entity
resolution end-to-end; this fix restores the MinHash fast-path that
was silently broken for non-Latin scripts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@danielchalef
Copy link
Copy Markdown
Member


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


潘彥廷 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

2b3pro pushed a commit to 2b3pro/graphiti that referenced this pull request Mar 31, 2026
Port confirmed bug fixes from upstream getzep/graphiti PRs:

- PR getzep#1356: Fix label_propagation infinite loop by updating community_map
  before break check in bounded for-loop
- PR getzep#1362/getzep#1291: Strip markdown code fences from OpenAI generic client
  JSON responses before parsing
- PR getzep#1357: CJK character support in MinHash fuzzy dedup - use Unicode
  \w instead of [a-z0-9], detect CJK for 2-gram shingles vs 3-gram
- PR getzep#1332: Guard against null/invalid embeddings in similarity search
  by adding size() checks to Neo4j and FalkorDB vector queries
- PR getzep#1303: Search both edge directions during dedup resolution using
  bidirectional RELATES_TO match
- PR getzep#1330: Fix FalkorDB default_group_id from escaped '\\_' to '_'

Not applicable (TS port doesn't have the code):
- PR getzep#1281: No Gemini LLM client in TS port
- PR getzep#1276: TS resolver uses client-side scoring, not LLM context
- PR getzep#1289: TS reranker already returns 0 for empty logprobs
- PR getzep#1351: TS episode_mentions_reranker already sorts DESC
- PR getzep#1312: TS already validates node labels
- PR getzep#1212: TS addTripletFull already checks UUID collision
- PRs getzep#1326,1305,1295,1270,1222,1249,1272: FalkorDB RediSearch query
  building not present in TS port (uses Cypher CONTAINS)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants